๐Ÿ“•
PRML
  • PRML study
  • Chapter-1 Introduction
    • 1.1 Example: Polynomial Curve Fitting
    • 1.2 Probability Theory
    • 1.3 Model Selection
    • 1.4 The Curse of dimensionality
    • 1.5 Decision Theory
    • 1.6 Information Theory
    • a.1 From Set Theory to Probability Theory
  • Chapter-2 Probability Distributions
    • 2.1 Binary Variables
    • 2.2 Multinomial Variables
    • 2.3 Gaussian Distribution
    • 2.4 The Exponential Family
    • 2.5 Nonparametric Methods
  • Chapter-3 Linear Models
    • 3.1 Linear Basis Function Models
    • 3.2 The Bias-Variance Decomposition
    • 3.3 Bayesian Linear Regression
    • 3.4 Bayesian Model comparison
    • 3.5 The Evidence Approximation
    • 3.6 Limitations of Fixed Basis Functions
  • Chapter-4 Linear Models for Classification
    • 4.1 Discriminant Functions
    • 4.2 Probabilistic Generative Models
    • 4.3 Probabilistic Discriminative Models
    • 4.4 The Laplace Approximation
    • 4.5 Bayesian Logistic Regression
  • Chapter-5 Neural Networks
    • 5.1 Feed-forward network Function
    • 5.2 Network Training
    • 5.3 Error Backpropagation
    • 5.4 The Hessian Matrix
    • 5.5 Regularization in Neural Networks
    • 5.6 Mixture Density Networks
    • 5.7 Bayesian Neural Networks
Powered by GitBook
On this page
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ
  • ๋‹คํ•ญ ํ•จ์ˆ˜
  • Python Code Solution for Polynomial
  • ์ตœ์ ์˜ ์ฐจ์ˆ˜(degree) ์ฐพ๊ธฐ
  • ์˜ค๋ฒ„ํ”ผํŒ…(over-fitting)
  • ์˜ค๋ฒ„ํ”ผํŒ…์„ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์ •๊ทœํ™”(Regularization)

Was this helpful?

  1. Chapter-1 Introduction

1.1 Example: Polynomial Curve Fitting

PreviousChapter-1 IntroductionNext1.2 Probability Theory

Last updated 3 years ago

Was this helpful?

์ œ์ผ ๊ฐ„๋‹จํ•œ ํšŒ๊ท€๋ฌธ์ œ๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ ๋‹ค.

  • ์‹ค์ˆ˜ ์ž…๋ ฅ๋ณ€์ˆ˜ xxx ๋กœ ์‹ค์ˆ˜ ํƒ€๊ฒŸ๋ณ€์ˆ˜ ttt ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋‹ค.

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ

๊ฐ€๋ น 10๊ฐœ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด๋ณด๋Š”๋ฐ, ์ž…๋ ฅ๋ณ€์ˆ˜ xxx๋Š” 0๊ณผ 1์‚ฌ์ด์˜ ์‹ค์ˆ˜, ํƒ€๊ฒŸ๋ณ€์ˆ˜ ttt ๋Š” sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x) ์—์„œ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ์ž‘์€ ๋…ธ์ด์ฆˆ(Noise) ๋ฅผ ์ค˜์„œ ์•ฝ๊ฐ„์˜ ๋ณ€ํ˜•์„ ๊ฐ€ํ•œ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์‹ถ์–ดํ•˜๋Š” sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x) ํ•จ์ˆ˜๋Š” ์„ธ์ƒ์˜ ์ง„๋ฆฌ ํ˜น์€ ์›๋ฆฌ๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ์„ธ์ƒ์˜ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์•Œ ์ˆ˜ ์—†๋Š” ์–ด๋–ค ์š”์†Œ๋“ค๋กœ ์ธํ•ด์„œ ์ง„์‹ค์„ ์•Œ ์ˆ˜ ์—†๊ฒŒ ๋˜์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด๋ฅผ ๋…ธ์ด์ฆˆ๊ฐ€ ๋“ค์–ด๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค. ์ฆ‰, ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด ์ž…๋ ฅ๋ฐ์ดํ„ฐ xxx ๋Š” ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผํ•ด์„œ ์ •๋‹ต์ด t=sinโก(2ฯ€x)t=\sin(2\pi x)t=sin(2ฯ€x) ๊ฐ€ ๋‚˜์™€์•ผํ•˜๋Š”๋ฐ, ๊ด€์ธก๋˜๋Š” ๋ฐ์ดํ„ฐ๋Š” ํ•ญ์ƒ ์–ด๋–ค ์š”์†Œ์— ์˜ํ•ด์„œ ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ์ ธ์„œ ๊ด€์ธก๋œ๋‹ค๋Š” ๋ง์ด๋‹ค.

xโ‰ก(x1,โ‹ฏโ€‰,xi,โ‹ฏโ€‰,xN)T,xiโˆˆ[0,1]tโ‰ก(t1,โ‹ฏโ€‰,ti,โ‹ฏโ€‰,tN)T,ti=sinโก(2ฯ€xi)+N(ฮผ,ฯƒ2)\begin{aligned} \textbf{x} & \equiv (x_1, \cdots, x_i, \cdots, x_N)^T, \quad x_i \in [0, 1] \\ \textbf{t} & \equiv (t_1, \cdots, t_i, \cdots, t_N)^T, \quad t_i = \sin(2\pi x_i) + N(\mu, \sigma^2)\end{aligned}xtโ€‹โ‰ก(x1โ€‹,โ‹ฏ,xiโ€‹,โ‹ฏ,xNโ€‹)T,xiโ€‹โˆˆ[0,1]โ‰ก(t1โ€‹,โ‹ฏ,tiโ€‹,โ‹ฏ,tNโ€‹)T,tiโ€‹=sin(2ฯ€xiโ€‹)+N(ฮผ,ฯƒ2)โ€‹

์ด์ œ ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ๊ด€์ธก๋œ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋“ค์„ ์‚ฌ์šฉํ•ด์„œ ์ƒˆ๋กœ์šด ์ž…๋ ฅ๋ณ€์ˆ˜ x^\hat{x}x^ ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ ํƒ€๊ฒŸ๋ณ€์ˆ˜t^\hat{t}t^ ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

import numpy as np
import matplotlib.pylab as plt
# making data
seed = 62
np.random.seed(seed)
N = 10
x = np.random.rand(N)
t = np.sin(2*np.pi*x) + np.random.randn(N) * 0.1
x_sin = np.linspace(0, 1)
t_sin = np.sin(2*np.pi*x_sin)
plt.plot(x_sin, t_sin, c='green')
plt.scatter(x, t)
plt.xlabel('x', fontsize=16)
plt.ylabel('t', rotation=0, fontsize=16)
plt.show()

๋‹คํ•ญ ํ•จ์ˆ˜

\tag{1} y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \cdots + w_M x^M = \sum_{j=0}^{M} w_j x^j

์šฐ๋ฆฌ๋Š” ๋‹คํ•ญ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์˜ˆ์ธก๋œ ํƒ€๊ฒŸ๊ณผ ์‹ค์ œ ํƒ€๊ฒŸ๋ณ€์ˆ˜์˜ ์ฐจ์ด๋ฅผ ๊ตฌํ•ด, ์–ผ๋งŒํผ ํ‹€๋ ธ๋Š”์ง€(misfit)๋ฅผ ์ธก์ •ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ชฉ์  ํ•จ์ˆ˜(object function) / ์†์‹ค ํ•จ์ˆ˜(error/loss function) ๋ผ๊ณ  ํ•˜๋ฉฐ, ์ด ์†์‹คํ•จ์ˆ˜๋ฅผ ์ค„์ž„์œผ๋กœ์จ ๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” ๋ณดํ†ต ๋งŽ์ด ์“ฐ์ด๋Š” ๋ชฉ์ ํ•จ์ˆ˜๋กœ MSE(Mean Square Error) ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

\tag{2} E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^{N} (y(x_n, \mathbf{w}) - t_n)^2

def error_function(pred, target):
    """MSE function"""
    return (1/2)*((pred-target)**2).sum()

Python Code Solution for Polynomial

def vandermonde_matrix(x, m):
    """vandermonde matrix"""
    return np.array([x**i for i in range(m+1)]).T

3 ์ฐจ ๋‹คํ•ญ ํ•จ์ˆ˜์˜ ๋ฐฉ๋ฐ๋ฅด๋ชฝ๋“œ ํ–‰๋ ฌ์„ ์‚ดํŽด๋ณด์ž.

M = 3
V = vandermonde_matrix(x, M)
print(V.round(3))

# -----print result-----
# [[1.    0.034 0.001 0.   ]
#  [1.    0.489 0.239 0.117]
#  [1.    0.846 0.716 0.606]
#  [1.    0.411 0.169 0.07 ]
#  [1.    0.631 0.399 0.252]
#  [1.    0.291 0.085 0.025]
#  [1.    0.543 0.295 0.16 ]
#  [1.    0.228 0.052 0.012]
#  [1.    0.24  0.058 0.014]
#  [1.    0.953 0.909 0.867]]

์ด์ œ ํ–‰๋ ฌ๋กœ ๋‹คํ•ญํ•จ์ˆ˜ ์‹ (1) ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š”๋ฐ, ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

def polynomial_function(x, w, m):
    assert w.size == m+1, "coefficients number must same as M+1"
    V = vandermonde_matrix(x, m)  # shape (x.size, M+1)
    return np.dot(V, w)

์ž„์˜์˜ ๊ณ„์ˆ˜๋ฅผ ์ดˆ๊ธฐํ™” ์‹œํ‚ค๊ณ  ๋‹คํ•ญ ํ•จ์ˆ˜ ๊ฐ’์„ ์‚ดํŽด๋ณธ๋‹ค.

np.random.seed(seed)
w = np.random.randn(M+1)
t_hat = polynomial_function(x, w, M)
print(t_hat.round(3))

# -----print result-----
# [-0.03  -0.208  0.016 -0.2   -0.177 -0.162 -0.204 -0.134 -0.14   0.197]

(3) ๋ฒˆ์˜ ์‹์—์„œ ์•ž์ชฝ์— ํ–‰๋ ฌ์„ ๋ฏธ๋ถ„ํ•ด ๋ณด๋ฉด ๋ฐฉ๋ฐ๋ฅด๋ชฝ๋“œ์˜ ์ „์น˜ ํ–‰๋ ฌ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ตœ์ข…์ ์œผ๋กœ ํ•ด๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

def poly_solution(x, t, m):
    V = vandermonde_matrix(x, m)
    return np.linalg.inv(np.dot(V.T, V)).dot(V.T).dot(t)

์ด์ œ ๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•ด๋ณธ๋‹ค.

print(f"Solution of coefficients are {poly_solution(x, t, M).round(3)}")

# -----print result-----
# Solution of coefficients are [ -0.245  11.722 -33.194  21.798]

์‚ฌ์‹ค numpy ์—์„œ๋Š” ๋” ๊ฐ„ํŽธํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค.

from numpy.polynomial import polynomial as P
print(P.polyfit(x, t, M).round(3))

# -----print result-----
# [ -0.245  11.722 -33.194  21.798]

์ตœ์ ์˜ ์ฐจ์ˆ˜(degree) ์ฐพ๊ธฐ

์ธก์ •์„ ์œ„ํ•ด์„œ 100๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€๋กœ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๊ตฌ์„ฑํ•œ๋‹ค.

np.random.seed(seed)
N_test = 100 
x_test = np.random.rand(N_test)
t_test = np.sin(2*np.pi*x_test) + np.random.randn(N_test) * 0.1
plt.plot(x_sin, t_sin, c='green')
plt.scatter(x_test, t_test, c='red')
plt.xlabel('x', fontsize=16)
plt.ylabel('t', rotation=0, fontsize=16)
plt.show()
def root_mean_square_error(error, n_samples):
    return np.sqrt(2*error/n_samples)

์˜ค๋ฒ„ํ”ผํŒ…(over-fitting)

์ด์ œ ์ธก์ •ํ•  ๋ฐฉ๋ฒ•๊นŒ์ง€ ์ƒ๊ฒผ์œผ๋‹ˆ ์ตœ์ ์˜ ์ฐจ์ˆ˜๋ฅผ ๊ณจ๋ผ๋ณด์ž. ์ฐจ์ˆ˜๊ฐ€ 0 ๋ถ€ํ„ฐ 9๊นŒ์ง€ ๋ฃจํ”„๋ฌธ์„ ๋Œ๋ฉด์„œ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ RMS error ๋ฅผ ์ธก์ •ํ•ด๋ณธ๋‹ค.

def get_rms_error(t_hat, t, n_sample, m):
    error = error_function(t_hat, t)
    rms = root_mean_square_error(error, n_sample)
    return rms

all_w = []
all_rms_train = []
all_rms_test = []

for m in range(10):
    optimal_w = poly_solution(x, t, m)
    t_hat = polynomial_function(x, optimal_w, m)
    t_hat_test = polynomial_function(x_test, optimal_w, m)

    rms_train = get_rms_error(t_hat, t, N, m)  # N=10
    rms_test = get_rms_error(t_hat_test, t_test, N_test, m)  # N_test = 100
    print(f"M={m} | rms_train: {rms_train:.4f} rms_test: {rms_test:.4f}")

    # Plot predicted line
    plt.plot(x_sin, t_sin, c="green", label="sin function")
    plt.plot(x_sin, polynomial_function(x_sin, optimal_w, m), c="red", label=f"model M={m}")
    plt.scatter(x, t)
    plt.xlim((0, 1))
    plt.ylim((-1.25, 1.25))
    plt.xlabel('x', fontsize=16)
    plt.ylabel('t', rotation=0, fontsize=16)
    plt.legend()
    plt.show()

    all_w.append(optimal_w)
    all_rms_train.append(rms_train)
    all_rms_test.append(rms_test)

๊ทธ์ค‘ ์ฐจ์ˆ˜๊ฐ€ 0, 1, 3, 9 ์ธ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณธ๋‹ค.

# M=0 | rms_train: 0.6353 rms_test: 0.7221
# M=1 | rms_train: 0.4227 rms_test: 0.4508
# M=3 | rms_train: 0.0930 rms_test: 0.1238
# M=9 | rms_train: 0.0872 rms_test: 19.2855

์ฐจ์ˆ˜์˜ ์„ ํƒ์— ๋”ฐ๋ฅธ RMS error ์„ ๊ทธ๋ ค๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

plt.scatter(np.arange(10), all_rms_train, facecolors='none', edgecolors='b')
plt.plot(np.arange(10), all_rms_train, c='b', label='Training')
plt.scatter(np.arange(len(all_rms_test)), all_rms_test, facecolors='none', edgecolors='r')
plt.plot(np.arange(len(all_rms_test)), all_rms_test, c='r', label='Test')
plt.legend()
plt.xlim((-0.1, 10))
plt.ylim((-0.1, 1.2))
plt.ylabel("root-mean-squared Error", fontsize=16)
plt.xlabel("M", fontsize=16)
plt.show()

ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๊ฐ„์˜ RMS error ์ฐจ์ด๊ฐ€ ์ฒ˜์Œ์—๋Š” ์ค„์–ด๋“ค๋‹ค๊ฐ€ ๋‚˜์ค‘์—๋Š” ์ปค์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ์ตœ์ ์˜ ์ฐจ์ˆ˜๋Š” ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๊ฐ„์˜ ์ธก์ •์ฒ™๋„๊ฐ€ ์ž‘์œผ๋ฉฐ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์–ด๋А์ •๋„ ๋‚ฎ์€ ์ธก์ • ์ฒ™๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

np.set_printoptions(precision=3)
for i in [0, 1, 3, 9]:
    print(f"coefficients at M={i} is {all_w[i]}")

# -----print result-----
# coefficients at M=0 is [0.149]
# coefficients at M=1 is [ 0.961 -1.739]
# coefficients at M=3 is [ -0.245  11.722 -33.194  21.798]
# coefficients at M=9 is [-5.400e+01  2.606e+03 -3.763e+04  2.686e+05 -1.111e+06  2.839e+06 -4.546e+06  4.438e+06 -2.411e+06  5.575e+05]

์˜ค๋ฒ„ํ”ผํŒ…์„ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•

๊ณผ๋Œ€์ ํ•ฉ(over-fitting)์„ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ? ๋ณต์žกํ•œ ๋ชจ๋ธ์ผ ์ˆ˜๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ์˜ค๋ฒ„ํ”ผํŒ…์„ ํ”ผํ•ด๊ฐˆ ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด์ž. ๊ฐ™์€ ์ฐจ์ˆ˜(M=9)์˜ ๋ชจ๋ธ๋กœ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ 15 ๊ฐœ์™€ 100๊ฐœ์˜ ์ฐจ์ด๋กœ ํ•™์Šต๋œ ๊ณก์„ ์ด ๋‹ฌ๋ผ์กŒ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

np.random.seed(seed)
N1 = 15
N2 = 100 
x1, x2 = np.random.rand(N1), np.random.rand(N2)
t1 = np.sin(2*np.pi*x1) + np.random.randn(N1) * 0.1
t2 = np.sin(2*np.pi*x2) + np.random.randn(N2) * 0.1

optimal_w1 = poly_solution(x1, t1, m=9)
optimal_w2 = poly_solution(x2, t2, m=9)

# Plot predicted line
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

def plot(x, t, x_sin, t_sin, optimal_w, m, ax):
    ax.plot(x_sin, t_sin, c="green", label="sin function")
    ax.plot(x_sin, polynomial_function(x_sin, optimal_w, m), c="red", label=f"model N={len(x)}")
    ax.scatter(x, t)
    ax.set_xlim((0, 1))
    ax.set_ylim((-1.25, 1.25))
    ax.set_xlabel('x', fontsize=16)
    ax.set_ylabel('t', rotation=0, fontsize=16)
    ax.legend()

plot(x1, t1, x_sin, t_sin, optimal_w1, m=9, ax=ax1)
plot(x2, t2, x_sin, t_sin, optimal_w2, m=9, ax=ax2)

plt.show()

์ฆ‰, ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„ ์งˆ ์ˆ˜๋ก ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ๋Š” ์ ์–ด์ง„๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ง๋กœ ํ•ด์„ํ•˜๋ฉด, ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์ผ ์ˆ˜๋ก ๋” ๋ณต์žกํ•œ(์œ ์—ฐํ•œ) ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ’€๋ ค๋ฉด, ๋” ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค๋Š” ์ƒ๊ฐ์ด ์ด์ฏค๋˜๋ฉด ์ƒ๊ธธ ๊ฒƒ์ด๋‹ค. ์ฐจํ›„์— ์šฐ๋ฆฌ๋Š” ์ตœ๋Œ€ ๊ฐ€๋Šฅ๋„(mamximum likelihood) ๋ฅผ ํ†ตํ•ด์„œ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(๊ณ„์ˆ˜)๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šธ ๊ฒƒ์ด๊ณ , ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ๋˜ํ•œ ์ตœ๋Œ€ ๊ฐ€๋Šฅ๋„์˜ ํ•œ ์ผ๋ฐ˜์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ดํ•ดํ•  ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฒ ์ด์ง€์•ˆ(Bayesian) ์ ‘๊ทผ๋ฒ• ์„ ํ†ตํ•ด ์˜ค๋ฒ„ํ”ผํŒ…์„ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ๋„ ๋ฐฐ์šธ ๊ฒƒ์ด๋‹ค.

์ •๊ทœํ™”(Regularization)

์œ„์— ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šฐ๊ธฐ ์ „์— ์šฐ์„  ์ •๊ทœํ™”(regularization) ์ด๋ผ๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณธ๋‹ค. ์ •๊ทœํ™”๋Š” ๊ณ„์ˆ˜๊ฐ€ ๋” ์ปค์ง€์ง€ ์•Š๋„๋ก ์†์‹ค ํ•จ์ˆ˜์— ํŒจ๋„ํ‹ฐ(penalty)๋ฅผ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. (2) ๋ฒˆ์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ ์ณ๋ณธ๋‹ค.

\tag{4} E(\mathbf{w}) = \dfrac{1}{2} \Vert y - V \cdot w \Vert^2 + \frac{\lambda}{2} \Vert \mathbf{w} \Vert^2

์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์ด ์žˆ์œผ๋‚˜, ์—ฌ๊ธฐ์„œ๋Š” ์ œ์ผ ๊ฐ„๋‹จํ•œ ๊ณ„์ˆ˜์˜ ์ œ๊ณฑ์„ ์†์‹คํ•จ์ˆ˜์— ๋”ํ•ด์ฃผ๋Š” ํ˜•์‹์œผ๋กœ ํŒจ๋„ํ‹ฐ๋ฅผ ๋”ํ–ˆ๋‹ค. (4) ์‹์˜ ํ•ด๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜๋‹ค.

def ridge_solution(x, t, m, alpha=0):
    V = vandermonde_matrix(x, m)
    return np.linalg.inv(np.dot(V.T, V) - alpha * np.eye(m+1)).dot(V.T).dot(t)
M=9
optimal_w1 = ridge_solution(x, t, m=M, alpha=1e-8)
optimal_w2 = ridge_solution(x, t, m=M, alpha=1.0)

# Plot predicted line
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

def plot_ridge(x, t, x_sin, t_sin, optimal_w, m, text, ax):
    ax.plot(x_sin, t_sin, c="green", label="sin function")
    ax.plot(x_sin, polynomial_function(x_sin, optimal_w, m), c="red", label=f"model M={m}")
    ax.scatter(x, t)
    ax.set_xlim((0, 1))
    ax.set_ylim((-1.25, 1.25))
    ax.set_xlabel('x', fontsize=16)
    ax.set_ylabel('t', rotation=0, fontsize=16)
    ax.legend()
    ax.annotate(text, (0.6, 0.5), fontsize=14)

plot_ridge(x, t, x_sin, t_sin, optimal_w1, m=M, text='lambda = 1e-8', ax=ax1)
plot_ridge(x, t, x_sin, t_sin, optimal_w2, m=M, text='lambda = 1.0', ax=ax2)

plt.show()
print(f"coefficients at lambda=1e-8 is {optimal_w1.round(3)}")
print(f"coefficients at lambda=1.0 is {optimal_w2.round(3)}")

# -----print result-----
# coefficients at lambda=1e-8 is [  0.104   0.223  33.063 -91.357   3.467  90.523  39.05  -58.172 -72.941 55.889]
# coefficients at lambda=1.0 is  [  0.364   0.321   0.074  -0.155  -0.312  -0.409  -0.465  -0.495  -0.507 -0.51 ]

ํ†ต๊ณ„ํ•™์—์„œ ์ด๋Ÿฌํ•œ ํ…Œํฌ๋‹‰์„ ์ˆ˜์ถ• ๋ฐฉ๋ฒ•(shrinkage method) ์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” ๊ณ„์ˆ˜์˜ ๊ฐ’์„ ์ค„์—ฌ์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํŠนํžˆ, ์˜ˆ์‹œ์—์„œ ๋‚˜์˜จ ๋ฐฉ๋ฒ•์€ ridge regression ์ด๋‹ค. ํ–ฅํ›„์— ์ด์•ผ๊ธฐํ•  ์‹ ๊ฒฝ๋ง์—์„œ๋Š” weight decay ๋ผ๊ณ ๋„ ํ•œ๋‹ค. ๊ทธ๋ ‡๋‹ค๊ณ ํ•ด์„œ ์•„์ฃผ ๋†’์€ ์ •๊ทœํ™”๋ฅผ ํ•ญ์ƒ ๊ฐ•ํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ€์•ผํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ์œ„์— ๊ทธ๋ฆผ์—์„œ ์ •๊ทœํ™” ๊ณ„์ˆ˜๊ฐ€ 1์ธ ๋ชจ๋ธ์€ ๊ณผ์†Œ์ ํ•ฉ์„ ์•ผ๊ธฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์€ ์ •๊ทœํ™” ๊ณ„์ˆ˜๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ RMS error ๋ฅผ ๊ตฌํ•œ ๊ฒƒ์ด๋‹ค. ์ฐจ์ˆ˜๊ฐ€ 9 ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋‚ฎ์€ RMS error ๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.

(์ฑ…์—์„œ ๋‚˜์˜จ ๊ทธ๋ž˜ํ”„๋ž‘ ์ƒ์ด ํ•  ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, ์ด๋Š” seed ๊ฐ€ ๋‹ฌ๋ผ์„œ ์ฑ…์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ์™€ ์™„์ „ํžˆ ๊ฐ™์„ ์ˆ˜๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.)

์€ ๋ถˆํ™•์‹ค์„ฑ์„ ์ •ํ™•ํ•˜๊ณ  ์–‘์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋‚˜์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋‹ค. ์€ ํ™•๋ฅ ๋กœ ํ‘œํ˜„ ๊ฒƒ๋“ค์„ ์˜ˆ์ธก(๊ฒฐ์ •) ํ•  ๋•Œ, ์ ์ ˆํ•œ ์ฒ™๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์ด๋“ค์„ ํ•ฉ๋ฆฌ์ ์œผ๋กœ ์ตœ์ ํ™” ํ•˜๋Š” ์ด๋ก ์ด๋‹ค.

๋ถˆํ™•์‹ค์„ฑ์„ ํ•ด์†Œํ•˜๊ธฐ์œ„ํ•ด ์œ„ ๋‘ ๊ฐ€์ง€ ์ด๋ก ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์šฐ์„  ๋‹คํ•ญ ํ•จ์ˆ˜(polynomial fucntion) ๋ฅผ ํ†ตํ•ด ์ ‘๊ทผํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. MMM ์€ ๋‹คํ•ญ ํ•จ์ˆ˜์˜ ์ฐจ์ˆ˜(degree) ๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ทธ ์‹์—์„œ ์ตœ๊ณ ์˜ ์ฐจ์ˆ˜๋ฅผ ๊ฐ€๋ฅดํ‚จ๋‹ค. ์ž˜ ์‚ดํŽด๋ณด๋ฉด, ๋‹คํ•ญ ํ•จ์ˆ˜(y(x,w)y(x, \mathbf{w})y(x,w))์€ ๊ณ„์ˆ˜ w\mathbf{w}w ์™€ ์—ฐ๊ด€๋œ ์„ ํ˜• ํšŒ๊ท€ ์ด๋‹ค.

๋‹ค์‹œ ์ž ๊น ์ •๋ฆฌํ•ด์„œ, ์ง€๊ธˆ ํ•˜๋Š” ๊ฒƒ์€ ์„ธ์ƒ์˜ ์ง„๋ฆฌ(sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x)) ๋ฅผ ๋ชจ๋ฅธ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ  ๋‹คํ•ญ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ ์ด๊ฒƒ์ด ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ์˜ ์ง„๋ฆฌ๊ฐ€ ์•„๋‹๊นŒ ํ•˜๊ณ  ์˜ˆ์ธกํ•ด๋ณด๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ์„ ํ˜• ํšŒ๊ท€์˜ ๊ณ„์ˆ˜ w\mathbf{w}w ๋Š” ๊ด€์ธก๋œ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋กœ ๋ถ€ํ„ฐ ๋„์ถœํ•  ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ ๋„์ถœํ•  ๊ฒƒ์ธ๊ฐ€?

์šฐ์„  (N,M+1)(N, M+1)(N,M+1) ํฌ๊ธฐ์˜ ๋ฐฉ๋ฐ๋ฅด๋ชฝ๋“œ ํ–‰๋ ฌ(Vandermode matrix) ๋ฅผ ์ •์˜ํ•˜๊ณ  ์ด๋ฅผ VVV ๋ผ๊ณ  ํ•œ๋‹ค. ์œ„์—์„œ๋„ ๋งํ–ˆ๋“ฏ์ด MMM ์€ ๋‹คํ•ญ ํ•จ์ˆ˜์˜ ์ฐจ์ˆ˜(degree) ๋‹ค.

V=[1x1x12โ‹ฏx1M1x2x22โ‹ฏx2Mโ‹ฎโ‹ฎโ‹ฎโ‹ฑโ‹ฎ1xNxN2โ‹ฏxNM]V = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^M \\ 1 & x_2 & x_2^2 & \cdots & x_2^M \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_N & x_N^2 & \cdots & x_N^M \end{bmatrix}V=โ€‹11โ‹ฎ1โ€‹x1โ€‹x2โ€‹โ‹ฎxNโ€‹โ€‹x12โ€‹x22โ€‹โ‹ฎxN2โ€‹โ€‹โ‹ฏโ‹ฏโ‹ฑโ‹ฏโ€‹x1Mโ€‹x2Mโ€‹โ‹ฎxNMโ€‹โ€‹โ€‹

y=Vโ‹…w=[y1y2โ‹ฎyN]=[w0+w1x1+w2x12+โ‹ฏ+wMx1Mw0+w1x2+w2x22+โ‹ฏ+wMx2Mโ‹ฎw0+w1xN+w2xN2+โ‹ฏ+wMxNM]y = V \cdot w = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} w_0 + w_1x_1 + w_2x_1^2 + \cdots + w_Mx_1^M \\ w_0 + w_1x_2 + w_2x_2^2 + \cdots + w_Mx_2^M \\ \vdots \\ w_0 + w_1x_N + w_2x_N^2 + \cdots + w_Mx_N^M \end{bmatrix}y=Vโ‹…w=โ€‹y1โ€‹y2โ€‹โ‹ฎyNโ€‹โ€‹โ€‹=โ€‹w0โ€‹+w1โ€‹x1โ€‹+w2โ€‹x12โ€‹+โ‹ฏ+wMโ€‹x1Mโ€‹w0โ€‹+w1โ€‹x2โ€‹+w2โ€‹x22โ€‹+โ‹ฏ+wMโ€‹x2Mโ€‹โ‹ฎw0โ€‹+w1โ€‹xNโ€‹+w2โ€‹xN2โ€‹+โ‹ฏ+wMโ€‹xNMโ€‹โ€‹โ€‹

๊ทธ๋ฆฌ๊ณ  ์œ„์—์„œ ์ •์˜ํ•œ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ๋‹ค์‹œ ํ–‰๋ ฌ์— ๋งž๊ฒŒ ๋ฐ”๊ฟ”๋ณด๊ณ , ์กฐ๊ธˆ ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ž”์ฐจ(residual) r=yโˆ’Vโ‹…wr=y-V\cdot wr=yโˆ’Vโ‹…w ๋ฅผ ์ •์˜ํ•ด์„œ ๋‹ค์‹œ ๋ฐ”๊ฟ”๋ณธ๋‹ค.

E(w)=12โˆ‘n=1N(y(xn,w)โˆ’tn)2=12โˆฅyโˆ’Vโ‹…wโˆฅ2=12โˆฅrโˆฅ2E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^{N} (y(x_n, \mathbf{w}) - t_n)^2 = \frac{1}{2} \Vert y - V \cdot w \Vert^2 = \frac{1}{2} \Vert r \Vert^2E(w)=21โ€‹n=1โˆ‘Nโ€‹(y(xnโ€‹,w)โˆ’tnโ€‹)2=21โ€‹โˆฅyโˆ’Vโ‹…wโˆฅ2=21โ€‹โˆฅrโˆฅ2

์šฐ๋ฆฌ ๋ชฉ์ ์€ ์†์‹ค ํ•จ์ˆ˜์„ ์ตœ๋Œ€ํ•œ ์ค„์—ฌ์„œ, ์ฆ‰ ์ตœ์†Œ๊ฐ’์„ ๊ตฌํ•ด์„œ ๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•  ๊ฒƒ์ด๋‹ค. w^=argโกminโกwE(w)\hat{w} = {\arg \min}_{w} E(w)w^=argminwโ€‹E(w). ๋˜ํ•œ, ์†์‹ค ํ•จ์ˆ˜๋Š” 2์ฐจ ํ•จ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— 1์ฐจ ๋ฏธ๋ถ„์ด 0์ผ ๋•Œ, ์œ ์ผํ•œ ํ•ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฏธ๋ถ„์˜ ์—ฐ์‡„๋ฒ•์น™(chain rule)์œผ๋กœ ์•„๋ž˜ ์ฒ˜๋Ÿผ ๋ฏธ๋ถ„์„ ์ง„ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค.

โˆ‚Eโˆ‚w=[โˆ‚Eโˆ‚w0โˆ‚Eโˆ‚w1โ‹ฎโˆ‚Eโˆ‚wM]=[โˆ‚Eโˆ‚r1โˆ‚r1โˆ‚w0+โˆ‚Eโˆ‚r2โˆ‚r2โˆ‚w0+โ‹ฏ+โˆ‚Eโˆ‚rNโˆ‚rNโˆ‚w0โˆ‚Eโˆ‚r1โˆ‚r1โˆ‚w1+โˆ‚Eโˆ‚r2โˆ‚r2โˆ‚w1+โ‹ฏ+โˆ‚Eโˆ‚rNโˆ‚rNโˆ‚w1โ‹ฎโˆ‚Eโˆ‚r1โˆ‚r1โˆ‚wM+โˆ‚Eโˆ‚r2โˆ‚r2โˆ‚wM+โ‹ฏ+โˆ‚Eโˆ‚rNโˆ‚rNโˆ‚wM]=[โˆ‚r1โˆ‚w0โˆ‚r2โˆ‚w0โ‹ฏโˆ‚rNโˆ‚w0โˆ‚r1โˆ‚w1โˆ‚r2โˆ‚w1โ‹ฏโˆ‚rNโˆ‚w1โ‹ฎโ‹ฎโ‹ฑโ‹ฎโˆ‚r1โˆ‚wMโˆ‚r2โˆ‚wMโ‹ฏโˆ‚rNโˆ‚wM]โ‹…[โˆ‚Eโˆ‚r1โˆ‚Eโˆ‚r2โ‹ฎโˆ‚Eโˆ‚rN]โ‹ฏ(3)=โˆ‚rโˆ‚wโ‹…โˆ‚Eโˆ‚r=VTโ‹…(yโˆ’Vโ‹…w)=0\begin{aligned} \frac{\partial E}{\partial w} &= \begin{bmatrix} \frac{\partial E}{\partial w_0} \\ \frac{\partial E}{\partial w_1} \\ \vdots \\ \frac{\partial E}{\partial w_M} \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial E}{\partial r_1}\frac{\partial r_1}{\partial w_0} + \frac{\partial E}{\partial r_2}\frac{\partial r_2}{\partial w_0} + \cdots +\frac{\partial E}{\partial r_N}\frac{\partial r_N}{\partial w_0} \\ \frac{\partial E}{\partial r_1}\frac{\partial r_1}{\partial w_1} + \frac{\partial E}{\partial r_2}\frac{\partial r_2}{\partial w_1} + \cdots +\frac{\partial E}{\partial r_N}\frac{\partial r_N}{\partial w_1} \\ \vdots \\ \frac{\partial E}{\partial r_1}\frac{\partial r_1}{\partial w_M} + \frac{\partial E}{\partial r_2}\frac{\partial r_2}{\partial w_M} + \cdots +\frac{\partial E}{\partial r_N}\frac{\partial r_N}{\partial w_M} \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial r_1}{\partial w_0} & \frac{\partial r_2}{\partial w_0} & \cdots & \frac{\partial r_N}{\partial w_0} \\ \frac{\partial r_1}{\partial w_1} & \frac{\partial r_2}{\partial w_1} & \cdots & \frac{\partial r_N}{\partial w_1} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial r_1}{\partial w_M} & \frac{\partial r_2}{\partial w_M} & \cdots & \frac{\partial r_N}{\partial w_M} \end{bmatrix} \cdot \begin{bmatrix} \frac{\partial E}{\partial r_1} \\ \frac{\partial E}{\partial r_2} \\ \vdots \\ \frac{\partial E}{\partial r_N} \end{bmatrix} \qquad \cdots (3)\\ \\ &= \frac{\partial r}{\partial w} \cdot \frac{\partial E}{\partial r} \\ \\ &= V^T \cdot (y - V\cdot w) = 0 \end{aligned}โˆ‚wโˆ‚Eโ€‹โ€‹=โ€‹โˆ‚w0โ€‹โˆ‚Eโ€‹โˆ‚w1โ€‹โˆ‚Eโ€‹โ‹ฎโˆ‚wMโ€‹โˆ‚Eโ€‹โ€‹โ€‹=โ€‹โˆ‚r1โ€‹โˆ‚Eโ€‹โˆ‚w0โ€‹โˆ‚r1โ€‹โ€‹+โˆ‚r2โ€‹โˆ‚Eโ€‹โˆ‚w0โ€‹โˆ‚r2โ€‹โ€‹+โ‹ฏ+โˆ‚rNโ€‹โˆ‚Eโ€‹โˆ‚w0โ€‹โˆ‚rNโ€‹โ€‹โˆ‚r1โ€‹โˆ‚Eโ€‹โˆ‚w1โ€‹โˆ‚r1โ€‹โ€‹+โˆ‚r2โ€‹โˆ‚Eโ€‹โˆ‚w1โ€‹โˆ‚r2โ€‹โ€‹+โ‹ฏ+โˆ‚rNโ€‹โˆ‚Eโ€‹โˆ‚w1โ€‹โˆ‚rNโ€‹โ€‹โ‹ฎโˆ‚r1โ€‹โˆ‚Eโ€‹โˆ‚wMโ€‹โˆ‚r1โ€‹โ€‹+โˆ‚r2โ€‹โˆ‚Eโ€‹โˆ‚wMโ€‹โˆ‚r2โ€‹โ€‹+โ‹ฏ+โˆ‚rNโ€‹โˆ‚Eโ€‹โˆ‚wMโ€‹โˆ‚rNโ€‹โ€‹โ€‹โ€‹=โ€‹โˆ‚w0โ€‹โˆ‚r1โ€‹โ€‹โˆ‚w1โ€‹โˆ‚r1โ€‹โ€‹โ‹ฎโˆ‚wMโ€‹โˆ‚r1โ€‹โ€‹โ€‹โˆ‚w0โ€‹โˆ‚r2โ€‹โ€‹โˆ‚w1โ€‹โˆ‚r2โ€‹โ€‹โ‹ฎโˆ‚wMโ€‹โˆ‚r2โ€‹โ€‹โ€‹โ‹ฏโ‹ฏโ‹ฑโ‹ฏโ€‹โˆ‚w0โ€‹โˆ‚rNโ€‹โ€‹โˆ‚w1โ€‹โˆ‚rNโ€‹โ€‹โ‹ฎโˆ‚wMโ€‹โˆ‚rNโ€‹โ€‹โ€‹โ€‹โ‹…โ€‹โˆ‚r1โ€‹โˆ‚Eโ€‹โˆ‚r2โ€‹โˆ‚Eโ€‹โ‹ฎโˆ‚rNโ€‹โˆ‚Eโ€‹โ€‹โ€‹โ‹ฏ(3)=โˆ‚wโˆ‚rโ€‹โ‹…โˆ‚rโˆ‚Eโ€‹=VTโ‹…(yโˆ’Vโ‹…w)=0โ€‹

w=(VTV)โˆ’1VTyw = (V^TV)^{-1}V^Tyw=(VTV)โˆ’1VTy

์ตœ์ ์˜ ๊ณ„์ˆ˜๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋„ ํ•ด๊ฒฐ๋˜์—ˆ์œผ๋‹ˆ, ์ด์ œ ์šฐ๋ฆฌ์—๊ฒŒ ๋‚จ์€ ๋ฌธ์ œ๋Š” ์ตœ์ ์˜ ์ฐจ์ˆ˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ๋‹คํ•ญ ํ•จ์ˆ˜์˜ ์ฐจ์ˆ˜๋Š” ์šฐ๋ฆฌ๊ฐ€ ๋งˆ์Œ๋Œ€๋กœ ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง„๋ฆฌ(sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x))์— ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ๋งŒ๋“œ๋Š” ์ตœ์ ์˜ ๊ณ„์ˆ˜๋Š” ๋ฌด์—‡์ธ๊ฐ€? ์ด๋ฅผ ์ฐพ๋Š” ๊ณผ์ •์„ ๋ชจ๋ธ ๋น„๊ต(model comparison) ํ˜น์€ ๋ชจ๋ธ ์„ ํƒ(model selection) ์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ ์ฐจ์ˆ˜ MMM ์ฒ˜๋Ÿผ์‚ฌ๋žŒ์ด ์ž„์˜์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” ๋ณ€์ˆ˜๋ฅผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ(hyperparameter) ๋ผ๊ณ  ํ•œ๋‹ค.

์ด์ „ ์žฅ(์—์„œ ์šฐ๋ฆฌ๋Š” ์ผ๋ฐ˜ํ™”(generalization) ์ด ํŒจํ„ด์ธ์‹์˜ ์ฃผ์š” ๋ชฉ์ ์ด๋ผ๊ณ  ํ–ˆ๋‹ค. ์ข‹์€ ์ผ๋ฐ˜ํ™”๋ž€ ์–ผ๋งŒํผ ์ง„๋ฆฌ์— ๊ฐ€๊นŒ์šด ํ‘œํ˜„๋ ฅ์„ ๋ณด์ด๋Š”๊ฐ€๋กœ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ์—ฌ๊ธฐ์„œ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ, ์–ผ๋งŒํผ sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x) ์— ์ ‘๊ทผํ•œ ๊ฐ€? ๋ฅผ ๋ณด๋ฉด ๋œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ธก์ •ํ•  ๊ฒƒ์ธ๊ฐ€?

๊ทธ๋ฆฌ๊ณ  ๋งค ๋ฒˆ ์ฐจ์ˆ˜(MMM)๋ฅผ ์„ ํƒํ•  ๋•Œ ๋งˆ๋‹ค, ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ์ตœ์ ํ™”๋œ ๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ , ์ด ๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์†์‹ค ๊ฐ’์˜ ์ž”์ฐจ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๊ฐ๊ฐ ์ ์šฉํ•ด์„œ ๊ตฌํ•œ๋‹ค. ๊ทธ ๋ฐฉ๋ฒ• ์ค‘์— ํ•˜๋‚˜๋Š” RMS error(root-mean-sqruare error) ๋ผ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตฌํ•˜๋Š”๋ฐ, ์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

ERMS=2E(wโˆ—)/NE_{RMS} = \sqrt{2E(\mathbf{w}^{*})/N}ERMSโ€‹=2E(wโˆ—)/Nโ€‹

wโˆ—\mathbf{w}^{*}wโˆ— ๋Š” MMM ์ฐจ์ˆ˜์—์„œ ์ตœ์ ์˜ ๊ณ„์ˆ˜, NNN ์€ ๋ฐ์ดํ„ฐ์˜ ๊ฐฏ์ˆ˜๋‹ค. NNN ์„ ๋‚˜๋ˆ ์ค€ ์ด์œ ๋Š” ๋น„๊ต๊ฐ€๋Šฅ๋„๋ก ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์…‹์„ ๋™๋“ฑํ•œ ํฌ๊ธฐ๋กœ ์Šค์ผ€์ผ๋ง ํ•œ ๊ฒƒ์ด๋‹ค. ์†์‹คํ•จ์ˆ˜๊ฐ€ ์ œ๊ณฑ์„ ์ทจํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ธกํ•œ ๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ๋ณ€์ˆ˜์˜ ์ฐจ์ด๊ฐ€ ํด์ˆ˜๋ก ๊ฐ’์ด ๋” ์ปค์ง€๋Š” ํ˜„์ƒ์ด ์žˆ๋Š”๋ฐ, ๋ฃจํŠธ ์—ฐ์‚ฐ์„ ์ทจํ•ด์คŒ์œผ๋กœ์จ, ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€ ๊ฐ™์€ ํฌ๊ธฐ์˜ ์Šค์ผ€์ผ๋กœ ๋‹ค์‹œ ๋งž์ถฐ์ง„๋‹ค.

์šฐ๋ฆฌ์˜ ์ง„๋ฆฌ์ธ sinโก(2ฯ€x)\sin(2\pi x)sin(2ฯ€x) ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ณก์„ ์€ M=3M=3M=3 ์ผ๋•Œ์˜ ๊ณก์„ ์ด๋‹ค. ๋‹ค๋ฅธ ์ฐจ์ˆ˜์—์„œ๋Š” ์ข‹์ง€ ์•Š์€ ํ‘œํ˜„๋ ฅ(์ง„๋ฆฌ ํ•จ์ˆ˜์™€ ๋ชจ์–‘์ด ๋น„์Šทํ•˜์ง€ ์•Š์Œ)์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ, ํŠนํžˆ M=9M=9M=9 ์ผ ๋•Œ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์šฐ๋ฆฌ์˜ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€ํ†ตํ•˜๋Š” ์•„์ฃผ ์ •ํ™•ํ•œ ์ผ์น˜์„ฑ์„ ๋ณด์ธ๋‹ค. ํ•˜์ง€๋งŒ ๊ณก์„ ์„ ๊ทธ๋ ค๋ณด๋ฉด ์ฒœ์žฅ๊ณผ ๋ฐ‘๋ฐ”๋‹ฅ์„ ๋šซ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์•„์ฃผ ๋‚˜์œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค. ๋ณดํ†ต ์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ ๊ณผ๋Œ€์ ํ•ฉ(over-fitting) ์ด๋ผ๊ณ  ํ•œ๋‹ค. M=0M=0M=0 ํ˜น์€ M=1M=1M=1 ์˜ RMS error ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์—„์ฒญ ํฌ์ง„ ์•Š์ง€๋งŒ, ํ•ด๋‹น ํ•จ์ˆ˜๋กœ๋Š” sinโก\sinsin ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•ด๋‚ด๊ธฐ์—๋Š” ์—ญ๋ถ€์กฑํ•ด ๋ณด์ธ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ ๊ณผ์†Œ์ ํ•ฉ(under-fitting) ์ด๋ผ๊ณ  ํ•œ๋‹ค.

์ด์ œ ๊ฐ๊ฐ์˜ ๊ณ„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ด๋ณธ๋‹ค. ์ฐจ์ˆ˜ MMM ๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ๊ณ„์ˆ˜๊ฐ€ ์ปค์ง€๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ โˆฅwโˆฅ2โ‰กwTw=w02+w12+โ‹ฏwM2\Vert \mathbf{w} \Vert^2 \equiv \mathbf{w}^T\mathbf{w}=w_0^2 + w_1^2 + \cdots w_M^2โˆฅwโˆฅ2โ‰กwTw=w02โ€‹+w12โ€‹+โ‹ฏwM2โ€‹ ๋‹ค. ์ •๊ทœํ™” ๊ณ„์ˆ˜ ฮป\lambdaฮป ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ฝ์กฐ๊ฑด์˜ ๋น„์ค‘์„ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋‹ค.

โˆ‚E(w)โˆ‚w=VTyโˆ’VTVโ‹…w+ฮปw=0w=(VTVโˆ’ฮปI(M+1))โˆ’1VTy\begin{aligned} \frac{\partial E(w)}{\partial w} &= V^Ty-V^TV\cdot w+\lambda w = 0 \\ w &= (V^TV- \lambda I_{(M+1)})^{-1}V^Ty \end{aligned}โˆ‚wโˆ‚E(w)โ€‹wโ€‹=VTyโˆ’VTVโ‹…w+ฮปw=0=(VTVโˆ’ฮปI(M+1)โ€‹)โˆ’1VTyโ€‹

์ •๊ทœํ™” ๊ณ„์ˆ˜์˜ ํšจ๊ณผ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ๊ทน๋ช…ํ•˜๋‹ค. ์ •๊ทœํ™” ๊ณ„์ˆ˜๊ฐ€ ํด์ˆ˜๋ก ๊ณ„์ˆ˜ www ๋ฅผ ๊ฐ•๋ ฅํ•˜๊ฒŒ ๊ทœ์ œํ•˜๋ฉฐ ๋”์ด์ƒ ์ปค์ง€์ง€ ๋ชปํ•˜๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ, ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ์„ ์ค„์—ฌ์ฃผ๊ณ  ๊ณผ๋Œ€์ ํ•ฉ์„ ๋ง‰์•„์ค€๋‹ค.

ํ™•๋ฅ ์ด๋ก (Probability theory)
๊ฒฐ์ •์ด๋ก (Decision theory)
Introduction