regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)
regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)
regression_overfitting_curve()
regression_overfitting_coeffs()
M=0 (Underfitting) | M=3 (Good) | M=9 (Overfitting) | M=12 (Overfitting) | |
---|---|---|---|---|
w_0 | 9.22491 | 7.66854 | -0.240721 | 0.036502 |
w_1 | -75.2974 | 6.36637 | -1.282764 | |
w_2 | 172.044 | -69.1889 | 19.592658 | |
w_3 | -14.2807 | 397.481 | -170.650848 | |
w_4 | -1290.02 | 934.785369 | ||
w_5 | 2328.31 | -3348.827201 | ||
w_6 | -2093.26 | 7896.286428 | ||
w_7 | 580.659 | -11973.474257 | ||
w_8 | 247.134 | 10891.034834 | ||
w_9 | -28.6505 | -4862.568457 | ||
w_10 | 131.114412 | |||
w_11 | 582.180371 | |||
w_12 | -28.827647 |
of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.
regression_regularization_plot()
regression_regularization_coeff()
lambda=0 | lambda=exp^1 | lambda=exp^10 | |
---|---|---|---|
w_0 | 30.203406 | 13.701085 | 0.010221 |
w_1 | 7133.542582 | 13.267902 | 0.013419 |
w_2 | -31022.107050 | 12.423422 | 0.020149 |
w_3 | 53507.324765 | 8.040454 | 0.029552 |
w_4 | -48906.151251 | 0.865971 | 0.038200 |
w_5 | 26564.237381 | -4.354079 | 0.034013 |
w_6 | -9013.171136 | -1.827607 | -0.002813 |
w_7 | 1929.741748 | 2.147727 | -0.054953 |
w_8 | -253.351938 | -0.613354 | 0.023119 |
w_9 | 18.620246 | 0.073751 | -0.003275 |
w_10 | -0.586531 | -0.003284 | 0.000155 |
Main Idea: Given a new observation $\vec{x}$, we generate the coefficients $\vec{w}$ and prediction $y(\vec{x}, \vec{w})$ by giving high weights for neighbours of $\vec{x}$.
** Regular vs. Locally-Weighted Linear Regression**
Linear Regression
Locally-weighted Linear Regression
The standard choice for weights $\vec{r}$ uses the Gaussian Kernel, with kernel width $\tau$ $$ r_n = \exp\left( -\frac{|| \vec{x}_n - \vec{x} ||^2}{2\tau^2} \right) $$
Choice of kernel width matters.
Recall that in regular linear regression, we have $$E(\vec{w}) = \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \left \| \Phi \vec{w}-\vec{t} \right \|^2$$
In locally-weighted linear regression, we are to minimize $$E(\vec{w}) = \sum_{n=1}^{N} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2 = \sum_{n=1}^{N} (\sqrt{r_n} \vec{w}^T \phi(\vec{x}_n) - \sqrt{r_n} t_n )^2 = \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 $$ of which $$R = \begin{bmatrix} r_1 & & & \\ & r_2 & & \\ & & \ddots & \\ & & & r_N \end{bmatrix} $$
Recall the solution to $\ \arg \min \left \| \Phi \vec{w}-\vec{t} \right \|^2 \ $ is $\ \Phi^\dagger \vec{t} \ $. Similarly, the solution to $\ \arg \min \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 \ $ is $$ \boxed{\hat{\vec{w}} = (\sqrt{R} \Phi)^\dagger \sqrt{R} \vec{t}} $$
We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.
Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$
The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$
For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$
Notation Clarification
Main Idea of Maximum Likelihood Estimate
Intuition about Maximum Likelihood Estimation
Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$
Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.
Complete data log-likelohood is $$ \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$
Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$
Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$
$\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$
So the MAP estimator is $$ \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align} $$
Exactly the objective in regularized least squares! So