EECS 545: Machine Learning¶

Lecture 05: Linear Regression II¶

Instructor: Jacob Abernethy
Date: January 25, 2015

Lecture Exposition Credit: Benjamin Bray and Zhe Du

Outline for this Lecture¶

Overfitting
Regularized Least Squares
Locally-Weighted Linear Regression
Maximum Likelihood Interpretation of Linear Regression

Reading List¶

Required:
- [PRML], §3.2: The Bias-Variance Decomposition
- [PRML], §3.3: Bayesian Linear Regression
Optional:
- [MLAPP], Chapter 7: Linear Regression

Overfitting¶

Overfitting: Degree of Linear Regression¶

In [2]:

regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)

Overfitting: Dataset Size¶

In [3]:

regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)

Overfitting: Overall Performance¶

On the left plot, we fix the dataset size and vary the polynomial degree
On the right plot, we fix the polynomial degree and vary the dataset size

In [4]:

regression_overfitting_curve()

Rule of Thumb to Choose the Degree¶

For a small number of datapoints, use a low degree
- Otherwise, the model will overfit!
As you obtain more data, you can gradually increase the degree
- Add more features to represent more data
- Warning: Your model is still limited by the finite amount of data available. The optimal model for finite data cannot be an infinite-dimensional polynomial!)
Use regularization to control model complexity.

Regularized Linear Regression¶

Coefficients of Overfitting¶

Before we move to regularized linear regression, let's first look at what happened to the coefficients $\vec{w}$ when there is overfitting.

In [5]:

regression_overfitting_coeffs()

	M=0 (Underfitting)	M=3 (Good)	M=9 (Overfitting)	M=12 (Overfitting)
w_0	9.22491	7.66854	-0.240721	0.036502
w_1		-75.2974	6.36637	-1.282764
w_2		172.044	-69.1889	19.592658
w_3		-14.2807	397.481	-170.650848
w_4			-1290.02	934.785369
w_5			2328.31	-3348.827201
w_6			-2093.26	7896.286428
w_7			580.659	-11973.474257
w_8			247.134	10891.034834
w_9			-28.6505	-4862.568457
w_10				131.114412
w_11				582.180371
w_12				-28.827647

Regularized Least Squares: Objective Function¶

Recall the objective function we minimizes in last lecture is

$$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$

To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether.

$$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$

of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.

$\lambda$ is the regularization coefficient.
If $\lambda$ is large, $E_{\vec{W}}(\vec{w})$ will dominate the objective function. As a result we will focus more on minimizing $E_W(\vec{w})$ and the resulting solution $\vec{w}$ tends to have smaller norm and the $E_D(\vec{w})$ term will be larger.

Regularized Least Squares: Derivation¶

Based on what we have derived in last lecture, we could write the objective function as

$$ \begin{align} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} + \frac{\lambda}{2}\vec{w}^T\vec{w} \end{align} $$

The gradient is

$$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{align} $$

Setting the gradient to 0, we will get the solution

$$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$

In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.

Regularized Least Squares: Example¶

In [6]:

regression_regularization_plot()

Regularized Least Squares: Coefficients¶

Let's look at how the coefficients change after we add regularization

In [7]:

regression_regularization_coeff()

	lambda=0	lambda=exp^1	lambda=exp^10
w_0	30.203406	13.701085	0.010221
w_1	7133.542582	13.267902	0.013419
w_2	-31022.107050	12.423422	0.020149
w_3	53507.324765	8.040454	0.029552
w_4	-48906.151251	0.865971	0.038200
w_5	26564.237381	-4.354079	0.034013
w_6	-9013.171136	-1.827607	-0.002813
w_7	1929.741748	2.147727	-0.054953
w_8	-253.351938	-0.613354	0.023119
w_9	18.620246	0.073751	-0.003275
w_10	-0.586531	-0.003284	0.000155

Regularized Least Squares: Summary¶

Simple modification of linear regression
$\ell^2$ Regularization controls the tradeoff between fitting error and complexity.
- Small $\ell^2$ regularization results in complex models, but with risk of overfitting
- Large $\ell^2$ regularization results in simple models, but with risk of underfitting
It is important to find an optimal regularization that balances between the two

Locally-Weighted Linear Regression¶

Main Idea: Given a new observation $\vec{x}$, we generate the coefficients $\vec{w}$ and prediction $y(\vec{x}, \vec{w})$ by giving high weights for neighbours of $\vec{x}$.
** Regular vs. Locally-Weighted Linear Regression**

Linear Regression

1. Fit $\vec{w}$ to minimize $\sum_{n} (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
2. For every new observation $\vec{x}$ to be predicted, output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **all** observations to be predicted.

Locally-weighted Linear Regression

1. For **every** new observation $\vec{x}$ to be predicted, generate the weights $r_n$ for every training sample $(\vec{x}_n, t_n)$. (The closer $\vec{x}_n$ is to $\vec{x}$, the larger $r_n$ will be)
2. Fit $\vec{w}$ to minimize $\sum_{n} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
3. Output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **only one** observations to be predicted.

Locally-Weighted Linear Regression: Weights¶

The standard choice for weights $\vec{r}$ uses the Gaussian Kernel, with kernel width $\tau$ $$ r_n = \exp\left( -\frac{|| \vec{x}_n - \vec{x} ||^2}{2\tau^2} \right) $$
Choice of kernel width matters.

- The bell shape is the weight curve, which has maximum at the query point $\vec{x}$ and decreases as we move farther. - The best kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.

Locally-Weighted Linear Regression: Derivation¶

Recall that in regular linear regression, we have $$E(\vec{w}) = \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \left \| \Phi \vec{w}-\vec{t} \right \|^2$$
In locally-weighted linear regression, we are to minimize $$E(\vec{w}) = \sum_{n=1}^{N} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2 = \sum_{n=1}^{N} (\sqrt{r_n} \vec{w}^T \phi(\vec{x}_n) - \sqrt{r_n} t_n )^2 = \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 $$ of which $$R = \begin{bmatrix} r_1 & & & \\ & r_2 & & \\ & & \ddots & \\ & & & r_N \end{bmatrix} $$
Recall the solution to $\ \arg \min \left \| \Phi \vec{w}-\vec{t} \right \|^2 \ $ is $\ \Phi^\dagger \vec{t} \ $. Similarly, the solution to $\ \arg \min \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 \ $ is $$ \boxed{\hat{\vec{w}} = (\sqrt{R} \Phi)^\dagger \sqrt{R} \vec{t}} $$

Probablistic Interpretation of Least Squares Regression¶

We have showed derived the solution to least squares regression by minimizing objective function. Now we will provide a probablistic perspective. Specifically, we will show the solution to regular least squares is just the maximum likelihood estimate of $\vec{w}$ and the solution to regularized least squares is the Maximum a Posteriori estimate.

Some Background¶

Gaussian Distribution

$$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$

Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)
- For distribution $t \sim p(t|\theta)$. $\theta$ is some unknown parameter (like mean or variance) to be estimated.
- Given observation $\vec{t} = (t_1, t_2, \dots, t_N)$,
  - The Maximum Likelihood Estimator is
  $$ \theta_{ML} = \arg \max \prod_{n=1}^N p(t_n | \theta) $$
  - If we have some prior knowledge about $\theta$, the MAP estimator is
  $$ \theta_{MAP} = \arg \max \prod_{n=1}^N p(\theta | t_n) \quad (\text{Posteriori Probability of } \theta) $$

Maximum Likelihood Estimator $\vec{w}_{ML}$¶

We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.
Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$
The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$
For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$
Notation Clarification
- $p(t|x,w,\beta)$ it the PDF of $t$ whose distribution is parameterized by $x,\vec{w},\beta$.
- $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$ is Gaussian distribution with mean $\vec{w}^T \phi(\vec{x})$ and variance $\beta^{-1}$.
- $\mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$ is the PDF of $\vec{t}$ which has Gaussian distribution $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$

Maximum Likelihood Estimator $\vec{w}_{ML}$¶

Main Idea of Maximum Likelihood Estimate
- Given $\{ \vec{x}_n, t_n \}_{n=1}^N$, we want to find $\vec{w}_{ML}$ that maximizes data likelihood function $$ \vec{w}_{ML} =\arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta) =\arg \max \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$ and by derivation we will show $\vec{w}_{ML}$ is equivalent to the least squares solution $\hat{\vec{w}} = \Phi^\dagger \vec{t}$.
Intuition about Maximum Likelihood Estimation
- Finding maximum likelihood estimate $\vec{w}_{ML} = \arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta)$ is just finding the parameter $\vec{w}$ under which for data $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$, observed $\vec{t}=(t_1,\dots,t_n)$ is the most likely result to be generated among all possible $\vec{t}$.

Maximum Likelihood Estimator $\vec{w}_{ML}$: Derivation¶

Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$
Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.
Complete data log-likelohood is $$ \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$

Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$
Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$

MAP Estimator $\vec{w}_{MAP}$¶

The MAP estimator is obtained by $$ \begin{align} \vec{w}_{MAP} &= \arg \max p(\vec{w}|\vec{t}, \mathcal{X},\beta) & & (\text{Posteriori Probability})\\ &= \arg \max \frac{p(\vec{w}, \vec{t}, \mathcal{X},\beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max \frac{p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta) & & (p(X, t, \beta) \text{ is irrelevant to} \ \vec{w})\\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) p(\mathcal{X}) p(\beta) & & (\text{Independence}) \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) & & (\text{Likelihood} \times \text{Prior}) \end{align} $$ We are just using Bayes Theorem for the above steps.
The only difference from ML estimator is we have an extra term of PDF of $\vec{w}$. This is the prior belief of $\vec{w}$. Here, we assume, $$ \vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I) $$

ML vs. MAP
- Maximum Likelihood: We know nothing about $\vec{w}$ initially and every $\vec{w}$ are equally likelihood
- Maximum a Posteriori: We know something about about $\vec{w}$ initially and certain $\vec{w}$ are more likely (depending on prior $p(\vec{w})$). In another way, $\vec{w}$ are weighted.
Assumption $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ makes sense because
- In regularized least squares
  - We already know large coefficient $\vec{w}$ that may lead to overfitting should be avoided.
  - When we increase the regularization coefficient $\lambda$, the smaller $\left \| \vec{w} \right \|$ will be.
- When use $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$
  - $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ encodes the assumption that $\vec{w}$ with a smaller norm $\left \| \vec{w} \right \|$ is more "likely" than a $\vec{w}$ with a bigger norm.
  - When we increase $\alpha$, variance is smaller, small $\left \| \vec{w} \right \|$ will be much more likely

MAP Estimator $\vec{w}_{MAP}$: Derivation¶

$\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$
So the MAP estimator is $$ \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align} $$
Exactly the objective in regularized least squares! So

$$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$