# EECS 545: Machine Learning¶

## Lecture 05: Linear Regression II¶

• Instructor: Jacob Abernethy
• Date: January 25, 2015

Lecture Exposition Credit: Benjamin Bray and Zhe Du

## Outline for this Lecture¶

• Overfitting
• Regularized Least Squares
• Locally-Weighted Linear Regression
• Maximum Likelihood Interpretation of Linear Regression

• Required:
• [PRML], §3.2: The Bias-Variance Decomposition
• [PRML], §3.3: Bayesian Linear Regression
• Optional:
• [MLAPP], Chapter 7: Linear Regression

## Overfitting¶

### Overfitting: Degree of Linear Regression¶

In [2]:
regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)


### Overfitting: Dataset Size¶

In [3]:
regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)


### Overfitting: Overall Performance¶

• On the left plot, we fix the dataset size and vary the polynomial degree
• On the right plot, we fix the polynomial degree and vary the dataset size
In [4]:
regression_overfitting_curve()


### Rule of Thumb to Choose the Degree¶

• For a small number of datapoints, use a low degree
• Otherwise, the model will overfit!
• As you obtain more data, you can gradually increase the degree
• Add more features to represent more data
• Warning: Your model is still limited by the finite amount of data available. The optimal model for finite data cannot be an infinite-dimensional polynomial!)
• Use regularization to control model complexity.

## Regularized Linear Regression¶

### Coefficients of Overfitting¶

• Before we move to regularized linear regression, let's first look at what happened to the coefficients $\vec{w}$ when there is overfitting.
In [5]:
regression_overfitting_coeffs()

M=0 (Underfitting) M=3 (Good) M=9 (Overfitting) M=12 (Overfitting)
w_0 9.22491 7.66854 -0.240721 0.036502
w_1 -75.2974 6.36637 -1.282764
w_2 172.044 -69.1889 19.592658
w_3 -14.2807 397.481 -170.650848
w_4 -1290.02 934.785369
w_5 2328.31 -3348.827201
w_6 -2093.26 7896.286428
w_7 580.659 -11973.474257
w_8 247.134 10891.034834
w_9 -28.6505 -4862.568457
w_10 131.114412
w_11 582.180371
w_12 -28.827647

### Regularized Least Squares: Objective Function¶

• Recall the objective function we minimizes in last lecture is $$E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$$

• To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})}$$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.

• $\lambda$ is the regularization coefficient.

• If $\lambda$ is large, $E_{\vec{W}}(\vec{w})$ will dominate the objective function. As a result we will focus more on minimizing $E_W(\vec{w})$ and the resulting solution $\vec{w}$ tends to have smaller norm and the $E_D(\vec{w})$ term will be larger.

### Regularized Least Squares: Derivation¶

• Based on what we have derived in last lecture, we could write the objective function as \begin{align} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} + \frac{\lambda}{2}\vec{w}^T\vec{w} \end{align}

• The gradient is \begin{align} \nabla_\vec{w} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{align}

• Setting the gradient to 0, we will get the solution $$\boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} }$$

• In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.

### Regularized Least Squares: Example¶

In [6]:
regression_regularization_plot()


### Regularized Least Squares: Coefficients¶

• Let's look at how the coefficients change after we add regularization
In [7]:
regression_regularization_coeff()

lambda=0 lambda=exp^1 lambda=exp^10
w_0 30.203406 13.701085 0.010221
w_1 7133.542582 13.267902 0.013419
w_2 -31022.107050 12.423422 0.020149
w_3 53507.324765 8.040454 0.029552
w_4 -48906.151251 0.865971 0.038200
w_5 26564.237381 -4.354079 0.034013
w_6 -9013.171136 -1.827607 -0.002813
w_7 1929.741748 2.147727 -0.054953
w_8 -253.351938 -0.613354 0.023119
w_9 18.620246 0.073751 -0.003275
w_10 -0.586531 -0.003284 0.000155

### Regularized Least Squares: Summary¶

• Simple modification of linear regression
• $\ell^2$ Regularization controls the tradeoff between fitting error and complexity.
• Small $\ell^2$ regularization results in complex models, but with risk of overfitting
• Large $\ell^2$ regularization results in simple models, but with risk of underfitting
• It is important to find an optimal regularization that balances between the two

## Locally-Weighted Linear Regression¶

### Locally-Weighted Linear Regression¶

• Main Idea: Given a new observation $\vec{x}$, we generate the coefficients $\vec{w}$ and prediction $y(\vec{x}, \vec{w})$ by giving high weights for neighbours of $\vec{x}$.

• Regular vs. Locally-Weighted Linear Regression

Linear Regression

1. Fit $\vec{w}$ to minimize $\sum_{n} (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
2. For every new observation $\vec{x}$ to be predicted, output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **all** observations to be predicted.

Locally-weighted Linear Regression

1. For **every** new observation $\vec{x}$ to be predicted, generate the weights $r_n$ for every training sample $(\vec{x}_n, t_n)$. (The closer $\vec{x}_n$ is to $\vec{x}$, the larger $r_n$ will be)
2. Fit $\vec{w}$ to minimize $\sum_{n} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
3. Output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **only one** observations to be predicted.

### Locally-Weighted Linear Regression: Weights¶

• The standard choice for weights $\vec{r}$ uses the Gaussian Kernel, with kernel width $\tau$ $$r_n = \exp\left( -\frac{|| \vec{x}_n - \vec{x} ||^2}{2\tau^2} \right)$$

• Choice of kernel width matters.

• The bell shape is the weight curve, which has maximum at the query point $\vec{x}$ and decreases as we move farther.
• The best kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.

### Locally-Weighted Linear Regression: Derivation¶

• Recall that in regular linear regression, we have $$E(\vec{w}) = \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \left \| \Phi \vec{w}-\vec{t} \right \|^2$$

• In locally-weighted linear regression, we are to minimize $$E(\vec{w}) = \sum_{n=1}^{N} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2 = \sum_{n=1}^{N} (\sqrt{r_n} \vec{w}^T \phi(\vec{x}_n) - \sqrt{r_n} t_n )^2 = \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2$$ of which $$R = \begin{bmatrix} r_1 & & & \\ & r_2 & & \\ & & \ddots & \\ & & & r_N \end{bmatrix}$$

• Recall the solution to $\ \arg \min \left \| \Phi \vec{w}-\vec{t} \right \|^2 \$ is $\ \Phi^\dagger \vec{t} \$. Similarly, the solution to $\ \arg \min \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 \$ is $$\boxed{\hat{\vec{w}} = (\sqrt{R} \Phi)^\dagger \sqrt{R} \vec{t}}$$

## Probablistic Interpretation of Least Squares Regression¶

• We have showed derived the solution to least squares regression by minimizing objective function. Now we will provide a probablistic perspective. Specifically, we will show the solution to regular least squares is just the maximum likelihood estimate of $\vec{w}$ and the solution to regularized least squares is the Maximum a Posteriori estimate.

### Some Background¶

• Gaussian Distribution $$\mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right]$$

• Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)

• For distribution $t \sim p(t|\theta)$. $\theta$ is some unknown parameter (like mean or variance) to be estimated.
• Given observation $\vec{t} = (t_1, t_2, \dots, t_N)$,
• The Maximum Likelihood Estimator is $$\theta_{ML} = \arg \max \prod_{n=1}^N p(t_n | \theta)$$
• If we have some prior knowledge about $\theta$, the MAP estimator is $$\theta_{MAP} = \arg \max \prod_{n=1}^N p(\theta | t_n) \quad (\text{Posteriori Probability of } \theta)$$

### Maximum Likelihood Estimator $\vec{w}_{ML}$¶

• We assume the signal+noise model of single data $(\vec{x}, t)$ is $$\begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather}$$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.

• Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$$

• The likelihood function of $t$ is just probability density function (PDF) of $t$ $$p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$$

• For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1})$$

• Notation Clarification

• $p(t|x,w,\beta)$ it the PDF of $t$ whose distribution is parameterized by $x,\vec{w},\beta$.
• $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$ is Gaussian distribution with mean $\vec{w}^T \phi(\vec{x})$ and variance $\beta^{-1}$.
• $\mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$ is the PDF of $\vec{t}$ which has Gaussian distribution $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$

### Maximum Likelihood Estimator $\vec{w}_{ML}$¶

• Main Idea of Maximum Likelihood Estimate

• Given $\{ \vec{x}_n, t_n \}_{n=1}^N$, we want to find $\vec{w}_{ML}$ that maximizes data likelihood function $$\vec{w}_{ML} =\arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta) =\arg \max \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1})$$ and by derivation we will show $\vec{w}_{ML}$ is equivalent to the least squares solution $\hat{\vec{w}} = \Phi^\dagger \vec{t}$.
• Intuition about Maximum Likelihood Estimation

• Finding maximum likelihood estimate $\vec{w}_{ML} = \arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta)$ is just finding the parameter $\vec{w}$ under which for data $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$, observed $\vec{t}=(t_1,\dots,t_n)$ is the most likely result to be generated among all possible $\vec{t}$.

### Maximum Likelihood Estimator $\vec{w}_{ML}$: Derivation¶

• Single data likelihood is $$p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \}$$

• Single data log-likelihood is $$\ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2$$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.

• Complete data log-likelohood is \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align}

• Maximum likelihood estimate $\vec{w}_{ML}$ is \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align}

• Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$\boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}}$$

### MAP Estimator $\vec{w}_{MAP}$¶

• The MAP estimator is obtained by \begin{align} \vec{w}_{MAP} &= \arg \max p(\vec{w}|\vec{t}, \mathcal{X},\beta) & & (\text{Posteriori Probability})\\ &= \arg \max \frac{p(\vec{w}, \vec{t}, \mathcal{X},\beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max \frac{p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta) & & (p(X, t, \beta) \text{ is irrelevant to} \ \vec{w})\\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) p(\mathcal{X}) p(\beta) & & (\text{Independence}) \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) & & (\text{Likelihood} \times \text{Prior}) \end{align} We are just using Bayes Theorem for the above steps.
• The only difference from ML estimator is we have an extra term of PDF of $\vec{w}$. This is the prior belief of $\vec{w}$. Here, we assume, $$\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$$
• ML vs. MAP
• Maximum Likelihood: We know nothing about $\vec{w}$ initially and every $\vec{w}$ are equally likelihood
• Maximum a Posteriori: We know something about about $\vec{w}$ initially and certain $\vec{w}$ are more likely (depending on prior $p(\vec{w})$). In another way, $\vec{w}$ are weighted.
• Assumption $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ makes sense because
• In regularized least squares
• We already know large coefficient $\vec{w}$ that may lead to overfitting should be avoided.
• When we increase the regularization coefficient $\lambda$, the smaller $\left \| \vec{w} \right \|$ will be.
• When use $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$
• $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ encodes the assumption that $\vec{w}$ with a smaller norm $\left \| \vec{w} \right \|$ is more "likely" than a $\vec{w}$ with a bigger norm.
• When we increase $\alpha$, variance is smaller, small $\left \| \vec{w} \right \|$ will be much more likely

### MAP Estimator $\vec{w}_{MAP}$: Derivation¶

• $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \}$$

• So the MAP estimator is \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align}

• Exactly the objective in regularized least squares! So $$\boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} }$$