EECS 545: Machine Learning

Lecture 05: Linear Regression II

  • Instructor: Jacob Abernethy
  • Date: January 25, 2015

Lecture Exposition Credit: Benjamin Bray and Zhe Du

Outline for this Lecture

  • Overfitting
  • Regularized Least Squares
  • Locally-Weighted Linear Regression
  • Maximum Likelihood Interpretation of Linear Regression

Reading List

  • Required:
    • [PRML], §3.2: The Bias-Variance Decomposition
    • [PRML], §3.3: Bayesian Linear Regression
  • Optional:
    • [MLAPP], Chapter 7: Linear Regression


Overfitting: Degree of Linear Regression

In [2]:
regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)

Overfitting: Dataset Size

In [3]:
regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)

Overfitting: Overall Performance

  • On the left plot, we fix the dataset size and vary the polynomial degree
  • On the right plot, we fix the polynomial degree and vary the dataset size
In [4]:

Rule of Thumb to Choose the Degree

  • For a small number of datapoints, use a low degree
    • Otherwise, the model will overfit!
  • As you obtain more data, you can gradually increase the degree
    • Add more features to represent more data
    • Warning: Your model is still limited by the finite amount of data available. The optimal model for finite data cannot be an infinite-dimensional polynomial!)
  • Use regularization to control model complexity.

Regularized Linear Regression

Coefficients of Overfitting

  • Before we move to regularized linear regression, let's first look at what happened to the coefficients $\vec{w}$ when there is overfitting.
In [5]:
M=0 (Underfitting) M=3 (Good) M=9 (Overfitting) M=12 (Overfitting)
w_0 9.22491 7.66854 -0.240721 0.036502
w_1 -75.2974 6.36637 -1.282764
w_2 172.044 -69.1889 19.592658
w_3 -14.2807 397.481 -170.650848
w_4 -1290.02 934.785369
w_5 2328.31 -3348.827201
w_6 -2093.26 7896.286428
w_7 580.659 -11973.474257
w_8 247.134 10891.034834
w_9 -28.6505 -4862.568457
w_10 131.114412
w_11 582.180371
w_12 -28.827647

Regularized Least Squares: Objective Function

  • Recall the objective function we minimizes in last lecture is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$

  • To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.

  • $\lambda$ is the regularization coefficient.

  • If $\lambda$ is large, $E_{\vec{W}}(\vec{w})$ will dominate the objective function. As a result we will focus more on minimizing $E_W(\vec{w})$ and the resulting solution $\vec{w}$ tends to have smaller norm and the $E_D(\vec{w})$ term will be larger.

Regularized Least Squares: Derivation

  • Based on what we have derived in last lecture, we could write the objective function as $$ \begin{align} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} + \frac{\lambda}{2}\vec{w}^T\vec{w} \end{align} $$

  • The gradient is $$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{align} $$

  • Setting the gradient to 0, we will get the solution $$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$

  • In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.

Regularized Least Squares: Example

In [6]:

Regularized Least Squares: Coefficients

  • Let's look at how the coefficients change after we add regularization
In [7]:
lambda=0 lambda=exp^1 lambda=exp^10
w_0 30.203406 13.701085 0.010221
w_1 7133.542582 13.267902 0.013419
w_2 -31022.107050 12.423422 0.020149
w_3 53507.324765 8.040454 0.029552
w_4 -48906.151251 0.865971 0.038200
w_5 26564.237381 -4.354079 0.034013
w_6 -9013.171136 -1.827607 -0.002813
w_7 1929.741748 2.147727 -0.054953
w_8 -253.351938 -0.613354 0.023119
w_9 18.620246 0.073751 -0.003275
w_10 -0.586531 -0.003284 0.000155

Regularized Least Squares: Summary

  • Simple modification of linear regression
  • $\ell^2$ Regularization controls the tradeoff between fitting error and complexity.
    • Small $\ell^2$ regularization results in complex models, but with risk of overfitting
    • Large $\ell^2$ regularization results in simple models, but with risk of underfitting
  • It is important to find an optimal regularization that balances between the two

Locally-Weighted Linear Regression

Locally-Weighted Linear Regression

  • Main Idea: Given a new observation $\vec{x}$, we generate the coefficients $\vec{w}$ and prediction $y(\vec{x}, \vec{w})$ by giving high weights for neighbours of $\vec{x}$.

  • Regular vs. Locally-Weighted Linear Regression

Linear Regression

1. Fit $\vec{w}$ to minimize $\sum_{n} (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
2. For every new observation $\vec{x}$ to be predicted, output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **all** observations to be predicted.

Locally-weighted Linear Regression

1. For **every** new observation $\vec{x}$ to be predicted, generate the weights $r_n$ for every training sample $(\vec{x}_n, t_n)$. (The closer $\vec{x}_n$ is to $\vec{x}$, the larger $r_n$ will be)
2. Fit $\vec{w}$ to minimize $\sum_{n} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
3. Output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **only one** observations to be predicted.

Locally-Weighted Linear Regression: Weights

  • The standard choice for weights $\vec{r}$ uses the Gaussian Kernel, with kernel width $\tau$ $$ r_n = \exp\left( -\frac{|| \vec{x}_n - \vec{x} ||^2}{2\tau^2} \right) $$

  • Choice of kernel width matters.

  • The bell shape is the weight curve, which has maximum at the query point $\vec{x}$ and decreases as we move farther.
  • The best kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.

Locally-Weighted Linear Regression: Derivation

  • Recall that in regular linear regression, we have $$E(\vec{w}) = \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \left \| \Phi \vec{w}-\vec{t} \right \|^2$$

  • In locally-weighted linear regression, we are to minimize $$E(\vec{w}) = \sum_{n=1}^{N} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2 = \sum_{n=1}^{N} (\sqrt{r_n} \vec{w}^T \phi(\vec{x}_n) - \sqrt{r_n} t_n )^2 = \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 $$ of which $$R = \begin{bmatrix} r_1 & & & \\ & r_2 & & \\ & & \ddots & \\ & & & r_N \end{bmatrix} $$

  • Recall the solution to $\ \arg \min \left \| \Phi \vec{w}-\vec{t} \right \|^2 \ $ is $\ \Phi^\dagger \vec{t} \ $. Similarly, the solution to $\ \arg \min \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 \ $ is $$ \boxed{\hat{\vec{w}} = (\sqrt{R} \Phi)^\dagger \sqrt{R} \vec{t}} $$

Probablistic Interpretation of Least Squares Regression

  • We have showed derived the solution to least squares regression by minimizing objective function. Now we will provide a probablistic perspective. Specifically, we will show the solution to regular least squares is just the maximum likelihood estimate of $\vec{w}$ and the solution to regularized least squares is the Maximum a Posteriori estimate.

Some Background

  • Gaussian Distribution $$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$

  • Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)

    • For distribution $t \sim p(t|\theta)$. $\theta$ is some unknown parameter (like mean or variance) to be estimated.
    • Given observation $\vec{t} = (t_1, t_2, \dots, t_N)$,
      • The Maximum Likelihood Estimator is $$ \theta_{ML} = \arg \max \prod_{n=1}^N p(t_n | \theta) $$
      • If we have some prior knowledge about $\theta$, the MAP estimator is $$ \theta_{MAP} = \arg \max \prod_{n=1}^N p(\theta | t_n) \quad (\text{Posteriori Probability of } \theta) $$

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.

  • Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$

  • The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$

  • For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$

  • Notation Clarification

    • $p(t|x,w,\beta)$ it the PDF of $t$ whose distribution is parameterized by $x,\vec{w},\beta$.
    • $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$ is Gaussian distribution with mean $\vec{w}^T \phi(\vec{x})$ and variance $\beta^{-1}$.
    • $\mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$ is the PDF of $\vec{t}$ which has Gaussian distribution $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • Main Idea of Maximum Likelihood Estimate

    • Given $\{ \vec{x}_n, t_n \}_{n=1}^N$, we want to find $\vec{w}_{ML}$ that maximizes data likelihood function $$ \vec{w}_{ML} =\arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta) =\arg \max \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$ and by derivation we will show $\vec{w}_{ML}$ is equivalent to the least squares solution $\hat{\vec{w}} = \Phi^\dagger \vec{t}$.
  • Intuition about Maximum Likelihood Estimation

    • Finding maximum likelihood estimate $\vec{w}_{ML} = \arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta)$ is just finding the parameter $\vec{w}$ under which for data $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$, observed $\vec{t}=(t_1,\dots,t_n)$ is the most likely result to be generated among all possible $\vec{t}$.

Maximum Likelihood Estimator $\vec{w}_{ML}$: Derivation

  • Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$

  • Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.

  • Complete data log-likelohood is $$ \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$

  • Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$

  • Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$

MAP Estimator $\vec{w}_{MAP}$

  • The MAP estimator is obtained by $$ \begin{align} \vec{w}_{MAP} &= \arg \max p(\vec{w}|\vec{t}, \mathcal{X},\beta) & & (\text{Posteriori Probability})\\ &= \arg \max \frac{p(\vec{w}, \vec{t}, \mathcal{X},\beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max \frac{p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta) & & (p(X, t, \beta) \text{ is irrelevant to} \ \vec{w})\\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) p(\mathcal{X}) p(\beta) & & (\text{Independence}) \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) & & (\text{Likelihood} \times \text{Prior}) \end{align} $$ We are just using Bayes Theorem for the above steps.
  • The only difference from ML estimator is we have an extra term of PDF of $\vec{w}$. This is the prior belief of $\vec{w}$. Here, we assume, $$ \vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I) $$
  • ML vs. MAP
    • Maximum Likelihood: We know nothing about $\vec{w}$ initially and every $\vec{w}$ are equally likelihood
    • Maximum a Posteriori: We know something about about $\vec{w}$ initially and certain $\vec{w}$ are more likely (depending on prior $p(\vec{w})$). In another way, $\vec{w}$ are weighted.
  • Assumption $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ makes sense because
    • In regularized least squares
      • We already know large coefficient $\vec{w}$ that may lead to overfitting should be avoided.
      • When we increase the regularization coefficient $\lambda$, the smaller $\left \| \vec{w} \right \|$ will be.
    • When use $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$
      • $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ encodes the assumption that $\vec{w}$ with a smaller norm $\left \| \vec{w} \right \|$ is more "likely" than a $\vec{w}$ with a bigger norm.
      • When we increase $\alpha$, variance is smaller, small $\left \| \vec{w} \right \|$ will be much more likely

MAP Estimator $\vec{w}_{MAP}$: Derivation

  • $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$

  • So the MAP estimator is $$ \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align} $$

  • Exactly the objective in regularized least squares! So $$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$