$$ \LaTeX \text{ command declarations here.} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} $$
In [1]:
%matplotlib inline
from Lec05 import *

EECS 545: Machine Learning

Lecture 05: Linear Regression II

  • Instructor: Jacob Abernethy
  • Date: January 25, 2015

Lecture Exposition Credit: Benjamin Bray and Zhe Du

Outline for this Lecture

  • Overfitting
  • Regularized Least Squares
  • Locally-Weighted Linear Regression
  • Maximum Likelihood Interpretation of Linear Regression

Reading List

  • Required:
    • [PRML], §3.2: The Bias-Variance Decomposition
    • [PRML], §3.3: Bayesian Linear Regression
  • Optional:
    • [MLAPP], Chapter 7: Linear Regression

In this lecture, we will first look at how degree of linear regression and sample dataset size will cause overfitting in linear regression. To deal with overfitting, regularized least squares will be introduced. When predicting the label of a new observation in linear regression, if we want to rely more on nearby traing data than distant training data, we will resort to locally-weighted linear regression. Finally, we will show regular linear regression and regularized linear regression can be interpreted from probablistic perspective each using maximum likelihood estimation and maximum a posteriori estimation.

Overfitting

Overfitting: Degree of Linear Regression

In [2]:
regression_overfitting_degree(degree0=0, degree1=3,degree2=9,degree3=12)

Remark

  • In the above plots, we try to predict the true sinusoidal curve hidden in the data with polynomial degrees 0, 3, 9 and 12.
  • In the first plot (degree=0), we only get a horizontal line. The learned curve can neither fit the training data nor match true curve. This is called Underfitting.
  • In the second plot (degree=3), the predicted plot fits both data and true curve perfectly. This is a good degree for our setting.
  • In the third (degree=9) and fourth (degree=12) plots, as the degree increases, the traning data are fitted better but the learned curve deviates from the true curve further. This is called Overfitting.
  • Explanations of why degree could impact the predicted curve will come in the Remark later.

Overfitting: Dataset Size

In [3]:
regression_overfitting_datasetsize(size0 = 13, size1 = 50, size2 = 100, size3 = 500)

Remark

  • In the above plots, we try to predict the true sinusoidal curve hidden in the data with polynomial degree 12 and training dataset size 13, 50, 100 and 500.
  • In the first plot (size=13), overfitting occurs as we have seen just now.
  • Comparing the four plots altogether, we could see as the size increases, overfitting diminishes. And although more and more data points cannot be fitted by the learned curve (training error is becoming higher), but it matches the true curve better (test error will be smaller).
  • Explanations of why training data size impact the predicted curve will come in the Remark later.

Overfitting: Overall Performance

  • On the left plot, we fix the dataset size and vary the polynomial degree
  • On the right plot, we fix the polynomial degree and vary the dataset size
In [4]:
regression_overfitting_curve()

Remark

  • The plot below is the root mean squared error (RMSE) of training dataset and test dataset with respect to different degree and dataset size.
  • NOTE: For simplicity of presentation, we divided the dataset into training set and test set. However, it's not legitimate to find the optimal hyperparameter based on the test set. We will talk about legitimate ways of doing this when we cover model selection and cross-validation.
  • Combining the last 10 plots, we could have:
  • Degree
    • When degree is really small, the regressor is not powerful enough (i.e. degree is small) to learn the underlying true model. At this time, underfitting occurs. Both training error and test error are high.
    • As degree increases, regressor become more powerful enough to roughly learn the true model but not that powerful to also take the noise into considerations. At this time, underfitting reduces. Both training error and test error will be smaller.
    • As degree further increases, regressor become so powerful that it could fit most of the data in training dataset perfectly. And since the data are noisy, the learned model actually deviates from the true model. At this time, overfitting occurs. Training error could becomes very small (even 0 sometimes), while test error increases.
  • Dataset Size
    • When training dataset size is small, a powerful regressor (i.e. degree is high) can fit every single sample in training dataset. Therefore, noise is also considered. This is equivalent to the ending zone of the plot (left) with respect degree.
    • As training dataset size increases, since the power of regressor is finite (degree is fixed), regressor starts to fail to fit every single sample. Instead it seeks to learn a curve such that samples will fall on both sides equally. Since the noise are assumed to 0 mean Gaussian noise which is symmetric with respect to 0, this is actually close to the underlying true curve.
    • The training error and test error will converge to 0.3 which is the variance of noise (you could check this by examining the Python code). This is not a coincidence and can be derived. We will cover this later when we study bias-variance tradeoff
  • $$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{n=1}^{N} (y_n-t_n)^2}$$

Rule of Thumb to Choose the Degree

  • For a small number of datapoints, use a low degree
    • Otherwise, the model will overfit!
  • As you obtain more data, you can gradually increase the degree
    • Add more features to represent more data
    • Warning: Your model is still limited by the finite amount of data available. The optimal model for finite data cannot be an infinite-dimensional polynomial!)
  • Use regularization to control model complexity.

Regularized Linear Regression

Coefficients of Overfitting

  • Before we move to regularized linear regression, let's first look at what happened to the coefficients $\vec{w}$ when there is overfitting.
In [5]:
regression_overfitting_coeffs()
M=0 (Underfitting) M=3 (Good) M=9 (Overfitting) M=12 (Overfitting)
w_0 9.22491 7.66854 -0.240721 0.036502
w_1 -75.2974 6.36637 -1.282764
w_2 172.044 -69.1889 19.592658
w_3 -14.2807 397.481 -170.650848
w_4 -1290.02 934.785369
w_5 2328.31 -3348.827201
w_6 -2093.26 7896.286428
w_7 580.659 -11973.474257
w_8 247.134 10891.034834
w_9 -28.6505 -4862.568457
w_10 131.114412
w_11 582.180371
w_12 -28.827647

Remark

  • The table above corresponds to the coefficients (multiplied by 100 for better visualization) for different degrees.
  • We could see that when overfitting occurs, we get some really crazy and large numbers!
  • So one intuition to handle overfitting is to penalize large coefficients.

Regularized Least Squares: Objective Function

  • Recall the objective function we minimizes in last lecture is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$

  • To penalize the large coefficients, we will add one penalization/regularization term to it and minimize them altogether. $$ E(\vec{w}) = \underbrace{ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 }_{E_D(\vec{w})}+ \underbrace{\boxed{\frac{\lambda}{2} \left \| \vec{w} \right \|^2}}_{E_W(\vec{w})} $$ of which $E_D(\vec{w})$ represents the term of sum of squared errors and $E_W(\vec{w})$ is the regularization term.

  • $\lambda$ is the regularization coefficient.

  • If $\lambda$ is large, $E_{\vec{W}}(\vec{w})$ will dominate the objective function. As a result we will focus more on minimizing $E_W(\vec{w})$ and the resulting solution $\vec{w}$ tends to have smaller norm and the $E_D(\vec{w})$ term will be larger.

Regularized Least Squares: Derivation

  • Based on what we have derived in last lecture, we could write the objective function as $$ \begin{align} E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} + \frac{\lambda}{2}\vec{w}^T\vec{w} \end{align} $$

  • The gradient is $$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} + \lambda \vec{w}\\ &= (\Phi^T \Phi + \lambda I)\vec{w} - \Phi^T \vec{t} \end{align} $$

  • Setting the gradient to 0, we will get the solution $$ \boxed{ \hat{\vec{w}}=(\Phi^T \Phi + \lambda I)^{-1} \Phi^T \vec{t} } $$

  • In the solution to ordinary least squares which is $\hat{\vec{w} }=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$, we cannot guarantee $\Phi^T \Phi$ is invertible. But in regularized least squares, if $\lambda > 0$, $\Phi^T \Phi + \lambda I$ is always invertible.

Regularized Least Squares: Different Norms

  • The $\ell^p$ norm of a vector $\vec{x}$ is defined as $$ \left \| \vec{x} \right \|_p = (\sum_{j=1}^{M} |x_j|^p)^\frac{1}{p} $$

  • For the regularized least squares above, we used $\ell^2$ norm. We could also use other $\ell^p$ norms for different regularizers and the objective function becomes $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$

    (Contour of Different p-norms)
  • Lasso regularization ($\ell^1$ regularization) tends to generate sparser solutions than ridge regression ($\ell^2$ regularization)
    (Image taken from [here](http://gerardnico.com/wiki/data_mining/lasso))

Remark

  • RSS is residual of sum of squares, which is the sum of squared errors $E_D(\vec{w})$ we use.
  • This plot is to illustrate intuitively why lasso has sparser solution than ridge regression.
  • Our objective function is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 + \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p $$ To illustrate, lets look at an equivalent constrained problem. (Not exactly equivalent, just for illustration purpose) $$ \begin{aligned} & {\text{minimize}} & & \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2\\ & \text{subject to} & & \frac{\lambda}{2} \left \| \vec{w} \right \|_p^p \leq C \end{aligned} $$
  • The objective function $\frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$ is the red contours in the plots. The optimal solution without the constraint is $\hat{\vec{w}}_{OLS}$, which is just the solution to ordinary least squares. The farther we are away from $\hat{\vec{w}}_{OLS}$ the larger objective function will be.
  • The feasible area that satisfying the constraint is the cyan area in the plots. The solution must fall in these areas. For lasso, it's a diamond; for ridge regression, it's a circle.
  • As we increase the value of objective function, the contour will expand. The first time the contour touch the feasible cyan area, the touching point would be the optimal solution. This is because our solution should both be as close to $\hat{\vec{w}}_{OLS}$ as possible and fall in the feasible area.
  • Since lasso has diamond area with four corners, the contours tend to touch the corner first. And since the corner is on the axis where coordinate has at least one zero component, this guarantees the sparsity of solution. On the contrary, the circle area in ridge regression cannot give us this property.

Regularized Least Squares: Example

In [6]:
regression_regularization_plot()

Remark

  • In the second plot, we can see that after the regularization term is added, the learned curve looks much more like the true curve than the learnd curve without regularization in the first plot.
  • However, in the third plot, when the coefficient $\lambda$ for regularization is too large, the minimization will mostly focus on minimizing $\left \| \vec{w} \right \|$, thus deviating from the true curve in a great deal. (We will see the coefficients $\vec{w}$) are really small for this case in next slide)
  • In the fourth plot, as we increase $\lambda$, the training error is monotonically increasing because we are far away from the task of minimizing sum of squared error $E_D(\vec{w})$. As for the the test error, it has minimum value near $\lambda=1$, which is the blancing point in the tradeoff between minimizing $E_D(\vec{w})$ and minimizing regularization term $E_W(\vec{w})$.

Regularized Least Squares: Coefficients

  • Let's look at how the coefficients change after we add regularization
In [7]:
regression_regularization_coeff()
lambda=0 lambda=exp^1 lambda=exp^10
w_0 30.203406 13.701085 0.010221
w_1 7133.542582 13.267902 0.013419
w_2 -31022.107050 12.423422 0.020149
w_3 53507.324765 8.040454 0.029552
w_4 -48906.151251 0.865971 0.038200
w_5 26564.237381 -4.354079 0.034013
w_6 -9013.171136 -1.827607 -0.002813
w_7 1929.741748 2.147727 -0.054953
w_8 -253.351938 -0.613354 0.023119
w_9 18.620246 0.073751 -0.003275
w_10 -0.586531 -0.003284 0.000155

Remark

  • From the table above, we can see the regularization term has effectively constrained those huge coefficients.
  • However, when $\lambda$ is large ($\lambda = e^{10}$), the coefficients are "over-regularized".

Regularized Least Squares: Summary

  • Simple modification of linear regression
  • $\ell^2$ Regularization controls the tradeoff between fitting error and complexity.
    • Small $\ell^2$ regularization results in complex models, but with risk of overfitting
    • Large $\ell^2$ regularization results in simple models, but with risk of underfitting
  • It is important to find an optimal regularization that balances between the two

Locally-Weighted Linear Regression

Locally-Weighted Linear Regression

  • Main Idea: Given a new observation $\vec{x}$, we generate the coefficients $\vec{w}$ and prediction $y(\vec{x}, \vec{w})$ by giving high weights for neighbours of $\vec{x}$.

  • Regular vs. Locally-Weighted Linear Regression

Linear Regression

1. Fit $\vec{w}$ to minimize $\sum_{n} (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
2. For every new observation $\vec{x}$ to be predicted, output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **all** observations to be predicted.

Locally-weighted Linear Regression

1. For **every** new observation $\vec{x}$ to be predicted, generate the weights $r_n$ for every training sample $(\vec{x}_n, t_n)$. (The closer $\vec{x}_n$ is to $\vec{x}$, the larger $r_n$ will be)
2. Fit $\vec{w}$ to minimize $\sum_{n} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2$ of which $\{(\vec{x}_n, t_n)\}_{n=1}^N$ is the training dataset.
3. Output $\vec{w}^T \phi(x)$
**Note**: **One** $\vec{w}$ for **only one** observations to be predicted.

Regular vs. Locally-Weighted Linear Regression

  • For linear regression with degree=1, the learned curve for regular linear regression is a straight line while the learned curve for locally-weighted linear regression is a curve. For queries/observations with different features (x-axis value), the weights $\vec{r}$ are different thus the solution $\hat{\vec{w}}$ are also different. For 1-degree regression, $\hat{\vec{w}}$ correspond to the interception and slope of learned curve, so the slope for locally-weighted linear regression is always changing and what we get cannot be a straight line.

Locally-Weighted Linear Regression: Weights

  • The standard choice for weights $\vec{r}$ uses the Gaussian Kernel, with kernel width $\tau$ $$ r_n = \exp\left( -\frac{|| \vec{x}_n - \vec{x} ||^2}{2\tau^2} \right) $$

  • Choice of kernel width matters.

  • The bell shape is the weight curve, which has maximum at the query point $\vec{x}$ and decreases as we move farther.
  • The best kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.

Remark

  • Note weight $\vec{r}$ and resulting $\hat{\vec{w}}$ depend on $\vec{x}$ (query point); we must solve linear regression for each query point $\vec{x}$.
  • Can be reformulated as a modified version of least squares problem.
  • The best kernel width requires hyperparameter tuning

Locally-Weighted Linear Regression: Derivation

  • Recall that in regular linear regression, we have $$E(\vec{w}) = \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \left \| \Phi \vec{w}-\vec{t} \right \|^2$$

  • In locally-weighted linear regression, we are to minimize $$E(\vec{w}) = \sum_{n=1}^{N} r_n (\vec{w}^T \phi(\vec{x}_n) - t_n )^2 = \sum_{n=1}^{N} (\sqrt{r_n} \vec{w}^T \phi(\vec{x}_n) - \sqrt{r_n} t_n )^2 = \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 $$ of which $$R = \begin{bmatrix} r_1 & & & \\ & r_2 & & \\ & & \ddots & \\ & & & r_N \end{bmatrix} $$

  • Recall the solution to $\ \arg \min \left \| \Phi \vec{w}-\vec{t} \right \|^2 \ $ is $\ \Phi^\dagger \vec{t} \ $. Similarly, the solution to $\ \arg \min \left \| \sqrt{R} \Phi \vec{w}- \sqrt{R} \vec{t} \right \|^2 \ $ is $$ \boxed{\hat{\vec{w}} = (\sqrt{R} \Phi)^\dagger \sqrt{R} \vec{t}} $$

Remark

  • When $\sqrt(R)\Phi$ has linearly independent columns, the solution can be written as $$ \boxed{\hat{\vec{w}} = (\Phi^T R \Phi)^{-1} \Phi R \vec{t}} $$ Recall when $A$ has linearly independent columns, $A^\dagger = (A^T A)^{-1} A^T$
  • We omitted $\frac12$ in the above $E(\vec{w})$ because it will not affect the solution $\vec{w}$. We add $\frac12$ in last lecture because it can cancel out the coefficient 2 generated by differentiating the quadratic expression.

Probablistic Interpretation of Least Squares Regression

  • We have showed derived the solution to least squares regression by minimizing objective function. Now we will provide a probablistic perspective. Specifically, we will show the solution to regular least squares is just the maximum likelihood estimate of $\vec{w}$ and the solution to regularized least squares is the Maximum a Posteriori estimate.

Some Background

  • Gaussian Distribution $$ \mathcal{N}(x, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ \frac{(x-\mu)^2}{2\sigma^2} \right] $$

  • Maximum Likelihood Estimation and Maximum a Posteriori Estimation (MAP)

    • For distribution $t \sim p(t|\theta)$. $\theta$ is some unknown parameter (like mean or variance) to be estimated.
    • Given observation $\vec{t} = (t_1, t_2, \dots, t_N)$,
      • The Maximum Likelihood Estimator is $$ \theta_{ML} = \arg \max \prod_{n=1}^N p(t_n | \theta) $$
      • If we have some prior knowledge about $\theta$, the MAP estimator is $$ \theta_{MAP} = \arg \max \prod_{n=1}^N p(\theta | t_n) \quad (\text{Posteriori Probability of } \theta) $$

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • We assume the signal+noise model of single data $(\vec{x}, t)$ is $$ \begin{gather} t = \vec{w}^T \phi(\vec{x}) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ of which $\vec{w}^T \phi(\vec{x})$ is the true model, $\epsilon$ is the perturbation/randomness.

  • Since $\vec{w}^T \phi(\vec{x})$ is deterministic/non-random, we have $$ t \sim \mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1}) $$

  • The likelihood function of $t$ is just probability density function (PDF) of $t$ $$ p(t|\vec{x},\vec{w},\beta) = \mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1}) $$

  • For inputs $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$ and target values $\vec{t}=(t_1,\dots,t_n)$, the data likelihood is $$ p(\vec{t}|\mathcal{X},\vec{w},\beta) = \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$

  • Notation Clarification

    • $p(t|x,w,\beta)$ it the PDF of $t$ whose distribution is parameterized by $x,\vec{w},\beta$.
    • $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$ is Gaussian distribution with mean $\vec{w}^T \phi(\vec{x})$ and variance $\beta^{-1}$.
    • $\mathcal{N}(t|\vec{w}^T \phi(\vec{x}),\beta^{-1})$ is the PDF of $\vec{t}$ which has Gaussian distribution $\mathcal{N}(\vec{w}^T \phi(\vec{x}), \beta^{-1})$

Maximum Likelihood Estimator $\vec{w}_{ML}$

  • Main Idea of Maximum Likelihood Estimate

    • Given $\{ \vec{x}_n, t_n \}_{n=1}^N$, we want to find $\vec{w}_{ML}$ that maximizes data likelihood function $$ \vec{w}_{ML} =\arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta) =\arg \max \prod_{n=1}^N \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) $$ and by derivation we will show $\vec{w}_{ML}$ is equivalent to the least squares solution $\hat{\vec{w}} = \Phi^\dagger \vec{t}$.
  • Intuition about Maximum Likelihood Estimation

    • Finding maximum likelihood estimate $\vec{w}_{ML} = \arg \max p(\vec{t}|\mathcal{X},\vec{w},\beta)$ is just finding the parameter $\vec{w}$ under which for data $\mathcal{X}=(\vec{x}_1, \dots, \vec{x}_n)$, observed $\vec{t}=(t_1,\dots,t_n)$ is the most likely result to be generated among all possible $\vec{t}$.

Maximum Likelihood Estimator $\vec{w}_{ML}$: Derivation

  • Single data likelihood is $$ p(t_n|\vec{x}_n,\vec{w},\beta) = \mathcal{N}(t_n|\vec{w}^T\phi(\vec{x}_n),\beta^{-1}) = \frac{1}{\sqrt{2 \pi \beta^{-1}}} \exp \left \{ - \frac{1}{2 \beta^{-1}} (t_n - \vec{w}^T \phi(x_n))^2 \right \} $$

  • Single data log-likelihood is $$ \ln p(t_n|\vec{x}_n,\vec{w},\beta) = - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 $$ We use logarithm because maximizer of $f(x)$ is the same as maximizer of $\log f(x)$. Logarithm can convert product to summation which makes life easier.

  • Complete data log-likelohood is $$ \begin{align} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) &= \ln \left[ \prod_{n=1}^N p(t_n|\vec{x}_n,\vec{w},\beta) \right] = \sum_{n=1}^N \ln p(t_n|\vec{x}_n,\vec{w},\beta) \\ &= \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$

  • Maximum likelihood estimate $\vec{w}_{ML}$ is $$ \begin{align} \vec{w}_{ML} &= \underset{\vec{w}}{\arg \max} \ln p(\vec{t}|\mathcal{X},\vec{w},\beta) \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac12 \ln 2 \pi \beta^{-1} - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \max} \sum_{n=1}^N \left[ - \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \sum_{n=1}^N \left[(\vec{w}^T \phi(x_n) - t_n)^2 \right] \end{align} $$

  • Familiar? Recall the objective function we minimized in least squares is $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$, so we could conclude that $$ \boxed{\vec{w}_{ML} = \hat{\vec{w}}_{LS} = \Phi^\dagger \vec{t}} $$

MAP Estimator $\vec{w}_{MAP}$

  • The MAP estimator is obtained by $$ \begin{align} \vec{w}_{MAP} &= \arg \max p(\vec{w}|\vec{t}, \mathcal{X},\beta) & & (\text{Posteriori Probability})\\ &= \arg \max \frac{p(\vec{w}, \vec{t}, \mathcal{X},\beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max \frac{p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta)}{p(\mathcal{X}, t, \beta)} \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}, \mathcal{X}, \beta) & & (p(X, t, \beta) \text{ is irrelevant to} \ \vec{w})\\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) p(\mathcal{X}) p(\beta) & & (\text{Independence}) \\ &= \arg \max p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) & & (\text{Likelihood} \times \text{Prior}) \end{align} $$ We are just using Bayes Theorem for the above steps.
  • The only difference from ML estimator is we have an extra term of PDF of $\vec{w}$. This is the prior belief of $\vec{w}$. Here, we assume, $$ \vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I) $$
  • ML vs. MAP
    • Maximum Likelihood: We know nothing about $\vec{w}$ initially and every $\vec{w}$ are equally likelihood
    • Maximum a Posteriori: We know something about about $\vec{w}$ initially and certain $\vec{w}$ are more likely (depending on prior $p(\vec{w})$). In another way, $\vec{w}$ are weighted.
  • Assumption $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ makes sense because
    • In regularized least squares
      • We already know large coefficient $\vec{w}$ that may lead to overfitting should be avoided.
      • When we increase the regularization coefficient $\lambda$, the smaller $\left \| \vec{w} \right \|$ will be.
    • When use $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$
      • $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ encodes the assumption that $\vec{w}$ with a smaller norm $\left \| \vec{w} \right \|$ is more "likely" than a $\vec{w}$ with a bigger norm.
      • When we increase $\alpha$, variance is smaller, small $\left \| \vec{w} \right \|$ will be much more likely

MAP Estimator $\vec{w}_{MAP}$: Derivation

  • $\vec{w} \sim \mathcal{N}(\vec{0}, \alpha^{-1}I)$ is multivariate Gaussian which has PDF $$ p(\vec{w}) = \frac{1}{\left( \sqrt{2 \pi \alpha^{-1}} \right)^N} \exp \left \{ -\frac{1}{2 \alpha^{-1}} \sum_{n=1}^N w_n^2 \right \} $$

  • So the MAP estimator is $$ \begin{align} \vec{w}_{MAP} &= \underset{\vec{w}}{\arg \max} \ p(\vec{t}|\vec{w}, \mathcal{X},\beta) p(\vec{w}) = \underset{\vec{w}}{\arg \max} \left[\ln p(\vec{t}|\vec{w}, \mathcal{X},\beta) + \ln p(\vec{w}) \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac{\beta}{2} (\vec{w}^T \phi(x_n) - t_n)^2 + \frac{\alpha}{2} \sum_{n=1}^N w_n^2 \right] \\ &= \underset{\vec{w}}{\arg \min} \left[ \sum_{n=1}^N \frac12 (\vec{w}^T \phi(x_n) - t_n)^2 + \frac12 \frac{\alpha}{\beta} \left \| \vec{w} \right \|^2 \right] \end{align} $$

  • Exactly the objective in regularized least squares! So $$ \boxed{ \vec{w}_{MAP} = \hat{\vec{w}}=\left(\Phi^T \Phi + \frac{\alpha}{\beta} I\right)^{-1} \Phi^T \vec{t} } $$

Remark

  • Of the above expression, $\frac{\alpha}{\beta}$ corresponds to the regularization coefficient $\lambda$ we used in previous regulazied least squares.
  • Priors: Represent prior beliefs about acceptable values for model parameters.
  • Example: In linear regression, $\ell^2$ regularization can be interpreted as placing a Gaussian Prior on the regression coefficients.
  • All statistical models and machine learning algorithms make assumptions.
    • All reasoning is based on implicit assumptions.
    • A Bayesian will tell you that his prior is a way of explicitly stating those assumptions.
  • This can all get very philosophical, but...
    • Bayesian reasoning is best seen as a useful tool.
    • Many concepts in machine learning have Bayesian interpretations.
      • Choice of loss / error function, regularization, etc.
  • For a fully Bayesian take on machine learning, check out the Murphy textbook:
  • We will cover more about Bayesian reasoning when we move to Bayesian Linear Regression, a linear regression that is used for streaming data.