EECS 545: Machine Learning¶

Lecture 04: Linear Regression I¶

Instructor: Jacob Abernethy
Date: January 20, 2015

Lecture Exposition Credit: Benjamin Bray

Outline for this Lecture¶

Introduction to Regression
Solving Least Squares
- Gradient Descent Method
- Closed Form Solution

Reading List¶

Required:
- [PRML], §1.1: Polynomial Curve Fitting Example
- [PRML], §3.1: Linear Basis Function Models
Optional:
- [MLAPP], Chapter 7: Linear Regression

Supervised Learning¶

Goal
- Given data $X$ in feature sapce and the labels $Y$
- Learn to predict $Y$ from $X$
Labels could be discrete or continuous
- Discrete-valued labels: Classification
- Continuous-valued labels: Regression

Notation¶

In this lecture, we will use
- Let vector $\vec{x}_n \in \R^D$ denote the $n\text{th}$ data. $D$ denotes number of attributes in dataset.
- Let vector $\phi(\vec{x}_n) \in \R^M$ denote features for data $\vec{x}_n$. $\phi_j(\vec{x}_n)$ denotes the $j\text{th}$ feature for data $x_n$.
- Feature $\phi(\vec{x}_n)$ is the artificial features which represents the preprocessing step. $\phi(\vec{x}_n)$ is usually some combination of transformations of $\vec{x}_n$. For example, $\phi(\vec{x})$ could be vector constructed by $[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]^T$. If we do nothing to $\vec{x}_n$, then $\phi(\vec{x}_n)=\vec{x}_n$.
- Continuous-valued label vector $t \in \R^D$ (target values). $t_n \in \R$ denotes the target value for $i\text{th}$ data.

Notation: Example¶

The table below is a dataset describing acceleration of the aircraft along a runway. Based on our notations above, we have $D=7$. Regardless of the header row, target value $t$ is the first column and $x_n$ denote the data on the $n\text{th}$ row, $2\text{th}$ to $7\text{th}$ columns.
We could manipulate the data to have our own features. For example,
- If we only choose the first three attributes as features, i.e. $\phi(\vec{x}_n)=\vec{x}_n[1:3]$, then $M=3$
- If we let $\phi(\vec{x}_n)=[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]$, then $M=3 \times D=21$
- We could also let $\phi(\vec{x}_n)=\vec{x}_n$, then $M=D=7$. This will occur frequently in later lectures.

(Example taken from [here](http://www.flightdatacommunity.com/linear-regression-applied-to-take-off/))

Linear Regression¶

Linear Regression (1D Inputs)¶

Consider 1D case (i.e. D=1)
- Given a set of observations $x_1, \dots, x_N \in \R^M$
- and corresponding target values $t_1, \dots, t_N$
We want to learn a function $y(x_n, \vec{w}) \approx t_n$ to predict future values.

$$ y(x_n, \vec{w}) = w_0 + w_1 x_n + w_2 x_n^2 + \dots w_{M-1} x_n^{M-1} = \sum_{k=0}^{M-1} w_k x_n^k = \vec{w}^T\phi(x_n) $$

of which feature coefficient $\vec{w}=[w_0, w_1, w_2, \dots ,w_{M-1}]^T$, feature $\phi(x_n)=[1, x_n, x_n^2, \dots, x_n^{M-1}]$ (here we add a bias term $\phi_0(x_n)=1$ to features).

Regression: Noisy Data¶

In [2]:

regression_example_draw(degree1=0,degree2=1,degree3=3, ifprint=True)

The expression for the first polynomial is y=-0.129
The expression for the second polynomial is y=-0.231+0.598x^1
The expression for the third polynomial is y=0.085-0.781x^1+1.584x^2-0.097x^3

Basis Functions¶

For feature basis function, we used polynomial functions for example above.
In fact, we have multiple choices for basis function $\phi_j(\vec{x})$.
Different basis functions will produce different features, thus may have different performances in prediction.

In [3]:

basis_function_plot()

Linear Regression (General Case)¶

The function $y(\vec{x}_n, \vec{w})$ is linear in parameters $\vec{w}$.
- Goal: Find the best value for the weights $\vec{w}$.
- For simplicity, add a bias term $\phi_0(\vec{x}_n) = 1$.

$$ \begin{align} y(\vec{x}_n, \vec{w}) &= w_0 \phi_0(\vec{x}_n)+w_1 \phi_1(\vec{x}_n)+ w_2 \phi_2(\vec{x}_n)+\dots +w_{M-1} \phi_{M-1}(\vec{x}_n) \\ &= \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}_n) \\ &= \vec{w}^T \phi(\vec{x}_n) \end{align} $$

of which $\phi(\vec{x}_n) = [\phi_0(\vec{x}_n),\phi_1(\vec{x}_n),\phi_2(\vec{x}_n), \dots, \phi_{M-1}(\vec{x}_n)]^T$

Least Squares¶

Least Squares: Objective Function¶

We will find the solution $\vec{w}$ to linear regression by minimizing a cost/objective function.
When the objective function is sum of squared errors (sum differences between target $t$ and prediction $y$ over entire training data), this approach is also called least squares.
The objective function is

$$ E(\vec{w}) = \frac12 \sum_{n=1}^N (y(\vec{x}_n, \vec{w}) - t_n)^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$

(Minizing the objective function is equivalent to minimizing sum of squares of green segments.)

How to Minimize the Objective Function?¶

We will solve the least square problem in two approaches:
- Gradient Descent Method: approach the solution step by step. We will show two ways when iterate:
  - Batch Gradient Descent
  - Stochastic Gradient Descent
- Closed Form Solution

Method I: Gradient Descent—Gradient Calculation¶

To minimize the objective function, take derivative w.r.t coefficient vector $\vec{w}$:

$$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \frac{\partial}{\partial \vec{w}} \left[ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 \right] \\ &= \sum_{n=1}^N \frac{\partial}{\partial \vec{w}} \left[ \frac12 \left( \vec{w}^T\phi(x_n) - t_n \right)^2 \right] \\ \text{Applying chain rule:} &= \sum_{n=1}^N \left[ \frac12 \cdot 2 \cdot \left( \vec{w}^T\phi(x_n) - t_n \right) \cdot \frac{\partial}{\partial \vec{w}} \vec{w}^T\phi(x_n) \right] \\ &= \sum_{n=1}^N \left( \vec{w}^T\phi(x_n) - t_n \right)\phi(x_n) \end{align} $$

Since we are taking derivative of a scalar $E(\vec{w})$ w.r.t a vector $\vec{w}$, the derivative $\nabla_\vec{w} E(\vec{w})$ will be a vector.
For details about matrix/vector derivative, please refer to appendix attached in the end of the slide.

Method I-1: Gradient Descent—Batch Gradient Descent¶

Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
Initialize: $\vec{w}_0$, learning rate $\eta$
Repeat until convergence:
- $\nabla_\vec{w} E(\vec{w}_\text{old}) = \sum_{n=1}^N \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
- $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_\vec{w} E(\vec{w}_\text{old})$
End
Output: $\vec{w}_\text{final}$

Method I-2: Gradient Descent—Stochastic Gradient Descent¶

Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual training sample and update.

Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
Initialize: $\vec{w}_0$, learning rate $\eta$
Repeat until convergence:
- Random shuffle $\{(\vec{x}_n, t_n)\}_{n=1}^N$
- For $n=1,\dots,N$ do:
  - $\nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n) = \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
  - $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n)$
- End
End
Output: $\vec{w}_\text{final}$

Method II: Closed Form Solution¶

Main Idea: Compute gradient and set to gradient to zero, solving in closed form.

Objective Function $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \phi(\vec{x}_n)^T \vec{w} - t_n \right)^2$
Let $\vec{e} = [\phi(\vec{x}_1)^T \vec{w} - t_1,\quad \phi(\vec{x}_2)^T \vec{w} - t_2, \quad \dots ,\quad \phi(\vec{x}_N)^T \vec{w} - t_N]^T$, we have $$E(\vec{w}) = \frac12 \vec{e}^T \vec{e} = \frac12 \parallel \vec{e}\ \parallel^2$$
Look at $\vec{e}$:

$$ \vec{e} = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} \vec{w}- \begin{bmatrix} t_1\\ \vdots\\ t_N \end{bmatrix} \triangleq \Phi \vec{w}-\vec{t} $$

Here $\Phi \in \R^{N \times M}$ is called design matrix. Each row represents one sample. Each column represents one feature $$\Phi = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \phi(\vec{x}_2)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$

From $E(\vec{w}) = \frac12 \left \| \vec{e} \right \|^2 = \frac12 \vec{e}^T \vec{e}$ and $\vec{e}=\Phi \vec{w}-\vec{t}$, we have

$$ \begin{align} E(\vec{w}) & = \frac12 \left \| \Phi \vec{w}-\vec{t} \right \|^2=\frac12 (\Phi \vec{w}-\vec{t})^T (\Phi \vec{w}-\vec{t}) \\ & = \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} \end{align} $$

So the derivative is $$\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi\vec{w} - \Phi^T \vec{t}$$

To minimize $E(\vec{w})$, we need to let $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, which is also $\Phi^T\Phi \vec{w} = \Phi^T \vec{t}$
When $\Phi^T \Phi$ is invertible ($\Phi$ has linearly independent columns), we simply have

$$ \begin{align} \hat{\vec{w}} &=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \\ &\triangleq \Phi^\dagger \vec{t} \end{align} $$

of which $\Phi^\dagger$ is called the Moore-Penrose Pseudoinverse of $\Phi$.

We will talk about case $\Phi^T \Phi$ is non-invertible later.

Digression: Moore-Penrose Pseudoinverse¶

When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
In general, we can get $A^\dagger$ by SVD: if we write $A \in \R^{m \times n} = U_{m \times m} \Sigma_{m \times n} V_{n \times n}^T$ then $A^\dagger \in \R^{n \times m} = V \Sigma^\dagger U^T$, where $\Sigma^\dagger \in \R^{n \times m}$ is obtained by taking reciprocals of non-zero entries of $\Sigma^T$.
Particularly, when $A$ has linearly independent columns then $A^\dagger = (A^T A)^{-1} A^T$. When $A$ is invertible, then $A^\dagger = A^{-1}$.

Back to Closed Form Solution¶

From previous derivation, we have $\hat{w}=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \triangleq \Phi^\dagger \vec{t}$.
What if $\Phi^T \Phi$ is non-invertible? This corresponds to the case where $\Phi$ doesn't have linearly independent columns. For dataset, this means the feature vector of certain sample is the linear combination of feature vectors of some other samples.
We could still resolve this using pseudoinverse.
To make $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, we have

$$ \hat{\vec{w}} = (\Phi^T\Phi)^\dagger \Phi^T \vec{t} = \Phi^\dagger \vec{t}$$

of which $(\Phi^T\Phi)^\dagger\Phi^T = \Phi^\dagger$. This is left as an exercise. (Hint: use SVD)

Now we could conclude the optimal $\vec{w}$ in the sense that minimizes sum of squared errors is $$\boxed{\hat{\vec{w}} = \Phi^\dagger \vec{t}}$$