# EECS 545: Machine Learning¶

## Lecture 04: Linear Regression I¶

• Instructor: Jacob Abernethy
• Date: January 20, 2015

Lecture Exposition Credit: Benjamin Bray

## Outline for this Lecture¶

• Introduction to Regression
• Solving Least Squares
• Closed Form Solution

• Required:
• [PRML], §1.1: Polynomial Curve Fitting Example
• [PRML], §3.1: Linear Basis Function Models
• Optional:
• [MLAPP], Chapter 7: Linear Regression

## Supervised Learning¶

• Goal
• Given data $X$ in feature sapce and the labels $Y$
• Learn to predict $Y$ from $X$
• Labels could be discrete or continuous
• Discrete-valued labels: Classification
• Continuous-valued labels: Regression

## Notation¶

• In this lecture, we will use
• Let vector $\vec{x}_n \in \R^D$ denote the $n\text{th}$ data. $D$ denotes number of attributes in dataset.
• Let vector $\phi(\vec{x}_n) \in \R^M$ denote features for data $\vec{x}_n$. $\phi_j(\vec{x}_n)$ denotes the $j\text{th}$ feature for data $x_n$.
• Feature $\phi(\vec{x}_n)$ is the artificial features which represents the preprocessing step. $\phi(\vec{x}_n)$ is usually some combination of transformations of $\vec{x}_n$. For example, $\phi(\vec{x})$ could be vector constructed by $[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]^T$. If we do nothing to $\vec{x}_n$, then $\phi(\vec{x}_n)=\vec{x}_n$.
• Continuous-valued label vector $t \in \R^D$ (target values). $t_n \in \R$ denotes the target value for $i\text{th}$ data.

### Notation: Example¶

• The table below is a dataset describing acceleration of the aircraft along a runway. Based on our notations above, we have $D=7$. Regardless of the header row, target value $t$ is the first column and $x_n$ denote the data on the $n\text{th}$ row, $2\text{th}$ to $7\text{th}$ columns.
• We could manipulate the data to have our own features. For example,
• If we only choose the first three attributes as features, i.e. $\phi(\vec{x}_n)=\vec{x}_n[1:3]$, then $M=3$
• If we let $\phi(\vec{x}_n)=[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]$, then $M=3 \times D=21$
• We could also let $\phi(\vec{x}_n)=\vec{x}_n$, then $M=D=7$. This will occur frequently in later lectures.
(Example taken from [here](http://www.flightdatacommunity.com/linear-regression-applied-to-take-off/))

## Linear Regression¶

### Linear Regression (1D Inputs)¶

• Consider 1D case (i.e. D=1)
• Given a set of observations $x_1, \dots, x_N \in \R^M$
• and corresponding target values $t_1, \dots, t_N$
• We want to learn a function $y(x_n, \vec{w}) \approx t_n$ to predict future values. $$y(x_n, \vec{w}) = w_0 + w_1 x_n + w_2 x_n^2 + \dots w_{M-1} x_n^{M-1} = \sum_{k=0}^{M-1} w_k x_n^k = \vec{w}^T\phi(x_n)$$ of which feature coefficient $\vec{w}=[w_0, w_1, w_2, \dots ,w_{M-1}]^T$, feature $\phi(x_n)=[1, x_n, x_n^2, \dots, x_n^{M-1}]$ (here we add a bias term $\phi_0(x_n)=1$ to features).

### Regression: Noisy Data¶

In [2]:
regression_example_draw(degree1=0,degree2=1,degree3=3, ifprint=True)

The expression for the first polynomial is y=-0.129
The expression for the second polynomial is y=-0.231+0.598x^1
The expression for the third polynomial is y=0.085-0.781x^1+1.584x^2-0.097x^3


### Basis Functions¶

• For feature basis function, we used polynomial functions for example above.
• In fact, we have multiple choices for basis function $\phi_j(\vec{x})$.
• Different basis functions will produce different features, thus may have different performances in prediction.
In [3]:
basis_function_plot()


### Linear Regression (General Case)¶

• The function $y(\vec{x}_n, \vec{w})$ is linear in parameters $\vec{w}$.
• Goal: Find the best value for the weights $\vec{w}$.
• For simplicity, add a bias term $\phi_0(\vec{x}_n) = 1$. \begin{align} y(\vec{x}_n, \vec{w}) &= w_0 \phi_0(\vec{x}_n)+w_1 \phi_1(\vec{x}_n)+ w_2 \phi_2(\vec{x}_n)+\dots +w_{M-1} \phi_{M-1}(\vec{x}_n) \\ &= \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}_n) \\ &= \vec{w}^T \phi(\vec{x}_n) \end{align} of which $\phi(\vec{x}_n) = [\phi_0(\vec{x}_n),\phi_1(\vec{x}_n),\phi_2(\vec{x}_n), \dots, \phi_{M-1}(\vec{x}_n)]^T$

## Least Squares¶

### Least Squares: Objective Function¶

• We will find the solution $\vec{w}$ to linear regression by minimizing a cost/objective function.
• When the objective function is sum of squared errors (sum differences between target $t$ and prediction $y$ over entire training data), this approach is also called least squares.
• The objective function is $$E(\vec{w}) = \frac12 \sum_{n=1}^N (y(\vec{x}_n, \vec{w}) - t_n)^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2$$
(Minizing the objective function is equivalent to minimizing sum of squares of green segments.)

### How to Minimize the Objective Function?¶

• We will solve the least square problem in two approaches:
• Gradient Descent Method: approach the solution step by step. We will show two ways when iterate:
• Closed Form Solution

• To minimize the objective function, take derivative w.r.t coefficient vector $\vec{w}$: \begin{align} \nabla_\vec{w} E(\vec{w}) &= \frac{\partial}{\partial \vec{w}} \left[ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 \right] \\ &= \sum_{n=1}^N \frac{\partial}{\partial \vec{w}} \left[ \frac12 \left( \vec{w}^T\phi(x_n) - t_n \right)^2 \right] \\ \text{Applying chain rule:} &= \sum_{n=1}^N \left[ \frac12 \cdot 2 \cdot \left( \vec{w}^T\phi(x_n) - t_n \right) \cdot \frac{\partial}{\partial \vec{w}} \vec{w}^T\phi(x_n) \right] \\ &= \sum_{n=1}^N \left( \vec{w}^T\phi(x_n) - t_n \right)\phi(x_n) \end{align}
• Since we are taking derivative of a scalar $E(\vec{w})$ w.r.t a vector $\vec{w}$, the derivative $\nabla_\vec{w} E(\vec{w})$ will be a vector.
• For details about matrix/vector derivative, please refer to appendix attached in the end of the slide.

• Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
• Initialize: $\vec{w}_0$, learning rate $\eta$
• Repeat until convergence:
• $\nabla_\vec{w} E(\vec{w}_\text{old}) = \sum_{n=1}^N \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
• $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_\vec{w} E(\vec{w}_\text{old})$
• End
• Output: $\vec{w}_\text{final}$

Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual training sample and update.

• Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
• Initialize: $\vec{w}_0$, learning rate $\eta$
• Repeat until convergence:
• Random shuffle $\{(\vec{x}_n, t_n)\}_{n=1}^N$
• For $n=1,\dots,N$ do:
• $\nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n) = \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
• $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n)$
• End
• End
• Output: $\vec{w}_\text{final}$

#### Method II: Closed Form Solution¶

Main Idea: Compute gradient and set to gradient to zero, solving in closed form.

• Objective Function $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \phi(\vec{x}_n)^T \vec{w} - t_n \right)^2$
• Let $\vec{e} = [\phi(\vec{x}_1)^T \vec{w} - t_1,\quad \phi(\vec{x}_2)^T \vec{w} - t_2, \quad \dots ,\quad \phi(\vec{x}_N)^T \vec{w} - t_N]^T$, we have $$E(\vec{w}) = \frac12 \vec{e}^T \vec{e} = \frac12 \parallel \vec{e}\ \parallel^2$$
• Look at $\vec{e}$: $$\vec{e} = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} \vec{w}- \begin{bmatrix} t_1\\ \vdots\\ t_N \end{bmatrix} \triangleq \Phi \vec{w}-\vec{t}$$ Here $\Phi \in \R^{N \times M}$ is called design matrix. Each row represents one sample. Each column represents one feature $$\Phi = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \phi(\vec{x}_2)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix}$$
• From $E(\vec{w}) = \frac12 \left \| \vec{e} \right \|^2 = \frac12 \vec{e}^T \vec{e}$ and $\vec{e}=\Phi \vec{w}-\vec{t}$, we have \begin{align} E(\vec{w}) & = \frac12 \left \| \Phi \vec{w}-\vec{t} \right \|^2=\frac12 (\Phi \vec{w}-\vec{t})^T (\Phi \vec{w}-\vec{t}) \\ & = \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} \end{align}
• So the derivative is $$\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi\vec{w} - \Phi^T \vec{t}$$
• To minimize $E(\vec{w})$, we need to let $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, which is also $\Phi^T\Phi \vec{w} = \Phi^T \vec{t}$
• When $\Phi^T \Phi$ is invertible ($\Phi$ has linearly independent columns), we simply have \begin{align} \hat{\vec{w}} &=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \\ &\triangleq \Phi^\dagger \vec{t} \end{align} of which $\Phi^\dagger$ is called the Moore-Penrose Pseudoinverse of $\Phi$.
• We will talk about case $\Phi^T \Phi$ is non-invertible later.

#### Digression: Moore-Penrose Pseudoinverse¶

• When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
• For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
• In general, we can get $A^\dagger$ by SVD: if we write $A \in \R^{m \times n} = U_{m \times m} \Sigma_{m \times n} V_{n \times n}^T$ then $A^\dagger \in \R^{n \times m} = V \Sigma^\dagger U^T$, where $\Sigma^\dagger \in \R^{n \times m}$ is obtained by taking reciprocals of non-zero entries of $\Sigma^T$.
• Particularly, when $A$ has linearly independent columns then $A^\dagger = (A^T A)^{-1} A^T$. When $A$ is invertible, then $A^\dagger = A^{-1}$.

#### Back to Closed Form Solution¶

• From previous derivation, we have $\hat{w}=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \triangleq \Phi^\dagger \vec{t}$.
• What if $\Phi^T \Phi$ is non-invertible? This corresponds to the case where $\Phi$ doesn't have linearly independent columns. For dataset, this means the feature vector of certain sample is the linear combination of feature vectors of some other samples.
• We could still resolve this using pseudoinverse.
• To make $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, we have $$\hat{\vec{w}} = (\Phi^T\Phi)^\dagger \Phi^T \vec{t} = \Phi^\dagger \vec{t}$$ of which $(\Phi^T\Phi)^\dagger\Phi^T = \Phi^\dagger$. This is left as an exercise. (Hint: use SVD)
• Now we could conclude the optimal $\vec{w}$ in the sense that minimizes sum of squared errors is $$\boxed{\hat{\vec{w}} = \Phi^\dagger \vec{t}}$$