EECS 545: Machine Learning

Lecture 04: Linear Regression I

  • Instructor: Jacob Abernethy
  • Date: January 20, 2015

Lecture Exposition Credit: Benjamin Bray

Outline for this Lecture

  • Introduction to Regression
  • Solving Least Squares
    • Gradient Descent Method
    • Closed Form Solution

Reading List

  • Required:
    • [PRML], §1.1: Polynomial Curve Fitting Example
    • [PRML], §3.1: Linear Basis Function Models
  • Optional:
    • [MLAPP], Chapter 7: Linear Regression

Supervised Learning

  • Goal
    • Given data $X$ in feature sapce and the labels $Y$
    • Learn to predict $Y$ from $X$
  • Labels could be discrete or continuous
    • Discrete-valued labels: Classification
    • Continuous-valued labels: Regression

Notation

  • In this lecture, we will use
    • Let vector $\vec{x}_n \in \R^D$ denote the $n\text{th}$ data. $D$ denotes number of attributes in dataset.
    • Let vector $\phi(\vec{x}_n) \in \R^M$ denote features for data $\vec{x}_n$. $\phi_j(\vec{x}_n)$ denotes the $j\text{th}$ feature for data $x_n$.
    • Feature $\phi(\vec{x}_n)$ is the artificial features which represents the preprocessing step. $\phi(\vec{x}_n)$ is usually some combination of transformations of $\vec{x}_n$. For example, $\phi(\vec{x})$ could be vector constructed by $[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]^T$. If we do nothing to $\vec{x}_n$, then $\phi(\vec{x}_n)=\vec{x}_n$.
    • Continuous-valued label vector $t \in \R^D$ (target values). $t_n \in \R$ denotes the target value for $i\text{th}$ data.

Notation: Example

  • The table below is a dataset describing acceleration of the aircraft along a runway. Based on our notations above, we have $D=7$. Regardless of the header row, target value $t$ is the first column and $x_n$ denote the data on the $n\text{th}$ row, $2\text{th}$ to $7\text{th}$ columns.
  • We could manipulate the data to have our own features. For example,
    • If we only choose the first three attributes as features, i.e. $\phi(\vec{x}_n)=\vec{x}_n[1:3]$, then $M=3$
    • If we let $\phi(\vec{x}_n)=[\vec{x}_n^T, \cos(\vec{x}_n)^T, \exp(\vec{x}_n)^T]$, then $M=3 \times D=21$
    • We could also let $\phi(\vec{x}_n)=\vec{x}_n$, then $M=D=7$. This will occur frequently in later lectures.
      (Example taken from [here](http://www.flightdatacommunity.com/linear-regression-applied-to-take-off/))

Linear Regression

Linear Regression (1D Inputs)

  • Consider 1D case (i.e. D=1)
    • Given a set of observations $x_1, \dots, x_N \in \R^M$
    • and corresponding target values $t_1, \dots, t_N$
  • We want to learn a function $y(x_n, \vec{w}) \approx t_n$ to predict future values. $$ y(x_n, \vec{w}) = w_0 + w_1 x_n + w_2 x_n^2 + \dots w_{M-1} x_n^{M-1} = \sum_{k=0}^{M-1} w_k x_n^k = \vec{w}^T\phi(x_n) $$ of which feature coefficient $\vec{w}=[w_0, w_1, w_2, \dots ,w_{M-1}]^T$, feature $\phi(x_n)=[1, x_n, x_n^2, \dots, x_n^{M-1}]$ (here we add a bias term $\phi_0(x_n)=1$ to features).

Regression: Noisy Data

In [2]:
regression_example_draw(degree1=0,degree2=1,degree3=3, ifprint=True)
The expression for the first polynomial is y=-0.129
The expression for the second polynomial is y=-0.231+0.598x^1
The expression for the third polynomial is y=0.085-0.781x^1+1.584x^2-0.097x^3

Basis Functions

  • For feature basis function, we used polynomial functions for example above.
  • In fact, we have multiple choices for basis function $\phi_j(\vec{x})$.
  • Different basis functions will produce different features, thus may have different performances in prediction.
In [3]:
basis_function_plot()

Linear Regression (General Case)

  • The function $y(\vec{x}_n, \vec{w})$ is linear in parameters $\vec{w}$.
    • Goal: Find the best value for the weights $\vec{w}$.
    • For simplicity, add a bias term $\phi_0(\vec{x}_n) = 1$. $$ \begin{align} y(\vec{x}_n, \vec{w}) &= w_0 \phi_0(\vec{x}_n)+w_1 \phi_1(\vec{x}_n)+ w_2 \phi_2(\vec{x}_n)+\dots +w_{M-1} \phi_{M-1}(\vec{x}_n) \\ &= \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}_n) \\ &= \vec{w}^T \phi(\vec{x}_n) \end{align} $$ of which $\phi(\vec{x}_n) = [\phi_0(\vec{x}_n),\phi_1(\vec{x}_n),\phi_2(\vec{x}_n), \dots, \phi_{M-1}(\vec{x}_n)]^T$

Least Squares

Least Squares: Objective Function

  • We will find the solution $\vec{w}$ to linear regression by minimizing a cost/objective function.
  • When the objective function is sum of squared errors (sum differences between target $t$ and prediction $y$ over entire training data), this approach is also called least squares.
  • The objective function is $$ E(\vec{w}) = \frac12 \sum_{n=1}^N (y(\vec{x}_n, \vec{w}) - t_n)^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 $$
    (Minizing the objective function is equivalent to minimizing sum of squares of green segments.)

How to Minimize the Objective Function?

  • We will solve the least square problem in two approaches:
    • Gradient Descent Method: approach the solution step by step. We will show two ways when iterate:
      • Batch Gradient Descent
      • Stochastic Gradient Descent
    • Closed Form Solution

Method I: Gradient Descent—Gradient Calculation

  • To minimize the objective function, take derivative w.r.t coefficient vector $\vec{w}$: $$ \begin{align} \nabla_\vec{w} E(\vec{w}) &= \frac{\partial}{\partial \vec{w}} \left[ \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 \right] \\ &= \sum_{n=1}^N \frac{\partial}{\partial \vec{w}} \left[ \frac12 \left( \vec{w}^T\phi(x_n) - t_n \right)^2 \right] \\ \text{Applying chain rule:} &= \sum_{n=1}^N \left[ \frac12 \cdot 2 \cdot \left( \vec{w}^T\phi(x_n) - t_n \right) \cdot \frac{\partial}{\partial \vec{w}} \vec{w}^T\phi(x_n) \right] \\ &= \sum_{n=1}^N \left( \vec{w}^T\phi(x_n) - t_n \right)\phi(x_n) \end{align} $$
  • Since we are taking derivative of a scalar $E(\vec{w})$ w.r.t a vector $\vec{w}$, the derivative $\nabla_\vec{w} E(\vec{w})$ will be a vector.
  • For details about matrix/vector derivative, please refer to appendix attached in the end of the slide.

Method I-1: Gradient Descent—Batch Gradient Descent

  • Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
  • Initialize: $\vec{w}_0$, learning rate $\eta$
  • Repeat until convergence:
    • $\nabla_\vec{w} E(\vec{w}_\text{old}) = \sum_{n=1}^N \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
    • $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_\vec{w} E(\vec{w}_\text{old})$
  • End
  • Output: $\vec{w}_\text{final}$

Method I-2: Gradient Descent—Stochastic Gradient Descent

Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual training sample and update.

  • Input: Given dataset $\{(\vec{x}_n, t_n)\}_{n=1}^N$
  • Initialize: $\vec{w}_0$, learning rate $\eta$
  • Repeat until convergence:
    • Random shuffle $\{(\vec{x}_n, t_n)\}_{n=1}^N$
    • For $n=1,\dots,N$ do:
      • $\nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n) = \left( \vec{w}_\text{old}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n)$
      • $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_{\vec{w}}E(\vec{w}_\text{old} | \vec{x}_n)$
    • End
  • End
  • Output: $\vec{w}_\text{final}$

Method II: Closed Form Solution

Main Idea: Compute gradient and set to gradient to zero, solving in closed form.

  • Objective Function $E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 = \frac12 \sum_{n=1}^N \left( \phi(\vec{x}_n)^T \vec{w} - t_n \right)^2$
  • Let $\vec{e} = [\phi(\vec{x}_1)^T \vec{w} - t_1,\quad \phi(\vec{x}_2)^T \vec{w} - t_2, \quad \dots ,\quad \phi(\vec{x}_N)^T \vec{w} - t_N]^T$, we have $$E(\vec{w}) = \frac12 \vec{e}^T \vec{e} = \frac12 \parallel \vec{e}\ \parallel^2$$
  • Look at $\vec{e}$: $$ \vec{e} = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} \vec{w}- \begin{bmatrix} t_1\\ \vdots\\ t_N \end{bmatrix} \triangleq \Phi \vec{w}-\vec{t} $$ Here $\Phi \in \R^{N \times M}$ is called design matrix. Each row represents one sample. Each column represents one feature $$\Phi = \begin{bmatrix} \phi(\vec{x}_1)^T\\ \phi(\vec{x}_2)^T\\ \vdots\\ \phi(\vec{x}_N)^T \end{bmatrix} = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$
  • From $E(\vec{w}) = \frac12 \left \| \vec{e} \right \|^2 = \frac12 \vec{e}^T \vec{e}$ and $\vec{e}=\Phi \vec{w}-\vec{t}$, we have $$ \begin{align} E(\vec{w}) & = \frac12 \left \| \Phi \vec{w}-\vec{t} \right \|^2=\frac12 (\Phi \vec{w}-\vec{t})^T (\Phi \vec{w}-\vec{t}) \\ & = \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{t}^T \Phi \vec{w} + \frac12 \vec{t}^T \vec{t} \end{align} $$
  • So the derivative is $$\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi\vec{w} - \Phi^T \vec{t}$$
  • To minimize $E(\vec{w})$, we need to let $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, which is also $\Phi^T\Phi \vec{w} = \Phi^T \vec{t}$
  • When $\Phi^T \Phi$ is invertible ($\Phi$ has linearly independent columns), we simply have $$ \begin{align} \hat{\vec{w}} &=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \\ &\triangleq \Phi^\dagger \vec{t} \end{align} $$ of which $\Phi^\dagger$ is called the Moore-Penrose Pseudoinverse of $\Phi$.
  • We will talk about case $\Phi^T \Phi$ is non-invertible later.

Digression: Moore-Penrose Pseudoinverse

  • When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
  • For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
  • In general, we can get $A^\dagger$ by SVD: if we write $A \in \R^{m \times n} = U_{m \times m} \Sigma_{m \times n} V_{n \times n}^T$ then $A^\dagger \in \R^{n \times m} = V \Sigma^\dagger U^T$, where $\Sigma^\dagger \in \R^{n \times m}$ is obtained by taking reciprocals of non-zero entries of $\Sigma^T$.
  • Particularly, when $A$ has linearly independent columns then $A^\dagger = (A^T A)^{-1} A^T$. When $A$ is invertible, then $A^\dagger = A^{-1}$.

Back to Closed Form Solution

  • From previous derivation, we have $\hat{w}=(\Phi^T \Phi)^{-1} \Phi^T \vec{t} \triangleq \Phi^\dagger \vec{t}$.
  • What if $\Phi^T \Phi$ is non-invertible? This corresponds to the case where $\Phi$ doesn't have linearly independent columns. For dataset, this means the feature vector of certain sample is the linear combination of feature vectors of some other samples.
  • We could still resolve this using pseudoinverse.
  • To make $\nabla_\vec{w} E(\vec{w}) = \Phi^T\Phi \vec{w} - \Phi^T \vec{t} = 0$, we have $$ \hat{\vec{w}} = (\Phi^T\Phi)^\dagger \Phi^T \vec{t} = \Phi^\dagger \vec{t}$$ of which $(\Phi^T\Phi)^\dagger\Phi^T = \Phi^\dagger$. This is left as an exercise. (Hint: use SVD)
  • Now we could conclude the optimal $\vec{w}$ in the sense that minimizes sum of squared errors is $$\boxed{\hat{\vec{w}} = \Phi^\dagger \vec{t}}$$