Notebook

Machine Learning¶

2. Linear Regression¶

About the notebook/slides¶

The slides are programmed as a Jupyter/IPython notebook.
Feel free to try them and experiment on your own by launching the notebooks.
You can run the notebook online:

If you are using nbviewer you can change to slides mode by clicking on the icon:

In [1]:

import random, math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [2]:

plt.rc('text', usetex=True); plt.rc('font', family='serif')
plt.rcParams['text.latex.preamble'] ='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}'

# numpy - pretty matrix 
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)

import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:

# Fixed seed to make the results replicable - remove in real life!
random.seed(42)

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

We are already familiar with regression problems.
We have a dataset $\Psi=\left\{\left<\vec{x}^{(1)},\vec{y}^{(1)}\right>,\ldots,\left<\vec{x}^{(i)},\vec{y}^{(i)}\right>,\ldots\left<\vec{x}^{(m)},\vec{y}^{(m)}\right>\right\}$ of *noisy samples* drawn from and *unknown* function $\mathbf{F}:\mathcal{D}\rightarrow\mathcal{I}$.
In the regression case $\mathcal{D}\subseteq\mathbb{R}^n$ and $\mathcal{I}\subseteq\mathbb{R}^m$.

We can assume that $m=1$ and noise is Gaussian and that $\Psi$ is finite.

Why we use the Gaussian noise assumption¶

We will assume that the target variable is described by $$y = \mathbf{F}(\vec{x},\vec{w}) + \epsilon$$ where $\mathbf{F}(\vec{x},\vec{w})$ is a (possibly undefined and/or unkown) function of $\vec{x}$ and $\vec{w}$ and $\epsilon$ is a Gaussian-distributed noise component.

Gaussian Noise?¶

The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges on the use of the Central_Limit_Theorem that basically says the the sum of many independent random variables behaves behaves like a Gaussian distributed random variable.

The noise term in this model, $\epsilon$, can be thought of as the sum of features not included in the model function, $\mathbf{F}(\vec{x},\vec{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorom suggests a Gaussian model is appropriate, assuming there are many independent unaccounted for features.

It is possible that there is only a small number of unaccounted for features or that there is genuine non-Gaussian noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.

An Example¶

Training dataset

In [4]:

N = 5                                              # Size of the data set
X = np.array([[0.0], [1.0], [2.0], [3.0], [4.0]])  # Inputs, shape: N x 1
y = np.array([10.5, 5.0, 3.0, 2.5, 1.0])           # Outputs, shape: N

In [5]:

print(X.T)
print(y)

[[ 0.  1.  2.  3.  4.]]
[ 10.5   5.    3.    2.5   1. ]

Test dataset

In [6]:

N_test = 100
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)

The training and test datasets

In [7]:

plt.plot(X[:, 0], y, 'bo', label='Training')
plt.plot(X_test[:, 0], y_test, "g.", label='Test')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);

Linear Regression Model¶

We can express our constraints down with a linear model:

$$y^{(i)} + \epsilon^{(i)} = \vec{x}^{(i)}\cdot \vec{w} + b, \forall i \in \{1,\ldots,m\}$$

where $\boldsymbol{w}$ and $b$ constitute a weight vector and $\epsilon^{(n)} \sim \mathcal{N}(0, \sigma^2)$.

Weights are parameters of the linear model that can be tuned to fit the training data better.

A shorter version of this expression is the following equation: $$\vec{y} + \vec{\epsilon} = \vec{X} \cdot \vec{w}, \quad \vec{X} \in \mathbb{R}^{m \times n}, \vec{y} \in \mathbb{R}^m, \vec{\epsilon} \in \mathbb{R}^m, \vec{w} \in \mathbb{R}^n,$$ where the $i$-th row of $\vec{X}$ represents $\vec{x}^{(i)}$ and the $j$-th entry of the vector $\vec{y}$ represents $\vec{y}^{(j)}$.

It might not be possible to find a weight vector that satisfies the constraints perfectly.
Instead, we can minimize the sum of squared errors (SSE):

$$\hat{\boldsymbol{w}} = \text{argmin}_\boldsymbol{w} \dfrac{1}{2}\left\|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\right\|^2_2,$$

where $\left\|\boldsymbol{A}\right\|_2$ is called Frobenius norm and is the generalization of the Euclidean norm for matrices.

The frequentist (maximum likelihood) approach

assumes that we first obtain observation training data, $\vec{y}$, and that the best value of $\vec{w}$, is that which maximizes the likelihood function, $p(\vec{y}|\vec{x},\vec{w})$.

Under the Gaussian noise condition it can be shown that the maximum likelihood function for the training data is $$ \begin{array}{rl} p(\vec{y}|\vec{X},\vec{w},\sigma^2) & = \prod_{i=1}^m {\mathcal{N}\left(y^{(i)}\left|\vec{w}^\intercal\phi(\vec{x}^{(i)}),\sigma^2\right.\right)} \\ & =\frac{m}{2}\ln\left(\frac{1}{\sigma^2}\right) -\frac{m}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^m\left\{t_n -\vec{w}^\intercal\phi(\vec{x}^{(i)})\right\}^2 \end{array} $$ where $\vec{X}=\{\vec{x}^{(1)},\ldots,\vec{x}^{(m)}\}$ is the input value set for the corresponding $N$ oberved output values contained in the vector $\mathbf{t}$, and $\mathcal{N}(\mu,\sigma^2)$ is the Normal Distribution (Gaussian).

Taking the logarithm of the maximum likelihood and setting the derivative with respect to $\vec{w}$ equal to zero, one can obtain the maximum likelikhood parameters given by the normal equations: $$\vec{w}_\text{ML} = \left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y},$$ where $\vec{\Phi}$ is the $N \times M$ design matrix with elements $\Phi_{i,j}=\phi_j(\vec{x}^{(i)})$, and $\vec{y}$ is the $N \times K$ matrix of training set target values (for $K=1$, it is simply a column vector).

Note that $\vec{\Phi}^\intercal$ is a $M \times N$ matrix, so that $\vec{w}_{ML}=\left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y}$ is $(M \times N)\times(N \times M)\times(M\times N)\times(N \times K) = M \times K$, where $M$ is the number of free parameters and $K$ is the number of predicted target values for a given input.

Note that the only term in the likelihood function that depends on $\vec{w}$ is the last term. Thus, maximizing the likelihood function with respect to $\vec{w}$ under the assumption of Gaussian noise is equivalent to minimizing a sum-of-squares error function.

Learning Weights¶

There are two ways to adjust the weights of a linear model (which includes linear models with nonlinear features):

Normal equations (analytical solution), requires inversion of $m \times n$ matrix.
- It does not scale well with the number of features
Gradient descent (iterative solution), requires the calculation of $m$ gradients in each step.
- It does not scale well with the number of examples.

In addition, we can add a constraint to the objective function to penalize large weights: Tikhonov regularization (forward pointer).

Normal Equations¶

Frequently solving the equation $\boldsymbol{X} \cdot \boldsymbol{w} = \boldsymbol{y}$ directly for $\boldsymbol{w}$ by inversion is not possible.
- $\boldsymbol{X}$ is not necessarily a square matrix, therefore $\boldsymbol{w} = \boldsymbol{X}^{-1} \boldsymbol{y}$ is usually not possible.
It is usually not possible to find an exact solution.
- Instead, we find the least squares solution, which is the solution of $$\vec{X}^\intercal\vec{X} \hat{\boldsymbol{w}} = \vec{X}^\intercal\vec{y}$$ for $\hat{\boldsymbol{w}}$, i.e. $$\hat{\vec{w}} = (\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal\vec{y}.$$

Hint: The expression $(\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal$ is equivalent to the Moore-Penrose pseudoinverse $\vec{X}^+$ of $\vec{X}$. It is a generalization of the inverse for non-square matrices. You can use this to implement normal equations if your library provides the function. You could also use a least squares solver (e.g. numpy.linalg.lstsq or dgelsd/zgelsd from LAPACK).

Inversion of 2x2 matrices:

$$\left(\begin{array}{cc}a & b\\ c & d\end{array}\right)^{-1} = \frac{1}{ad - bc}\left(\begin{array}{cc}d & -b\\ -c & a\end{array}\right)$$

In [8]:

def normal_equations(X, y):
    w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    # ... or we can use the Moore-Penrose pseudoinverse (is usually more likely to 
    # be numerically stable)
    #w = np.linalg.pinv(X).dot(y)
    # ... or we use the solver
    # w = np.linalg.lstsq(X, y)[0]
    return w

In [9]:

# Add bias to each row/instance:
X_bias = np.hstack((X, np.ones((N, 1))))
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))

In [10]:

print('X:',X.T)
print('X_bias:',X_bias.T)

X: [[ 0.  1.  2.  3.  4.]]
X_bias: [[ 0.  1.  2.  3.  4.]
 [ 1.  1.  1.  1.  1.]]

In [11]:

w_norm = normal_equations(X_bias, y)

In [12]:

plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "m-", linewidth=2)
plt.xlabel("x"); plt.ylabel("y"); plt.title("Model: $y = %.2f x + %.2f$" % tuple(w_norm));

We can now solve linear regression problems at will!

Homework 1¶

Test the other forms of normal equations.

Computational cost and complexity¶

Think of this problem:

inputs: 100 images of size 256x256 pixels
output: radius of the ball which is at the center of the image
Is the direct solution via normal equations viable?

Estimating problem tractability¶

In [13]:

n_pixels = 256 * 256
memory_size_xtx = n_pixels * n_pixels * 8
print("Required memory: %d GB" % (memory_size_xtx / 2**30))

Required memory: 32 GB

Inversion is an operation that requires $O(n^{2.373})$ operations (Source).

Gradient Descent¶

In cases where the training data set is very large or data is received in a stream, a direct solution using the normal equations may not be possible. An alternative approach is the stochastic gradient descent algorithm.

Iteratively update the weights using the gradient as reference.

$\Delta \boldsymbol{w}_t$ is minus the derivative of the error function $E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}_t)$ with respect to the weight $w_i$: $$\Delta w_i = - \alpha\frac{\partial E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w})}{\partial w_i}$$

where

$\alpha$ is a hyperparameter called learning rate, it has to be set manually and must usually be within $(0, 1)$, typical values are 10e-1, 10e-2, 10e-3, ...

Update $\boldsymbol{w}$ incrementally in every iteration $t$ as:

$$\boldsymbol{w}(t+1) = \boldsymbol{w}(t) + \Delta \boldsymbol{w}(t),$$

Note: Gradient descent actually is a more general rule that can be applied to any minimization problem if the gradient can be computed, e.g., for artificial neural networks, support vector machines, $k$-means, etc. (another forward pointer).

Gradient of the Linear Model¶

Using as error function the sum of squared errors (SSE),

$$E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \frac{1}{2}\left\|\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y}\right\|^2_2$$

The gradient becomes $$\nabla \boldsymbol{w} = \nabla_{\boldsymbol{w}} E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \boldsymbol{X}^T \cdot (\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y})$$

Implementing gradient descent¶

In [14]:

def error(X, y, w):
    return 0.5*np.linalg.norm(X.dot(w) - y)**2

In [15]:

def linear_regression_gradient(X, y, w):
    return X.T.dot(X.dot(w)-y)

The learning loop¶

Initialize weights to small random values

In [16]:

w = np.random.random(2)/10000

In [17]:

alpha = 0.05

In [18]:

for i in range(50):
    # print('Iteration', i+1, 'weights', w, 'error', error(X_bias, y, w))
    w = w - alpha*linear_regression_gradient(X_bias, y, w)
print('w_norm:', w_norm, 'w grad. desc.', w)

w_norm: [-2.15  8.7 ] w grad. desc. [-2.089  8.526]

Visualizing gradient descent¶

In [19]:

def plot_contour(X_data, y_data, bounds, resolution=50, cmap=cm.viridis, 
                 alpha=0.3, linewidth=5, rstride=1, cstride=5, ax=None):
    (minx,miny),(maxx,maxy) = bounds
    
    x_range = np.linspace(minx, maxx, num=resolution)
    y_range = np.linspace(miny, maxy, num=resolution)
    X, Y = np.meshgrid(x_range, y_range)
    
    Z = np.zeros((len(x_range), len(y_range)))
    
    for i, w_i in enumerate(x_range):
        for j, w_j in enumerate(y_range):
            Z[j,i] = error(X_data, y_data, [w_i, w_j])
    
    if not ax:
        fig = plt.figure(figsize=(6,6))
        ax = fig.gca()
        ax.set_aspect('equal')
        ax.autoscale(tight=True)
    cset = ax.contourf(X, Y, Z, 30, cmap=cmap, rstride=rstride, 
                       cstride=cstride, linewidth=linewidth, alpha=alpha)
    cset = ax.contour(X, Y, Z, 10, cmap=cmap, rstride=rstride, 
                       cstride=cstride, linewidth=linewidth)
    plt.clabel(cset, inline=1, fontsize=7)
    return Z

Implementing the gradient descent loop as a function

In [20]:

def gradient_descent(X, y, w_0, alpha, max_iters):
    'Returns the values of the weights as learning took place.'
    w = np.array(w_0, dtype=np.float64)
    w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
    w_hist[0] = w
    for i in range(0,max_iters):
        delta_weights = -alpha*linear_regression_gradient(X_bias, y, w)
        w += delta_weights
        w_hist[i+1] = w
    return w_hist

In [21]:

w_0 = [-3,2] # we fix the initial weights to make it more illustrative
alpha = 0.05
max_iters = 25

In [22]:

w_hist = gradient_descent(X_bias, y, w_0, alpha, max_iters)

In [23]:

fig = plt.figure(figsize=(5,5)); ax = fig.gca()
bounds = (np.min(w_hist, axis=0)-2, np.max(w_hist, axis=0)+2)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], color='navy', marker='D', s=50, label='w_norm')
ax.plot(w_hist[:,0], w_hist[:,1], '.--', color='m')
ax.scatter(w_hist[0,0], w_hist[0,1], color='navy', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], color='navy', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$'); plt.title('end='+str(w_hist[-1]))
plt.legend(scatterpoints=1, bbox_to_anchor=(1.37,1), frameon=True);

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    305                 pass
    306             else:
--> 307                 return printer(obj)
    308             # Finally look for special method names
    309             method = get_real_method(obj, self.print_method)

/Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in <lambda>(fig)
    240         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    241     if 'retina' in formats or 'png2x' in formats:
--> 242         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))
    243     if 'jpg' in formats or 'jpeg' in formats:
    244         jpg_formatter.for_type(Figure, lambda fig: print_figure(fig, 'jpg', **kwargs))

/Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in retina_figure(fig, **kwargs)
    130 def retina_figure(fig, **kwargs):
    131     """format a figure as a pixel-doubled (retina) PNG"""
--> 132     pngdata = print_figure(fig, fmt='retina', **kwargs)
    133     # Make sure that retina_figure acts just like print_figure and returns
    134     # None when the figure is empty.

/Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    122 
    123     bytes_io = BytesIO()
--> 124     fig.canvas.print_figure(bytes_io, **kw)
    125     data = bytes_io.getvalue()
    126     if fmt == 'svg':

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs)
   2190                     orientation=orientation,
   2191                     dryrun=True,
-> 2192                     **kwargs)
   2193                 renderer = self.figure._cachedRenderer
   2194                 bbox_inches = self.figure.get_tightbbox(renderer)

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs)
    543 
    544     def print_png(self, filename_or_obj, *args, **kwargs):
--> 545         FigureCanvasAgg.draw(self)
    546         renderer = self.get_renderer()
    547         original_dpi = renderer.dpi

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in draw(self)
    462 
    463         try:
--> 464             self.figure.draw(self.renderer)
    465         finally:
    466             RendererAgg.lock.release()

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/figure.py in draw(self, renderer)
   1141 
   1142             mimage._draw_list_compositing_images(
-> 1143                 renderer, self, dsu, self.suppressComposite)
   1144 
   1145             renderer.close_group('figure')

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite)
    137     if not_composite or not has_images:
    138         for zorder, a in dsu:
--> 139             a.draw(renderer)
    140     else:
    141         # Composite any adjacent images together

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/axes/_base.py in draw(self, renderer, inframe)
   2407             renderer.stop_rasterizing()
   2408 
-> 2409         mimage._draw_list_compositing_images(renderer, self, dsu)
   2410 
   2411         renderer.close_group('axes')

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite)
    137     if not_composite or not has_images:
    138         for zorder, a in dsu:
--> 139             a.draw(renderer)
    140     else:
    141         # Composite any adjacent images together

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/legend.py in draw(self, renderer)
    469         # update the location and size of the legend. This needs to
    470         # be done in any case to clip the figure right.
--> 471         bbox = self._legend_box.get_window_extent(renderer)
    472         self.legendPatch.set_bounds(bbox.x0, bbox.y0,
    473                                     bbox.width, bbox.height)

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_window_extent(self, renderer)
    267         get the bounding box in display space.
    268         '''
--> 269         w, h, xd, yd, offsets = self.get_extent_offsets(renderer)
    270         px, py = self.get_offset(w, h, xd, yd, renderer)
    271         return mtransforms.Bbox.from_bounds(px - xd, py - yd, w, h)

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer)
    389 
    390         whd_list = [c.get_extent(renderer)
--> 391                     for c in self.get_visible_children()]
    392         whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list]
    393 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0)
    389 
    390         whd_list = [c.get_extent(renderer)
--> 391                     for c in self.get_visible_children()]
    392         whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list]
    393 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer)
    260         Return with, height, xdescent, ydescent of box
    261         """
--> 262         w, h, xd, yd, offsets = self.get_extent_offsets(renderer)
    263         return w, h, xd, yd
    264 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer)
    460 
    461         whd_list = [c.get_extent(renderer)
--> 462                     for c in self.get_visible_children()]
    463 
    464         if not whd_list:

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0)
    460 
    461         whd_list = [c.get_extent(renderer)
--> 462                     for c in self.get_visible_children()]
    463 
    464         if not whd_list:

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer)
    260         Return with, height, xdescent, ydescent of box
    261         """
--> 262         w, h, xd, yd, offsets = self.get_extent_offsets(renderer)
    263         return w, h, xd, yd
    264 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer)
    389 
    390         whd_list = [c.get_extent(renderer)
--> 391                     for c in self.get_visible_children()]
    392         whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list]
    393 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0)
    389 
    390         whd_list = [c.get_extent(renderer)
--> 391                     for c in self.get_visible_children()]
    392         whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list]
    393 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer)
    260         Return with, height, xdescent, ydescent of box
    261         """
--> 262         w, h, xd, yd, offsets = self.get_extent_offsets(renderer)
    263         return w, h, xd, yd
    264 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer)
    460 
    461         whd_list = [c.get_extent(renderer)
--> 462                     for c in self.get_visible_children()]
    463 
    464         if not whd_list:

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0)
    460 
    461         whd_list = [c.get_extent(renderer)
--> 462                     for c in self.get_visible_children()]
    463 
    464         if not whd_list:

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer)
    833             "lp", self._text._fontproperties, ismath=False)
    834 
--> 835         bbox, info, d = self._text._get_layout(renderer)
    836         w, h = bbox.width, bbox.height
    837 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/text.py in _get_layout(self, renderer)
    360                 w, h, d = renderer.get_text_width_height_descent(clean_line,
    361                                                         self._fontproperties,
--> 362                                                         ismath=ismath)
    363             else:
    364                 w, h, d = 0, 0, 0

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in get_text_width_height_descent(self, s, prop, ismath)
    228             fontsize = prop.get_size_in_points()
    229             w, h, d = texmanager.get_text_width_height_descent(s, fontsize,
--> 230                                                                renderer=self)
    231             return w, h, d
    232 

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/texmanager.py in get_text_width_height_descent(self, tex, fontsize, renderer)
    674         else:
    675             # use dviread. It sometimes returns a wrong descent.
--> 676             dvifile = self.make_dvi(tex, fontsize)
    677             dvi = dviread.Dvi(dvifile, 72 * dpi_fraction)
    678             try:

/Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/texmanager.py in make_dvi(self, tex, fontsize)
    421                      'string:\n%s\nHere is the full report generated by '
    422                      'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) +
--> 423                      report))
    424             else:
    425                 mpl.verbose.report(report, 'debug')

RuntimeError: LaTeX was not able to process the following string:
b'w_norm'
Here is the full report generated by LaTeX: 

This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=latex)
 restricted \write18 enabled.
entering extended mode
(./205c878eadf6ef7aab19df7dd1db849a.tex
LaTeX2e <2017/01/01> patch level 1
Babel <3.9r> and hyphenation patterns for 83 language(s) loaded.
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/article.cls
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/size10.clo))
(/usr/local/texlive/2016/texmf-dist/tex/latex/type1cm/type1cm.sty)
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/textcomp.sty
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1enc.def))
(/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/libertine.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/ifxetex/ifxetex.sty)
(/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifluatex.sty)
(/usr/local/texlive/2016/texmf-dist/tex/latex/xkeyval/xkeyval.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/xkeyval.tex
(/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/xkvutils.tex
(/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/keyval.tex))))
(/usr/local/texlive/2016/texmf-dist/tex/latex/mweights/mweights.sty)
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/fontenc.sty)
(/usr/local/texlive/2016/texmf-dist/tex/latex/fontaxes/fontaxes.sty)
(/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/LinLibertine_I.tex))
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/inputenc.sty
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/utf8.def
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/t1enc.dfu)
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/ot1enc.dfu)
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/omsenc.dfu)
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1enc.dfu)))
(/usr/local/texlive/2016/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifpdf.sty)
(/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifvtex.sty)

Package geometry Warning: Over-specification in `h'-direction.
    `width' (5058.9pt) is ignored.


Package geometry Warning: Over-specification in `v'-direction.
    `height' (5058.9pt) is ignored.

)
No file 205c878eadf6ef7aab19df7dd1db849a.aux.
(/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1cmr.fd)
(/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/OT1LinuxLibertineT-TLF.
fd)
*geometry* driver: auto-detecting
*geometry* detected driver: dvips

(/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/OT1LinuxBiolinumT-TLF.f
d)
! Missing $ inserted.
<inserted text> 
                $
l.13 \fontsize{13.000000}{16.250000}{\sffamily w_
                                                 norm}
! Extra }, or forgotten $.
l.13 ...ze{13.000000}{16.250000}{\sffamily w_norm}
                                                  
! Missing $ inserted.
<inserted text> 
                $
l.14 \end{document}
                   
[1] (./205c878eadf6ef7aab19df7dd1db849a.aux) )
(\end occurred inside a group at level 1)

### simple group (level 1) entered at line 13 ({)
### bottom level
(see the transcript file for additional information)
Output written on 205c878eadf6ef7aab19df7dd1db849a.dvi (1 page, 376 bytes).
Transcript written on 205c878eadf6ef7aab19df7dd1db849a.log.

<matplotlib.figure.Figure at 0x11199ea90>

What is the impact of the learning rate ($\alpha$)?

In [24]:

def alphas_study(alphas):
    fig = plt.figure(figsize=(11,7))
    for i,alpha in enumerate(alphas):
        ax = fig.add_subplot(2,3,i+1)
        w_hist = gradient_descent(X_bias, y , w_0, alpha, max_iters)
        combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
        bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
        plot_contour(X_bias, y, bounds, ax=ax)
        ax.scatter(w_norm[0], w_norm[1], c='navy', marker='D', s=50, label='w_norm')
        ax.plot(w_hist[:,0], w_hist[:,1], '.--', color='m')
        ax.scatter(w_hist[0,0], w_hist[0,1], c='navy', marker='o', s=65, label='start')
        ax.scatter(w_hist[-1,0], w_hist[-1,1], c='navy', marker='s', s=50, label='end')
        plt.xlabel('$w_1$'); plt.ylabel('$w_2$'); 
        plt.title('$\\alpha='+str(alpha)+'$')
    plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.15), frameon=True);
    plt.tight_layout()

Impact of learning rate¶

In [25]:

alphas = np.linspace(0.02,0.07,6)
alphas

Out[25]:

array([ 0.02,  0.03,  0.04,  0.05,  0.06,  0.07])

In [26]:

alphas_study(alphas)

Improving GD by adding a momentum term¶

Idea adding some an inertial momentum to the process can:

accelerate convergence when moving in a plateau, or
damp oscilations.

Update $\vec{w}$ incrementally in every iteration $t$ as:

$$\vec{w}(t+1) = \vec{w}(t) + \alpha\Delta \vec{w}(t) + \beta \Delta \vec{w}(t-1),$$

where $\beta\in\mathbb{R}^+$ is known as the momentum rate.

In [27]:

def gradient_descent_with_momentum(X, y, w_0, alpha, beta, max_iters):
    w = np.array(w_0, dtype=np.float64)
    w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
    w_hist[0] = w
    omega = np.zeros_like(w)
    for i in range(max_iters): 
        delta_weights = -alpha*linear_regression_gradient(X, y, w) + beta*omega
        omega = delta_weights
        w += delta_weights
        w_hist[i+1] = w
    return w_hist

In [28]:

alpha = 0.05
beta = 0.5
max_iters = 25

In [29]:

w_hist = gradient_descent(X_bias, y, (-3,2), alpha, max_iters)
w_hist_mom = gradient_descent_with_momentum(X_bias,y, (-3,2), alpha, beta, max_iters)

In [30]:

def comparison_plot():
    fig = plt.figure(figsize=(8,4))
    ax = fig.add_subplot(121)
    combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
    bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
    plot_contour(X_bias, y, bounds, ax=ax)
    ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
    ax.plot(w_hist[:,0], w_hist[:,1], '.-', c='navy')
    ax.scatter(w_hist[0,0], w_hist[0,1], c='lightskyblue', marker='o', s=65, label='start')
    ax.scatter(w_hist[-1,0], w_hist[-1,1], c='lightskyblue', marker='s', s=50, label='end')
    plt.xlabel('$w_1$'); plt.ylabel('$w_2$');plt.title('Gradient descent') 
    ax = fig.add_subplot(122)
    plot_contour(X_bias, y, bounds, ax=ax)
    ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
    ax.plot(w_hist_mom[:,0], w_hist_mom[:,1], '.-', c='m')
    ax.scatter(w_hist_mom[0,0], w_hist_mom[0,1], c='lightskyblue', marker='o', s=65, label='start')
    ax.scatter(w_hist_mom[-1,0], w_hist_mom[-1,1], c='lightskyblue', marker='s', s=50, label='end')
    plt.xlabel('$w_1$'); plt.ylabel('$w_2$');plt.title('Gradient descent with momentum') 
    plt.legend(scatterpoints=1, bbox_to_anchor=(1.44,1), frameon=True)
    plt.tight_layout()

In [31]:

comparison_plot()

In [32]:

def alphas_study_with_momentum(alphas, beta):
    fig = plt.figure(figsize=(11,7))
    for i,alpha in enumerate(alphas):
        ax = fig.add_subplot(2,3,i+1)
        w_hist = gradient_descent_with_momentum(X_bias, y , w_0, alpha, beta, max_iters)
        combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
        bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
        z= plot_contour(X_bias, y, bounds, ax=ax)
        ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
        ax.plot(w_hist[:,0], w_hist[:,1], '.-', c='m')
        ax.scatter(w_hist[0,0], w_hist[0,1], c='lightskyblue', marker='o', s=65, label='start')
        ax.scatter(w_hist[-1,0], w_hist[-1,1], c='lightskyblue', marker='s', s=50, label='end')
        plt.xlabel('$w_1$'); plt.ylabel('$w_2$'); 
        plt.title('$\\alpha='+str(alpha)+'$')
    plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.15), frameon=True);
    plt.tight_layout()

Impact of the momentum term¶

In [33]:

alphas_study_with_momentum(alphas, 0.5)

In [34]:

alphas_study_with_momentum(alphas, 0.25)

In [35]:

alphas_study_with_momentum(alphas, 0.75)

... I know you are wondering how $\beta=1$ looks like.

In [36]:

alphas_study_with_momentum(alphas, 1)

In [37]:

plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "D-", c='lightcoral',  
         linewidth=2, label='normal equation')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist[-1]), "-", c='skyblue', 
         linewidth=2, label='gradient descent')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist_mom[-1]), "-", c='forestgreen', 
         linewidth=2, alpha=0.5, label='gradient descent with momentum')
plt.xlabel("x"); plt.ylabel("y"); plt.legend(numpoints=1, bbox_to_anchor=(1.4,1), frameon=True);

Normal equations vs gradient descent¶

In [38]:

N_test = 10000000
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))

In [39]:

%time w_norm = normal_equations(X_test_bias, y_test)

CPU times: user 295 ms, sys: 113 ms, total: 409 ms
Wall time: 436 ms

In [40]:

%time w_grad = gradient_descent_with_momentum(X_test_bias, y_test, np.random.random(2)/1000, 0.000001, 0.25, 25)

CPU times: user 4.86 s, sys: 1.39 s, total: 6.24 s
Wall time: 4.29 s

Comparing results¶

In [41]:

w_norm

Out[41]:

array([-3.001,  7.002])

In [42]:

w_grad[-1]

Out[42]:

array([ -6.263e+43,  -2.449e+43])

In [43]:

np.linalg.norm(w_norm - w_grad[-1])

Out[43]:

6.7249641881311471e+43

Questions on the homework¶

Why don't we always choose a very small learning rate?
Your implementation of gradient descent oscillates between two error values. What could be the cause?

Can we make linear models non-linear?¶

Nonlinear projection of features (e.g. polynomial regression)
Kernel trick (e.g. SVR, SVM, ...)
Mixture of experts (e.g. locally weighted regression, locally weighted projection regression, Gaussian mixture regression, ...)
Learn nonlinear features (e.g. artificial neural networks)

Approximation of nonlinear functions (with a linear model)¶

To approximate nonlinear funtions with a linear model, we have to generate nonlinear features.
In this example, we generate sinusoidal features. You could also try radial basis functions, polynomials, ...
We expand each feature $x_j$ to the nonlinear feature vector $$\left\langle\cos(\frac{0}{d} \pi x_j), \cos(\frac{1}{d} \pi x_j), \ldots, \cos(\frac{d}{d} \pi x_j)\right\rangle^\intercal,$$ where $d$ is the number of basis functions.
Note that $\cos \left(\frac{0}{D} \pi x_j \right)=1$ is the bias we added manually in the previous example.

In [44]:

from itertools import chain

def sinusoidalize(X, n_degree):
    X_sinusoidal = np.ndarray((len(X), n_degree+1))
    for d in range(n_degree+1):
        X_sinusoidal[:, d] = np.cos(X[:, 0] * np.pi * d / n_degree)
    return X_sinusoidal

# Utility function
def build_sinusoidal(n_degree, w):
    return ((("%+.2f \sin(%d /" + str(n_degree) + " \pi x)") * (n_degree+1)) % tuple(chain(*zip(w, range(n_degree+1)))))

n_degree = 4
X_sinusoidal = sinusoidalize(X, n_degree=n_degree)
X_test_sinusoidal = sinusoidalize(X_test, n_degree=n_degree)
w_sin = normal_equations(X_sinusoidal, y)

In [45]:

plt.plot(X_bias[:, 0], X_bias.dot(w), "m-", linewidth=2, label='linear')
plt.plot(X_test_bias[:, 0], X_test_sinusoidal.dot(w_sin), "-", linewidth=2, color="r", label='sinusoidal')
plt.plot(X_bias[:, 0], y, "bo", label='training')
plt.xlabel("x"); plt.ylabel("y"); plt.legend(frameon=True)
plt.title("Model: $y = " + build_sinusoidal(n_degree, w_sin) + "$");

You might be wondering, is this cheating?

I am hand-picking basis functions.
This approach has little applicability in real-world practice but...
...hints us towards an interesting direction.

How can we make a model more general?¶

Get more training data.
Generate artificial training data:
- Add noise to your training data
- Example for object recognition: scale, rotate, distort images, select random samples from a bigger image, corrupt image, ...
Prefer simple models (e.g. degree of polynomial)
Penalize large weights, for example
- quadratic values of weights
- absolute values of weights
- ...
Average different models (ensemble methods)

Over and Under Fitting¶

Techniques used to account for over fitting or under fitting a given model when using a maximum likelihood approach.

Regularized Least Squares¶

Regularization attempts to address the problem of overfitting a given model to the training data and/or eliminating parameters that are not important based on the data.
The general total error function with a regularization term is given by

$$E_D(\vec{w}) + \lambda E_W(\vec{w})$$

where $\lambda$ is the regularization coefficient and $E_W$ is the regularization term.

A commonly used regularization term is the sum-of-squares of the model parameters, $$E_W(\vec{w}) = \frac1{2}\vec{w}^\intercal\vec{w}$$ known as the weight decay regularizer.

This regularization terms leads to the optimal solution, assuming a linear regression model with Gaussian noise on the training data, of $$\vec{w} = \left(\lambda \vec{I} + \vec{\Phi}^\intercal \vec{\Phi}\right)^{-1} \vec{\Phi}^T\vec{y}$$

In [46]:

def y(x,m,b,mu=0,sigma=1): 
    return m*x + b + np.random.normal(mu,sigma,1)[0]

def el_pow(x,pow):
    temp = x
    for i in range(pow-1):
        temp = temp * x
    return temp

def prediction(params, x):
    pred = 0
    for i in range(len(params)):
        pred += params[i]*math.pow(x,i)
    return pred

In [47]:

#training data, with N data points
N = 101
M = 8
t = np.empty(N)
domain = np.empty(N)
domain_bound = 1.0/N
for i in range(N): 
    domain[i] = i*domain_bound
for i in range(N): 
    t[i] = y(x=domain[i],m=4.89,b=0.57)
#find the solution without using regularization
#design matrix, phi, N X M
phi = np.array([np.ones(N), domain, el_pow(domain,2), el_pow(domain,3),
                el_pow(domain,4), el_pow(domain,5), el_pow(domain,6),el_pow(domain,7)]).T
temp1 = np.linalg.inv(np.dot(phi.T,phi)) #inverse of phi.T X phi
temp2 = np.dot(temp1, phi.T)
w1 = np.dot(temp2,t) #solution
predicted_t = [prediction(w1,x) for x in domain]
print('w1=',w1)

w1= [    1.784   -31.545   297.95  -1047.916  1790.514 -1433.442   386.297    42.499]

In [48]:

#find the regularized solution
lam = 0.1
temp1 = np.linalg.inv(lam*np.eye(M)+np.dot(phi.T,phi))
temp2 = np.dot(temp1,phi.T)
w2 = np.dot(temp2,t)
print('w2=',w2)
predicted_t_reg = [prediction(w2,x) for x in domain]

w2= [ 0.777  3.509  1.983  0.232 -0.47  -0.538 -0.281  0.135]

In [49]:

plt.plot(domain,t, label='training')
plt.plot(domain,predicted_t, label='predicted')
plt.plot(domain,predicted_t_reg, label='regularized')
plt.legend(loc='lower right', frameon=True);

Final remarks¶

We now have a tool for solving regression problems.
...but the world is highly non-linear 😱.
Gradient descent is applicable in cases where the loss function is differenciable with respect to the parameters.
We saw regularization in action for the first time.
Read Chapter 3 of Hastie, Tibshirani and Friedman (2009) The Elements of Statistical Learning (2nd edition) Springer-Verlag.

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).

In [50]:

%load_ext version_information
%version_information scipy, numpy, matplotlib

Out[50]:

Software	Version
Python	3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
IPython	5.3.0
OS	Darwin 16.5.0 x86_64 i386 64bit
scipy	0.19.0
numpy	1.12.1
matplotlib	2.0.0
Sat Apr 08 17:05:34 2017 -03

In [51]:

# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))

Out[51]:

Machine Learning¶

2. Linear Regression¶

Luis Martí ¶

Instituto de Computação ¶

Universidade Federal Fluminense ¶

About the notebook/slides¶

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

Why we use the Gaussian noise assumption¶

Gaussian Noise?¶

An Example¶

Linear Regression Model¶

Learning Weights¶

Normal Equations¶

Homework 1¶

Computational cost and complexity¶

Estimating problem tractability¶

Gradient Descent¶

Gradient of the Linear Model¶

Implementing gradient descent¶

The learning loop¶

Visualizing gradient descent¶

Impact of learning rate¶

Improving GD by adding a momentum term¶

Impact of the momentum term¶

Normal equations vs gradient descent¶

Comparing results¶

Questions on the homework¶

Can we make linear models non-linear?¶

Approximation of nonlinear functions (with a linear model)¶

How can we make a model more general?¶

Over and Under Fitting¶

Regularized Least Squares¶

Final remarks¶

Machine Learning¶

2. Linear Regression¶

Luis Martí¶

Instituto de Computação¶

Universidade Federal Fluminense¶

About the notebook/slides¶

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

Why we use the Gaussian noise assumption¶

Gaussian Noise?¶

An Example¶

Linear Regression Model¶

Learning Weights¶

Normal Equations¶

Homework 1¶

Computational cost and complexity¶

Estimating problem tractability¶

Gradient Descent¶

Gradient of the Linear Model¶

Implementing gradient descent¶

The learning loop¶

Visualizing gradient descent¶

Impact of learning rate¶

Improving GD by adding a momentum term¶

Impact of the momentum term¶

Normal equations vs gradient descent¶

Comparing results¶

Questions on the homework¶

Can we make linear models non-linear?¶

Approximation of nonlinear functions (with a linear model)¶

How can we make a model more general?¶

Over and Under Fitting¶

Regularized Least Squares¶

Final remarks¶

Luis Martí ¶

Instituto de Computação ¶

Universidade Federal Fluminense ¶