If you are using nbviewer you can change to slides mode by clicking on the icon:
import random, math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.rc('text', usetex=True); plt.rc('font', family='serif')
plt.rcParams['text.latex.preamble'] ='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}'
# numpy - pretty matrix
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)
import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Fixed seed to make the results replicable - remove in real life!
random.seed(42)
We can assume that $m=1$ and noise is Gaussian and that $\Psi$ is finite.
We will assume that the target variable is described by $$y = \mathbf{F}(\vec{x},\vec{w}) + \epsilon$$ where $\mathbf{F}(\vec{x},\vec{w})$ is a (possibly undefined and/or unkown) function of $\vec{x}$ and $\vec{w}$ and $\epsilon$ is a Gaussian-distributed noise component.
The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges on the use of the Central_Limit_Theorem that basically says the the sum of many independent random variables behaves behaves like a Gaussian distributed random variable.
The noise term in this model, $\epsilon$, can be thought of as the sum of features not included in the model function, $\mathbf{F}(\vec{x},\vec{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorom suggests a Gaussian model is appropriate, assuming there are many independent unaccounted for features.
It is possible that there is only a small number of unaccounted for features or that there is genuine non-Gaussian noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.
Training dataset
N = 5 # Size of the data set
X = np.array([[0.0], [1.0], [2.0], [3.0], [4.0]]) # Inputs, shape: N x 1
y = np.array([10.5, 5.0, 3.0, 2.5, 1.0]) # Outputs, shape: N
print(X.T)
print(y)
[[ 0. 1. 2. 3. 4.]] [ 10.5 5. 3. 2.5 1. ]
Test dataset
N_test = 100
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
The training and test datasets
plt.plot(X[:, 0], y, 'bo', label='Training')
plt.plot(X_test[:, 0], y_test, "g.", label='Test')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
where $\boldsymbol{w}$ and $b$ constitute a weight vector and $\epsilon^{(n)} \sim \mathcal{N}(0, \sigma^2)$.
A shorter version of this expression is the following equation: $$\vec{y} + \vec{\epsilon} = \vec{X} \cdot \vec{w}, \quad \vec{X} \in \mathbb{R}^{m \times n}, \vec{y} \in \mathbb{R}^m, \vec{\epsilon} \in \mathbb{R}^m, \vec{w} \in \mathbb{R}^n,$$ where the $i$-th row of $\vec{X}$ represents $\vec{x}^{(i)}$ and the $j$-th entry of the vector $\vec{y}$ represents $\vec{y}^{(j)}$.
where $\left\|\boldsymbol{A}\right\|_2$ is called Frobenius norm and is the generalization of the Euclidean norm for matrices.
The frequentist (maximum likelihood) approach
Under the Gaussian noise condition it can be shown that the maximum likelihood function for the training data is $$ \begin{array}{rl} p(\vec{y}|\vec{X},\vec{w},\sigma^2) & = \prod_{i=1}^m {\mathcal{N}\left(y^{(i)}\left|\vec{w}^\intercal\phi(\vec{x}^{(i)}),\sigma^2\right.\right)} \\ & =\frac{m}{2}\ln\left(\frac{1}{\sigma^2}\right) -\frac{m}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^m\left\{t_n -\vec{w}^\intercal\phi(\vec{x}^{(i)})\right\}^2 \end{array} $$ where $\vec{X}=\{\vec{x}^{(1)},\ldots,\vec{x}^{(m)}\}$ is the input value set for the corresponding $N$ oberved output values contained in the vector $\mathbf{t}$, and $\mathcal{N}(\mu,\sigma^2)$ is the Normal Distribution (Gaussian).
Taking the logarithm of the maximum likelihood and setting the derivative with respect to $\vec{w}$ equal to zero, one can obtain the maximum likelikhood parameters given by the normal equations: $$\vec{w}_\text{ML} = \left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y},$$ where $\vec{\Phi}$ is the $N \times M$ design matrix with elements $\Phi_{i,j}=\phi_j(\vec{x}^{(i)})$, and $\vec{y}$ is the $N \times K$ matrix of training set target values (for $K=1$, it is simply a column vector).
Note that $\vec{\Phi}^\intercal$ is a $M \times N$ matrix, so that $\vec{w}_{ML}=\left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y}$ is $(M \times N)\times(N \times M)\times(M\times N)\times(N \times K) = M \times K$, where $M$ is the number of free parameters and $K$ is the number of predicted target values for a given input.
Note that the only term in the likelihood function that depends on $\vec{w}$ is the last term. Thus, maximizing the likelihood function with respect to $\vec{w}$ under the assumption of Gaussian noise is equivalent to minimizing a sum-of-squares error function.
There are two ways to adjust the weights of a linear model (which includes linear models with nonlinear features):
In addition, we can add a constraint to the objective function to penalize large weights: Tikhonov regularization (forward pointer).
Frequently solving the equation $\boldsymbol{X} \cdot \boldsymbol{w} = \boldsymbol{y}$ directly for $\boldsymbol{w}$ by inversion is not possible.
It is usually not possible to find an exact solution.
Hint: The expression $(\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal$ is equivalent to the Moore-Penrose pseudoinverse $\vec{X}^+$ of $\vec{X}$. It is a generalization of the inverse for non-square matrices. You can use this to implement normal equations if your library provides the function. You could also use a least squares solver (e.g. numpy.linalg.lstsq
or dgelsd/zgelsd
from LAPACK).
def normal_equations(X, y):
w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
# ... or we can use the Moore-Penrose pseudoinverse (is usually more likely to
# be numerically stable)
#w = np.linalg.pinv(X).dot(y)
# ... or we use the solver
# w = np.linalg.lstsq(X, y)[0]
return w
# Add bias to each row/instance:
X_bias = np.hstack((X, np.ones((N, 1))))
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))
print('X:',X.T)
print('X_bias:',X_bias.T)
X: [[ 0. 1. 2. 3. 4.]] X_bias: [[ 0. 1. 2. 3. 4.] [ 1. 1. 1. 1. 1.]]
w_norm = normal_equations(X_bias, y)
plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "m-", linewidth=2)
plt.xlabel("x"); plt.ylabel("y"); plt.title("Model: $y = %.2f x + %.2f$" % tuple(w_norm));
We can now solve linear regression problems at will!
Test the other forms of normal equations.
Think of this problem:
n_pixels = 256 * 256
memory_size_xtx = n_pixels * n_pixels * 8
print("Required memory: %d GB" % (memory_size_xtx / 2**30))
Required memory: 32 GB
Inversion is an operation that requires $O(n^{2.373})$ operations (Source).
In cases where the training data set is very large or data is received in a stream, a direct solution using the normal equations may not be possible. An alternative approach is the stochastic gradient descent algorithm.
Iteratively update the weights using the gradient as reference.
$\Delta \boldsymbol{w}_t$ is minus the derivative of the error function $E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}_t)$ with respect to the weight $w_i$: $$\Delta w_i = - \alpha\frac{\partial E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w})}{\partial w_i}$$
where
Update $\boldsymbol{w}$ incrementally in every iteration $t$ as:
$$\boldsymbol{w}(t+1) = \boldsymbol{w}(t) + \Delta \boldsymbol{w}(t),$$Note: Gradient descent actually is a more general rule that can be applied to any minimization problem if the gradient can be computed, e.g., for artificial neural networks, support vector machines, $k$-means, etc. (another forward pointer).
Using as error function the sum of squared errors (SSE),
$$E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \frac{1}{2}\left\|\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y}\right\|^2_2$$The gradient becomes $$\nabla \boldsymbol{w} = \nabla_{\boldsymbol{w}} E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \boldsymbol{X}^T \cdot (\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y})$$
def error(X, y, w):
return 0.5*np.linalg.norm(X.dot(w) - y)**2
def linear_regression_gradient(X, y, w):
return X.T.dot(X.dot(w)-y)
Initialize weights to small random values
w = np.random.random(2)/10000
alpha = 0.05
for i in range(50):
# print('Iteration', i+1, 'weights', w, 'error', error(X_bias, y, w))
w = w - alpha*linear_regression_gradient(X_bias, y, w)
print('w_norm:', w_norm, 'w grad. desc.', w)
w_norm: [-2.15 8.7 ] w grad. desc. [-2.089 8.526]
def plot_contour(X_data, y_data, bounds, resolution=50, cmap=cm.viridis,
alpha=0.3, linewidth=5, rstride=1, cstride=5, ax=None):
(minx,miny),(maxx,maxy) = bounds
x_range = np.linspace(minx, maxx, num=resolution)
y_range = np.linspace(miny, maxy, num=resolution)
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros((len(x_range), len(y_range)))
for i, w_i in enumerate(x_range):
for j, w_j in enumerate(y_range):
Z[j,i] = error(X_data, y_data, [w_i, w_j])
if not ax:
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
ax.set_aspect('equal')
ax.autoscale(tight=True)
cset = ax.contourf(X, Y, Z, 30, cmap=cmap, rstride=rstride,
cstride=cstride, linewidth=linewidth, alpha=alpha)
cset = ax.contour(X, Y, Z, 10, cmap=cmap, rstride=rstride,
cstride=cstride, linewidth=linewidth)
plt.clabel(cset, inline=1, fontsize=7)
return Z
Implementing the gradient descent loop as a function
def gradient_descent(X, y, w_0, alpha, max_iters):
'Returns the values of the weights as learning took place.'
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
for i in range(0,max_iters):
delta_weights = -alpha*linear_regression_gradient(X_bias, y, w)
w += delta_weights
w_hist[i+1] = w
return w_hist
w_0 = [-3,2] # we fix the initial weights to make it more illustrative
alpha = 0.05
max_iters = 25
w_hist = gradient_descent(X_bias, y, w_0, alpha, max_iters)
fig = plt.figure(figsize=(5,5)); ax = fig.gca()
bounds = (np.min(w_hist, axis=0)-2, np.max(w_hist, axis=0)+2)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], color='navy', marker='D', s=50, label='w_norm')
ax.plot(w_hist[:,0], w_hist[:,1], '.--', color='m')
ax.scatter(w_hist[0,0], w_hist[0,1], color='navy', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], color='navy', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$'); plt.title('end='+str(w_hist[-1]))
plt.legend(scatterpoints=1, bbox_to_anchor=(1.37,1), frameon=True);
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj) 305 pass 306 else: --> 307 return printer(obj) 308 # Finally look for special method names 309 method = get_real_method(obj, self.print_method) /Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in <lambda>(fig) 240 png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs)) 241 if 'retina' in formats or 'png2x' in formats: --> 242 png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs)) 243 if 'jpg' in formats or 'jpeg' in formats: 244 jpg_formatter.for_type(Figure, lambda fig: print_figure(fig, 'jpg', **kwargs)) /Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in retina_figure(fig, **kwargs) 130 def retina_figure(fig, **kwargs): 131 """format a figure as a pixel-doubled (retina) PNG""" --> 132 pngdata = print_figure(fig, fmt='retina', **kwargs) 133 # Make sure that retina_figure acts just like print_figure and returns 134 # None when the figure is empty. /Users/lm/anaconda/lib/python3.6/site-packages/IPython/core/pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs) 122 123 bytes_io = BytesIO() --> 124 fig.canvas.print_figure(bytes_io, **kw) 125 data = bytes_io.getvalue() 126 if fmt == 'svg': /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs) 2190 orientation=orientation, 2191 dryrun=True, -> 2192 **kwargs) 2193 renderer = self.figure._cachedRenderer 2194 bbox_inches = self.figure.get_tightbbox(renderer) /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs) 543 544 def print_png(self, filename_or_obj, *args, **kwargs): --> 545 FigureCanvasAgg.draw(self) 546 renderer = self.get_renderer() 547 original_dpi = renderer.dpi /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in draw(self) 462 463 try: --> 464 self.figure.draw(self.renderer) 465 finally: 466 RendererAgg.lock.release() /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs) 61 def draw_wrapper(artist, renderer, *args, **kwargs): 62 before(artist, renderer) ---> 63 draw(artist, renderer, *args, **kwargs) 64 after(artist, renderer) 65 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/figure.py in draw(self, renderer) 1141 1142 mimage._draw_list_compositing_images( -> 1143 renderer, self, dsu, self.suppressComposite) 1144 1145 renderer.close_group('figure') /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite) 137 if not_composite or not has_images: 138 for zorder, a in dsu: --> 139 a.draw(renderer) 140 else: 141 # Composite any adjacent images together /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs) 61 def draw_wrapper(artist, renderer, *args, **kwargs): 62 before(artist, renderer) ---> 63 draw(artist, renderer, *args, **kwargs) 64 after(artist, renderer) 65 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/axes/_base.py in draw(self, renderer, inframe) 2407 renderer.stop_rasterizing() 2408 -> 2409 mimage._draw_list_compositing_images(renderer, self, dsu) 2410 2411 renderer.close_group('axes') /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite) 137 if not_composite or not has_images: 138 for zorder, a in dsu: --> 139 a.draw(renderer) 140 else: 141 # Composite any adjacent images together /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs) 61 def draw_wrapper(artist, renderer, *args, **kwargs): 62 before(artist, renderer) ---> 63 draw(artist, renderer, *args, **kwargs) 64 after(artist, renderer) 65 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/legend.py in draw(self, renderer) 469 # update the location and size of the legend. This needs to 470 # be done in any case to clip the figure right. --> 471 bbox = self._legend_box.get_window_extent(renderer) 472 self.legendPatch.set_bounds(bbox.x0, bbox.y0, 473 bbox.width, bbox.height) /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_window_extent(self, renderer) 267 get the bounding box in display space. 268 ''' --> 269 w, h, xd, yd, offsets = self.get_extent_offsets(renderer) 270 px, py = self.get_offset(w, h, xd, yd, renderer) 271 return mtransforms.Bbox.from_bounds(px - xd, py - yd, w, h) /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer) 389 390 whd_list = [c.get_extent(renderer) --> 391 for c in self.get_visible_children()] 392 whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list] 393 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0) 389 390 whd_list = [c.get_extent(renderer) --> 391 for c in self.get_visible_children()] 392 whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list] 393 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer) 260 Return with, height, xdescent, ydescent of box 261 """ --> 262 w, h, xd, yd, offsets = self.get_extent_offsets(renderer) 263 return w, h, xd, yd 264 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer) 460 461 whd_list = [c.get_extent(renderer) --> 462 for c in self.get_visible_children()] 463 464 if not whd_list: /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0) 460 461 whd_list = [c.get_extent(renderer) --> 462 for c in self.get_visible_children()] 463 464 if not whd_list: /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer) 260 Return with, height, xdescent, ydescent of box 261 """ --> 262 w, h, xd, yd, offsets = self.get_extent_offsets(renderer) 263 return w, h, xd, yd 264 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer) 389 390 whd_list = [c.get_extent(renderer) --> 391 for c in self.get_visible_children()] 392 whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list] 393 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0) 389 390 whd_list = [c.get_extent(renderer) --> 391 for c in self.get_visible_children()] 392 whd_list = [(w, h, xd, (h - yd)) for w, h, xd, yd in whd_list] 393 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer) 260 Return with, height, xdescent, ydescent of box 261 """ --> 262 w, h, xd, yd, offsets = self.get_extent_offsets(renderer) 263 return w, h, xd, yd 264 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent_offsets(self, renderer) 460 461 whd_list = [c.get_extent(renderer) --> 462 for c in self.get_visible_children()] 463 464 if not whd_list: /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in <listcomp>(.0) 460 461 whd_list = [c.get_extent(renderer) --> 462 for c in self.get_visible_children()] 463 464 if not whd_list: /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/offsetbox.py in get_extent(self, renderer) 833 "lp", self._text._fontproperties, ismath=False) 834 --> 835 bbox, info, d = self._text._get_layout(renderer) 836 w, h = bbox.width, bbox.height 837 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/text.py in _get_layout(self, renderer) 360 w, h, d = renderer.get_text_width_height_descent(clean_line, 361 self._fontproperties, --> 362 ismath=ismath) 363 else: 364 w, h, d = 0, 0, 0 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py in get_text_width_height_descent(self, s, prop, ismath) 228 fontsize = prop.get_size_in_points() 229 w, h, d = texmanager.get_text_width_height_descent(s, fontsize, --> 230 renderer=self) 231 return w, h, d 232 /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/texmanager.py in get_text_width_height_descent(self, tex, fontsize, renderer) 674 else: 675 # use dviread. It sometimes returns a wrong descent. --> 676 dvifile = self.make_dvi(tex, fontsize) 677 dvi = dviread.Dvi(dvifile, 72 * dpi_fraction) 678 try: /Users/lm/anaconda/lib/python3.6/site-packages/matplotlib/texmanager.py in make_dvi(self, tex, fontsize) 421 'string:\n%s\nHere is the full report generated by ' 422 'LaTeX: \n\n' % repr(tex.encode('unicode_escape')) + --> 423 report)) 424 else: 425 mpl.verbose.report(report, 'debug') RuntimeError: LaTeX was not able to process the following string: b'w_norm' Here is the full report generated by LaTeX: This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./205c878eadf6ef7aab19df7dd1db849a.tex LaTeX2e <2017/01/01> patch level 1 Babel <3.9r> and hyphenation patterns for 83 language(s) loaded. (/usr/local/texlive/2016/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/local/texlive/2016/texmf-dist/tex/latex/base/size10.clo)) (/usr/local/texlive/2016/texmf-dist/tex/latex/type1cm/type1cm.sty) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/textcomp.sty (/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1enc.def)) (/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/libertine.sty (/usr/local/texlive/2016/texmf-dist/tex/generic/ifxetex/ifxetex.sty) (/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifluatex.sty) (/usr/local/texlive/2016/texmf-dist/tex/latex/xkeyval/xkeyval.sty (/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/xkeyval.tex (/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/xkvutils.tex (/usr/local/texlive/2016/texmf-dist/tex/generic/xkeyval/keyval.tex)))) (/usr/local/texlive/2016/texmf-dist/tex/latex/mweights/mweights.sty) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/fontenc.sty) (/usr/local/texlive/2016/texmf-dist/tex/latex/fontaxes/fontaxes.sty) (/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/LinLibertine_I.tex)) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/inputenc.sty (/usr/local/texlive/2016/texmf-dist/tex/latex/base/utf8.def (/usr/local/texlive/2016/texmf-dist/tex/latex/base/t1enc.dfu) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/ot1enc.dfu) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/omsenc.dfu) (/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1enc.dfu))) (/usr/local/texlive/2016/texmf-dist/tex/latex/geometry/geometry.sty (/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifpdf.sty) (/usr/local/texlive/2016/texmf-dist/tex/generic/oberdiek/ifvtex.sty) Package geometry Warning: Over-specification in `h'-direction. `width' (5058.9pt) is ignored. Package geometry Warning: Over-specification in `v'-direction. `height' (5058.9pt) is ignored. ) No file 205c878eadf6ef7aab19df7dd1db849a.aux. (/usr/local/texlive/2016/texmf-dist/tex/latex/base/ts1cmr.fd) (/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/OT1LinuxLibertineT-TLF. fd) *geometry* driver: auto-detecting *geometry* detected driver: dvips (/usr/local/texlive/2016/texmf-dist/tex/latex/libertine/OT1LinuxBiolinumT-TLF.f d) ! Missing $ inserted. <inserted text> $ l.13 \fontsize{13.000000}{16.250000}{\sffamily w_ norm} ! Extra }, or forgotten $. l.13 ...ze{13.000000}{16.250000}{\sffamily w_norm} ! Missing $ inserted. <inserted text> $ l.14 \end{document} [1] (./205c878eadf6ef7aab19df7dd1db849a.aux) ) (\end occurred inside a group at level 1) ### simple group (level 1) entered at line 13 ({) ### bottom level (see the transcript file for additional information) Output written on 205c878eadf6ef7aab19df7dd1db849a.dvi (1 page, 376 bytes). Transcript written on 205c878eadf6ef7aab19df7dd1db849a.log.
<matplotlib.figure.Figure at 0x11199ea90>
What is the impact of the learning rate ($\alpha$)?
def alphas_study(alphas):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent(X_bias, y , w_0, alpha, max_iters)
combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], c='navy', marker='D', s=50, label='w_norm')
ax.plot(w_hist[:,0], w_hist[:,1], '.--', color='m')
ax.scatter(w_hist[0,0], w_hist[0,1], c='navy', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], c='navy', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$');
plt.title('$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.15), frameon=True);
plt.tight_layout()
alphas = np.linspace(0.02,0.07,6)
alphas
array([ 0.02, 0.03, 0.04, 0.05, 0.06, 0.07])
alphas_study(alphas)
Idea adding some an inertial momentum to the process can:
Update $\vec{w}$ incrementally in every iteration $t$ as:
$$\vec{w}(t+1) = \vec{w}(t) + \alpha\Delta \vec{w}(t) + \beta \Delta \vec{w}(t-1),$$where $\beta\in\mathbb{R}^+$ is known as the momentum rate.
def gradient_descent_with_momentum(X, y, w_0, alpha, beta, max_iters):
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
omega = np.zeros_like(w)
for i in range(max_iters):
delta_weights = -alpha*linear_regression_gradient(X, y, w) + beta*omega
omega = delta_weights
w += delta_weights
w_hist[i+1] = w
return w_hist
alpha = 0.05
beta = 0.5
max_iters = 25
w_hist = gradient_descent(X_bias, y, (-3,2), alpha, max_iters)
w_hist_mom = gradient_descent_with_momentum(X_bias,y, (-3,2), alpha, beta, max_iters)
def comparison_plot():
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(121)
combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
ax.plot(w_hist[:,0], w_hist[:,1], '.-', c='navy')
ax.scatter(w_hist[0,0], w_hist[0,1], c='lightskyblue', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], c='lightskyblue', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$');plt.title('Gradient descent')
ax = fig.add_subplot(122)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
ax.plot(w_hist_mom[:,0], w_hist_mom[:,1], '.-', c='m')
ax.scatter(w_hist_mom[0,0], w_hist_mom[0,1], c='lightskyblue', marker='o', s=65, label='start')
ax.scatter(w_hist_mom[-1,0], w_hist_mom[-1,1], c='lightskyblue', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$');plt.title('Gradient descent with momentum')
plt.legend(scatterpoints=1, bbox_to_anchor=(1.44,1), frameon=True)
plt.tight_layout()
comparison_plot()
def alphas_study_with_momentum(alphas, beta):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent_with_momentum(X_bias, y , w_0, alpha, beta, max_iters)
combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
z= plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], c='lightcoral', marker='D', s=50, label='w_norm')
ax.plot(w_hist[:,0], w_hist[:,1], '.-', c='m')
ax.scatter(w_hist[0,0], w_hist[0,1], c='lightskyblue', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], c='lightskyblue', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$');
plt.title('$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.15), frameon=True);
plt.tight_layout()
alphas_study_with_momentum(alphas, 0.5)
alphas_study_with_momentum(alphas, 0.25)
alphas_study_with_momentum(alphas, 0.75)
... I know you are wondering how $\beta=1$ looks like.
alphas_study_with_momentum(alphas, 1)
plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "D-", c='lightcoral',
linewidth=2, label='normal equation')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist[-1]), "-", c='skyblue',
linewidth=2, label='gradient descent')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist_mom[-1]), "-", c='forestgreen',
linewidth=2, alpha=0.5, label='gradient descent with momentum')
plt.xlabel("x"); plt.ylabel("y"); plt.legend(numpoints=1, bbox_to_anchor=(1.4,1), frameon=True);
N_test = 10000000
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))
%time w_norm = normal_equations(X_test_bias, y_test)
CPU times: user 295 ms, sys: 113 ms, total: 409 ms Wall time: 436 ms
%time w_grad = gradient_descent_with_momentum(X_test_bias, y_test, np.random.random(2)/1000, 0.000001, 0.25, 25)
CPU times: user 4.86 s, sys: 1.39 s, total: 6.24 s Wall time: 4.29 s
w_norm
array([-3.001, 7.002])
w_grad[-1]
array([ -6.263e+43, -2.449e+43])
np.linalg.norm(w_norm - w_grad[-1])
6.7249641881311471e+43
To approximate nonlinear funtions with a linear model, we have to generate nonlinear features.
In this example, we generate sinusoidal features. You could also try radial basis functions, polynomials, ...
We expand each feature $x_j$ to the nonlinear feature vector $$\left\langle\cos(\frac{0}{d} \pi x_j), \cos(\frac{1}{d} \pi x_j), \ldots, \cos(\frac{d}{d} \pi x_j)\right\rangle^\intercal,$$ where $d$ is the number of basis functions.
Note that $\cos \left(\frac{0}{D} \pi x_j \right)=1$ is the bias we added manually in the previous example.
from itertools import chain
def sinusoidalize(X, n_degree):
X_sinusoidal = np.ndarray((len(X), n_degree+1))
for d in range(n_degree+1):
X_sinusoidal[:, d] = np.cos(X[:, 0] * np.pi * d / n_degree)
return X_sinusoidal
# Utility function
def build_sinusoidal(n_degree, w):
return ((("%+.2f \sin(%d /" + str(n_degree) + " \pi x)") * (n_degree+1)) % tuple(chain(*zip(w, range(n_degree+1)))))
n_degree = 4
X_sinusoidal = sinusoidalize(X, n_degree=n_degree)
X_test_sinusoidal = sinusoidalize(X_test, n_degree=n_degree)
w_sin = normal_equations(X_sinusoidal, y)
plt.plot(X_bias[:, 0], X_bias.dot(w), "m-", linewidth=2, label='linear')
plt.plot(X_test_bias[:, 0], X_test_sinusoidal.dot(w_sin), "-", linewidth=2, color="r", label='sinusoidal')
plt.plot(X_bias[:, 0], y, "bo", label='training')
plt.xlabel("x"); plt.ylabel("y"); plt.legend(frameon=True)
plt.title("Model: $y = " + build_sinusoidal(n_degree, w_sin) + "$");
You might be wondering, is this cheating?
Techniques used to account for over fitting or under fitting a given model when using a maximum likelihood approach.
where $\lambda$ is the regularization coefficient and $E_W$ is the regularization term.
A commonly used regularization term is the sum-of-squares of the model parameters, $$E_W(\vec{w}) = \frac1{2}\vec{w}^\intercal\vec{w}$$ known as the weight decay regularizer.
This regularization terms leads to the optimal solution, assuming a linear regression model with Gaussian noise on the training data, of $$\vec{w} = \left(\lambda \vec{I} + \vec{\Phi}^\intercal \vec{\Phi}\right)^{-1} \vec{\Phi}^T\vec{y}$$
def y(x,m,b,mu=0,sigma=1):
return m*x + b + np.random.normal(mu,sigma,1)[0]
def el_pow(x,pow):
temp = x
for i in range(pow-1):
temp = temp * x
return temp
def prediction(params, x):
pred = 0
for i in range(len(params)):
pred += params[i]*math.pow(x,i)
return pred
#training data, with N data points
N = 101
M = 8
t = np.empty(N)
domain = np.empty(N)
domain_bound = 1.0/N
for i in range(N):
domain[i] = i*domain_bound
for i in range(N):
t[i] = y(x=domain[i],m=4.89,b=0.57)
#find the solution without using regularization
#design matrix, phi, N X M
phi = np.array([np.ones(N), domain, el_pow(domain,2), el_pow(domain,3),
el_pow(domain,4), el_pow(domain,5), el_pow(domain,6),el_pow(domain,7)]).T
temp1 = np.linalg.inv(np.dot(phi.T,phi)) #inverse of phi.T X phi
temp2 = np.dot(temp1, phi.T)
w1 = np.dot(temp2,t) #solution
predicted_t = [prediction(w1,x) for x in domain]
print('w1=',w1)
w1= [ 1.784 -31.545 297.95 -1047.916 1790.514 -1433.442 386.297 42.499]
#find the regularized solution
lam = 0.1
temp1 = np.linalg.inv(lam*np.eye(M)+np.dot(phi.T,phi))
temp2 = np.dot(temp1,phi.T)
w2 = np.dot(temp2,t)
print('w2=',w2)
predicted_t_reg = [prediction(w2,x) for x in domain]
w2= [ 0.777 3.509 1.983 0.232 -0.47 -0.538 -0.281 0.135]
plt.plot(domain,t, label='training')
plt.plot(domain,predicted_t, label='predicted')
plt.plot(domain,predicted_t_reg, label='regularized')
plt.legend(loc='lower right', frameon=True);
%load_ext version_information
%version_information scipy, numpy, matplotlib
Software | Version |
---|---|
Python | 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] |
IPython | 5.3.0 |
OS | Darwin 16.5.0 x86_64 i386 64bit |
scipy | 0.19.0 |
numpy | 1.12.1 |
matplotlib | 2.0.0 |
Sat Apr 08 17:05:34 2017 -03 |
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))