Abstract: In his philosophical essay on probabilities, Laplace motivated the deterministic universe as a straw man in terms of driving predictions. He suggested ignorance of data and models drives the need to turn to probability. Bayesian formalisms deal with uncertainty in parameters of the model. In this lecture we review the Bayesian formalism in the context of linear models, reviewing initially maximum likelihood and introducing basis functions as a way of driving non-linearity in the model.
$$ \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$. . .
$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$. . .
. . .
. . .
. . .
. . .
. . .
. . .
import numpy as np
import matplotlib.pyplot as plt
import pods
import teaching_plots as plot
import mlai
The first thing we will do is load a standard data set for regression modelling. The data consists of the pace of Olympic Gold Medal Marathon winners for the Olympics from 1896 to present. First we load in the data and plot.
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)
- Gold medal times for Olympic Marathon since 1896.
|
![image](../slides/diagrams/Stephen_Kiprotich.jpg) Image from
Wikimedia Commons |
$\dataScalar_i$ : winning pace.
$\inputScalar_i$ : year of Olympics.
$m$ : rate of improvement over time.
$c$ : winning time at year 0.
import teaching_plots as plot
plot.over_determined_system(diagrams='../slides/diagrams/ml')
from ipywidgets import IntSlider
import pods
pods.notebook.display_plots('over_determined_system{samp:0>3}.svg',
directory='../slides/diagrams/ml',
samp=IntSlider(1,1,8,1))
\startslides{over_determined_system}{1}{8}
. . .
point 1: $\inputScalar = 1$, $\dataScalar=3$ $$ 3 = m + c $$
. . .
point 2: $\inputScalar = 3$, $\dataScalar=1$ $$ 1 = 3m + c $$
. . .
point 3: $\inputScalar = 2$, $\dataScalar=2.5$
$$2.5 = 2m + c$$. . .
point 1: $\inputScalar = 1$, $\dataScalar=3$ $$ 3 = m + c + \noiseScalar_1 $$
. . .
point 2: $\inputScalar = 3$, $\dataScalar=1$ $$ 1 = 3m + c + \noiseScalar_2 $$
. . .
point 3: $\inputScalar = 2$, $\dataScalar=2.5$ $$ 2.5 = 2m + c + \noiseScalar_3 $$
. . .
Set the mean of Gaussian to be a function. $$p \left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp \left(-\frac{\left(\dataScalar_i-\mappingFunction\left(\inputScalar_i\right)\right)^{2}}{2\dataStd^2}\right). $$
. . .
This gives us a 'noisy function'.
. . .
This is known as a stochastic process.
. . .
$$\begin{align} p(\dataScalar| \meanScalar, \dataStd^2) & = \frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{(\dataScalar - \meanScalar)^2}{2\dataStd^2}\right)\\& \buildrel\triangle\over = \gaussianDist{\dataScalar}{\meanScalar}{\dataStd^2} \end{align}$$import teaching_plots as plot
plot.gaussian_of_height(diagrams='../../slides/diagrams/ml')
. . .
. . .
[Sum of Gaussian variables is also Gaussian.]{align="left"}
$$\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\sigma_i^2}$$. . .
[And the sum is distributed as]{align="left"}
$$\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData \sigma_i^2}$$. . .
(Aside: As sum increases, sum of non-Gaussian, finite variance variables is also Gaussian because of central limit theorem.)
. . .
[Scaling a Gaussian leads to a Gaussian.]{align="left"}
. . .
$$\dataScalar \sim \gaussianSamp{\meanScalar}{\sigma^2}$$. . .
[And the scaled variable is distributed as]{align="left"}
$$\mappingScalar \dataScalar \sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \sigma^2}.$$Set the mean of Gaussian to be a function.
. . .
$$p\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-f\left(\inputScalar_i\right)\right)^{2}}{2\dataStd^2}\right).$$. . .
This gives us a 'noisy function'.
. . .
This is known as a stochastic process.
In the standard Gaussian, parametized by mean and variance.
Make the mean a linear function of an input.
This leads to a regression model. $$ \begin{align*} \dataScalar_i=&\mappingFunction\left(\inputScalar_i\right)+\noiseScalar_i,\\ \noiseScalar_i \sim & \gaussianSamp{0}{\dataStd^2}. \end{align*} $$
Assume $\dataScalar_i$ is height and $\inputScalar_i$ is weight.
Likelihood of an individual data point $$ p\left(\dataScalar_i|\inputScalar_i,m,c\right)=\frac{1}{\sqrt{2\pi \dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right). $$ Parameters are gradient, $m$, offset, $c$ of the function and noise variance $\dataStd^2$.
If the noise, $\epsilon_i$ is sampled independently for each data point. Each data point is independent (given $m$ and $c$). For independent variables: $$ p(\dataVector) = \prod_{i=1}^\numData p(\dataScalar_i) $$ $$ p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData p(\dataScalar_i|\inputScalar_i, m, c) $$
i.i.d. assumption $$ p(\dataVector|\inputVector, m, c) = \prod_{i=1}^\numData \frac{1}{\sqrt{2\pi \dataStd^2}}\exp \left(-\frac{\left(\dataScalar_i- m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right). $$ $$ p(\dataVector|\inputVector, m, c) = \frac{1}{\left(2\pi \dataStd^2\right)^{\frac{\numData}{2}}}\exp\left(-\frac{\sum_{i=1}^\numData\left(\dataScalar_i-m\inputScalar_i-c\right)^{2}}{2\dataStd^2}\right). $$
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
print(x)
print(y)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x, y, 'rx')
plt.xlabel('year')
plt.ylabel('pace in min/km')
m = -0.4
c = 80
# set c to the minimum
c = (y - m*x).mean()
print(c)
[Worked example.]{align="left"} $$ \begin{aligned} c^{*}=&\frac{\sum _{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i\right)}{\numData},\\ m^{*}=&\frac{\sum _{i=1}^{\numData}\inputScalar_i\left(\dataScalar_i-c^{*}\right)}{\sum _{i=1}^{\numData}\inputScalar_i^{2}},\\ \left.\dataStd^2\right.^{*}=&\frac{\sum _{i=1}^{\numData}\left(\dataScalar_i-m^{*}\inputScalar_i-c^{*}\right)^{2}}{\numData} \end{aligned} $$
m = ((y - c)*x).sum()/(x**2).sum()
print(m)
import numpy as np
x_test = np.linspace(1890, 2020, 130)[:, None]
f_test = m*x_test + c
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
for i in np.arange(10):
m = ((y - c)*x).sum()/(x*x).sum()
c = (y-m*x).sum()/y.shape[0]
print(m)
print(c)
f_test = m*x_test + c
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
$$\mappingFunction(\inputScalar_i) = m\inputScalar_i + c$$
Need an algorithm to fit it.
Least squares: minimize an error.
import numpy as np
import matplotlib.pyplot as plt
import mlai
x = np.random.normal(size=4)
We now need to decide on a true value for $m$ and a true value for $c$ to use for generating the data.
m_true = 1.4
c_true = -3.1
We can use these values to create our artificial data. The formula $$\dataScalar_i = m\inputScalar_i + c$$ is translated to code as follows:
y = m_true*x+c_true
We can now plot the artifical data we've created.
plt.plot(x, y, 'r.', markersize=10) # plot data as red dots
plt.xlim([-3, 3])
mlai.write_figure(filename="../slides/diagrams/ml/regression.svg", transparent=True)
These points lie exactly on a straight line, that's not very realistic, let's corrupt them with a bit of Gaussian 'noise'.
noise = np.random.normal(scale=0.5, size=4) # standard deviation of the noise is 0.5
y = m_true*x + c_true + noise
plt.plot(x, y, 'r.', markersize=10)
plt.xlim([-3, 3])
mlai.write_figure(filename="../slides/diagrams/ml/regression_noise.svg", transparent=True)
# create an array of linearly separated values around m_true
m_vals = np.linspace(m_true-3, m_true+3, 100)
# create an array of linearly separated values ae
c_vals = np.linspace(c_true-3, c_true+3, 100)
m_grid, c_grid = np.meshgrid(m_vals, c_vals)
E_grid = np.zeros((100, 100))
for i in range(100):
for j in range(100):
E_grid[i, j] = ((y - m_grid[i, j]*x - c_grid[i, j])**2).sum()
%load -s regression_contour teaching_plots.py
f, ax = plt.subplots(figsize=(5,5))
regression_contour(f, ax, m_vals, c_vals, E_grid)
mlai.write_figure(filename='../slides/diagrams/ml/regression_contour.svg')
m_star = 0.0
c_star = -5.0
c_grad = -2*(y-m_star*x - c_star).sum()
print("Gradient with respect to c is ", c_grad)
To see how the gradient was derived, first note that the $c$ appears in every term in the sum. So we are just differentiating $(\dataScalar_i - m\inputScalar_i - c)^2$ for each term in the sum. The gradient of this term with respect to $c$ is simply the gradient of the outer quadratic, multiplied by the gradient with respect to $c$ of the part inside the quadratic. The gradient of a quadratic is two times the argument of the quadratic, and the gradient of the inside linear term is just minus one. This is true for all terms in the sum, so we are left with the sum in the gradient.
The gradient with respect tom $m$ is similar, but now the gradient of the quadratic's argument is $-\inputScalar_i$ so the gradient with respect to $m$ is
$$\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i - m\inputScalar_i - c)$$which can be implemented in python (numpy) as
m_grad = -2*(x*(y-m_star*x - c_star)).sum()
print("Gradient with respect to m is ", m_grad)
import teaching_plots as plot
f, ax = plt.subplots(figsize=plot.big_figsize)
plot.regression_contour(f, ax, m_vals, c_vals, E_grid)
ax.plot(m_star, c_star, 'g*', markersize=20)
ax.arrow(m_star, c_star, -m_grad*0.1, -c_grad*0.1, head_width=0.2)
mlai.write_figure(filename='../slides/diagrams/ml/regression_contour_step001.svg', transparent=True)
The step size has already been introduced, it's again known as the learning rate and is denoted by $\learnRate$. $$ c_\text{new}\leftarrow c_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}c} $$
gives us an update for our estimate of $c$ (which in the code we've
been calling c_star
to represent a common way of writing a
parameter estimate, $c^*$) and $$
m_\text{new} \leftarrow m_{\text{old}} - \learnRate \frac{\text{d}\errorFunction(m, c)}{\text{d}m}
$$
Giving us an update for $m$.
print("Original m was", m_star, "and original c was", c_star)
learn_rate = 0.01
c_star = c_star - learn_rate*c_grad
m_star = m_star - learn_rate*m_grad
print("New m is", m_star, "and new c is", c_star)
num_plots = plot.regression_contour_fit(x, y, diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('regression_contour_fit{num:0>3}.svg', directory='../slides/diagrams/ml', num=IntSlider(0, 0, num_plots, 1))
\startslides{regression_contour_fit}{1}{28}
but it has $\numData$ terms in the sum. Substituting in the gradient we can see that the full update is of the form
$$m_\text{new} \leftarrow m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old}) + (\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old}) + \dots + (\inputScalar_n (\dataScalar_n - m_\text{old}\inputScalar_n - c_\text{old})\right]$$This could be split up into lots of individual updates $$m_1 \leftarrow m_\text{old} + 2\learnRate \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old})\right]$$ $$m_2 \leftarrow m_1 + 2\learnRate \left[\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old})\right]$$ $$m_3 \leftarrow m_2 + 2\learnRate \left[\dots\right]$$ $$m_n \leftarrow m_{n-1} + 2\learnRate \left[\inputScalar_n (\dataScalar_n - m_\text{old}\inputScalar_n - c_\text{old})\right]$$
which would lead to the same final update.
# choose a random point for the update
i = np.random.randint(x.shape[0]-1)
# update m
m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))
# update c
c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)
Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.
num_plots = plot.regression_contour_sgd(x, y, diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('regression_sgd_contour_fit{num:0>3}.svg',
directory='../slides/diagrams/mlai', num=IntSlider(0, 0, num_plots, 1))
\startslides{regression_sgd_contour_fit}{0}{58}
Think about:
What effect does the learning rate have in the optimization? What's the effect of making it too small, what's the effect of making it too big? Do you get the same result for both stochastic and steepest gradient descent?
The stochastic gradient descent doesn't help very much for such a small data set. It's real advantage comes when there are many, you'll see this in the lab.
The likelihood of a single data point is
. . .
$$p\left(\dataScalar_i|\inputScalar_i\right)=\frac{1}{\sqrt{2\pi\dataStd^2}}\exp\left(-\frac{\left(\dataScalar_i-\mappingVector^{\top}\inputVector_i\right)^{2}}{2\dataStd^2}\right).$$. . .
Leading to a log likelihood for the data set of
. . .
$$L(\mappingVector,\dataStd^2)= -\frac{\numData}{2}\log \dataStd^2-\frac{\numData}{2}\log 2\pi -\frac{\sum_{i=1}^{\numData}\left(\dataScalar_i-\mappingVector^{\top}\inputVector_i\right)^{2}}{2\dataStd^2}.$$And a corresponding error function of $$\errorFunction(\mappingVector,\dataStd^2)=\frac{\numData}{2}\log\dataStd^2 + \frac{\sum_{i=1}^{\numData}\left(\dataScalar_i-\mappingVector^{\top}\inputVector_i\right)^{2}}{2\dataStd^2}.$$
# define the vector w
w = np.zeros(shape=(2, 1))
w[0] = m
w[1] = c
X = np.hstack((np.ones_like(x), x))
print(X)
f = np.dot(X, w) # np.dot does matrix multiplication in python
resid = (y-f)
E = np.dot(resid.T, resid) # matrix multiplication on a single vector is equivalent to a dot product.
print("Error function is:", E)
[Differentiating with respect to the vector $\mappingVector$ we obtain]{align="left"} $$ \frac{\partial L\left(\mappingVector,\dataStd^2 \right)}{\partial \mappingVector}=\frac{1}{\dataStd^2} \sum _{i=1}^{\numData}\inputVector_i \dataScalar_i-\frac{1}{\dataStd^2} \left[\sum _{i=1}^{\numData}\inputVector_i\inputVector_i^{\top}\right]\mappingVector $$ Leading to $$ \mappingVector^{*}=\left[\sum _{i=1}^{\numData}\inputVector_i\inputVector_i^{\top}\right]^{-1}\sum _{i=1}^{\numData}\inputVector_i\dataScalar_i, $$
Rewrite in matrix notation: $$ \sum_{i=1}^{\numData}\inputVector_i\inputVector_i^\top = \inputMatrix^\top \inputMatrix $$ $$ \sum_{i=1}^{\numData}\inputVector_i\dataScalar_i = \inputMatrix^\top \dataVector $$
np.linalg.solve?
w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
print(w)
m = w[1]; c=w[0]
f_test = m*x_test + c
print(m)
print(c)
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
data = pods.datasets.movie_body_count()
movies = data['Y']
print(', '.join(movies.columns))
select_features = ['Year', 'Body_Count', 'Length_Minutes']
X = movies[select_features]
X['Eins'] = 1 # add a column for the offset
y = movies[['IMDB_Rating']]
import pandas as pd
w = pd.DataFrame(data=np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)), # solve linear regression here
index = X.columns, # columns of X become rows of w
columns=['regression_coefficient']) # the column of X is the value of regression coefficient
(y - np.dot(X, w)).hist()
w
import scipy as sp
Q, R = np.linalg.qr(X)
w = sp.linalg.solve_triangular(R, np.dot(Q.T, y))
w = pd.DataFrame(w, index=X.columns)
w
import numpy as np
def quadratic(x, **kwargs):
"""Take in a vector of input values and return the design matrix associated
with the basis functions."""
return np.hstack([np.ones((x.shape[0], 1)), x, x**2])
import matplotlib.pyplot as plt
import teaching_plots as plot
f, ax = plt.subplots(figsize=plot.big_wide_figsize)
loc =[[0, 1.4,],
[0, -0.7],
[0.75, -0.2]]
text =['$\phi(x) = 1$',
'$\phi(x) = x$',
'$\phi(x) = x^2$']
plot.basis(quadratic, x_min=-1.3, x_max=1.3,
fig=f, ax=ax, loc=loc, text=text,
diagrams='../slides/diagrams/ml')
\startslides{quadratic_basis}{0}{2}
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('quadratic_basis{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(0,0,2,1))
# first let's generate some inputs
n = 100
x = np.zeros((n, 1)) # create a data set of zeros
x[:, 0] = np.linspace(-1, 1, n) # fill it with values between -1 and 1
Phi = quadratic(x)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
ax.set_ylim([-1.2, 1.2]) # set y limits to ensure basis functions show.
ax.plot(x[:,0], Phi[:, 0], 'r-', label = '$\phi=1$', linewidth=3)
ax.plot(x[:,0], Phi[:, 1], 'g-', label = '$\phi=x$', linewidth=3)
ax.plot(x[:,0], Phi[:, 2], 'b-', label = '$\phi=x^2$', linewidth=3)
ax.legend(loc='lower right')
_ = ax.set_title('Quadratic Basis Functions')
\startslides{quadratic_function}{0}{2}
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('quadratic_function{num_function:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(0,0,2,1))
import numpy as np
from matplotlib import pyplot as plt
import teaching_plots as plot
import mlai
import pods
basis = mlai.polynomial
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
xlim = [1892, 2020]
basis=mlai.basis(mlai.polynomial, number=1, data_limits=xlim)
plot.rmse_fit(x, y, param_name='number', param_range=(1, 27),
model=mlai.LM,
basis=basis,
xlim=xlim, objective_ylim=[0, 0.8],
diagrams='../slides/diagrams/ml')
from ipywidgets import IntSlider
pods.notebook.display_plots('olympic_LM_polynomial_num_basis{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1,1,27,1))
\startslides{olympic_LM_polynomial_num_basis}{1}{26}
import teaching_plots as plot
plot.under_determined_system(diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('under_determined_system{samp:0>3}.svg',
directory='../slides/diagrams/ml', samp=IntSlider(0, 0, 10, 1))
\startslides{under_determined_system}{1}{3}
Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the intercept: $$c \sim \gaussianSamp{0}{\alpha_1}$$
import teaching_plots as plot
plot.bayes_update(diagrams='../slides/diagrams/ml')
from ipywidgets import IntSlider
import pods
pods.notebook.display_plots('dem_gaussian{stage:0>2}.svg',
diagrams='../slides/diagrams/ml',
stage=IntSlider(1, 1, 3, 1))
\startslides{dem_gaussian}{1}{3}
complete the square of the quadratic form to obtain $$\log p(c | \dataVector, \inputVector, m, \dataStd^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},$$ where $\tau^2 = \left(\numData\dataStd^{-2} +\alpha_1^{-1}\right)^{-1}$ and $\mu = \frac{\tau^2}{\dataStd^2} \sum_{i=1}^\numData(\dataScalar_i-m\inputScalar_i)$.
import teaching_plots as plot
plot.height_weight(diagrams='../slides/diagrams/ml')
Gaussian distributions for height and weight.
import teaching_plots as plot
plot.independent_height_weight(num_samps=8,
diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('independent_height_weight{fig:0>3}.png', '../slides/diagrams/ml', fig=IntSlider(0, 0, 8, 1))
import teaching_plots as plot
plot.correlated_height_weight(num_samps=8,
diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('correlated_height_weight{fig:0>3}.png', '../slides/diagrams/ml', fig=IntSlider(0, 0, 8, 1))
\startslides{correlated_height_weight}{0}{7}
Form correlated from original by rotating the data space using matrix $\rotationMatrix$.
$$ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\mathbf{D}^{-1}(\dataVector - \meanVector)\right) $$Form correlated from original by rotating the data space using matrix $\rotationMatrix$.
$$ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)^\top\mathbf{D}^{-1}(\rotationMatrix^\top\dataVector - \rotationMatrix^\top\meanVector)\right) $$Form correlated from original by rotating the data space using matrix $\rotationMatrix$.
$$ p(\dataVector) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\rotationMatrix\mathbf{D}^{-1}\rotationMatrix^\top(\dataVector - \meanVector)\right) $$this gives a covariance matrix: $$ \covarianceMatrix^{-1} = \rotationMatrix \mathbf{D}^{-1} \rotationMatrix^\top $$
Form correlated from original by rotating the data space using matrix $\rotationMatrix$.
$$ p(\dataVector) = \frac{1}{\det{2\pi\covarianceMatrix}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\dataVector - \meanVector)^\top\covarianceMatrix^{-1} (\dataVector - \meanVector)\right) $$this gives a covariance matrix: $$ \covarianceMatrix = \rotationMatrix \mathbf{D} \rotationMatrix^\top $$
with $$ \covarianceMatrix_\mappingScalar = \left(\dataStd^{-2}\basisMatrix^\top \basisMatrix + \alpha^{-1}\eye\right)^{-1} $$ and $$ \meanVector_\mappingScalar = \covarianceMatrix_\mappingScalar \dataStd^{-2}\basisMatrix^\top \dataVector $$
import mlai
import pods
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
num_data = x.shape[0]
data_limits = [1892, 2020]
basis = mlai.basis(mlai.polynomial, number=1, data_limits=data_limits)
max_basis = y.shape[0]
import teaching_plots as plot
plot.rmse_fit(x, y, param_name='number', param_range=(1, max_basis+1),
model=mlai.BLM,
basis=basis,
alpha=1,
sigma2=0.04,
data_limits=data_limits,
xlim=data_limits,
objective_ylim=[0.5,1.6]
diagrams='../slides/diagrams/ml')
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('olympic_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml/',
num_basis=IntSlider(1, 1, 27, 1))
\startslides{olympic_BLM_polynomial_number}{1}{26}
plot.holdout_fit(x, y, param_name='number', param_range=(1, 27),
diagrams='../slides/diagrams/ml',
model=mlai.BLM,
basis=basis,
alpha=1,
sigma2=0.04,
xlim=data_limits,
objective_ylim=[0.1,0.6],
permute=False)
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('olympic_val_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
num_basis=IntSlider(1, 1, 27, 1))
\startslides{olympic_val_BLM_polynomial_number}{1}{26}
num_parts=5
plot.cv_fit(x, y, param_name='number', param_range=(1, 27),
diagrams='../slides/diagrams/ml',
model=mlai.BLM,
basis=basis,
alpha=1,
sigma2=0.04,
xlim=data_limits,
objective_ylim=[0.2,0.6],
num_parts=num_parts)
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('olympic_5cv{part:0>2}_BLM_polynomial_number{num_basis:0>3}.svg',
directory='../slides/diagrams/ml',
part=(0, 5),
num_basis=IntSlider(1, 1, 27, 1))
\startslides{olympic_5cv05_BLM_polynomial_number}{1}{26}
The marginal likelihood can also be computed, it has the form: $$ p(\dataVector|\inputMatrix, \dataStd^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\kernelMatrix\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \dataVector^\top \kernelMatrix^{-1} \dataVector\right) $$ where $\kernelMatrix = \alpha \basisMatrix\basisMatrix^\top + \dataStd^2 \eye$.
So it is a zero mean $\numData$-dimensional Gaussian with covariance matrix $\kernelMatrix$.