Abstract: Neural network models are algorithmically simple, but mathematically complex. Gaussian process models are mathematically simple, but algorithmically complex. In this tutorial we will explore Deep Gaussian Process models. They bring advantages in their mathematical simplicity but are challenging in their algorithmic complexity. We will give an overview of Gaussian processes and highlight the algorithmic approximations that allow us to stack Gaussian process models: they are based on variational methods. In the last part of the tutorial will explore a use case exemplar: uncertainty quantification. We end with open questions.
$$ \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$. . .
$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$. . .
. . .
. . .
. . .
. . .
. . .
. . .
adaptive non-linear function models inspired by simple neuron models [@McCulloch:neuron43]
have become popular because of their ability to model data.
can be composed to form highly complex functions
start by focussing on one hidden layer
$\mappingFunction(\cdot)$ is a scalar function with vector inputs,
$\activationVector(\cdot)$ is a vector function with vector inputs.
dimensionality of the vector function is known as the number of hidden units, or the number of neurons.
elements of $\activationVector(\cdot)$ are the activation function of the neural network
elements of $\mappingMatrix_{1}$ are the parameters of the activation functions.
In statistics activation functions are known as basis functions.
would think of this as a linear model: not linear predictions, linear in the parameters
$\mappingVector_{1}$ are static parameters.
In machine learning we optimize $\mappingMatrix_{1}$ as well as $\mappingMatrix_{2}$ (which would normally be denoted in statistics by $\boldsymbol{\beta}$).
This tutorial: revisit that decision: follow the path of @Neal:bayesian94 and @MacKay:bayesian92.
Consider the probabilistic approach.
This Bayesian approach is designed to deal with uncertainty arising from fitting our prediction function to the data we have, a reduced data set.
The Bayesian approach can be derived from a broader understanding of what our objective is. If we accept that we can jointly represent all things that happen in the world with a probability distribution, then we can interogate that probability to make predictions. So, if we are interested in predictions, $\dataScalar_*$ at future points input locations of interest, $\inputVector_*$ given previously training data, $\dataVector$ and corresponding inputs, $\inputMatrix$, then we are really interogating the following probability density, $$ p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*), $$ there is nothing controversial here, as long as you accept that you have a good joint model of the world around you that relates test data to training data, $p(\dataScalar_*, \dataVector, \inputMatrix, \inputVector_*)$ then this conditional distribution can be recovered through standard rules of probability ($\text{data} + \text{model} \rightarrow \text{prediction}$).
We can construct this joint density through the use of the following decomposition: $$ p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) \text{d} \mappingMatrix $$
where, for convenience, we are assuming all the parameters of the model are now represented by $\parameterVector$ (which contains $\mappingMatrix$ and $\mappingMatrixTwo$) and $p(\parameterVector | \dataVector, \inputMatrix)$ is recognised as the posterior density of the parameters given data and $p(\dataScalar_*|\inputVector_*, \parameterVector)$ is the likelihood of an individual test data point given the parameters.
The likelihood of the data is normally assumed to be independent across the parameters, $$ p(\dataVector|\inputMatrix, \mappingMatrix) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),$$
and if that is so, it is easy to extend our predictions across all future, potential, locations, $$ p(\dataVector_*|\dataVector, \inputMatrix, \inputMatrix_*) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector. $$
The likelihood is also where the prediction function is incorporated. For example in the regression case, we consider an objective based around the Gaussian density, $$ p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2}{2\dataStd^2}\right) $$
In short, that is the classical approach to probabilistic inference, and all approaches to Bayesian neural networks fall within this path. For a deep probabilistic model, we can simply take this one stage further and place a probability distribution over the input locations, $$ p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \parameterVector \text{d} \inputMatrix \text{d}\inputMatrix_* $$ and we have unsupervised learning (from where we can get deep generative models).
Represent joint distribution through conditional dependencies.
E.g. Markov chain
import daft
from matplotlib import rc
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)
pgm = daft.PGM(shape=[3, 1],
origin=[0, 0],
grid_unit=5,
node_unit=1.9,
observed_style='shaded',
line_width=3)
pgm.add_node(daft.Node("y_1", r"$y_1$", 0.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_2", r"$y_2$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_3", r"$y_3$", 2.5, 0.5, fixed=False))
pgm.add_edge("y_1", "y_2")
pgm.add_edge("y_2", "y_3")
pgm.render().figure.savefig("../slides/diagrams/ml/markov.svg", transparent=True)
Predict Perioperative Risk of Clostridium Difficile Infection Following Colon Surgery [@Steele:predictive12]
Easy to write in probabilities
But underlying this is a wealth of computational challenges.
High dimensional integrals typically require approximation.
Statisticians realized these challenges early on, indeed, so early that they were actually physicists, both Laplace and Gauss worked on models such as this, in Gauss's case he made his career on prediction of the location of the lost planet (later reclassified as a asteroid, then dwarf planet), Ceres. Gauss and Laplace made use of maximum a posteriori estimates for simplifying their computations and Laplace developed Laplace's method (and invented the Gaussian density) to expand around that mode. But classical statistics needs better guarantees around model performance and interpretation, and as a result has focussed more on the linear model implied by $$ \mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_1, \inputVector) $$
Hold $\mappingMatrix_1$ fixed for given analysis.
Gaussian prior for $\mappingMatrix$, $$ \mappingVector^{(2)} \sim \gaussianSamp{\zerosVector}{\covarianceMatrix}. $$ $$ \dataScalar_i = \mappingFunction(\inputVector_i) + \noiseScalar_i, $$ where $$ \noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2} $$
. . .
. . .
$$\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \mu_i}{\sum_{i=1}^\numData\dataStd_i^2}$$. . .
Scaling a Gaussian leads to a Gaussian.
. . .
. . .
$$\mappingScalar\dataScalar\sim \gaussianSamp{\mappingScalar\mu}{\mappingScalar^2 \dataStd^2}$$[If]{align="left"}
$$\inputVector \sim \gaussianSamp{\boldsymbol{\mu}}{\boldsymbol{\Sigma}}$$. . .
[And]{align="left"} $$\dataVector= \mappingMatrix\inputVector$$
. . .
[Then]{align="left"} $$\dataVector \sim \gaussianSamp{\mappingMatrix\boldsymbol{\mu}}{\mappingMatrix\boldsymbol{\Sigma}\mappingMatrix^\top}$$
Let's first of all review the properties of the multivariate Gaussian distribution that make linear Gaussian models easier to deal with. We'll return to the, perhaps surprising, result on the parameters within the nonlinearity, $\parameterVector$, shortly.
To work with linear Gaussian models, to find the marginal likelihood all you need to know is the following rules. If $$ \dataVector = \mappingMatrix \inputVector + \noiseVector, $$ where $\dataVector$, $\inputVector$ and $\noiseVector$ are vectors and we assume that $\inputVector$ and $\noiseVector$ are drawn from multivariate Gaussians, $$\begin{align} \inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\ \noiseVector & \sim \gaussianSamp{\zerosVector}{\covarianceMatrixTwo} \end{align}$$ then we know that $\dataVector$ is also drawn from a multivariate Gaussian with, $$ \dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}. $$ With apprioriately defined covariance, $\covarianceTwoMatrix$, this is actually the marginal likelihood for Factor Analysis, or Probabilistic Principal Component Analysis [@Tipping:probpca99], because we integrated out the inputs (or latent variables they would be called in that case).
However, we are focussing on what happens in models which are non-linear in the inputs, whereas the above would be linear in the inputs. To consider these, we introduce a matrix, called the design matrix. We set each activation function computed at each data point to be $$ \activationScalar_{i,j} = \activationScalar(\mappingVector^{(1)}_{j}, \inputVector_{i}) $$ and define the matrix of activations (known as the design matrix in statistics) to be, $$ \activationMatrix = \begin{bmatrix} \activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\ \activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\ \vdots & \vdots & \ddots & \vdots \\ \activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden} \end{bmatrix}. $$ By convention this matrix always has $\numData$ rows and $\numHidden$ columns, now if we define the vector of all noise corruptions, $\noiseVector = \left[\noiseScalar_1, \dots \noiseScalar_\numData\right]^\top$.
. . .
$$\dataVector = \activationMatrix\mappingVector + \noiseVector$$. . .
$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$${ If we define the prior distribution over the vector $\mappingVector$ to be Gaussian,} $$ \mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye}, $$
{ then we can use rules of multivariate Gaussians to see that,} $$ \dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}. $$
In other words, our training data is distributed as a multivariate Gaussian, with zero mean and a covariance given by $$ \kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye. $$
This is an $\numData \times \numData$ size matrix. Its elements are in the form of a function. The maths shows that any element, index by $i$ and $j$, is a function only of inputs associated with data points $i$ and $j$, $\dataVector_i$, $\dataVector_j$. $\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right)$
If we look at the portion of this function associated only with $\mappingFunction(\cdot)$, i.e. we remove the noise, then we can write down the covariance associated with our neural network, $$ \kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right) $$ so the elements of the covariance or kernel matrix are formed by inner products of the rows of the design matrix.
This is the essence of a Gaussian process. Instead of making assumptions about our density over each data point, $\dataScalar_i$ as i.i.d. we make a joint Gaussian assumption over our data. The covariance matrix is now a function of both the parameters of the activation function, $\mappingMatrixTwo$, and the input variables, $\inputMatrix$. This comes about through integrating out the parameters of the model, $\mappingVector$.
We can basically put anything inside the basis functions, and many people do. These can be deep kernels [@Cho:deep09] or we can learn the parameters of a convolutional neural network inside there.
Viewing a neural network in this way is also what allows us to beform sensible batch normalizations [@Ioffe:batch15].
This process is degenerate.
Covariance function is of rank at most $\numHidden$.
As $\numData \rightarrow \infty$, covariance matrix is not full rank.
Leading to $\det{\kernelMatrix} = 0$
Page 37 of Radford Neal's 1994 thesis
If $$ \begin{align*} \mappingVector^{(1)} & \sim p(\cdot)\\ \phi_i & = \activationScalar\left(\mappingVector^{(1)}, \inputVector_i\right), \end{align*} $$ has finite variance.
Then taking number of hidden units to infinity, is also a Gaussian process.
Chapter 2 of Neal's thesis [@Neal:bayesian94]
Rest of Neal's thesis. [@Neal:bayesian94]
David MacKay's PhD thesis [@MacKay:bayesian92]
import numpy as np
import teaching_plots as plot
%load -s compute_kernel mlai.py
%load -s eq_cov mlai.py
np.random.seed(10)
plot.rejection_samples(compute_kernel, kernel=eq_cov,
lengthscale=0.25, diagrams='../slides/diagrams/gp')
import pods
pods.notebook.display_plots('gp_rejection_samples{sample:0>3}.svg',
'../slides/diagrams/gp', sample=(1,5))
import numpy as np
np.random.seed(4949)
import teaching_plots as plot
import pods
Multi-variate Gaussians
We will consider a Gaussian with a particular structure of covariance matrix.
Generate a single sample from this 25 dimensional Gaussian distribution, $\mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right]$.
We will plot these points against their index.
%load -s compute_kernel mlai.py
%load -s polynomial_cov mlai.py
%load -s exponentiated_quadratic mlai.py
plot.two_point_sample(compute_kernel, kernel=exponentiated_quadratic,
lengthscale=0.5, diagrams='../slides/diagrams/gp')
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg',
'../slides/diagrams/gp', sample=(0,13))
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
Prediction of $\mappingFunction_2$ from $\mappingFunction_1$ requires conditional density.
Conditional density is also Gaussian. $$ p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}} $$ where covariance of joint density is given by $$ \kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}\end{bmatrix} $$
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg',
'../slides/diagrams/gp', sample=(13,17))
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
A 25 dimensional correlated random variable (values ploted against index)
Covariance function, $\kernelMatrix$
Determines properties of samples.
Function of $\inputMatrix$, $$\kernelScalar_{i,j} = \kernelScalar(\inputVector_i, \inputVector_j)$$
$$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \kernelMatrix^{-1}
\mathbf{y}$$
$$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \boldsymbol{\alpha}$$
%load -s eq_cov mlai.py
import teaching_plots as plot
import mlai
import numpy as np
K, anim=plot.animate_covariance_function(mlai.compute_kernel,
kernel=eq_cov, lengthscale=0.2)
from IPython.core.display import HTML
HTML(anim.to_jshtml())
plot.save_animation(anim,
diagrams='../slides/diagrams/kern',
filename='eq_covariance.html')
import numpy as np
import matplotlib.pyplot as plt
import pods
import teaching_plots as plot
import mlai
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)
- Gold medal times for Olympic Marathon since 1896.
|
![image](../slides/diagrams/Stephen_Kiprotich.jpg) Image from
Wikimedia Commons |
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
xt = np.linspace(1870,2030,200)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)
import teaching_plots as plot
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/olympic-marathon-gp.svg',
transparent=True, frameon=True)
%load -s brownian_cov mlai.py
import teaching_plots as plot
import mlai
import numpy as np
t=np.linspace(0, 2, 200)[:, np.newaxis]
K, anim=plot.animate_covariance_function(mlai.compute_kernel,
t,
kernel=brownian_cov)
from IPython.core.display import HTML
HTML(anim.to_jshtml())
plot.save_animation(anim,
diagrams='../slides/diagrams/kern',
filename='brownian_covariance.html')
%load -s mlp_cov mlai.py
import teaching_plots as plot
import mlai
import numpy as np
K, anim=plot.animate_covariance_function(mlai.compute_kernel,
kernel=mlp_cov, lengthscale=0.2)
from IPython.core.display import HTML
HTML(anim.to_jshtml())
plot.save_animation(anim,
diagrams='../slides/diagrams/kern',
filename='mlp_covariance.html')
$=f\Bigg($ $\Bigg)$
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import GPy
import mlai
import teaching_plots as plot
from gp_tutorial import gpplot
np.random.seed(101)
N = 50
noise_var = 0.01
X = np.zeros((50, 1))
X[:25, :] = np.linspace(0,3,25)[:,None] # First cluster of inputs/covariates
X[25:, :] = np.linspace(7,10,25)[:,None] # Second cluster of inputs/covariates
# Sample response variables from a Gaussian process with exponentiated quadratic covariance.
k = GPy.kern.RBF(1)
y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var)).reshape(-1,1)
m_full = GPy.models.GPRegression(X,y)
_ = m_full.optimize(messages=True) # Optimize parameters of covariance function
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/sparse-demo-full-gp.svg',
transparent=True, frameon=True)
kern = GPy.kern.RBF(1)
Z = np.hstack(
(np.linspace(2.5,4.,3),
np.linspace(7,8.5,3)))[:,None]
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.noise_var = noise_var
m.inducing_inputs.constrain_fixed()
display(m)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg',
transparent=True, frameon=True)
_ = m.optimize(messages=True)
display(m)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/sparse-demo-full-gp.svg',
transparent=True, frameon=True)
m.randomize()
m.inducing_inputs.unconstrain()
_ = m.optimize(messages=True)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2,xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/sparse-demo-unconstrained-inducing-6-gp.svg',
transparent=True, frameon=True)
m.num_inducing=8
m.randomize()
M = 8
m.set_Z(np.random.rand(M,1)*12)
_ = m.optimize(messages=True)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/sparse-demo-sparse-inducing-8-gp.svg',
transparent=True, frameon=True)
print(m.log_likelihood(), m_full.log_likelihood())
A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation @Thang:unifying17
Deep Gaussian Processes and Variational Propagation of Uncertainty @Damianou:thesis2015
import teaching_plots as plot
plot.deep_nn(diagrams='../slides/diagrams/deepgp/')
Potential problem: if number of nodes in two adjacent layers is big, corresponding $\mappingMatrix$ is also very big and there is the potential to overfit.
Proposed solution: “dropout”.
Alternative solution: parameterize $\mappingMatrix$ with its SVD. $$ \mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top $$ or $$ \mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top $$ where if $\mappingMatrix \in \Re^{k_1\times k_2}$ then $\eigenvectorMatrix\in \Re^{k_1\times q}$ and $\eigenvectwoMatrix \in \Re^{k_2\times q}$, i.e. we have a low rank matrix factorization for the weights.
import teaching_plots as plot
plot.low_rank_approximation(diagrams='../slides/diagrams')
import teaching_plots as plot
plot.deep_nn_bottleneck(diagrams='../slides/diagrams/deepgp')
The network can now be written mathematically as $$ \begin{align} \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\ \hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\ \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\ \hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\ \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\ \hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\ \dataVector &= \mappingVector_4^\top\hiddenVector_{3}. \end{align} $$
Replace each neural network with a Gaussian process $$ \begin{align} \latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\ \latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\ \latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\ \dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right) \end{align} $$
Equivalent to prior over parameters, take width of each layer to infinity.
from matplotlib import rc
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'],'size':30})
rc("text", usetex=True)
pgm = plot.horizontal_chain(depth=5)
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov.svg", transparent=True)
from matplotlib import rc
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'], 'size':15})
rc("text", usetex=True)
pgm = plot.vertical_chain(depth=5)
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov-vertical.svg", transparent=True)
Gaussian processes give priors over functions.
Elegant properties:
e.g. Derivatives of process are also Gaussian distributed (if they exist).
For particular covariance functions they are ‘universal approximators’, i.e. all functions can have support under the prior.
Gaussian derivatives might ring alarm bells.
E.g. a priori they don’t believe in function ‘jumps’.
From a process perspective: process composition.
A (new?) way of constructing more complex processes based on simpler components.
pgm = plot.vertical_chain(depth=5, shape=[2, 7])
pgm.add_node(daft.Node('y_2', r'$\mathbf{y}_2$', 1.5, 3.5, observed=True))
pgm.add_edge('f_2', 'y_2')
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov-vertical-side.svg", transparent=True)
plot.non_linear_difficulty_plot_3(diagrams='../../slides/diagrams/dimred/')
Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.
plot.non_linear_difficulty_plot_2(diagrams='../../slides/diagrams/dimred/')
Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.
plot.non_linear_difficulty_plot_1(diagrams='../../slides/diagrams/dimred')
Propagate a probability distribution through a non-linear mapping.
Normalisation of distribution becomes intractable.
Deep architectures allow abstraction of features [@Bengio:deep09; @Hinton:fast06; @Salakhutdinov:quantitative08]
We use variational approach to stack GP models.
plot.stack_gp_sample(kernel=GPy.kern.Linear,
diagrams="../../slides/diagrams/deepgp")
pods.notebook.display_plots('stack-gp-sample-Linear-{sample:0>1}.svg',
directory='../../slides/diagrams/deepgp', sample=(0,4))
plot.stack_gp_sample(kernel=GPy.kern.RBF,
diagrams="../../slides/diagrams/deepgp")
pods.notebook.display_plots('stack-gp-sample-RBF-{sample:0>1}.svg',
directory='../../slides/diagrams/deepgp', sample=(0,4))
Avoiding pathologies in very deep networks @Duvenaud:pathologies14 show that the derivative distribution of the process becomes more heavy tailed as number of layers increase.
How Deep Are Deep Gaussian Processes? @Dunlop:deep2017 perform a theoretical analysis possible through conditional Gaussian Markov property.
from IPython.lib.display import YouTubeVideo
YouTubeVideo('XhIvygQYFFQ')
import numpy as np
import matplotlib.pyplot as plt
import pods
import teaching_plots as plot
import mlai
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)
- Gold medal times for Olympic Marathon since 1896.
|
![image](../slides/diagrams/Stephen_Kiprotich.jpg) Image from
Wikimedia Commons |
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
xt = np.linspace(1870,2030,200)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)
import teaching_plots as plot
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
filename='../slides/diagrams/gp/olympic-marathon-gp.svg',
transparent=True, frameon=True)
Can a Deep Gaussian process help?
Deep GP is one GP feeding into another.
hidden = 1
m = deepgp.DeepGP([y.shape[1],hidden,x.shape[1]],Y=yhat, X=x, inits=['PCA','PCA'],
kernels=[GPy.kern.RBF(hidden,ARD=True),
GPy.kern.RBF(x.shape[1],ARD=True)], # the kernels for each layer
num_inducing=50, back_constraint=False)
def initialize(self, noise_factor=0.01, linear_factor=1):
"""Helper function for deep model initialization."""
self.obslayer.likelihood.variance = self.Y.var()*noise_factor
for layer in self.layers:
if type(layer.X) is GPy.core.parameterization.variational.NormalPosterior:
if layer.kern.ARD:
var = layer.X.mean.var(0)
else:
var = layer.X.mean.var()
else:
if layer.kern.ARD:
var = layer.X.var(0)
else:
var = layer.X.var()
# Average 0.5 upcrossings in four standard deviations.
layer.kern.lengthscale = linear_factor*np.sqrt(layer.kern.input_dim)*2*4*np.sqrt(var)/(2*np.pi)
# Bind the new method to the Deep GP object.
deepgp.DeepGP.initialize=initialize
# Call the initalization
m.initialize()
def staged_optimize(self, iters=(1000,1000,10000), messages=(False, False, True)):
"""Optimize with parameters constrained and then with parameters released"""
for layer in self.layers:
# Fix the scale of each of the covariance functions.
layer.kern.variance.fix(warning=False)
layer.kern.lengthscale.fix(warning=False)
# Fix the variance of the noise in each layer.
layer.likelihood.variance.fix(warning=False)
self.optimize(messages=messages[0],max_iters=iters[0])
for layer in self.layers:
layer.kern.lengthscale.constrain_positive(warning=False)
self.obslayer.kern.variance.constrain_positive(warning=False)
self.optimize(messages=messages[1],max_iters=iters[1])
for layer in self.layers:
layer.kern.variance.constrain_positive(warning=False)
layer.likelihood.variance.constrain_positive(warning=False)
self.optimize(messages=messages[2],max_iters=iters[2])
# Bind the new method to the Deep GP object.
deepgp.DeepGP.staged_optimize=staged_optimize
m.staged_optimize(messages=(True,True,True))
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km',
fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp.svg',
transparent=True, frameon=True)
def posterior_sample(self, X, **kwargs):
"""Give a sample from the posterior of the deep GP."""
Z = X
for i, layer in enumerate(reversed(self.layers)):
Z = layer.posterior_samples(Z, size=1, **kwargs)[:, :, 0]
return Z
deepgp.DeepGP.posterior_sample = posterior_sample
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax,
xlabel='year', ylabel='pace min/km', portion = 0.225)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg',
transparent=True, frameon=True)
def visualize(self, scale=1.0, offset=0.0, xlabel='input', ylabel='output',
xlim=None, ylim=None, fontsize=20, portion=0.2,dataset=None,
diagrams='../diagrams'):
"""Visualize the layers in a deep GP with one-d input and output."""
depth = len(self.layers)
if dataset is None:
fname = 'deep-gp-layer'
else:
fname = dataset + '-deep-gp-layer'
filename = os.path.join(diagrams, fname)
last_name = xlabel
last_x = self.X
for i, layer in enumerate(reversed(self.layers)):
if i>0:
plt.plot(last_x, layer.X.mean, 'r.',markersize=10)
last_x=layer.X.mean
ax=plt.gca()
name = 'layer ' + str(i)
plt.xlabel(last_name, fontsize=fontsize)
plt.ylabel(name, fontsize=fontsize)
last_name=name
mlai.write_figure(filename=filename + '-' + str(i-1) + '.svg',
transparent=True, frameon=True)
if i==0 and xlim is not None:
xt = plot.pred_range(np.array(xlim), portion=0.0)
elif i>0:
xt = plot.pred_range(np.array(next_lim), portion=0.0)
else:
xt = plot.pred_range(last_x, portion=portion)
yt_mean, yt_var = layer.predict(xt)
if layer==self.obslayer:
yt_mean = yt_mean*scale + offset
yt_var *= scale*scale
yt_sd = np.sqrt(yt_var)
gpplot(xt,yt_mean,yt_mean-2*yt_sd,yt_mean+2*yt_sd)
ax = plt.gca()
if i>0:
ax.set_xlim(next_lim)
elif xlim is not None:
ax.set_xlim(xlim)
next_lim = plt.gca().get_ylim()
plt.plot(last_x, self.Y*scale + offset, 'r.',markersize=10)
plt.xlabel(last_name, fontsize=fontsize)
plt.ylabel(ylabel, fontsize=fontsize)
mlai.write_figure(filename=filename + '-' + str(i) + '.svg',
transparent=True, frameon=True)
if ylim is not None:
ax=plt.gca()
ax.set_ylim(ylim)
# Bind the new method to the Deep GP object.
deepgp.DeepGP.visualize=visualize
m.visualize(scale=scale, offset=offset, xlabel='year',
ylabel='pace min/km',xlim=xlim, ylim=ylim,
dataset='olympic-marathon',
diagrams='../slides/diagrams/deepgp')
import pods
pods.notebook.display_plots('olympic-marathon-deep-gp-layer-{sample:0>1}.svg',
'../slides/diagrams/deepgp', sample=(0,1))
def scale_data(x, portion):
scale = (x.max()-x.min())/(1-2*portion)
offset = x.min() - portion*scale
return (x-offset)/scale, scale, offset
def visualize_pinball(self, ax=None, scale=1.0, offset=0.0, xlabel='input', ylabel='output',
xlim=None, ylim=None, fontsize=20, portion=0.2, points=50, vertical=True):
"""Visualize the layers in a deep GP with one-d input and output."""
if ax is None:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
depth = len(self.layers)
last_name = xlabel
last_x = self.X
# Recover input and output scales from output plot
plot_model_output(self, scale=scale, offset=offset, ax=ax,
xlabel=xlabel, ylabel=ylabel,
fontsize=fontsize, portion=portion)
xlim=ax.get_xlim()
xticks=ax.get_xticks()
xtick_labels=ax.get_xticklabels().copy()
ylim=ax.get_ylim()
yticks=ax.get_yticks()
ytick_labels=ax.get_yticklabels().copy()
# Clear axes and start again
ax.cla()
if vertical:
ax.set_xlim((0, 1))
ax.invert_yaxis()
ax.set_ylim((depth, 0))
else:
ax.set_ylim((0, 1))
ax.set_xlim((0, depth))
ax.set_axis_off()#frame_on(False)
def pinball(x, y, y_std, color_scale=None,
layer=0, depth=1, ax=None,
alpha=1.0, portion=0.0, vertical=True):
scaledx, xscale, xoffset = scale_data(x, portion=portion)
scaledy, yscale, yoffset = scale_data(y, portion=portion)
y_std /= yscale
# Check whether data is anti-correlated on output
if np.dot((scaledx-0.5).T, (scaledy-0.5))<0:
scaledy=1-scaledy
flip=-1
else:
flip=1
if color_scale is not None:
color_scale, _, _=scale_data(color_scale, portion=0)
scaledy = (1-alpha)*scaledx + alpha*scaledy
def color_value(x, cmap=None, width=None, centers=None):
"""Return color as a function of position along x axis"""
if cmap is None:
cmap = np.asarray([[1, 0, 0], [1, 1, 0], [0, 1, 0]])
ncenters = cmap.shape[0]
if centers is None:
centers = np.linspace(0+0.5/ncenters, 1-0.5/ncenters, ncenters)
if width is None:
width = 0.25/ncenters
r = (x-centers)/width
weights = np.exp(-0.5*r*r).flatten()
weights /=weights.sum()
weights = weights[:, np.newaxis]
return np.dot(cmap.T, weights).flatten()
for i in range(x.shape[0]):
if color_scale is not None:
color = color_value(color_scale[i])
else:
color=(1, 0, 0)
x_plot = np.asarray((scaledx[i], scaledy[i])).flatten()
y_plot = np.asarray((layer, layer+alpha)).flatten()
if vertical:
ax.plot(x_plot, y_plot, color=color, alpha=0.5, linewidth=3)
ax.plot(x_plot, y_plot, color='k', alpha=0.5, linewidth=0.5)
else:
ax.plot(y_plot, x_plot, color=color, alpha=0.5, linewidth=3)
ax.plot(y_plot, x_plot, color='k', alpha=0.5, linewidth=0.5)
# Plot error bars that increase as sqrt of distance from start.
std_points = 50
stdy = np.linspace(0, alpha,std_points)
stdx = np.sqrt(stdy)*y_std[i]
stdy += layer
mean_vals = np.linspace(scaledx[i], scaledy[i], std_points)
upper = mean_vals+stdx
lower = mean_vals-stdx
fillcolor=color
x_errorbars=np.hstack((upper,lower[::-1]))
y_errorbars=np.hstack((stdy,stdy[::-1]))
if vertical:
ax.fill(x_errorbars,y_errorbars,
color=fillcolor, alpha=0.1)
ax.plot(scaledy[i], layer+alpha, '.',markersize=10, color=color, alpha=0.5)
else:
ax.fill(y_errorbars,x_errorbars,
color=fillcolor, alpha=0.1)
ax.plot(layer+alpha, scaledy[i], '.',markersize=10, color=color, alpha=0.5)
# Marker to show end point
return flip
# Whether final axis is flipped
flip = 1
first_x=last_x
for i, layer in enumerate(reversed(self.layers)):
if i==0:
xt = plot.pred_range(last_x, portion=portion, points=points)
color_scale=xt
yt_mean, yt_var = layer.predict(xt)
if layer==self.obslayer:
yt_mean = yt_mean*scale + offset
yt_var *= scale*scale
yt_sd = np.sqrt(yt_var)
flip = flip*pinball(xt,yt_mean,yt_sd,color_scale,portion=portion,
layer=i, ax=ax, depth=depth,vertical=vertical)#yt_mean-2*yt_sd,yt_mean+2*yt_sd)
xt = yt_mean
# Make room for axis labels
if vertical:
ax.set_ylim((2.1, -0.1))
ax.set_xlim((-0.02, 1.02))
ax.set_yticks(range(depth,0,-1))
else:
ax.set_xlim((-0.1, 2.1))
ax.set_ylim((-0.02, 1.02))
ax.set_xticks(range(0, depth))
def draw_axis(ax, scale=1.0, offset=0.0, level=0.0, flip=1,
label=None,up=False, nticks=10, ticklength=0.05,
vertical=True,
fontsize=20):
def clean_gap(gap):
nsf = np.log10(gap)
if nsf>0:
nsf = np.ceil(nsf)
else:
nsf = np.floor(nsf)
lower_gaps = np.asarray([0.005, 0.01, 0.02, 0.03, 0.04, 0.05,
0.1, 0.25, 0.5,
1, 1.5, 2, 2.5, 3, 4, 5, 10, 25, 50, 100])
upper_gaps = np.asarray([1, 2, 3, 4, 5, 10])
if nsf >2 or nsf<-2:
d = np.abs(gap-upper_gaps*10**nsf)
ind = np.argmin(d)
return upper_gaps[ind]*10**nsf
else:
d = np.abs(gap-lower_gaps)
ind = np.argmin(d)
return lower_gaps[ind]
tickgap = clean_gap(scale/(nticks-1))
nticks = int(scale/tickgap) + 1
tickstart = np.round(offset/tickgap)*tickgap
ticklabels = np.asarray(range(0, nticks))*tickgap + tickstart
ticks = (ticklabels-offset)/scale
axargs = {'color':'k', 'linewidth':1}
if not up:
ticklength=-ticklength
tickspot = np.linspace(0, 1, nticks)
if flip < 0:
ticks = 1-ticks
for tick, ticklabel in zip(ticks, ticklabels):
if vertical:
ax.plot([tick, tick], [level, level-ticklength], **axargs)
ax.text(tick, level-ticklength*2, ticklabel, horizontalalignment='center',
fontsize=fontsize/2)
ax.text(0.5, level-5*ticklength, label, horizontalalignment='center', fontsize=fontsize)
else:
ax.plot([level, level-ticklength], [tick, tick], **axargs)
ax.text(level-ticklength*2, tick, ticklabel, horizontalalignment='center',
fontsize=fontsize/2)
ax.text(level-5*ticklength, 0.5, label, horizontalalignment='center', fontsize=fontsize)
if vertical:
xlim = list(ax.get_xlim())
if ticks.min()<xlim[0]:
xlim[0] = ticks.min()
if ticks.max()>xlim[1]:
xlim[1] = ticks.max()
ax.set_xlim(xlim)
ax.plot([ticks.min(), ticks.max()], [level, level], **axargs)
else:
ylim = list(ax.get_ylim())
if ticks.min()<ylim[0]:
ylim[0] = ticks.min()
if ticks.max()>ylim[1]:
ylim[1] = ticks.max()
ax.set_ylim(ylim)
ax.plot([level, level], [ticks.min(), ticks.max()], **axargs)
_, xscale, xoffset = scale_data(first_x, portion=portion)
_, yscale, yoffset = scale_data(yt_mean, portion=portion)
draw_axis(ax=ax, scale=xscale, offset=xoffset, level=0.0, label=xlabel,
up=True, vertical=vertical)
draw_axis(ax=ax, scale=yscale, offset=yoffset,
flip=flip, level=depth, label=ylabel, up=False, vertical=vertical)
#for txt in xticklabels:
# txt.set
# Bind the new method to the Deep GP object.
deepgp.DeepGP.visualize_pinball=visualize_pinball
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, scale=scale, offset=offset, points=30, portion=0.1,
xlabel='year', ylabel='pace km/min', vertical=True)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg',
transparent=True, frameon=True)
num_low=25
num_high=25
gap = -.1
noise=0.0001
x = np.vstack((np.linspace(-1, -gap/2.0, num_low)[:, np.newaxis],
np.linspace(gap/2.0, 1, num_high)[:, np.newaxis]))
y = np.vstack((np.zeros((num_low, 1)), np.ones((num_high,1))))
scale = np.sqrt(y.var())
offset = y.mean()
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('$x$', fontsize=20)
_ = ax.set_ylabel('$y$', fontsize=20)
xlim = (-2, 2)
ylim = (-0.6, 1.6)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/step-function.svg',
transparent=True, frameon=True)
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig,filename='../../slides/diagrams/gp/step-function-gp.svg',
transparent=True, frameon=True)
layers = [y.shape[1], 1, 1, 1,x.shape[1]]
inits = ['PCA']*(len(layers)-1)
kernels = []
for i in layers[1:]:
kernels += [GPy.kern.RBF(i)]
m = deepgp.DeepGP(layers,Y=yhat, X=x,
inits=inits,
kernels=kernels, # the kernels for each layer
num_inducing=20, back_constraint=False)
m.initialize()
m.staged_optimize()
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/deepgp/step-function-deep-gp.svg',
transparent=True, frameon=True)
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/step-function-deep-gp-samples.svg',
transparent=True, frameon=True)
m.visualize(offset=offset, scale=scale, xlim=xlim, ylim=ylim,
dataset='step-function',
diagrams='../../slides/diagrams/deepgp')
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(offset=offset, ax=ax, scale=scale, xlim=xlim, ylim=ylim, portion=0.1, points=50)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/step-function-deep-gp-pinball.svg',
transparent=True, frameon=True, ax=ax)
import pods
data = pods.datasets.mcycle()
x = data['X']
y = data['Y']
scale=np.sqrt(y.var())
offset=y.mean()
yhat = (y - offset)/scale
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('time', fontsize=20)
_ = ax.set_ylabel('acceleration', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='../../slides/diagrams/datasets/motorcycle-helmet.svg',
transparent=True, frameon=True)
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
xlim=(-20,80)
ylim=(-180,120)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig,filename='../../slides/diagrams/gp/motorcycle-helmet-gp.svg',
transparent=True, frameon=True)
layers = [y.shape[1], 1, x.shape[1]]
inits = ['PCA']*(len(layers)-1)
kernels = []
for i in layers[1:]:
kernels += [GPy.kern.RBF(i)]
m = deepgp.DeepGP(layers,Y=yhat, X=x,
inits=inits,
kernels=kernels, # the kernels for each layer
num_inducing=20, back_constraint=False)
m.initialize()
m.staged_optimize(iters=(1000,1000,10000), messages=(True, True, True))
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp.svg',
transparent=True, frameon=True)
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='acceleration/$g$', portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-samples.svg',
transparent=True, frameon=True)
m.visualize(xlim=xlim, ylim=ylim, scale=scale,offset=offset,
xlabel="time", ylabel="acceleration/$g$", portion=0.5,
dataset='motorcycle-helmet',
diagrams='../../slides/diagrams/deepgp')
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, xlabel='time', ylabel='acceleration/g',
points=50, scale=scale, offset=offset, portion=0.1)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-pinball.svg',
transparent=True, frameon=True)
data=pods.datasets.robot_wireless()
x = np.linspace(0,1,215)[:, np.newaxis]
y = data['Y']
offset = y.mean()
scale = np.sqrt(y.var())
yhat = (y-offset)/scale
fig, ax = plt.subplots(figsize=plot.big_figsize)
plt.plot(data['X'][:, 1], data['X'][:, 2], 'r.', markersize=5)
ax.set_xlabel('x position', fontsize=20)
ax.set_ylabel('y position', fontsize=20)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/robot-wireless-ground-truth.svg', transparent=True, frameon=True)
output_dim=1
xlim = (-0.3, 1.3)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x.flatten(), y[:, output_dim],
'r.', markersize=5)
ax.set_xlabel('time', fontsize=20)
ax.set_ylabel('signal strength', fontsize=20)
xlim = (-0.2, 1.2)
ylim = (-0.6, 2.0)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/robot-wireless-dim-' + str(output_dim) + '.svg',
transparent=True, frameon=True)
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, output_dim=output_dim, scale=scale, offset=offset, ax=ax,
xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/gp/robot-wireless-gp-dim-' + str(output_dim)+ '.svg',
transparent=True, frameon=True)
layers = [y.shape[1], 10, 5, 2, 2, x.shape[1]]
inits = ['PCA']*(len(layers)-1)
kernels = []
for i in layers[1:]:
kernels += [GPy.kern.RBF(i, ARD=True)]
m = deepgp.DeepGP(layers,Y=y, X=x, inits=inits,
kernels=kernels,
num_inducing=50, back_constraint=False)
m.initialize()
m.staged_optimize(messages=(True,True,True))
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, output_dim=output_dim, scale=scale, offset=offset, ax=ax,
xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-deep-gp-dim-' + str(output_dim)+ '.svg',
transparent=True, frameon=True)
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, output_dim=output_dim, scale=scale, offset=offset, samps=10, ax=ax,
xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-deep-gp-samples-dim-' + str(output_dim)+ '.svg',
transparent=True, frameon=True)
fig, ax = plt.subplots(figsize=plot.big_figsize)
ax.plot(m.layers[-2].latent_space.mean[:, 0],
m.layers[-2].latent_space.mean[:, 1],
'r.-', markersize=5)
ax.set_xlabel('latent dimension 1', fontsize=20)
ax.set_ylabel('latent dimension 2', fontsize=20)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-latent-space.svg',
transparent=True, frameon=True)
‘High five’ data.
Model learns structure between two interacting subjects.
\credit{Zhenwen Dai and Neil D. Lawrence}
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
from IPython.display import display
import deepgp
import GPy
from gp_tutorial import ax_default, meanplot, gpplot
import mlai
import teaching_plots as plot
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
np.random.seed(0)
digits = [0,1,2,3,4]
N_per_digit = 100
Y = []
labels = []
for d in digits:
imgs = mnist['data'][mnist['target']==d]
Y.append(imgs[np.random.permutation(imgs.shape[0])][:N_per_digit])
labels.append(np.ones(N_per_digit)*d)
Y = np.vstack(Y).astype(np.float64)
labels = np.hstack(labels)
Y /= 255.
num_latent = 2
num_hidden_2 = 5
m = deepgp.DeepGP([Y.shape[1],num_hidden_2,num_latent],
Y,
kernels=[GPy.kern.RBF(num_hidden_2,ARD=True),
GPy.kern.RBF(num_latent,ARD=False)],
num_inducing=50, back_constraint=False,
encoder_dims=[[200],[200]])
m.obslayer.likelihood.variance[:] = Y.var()*0.01
for layer in m.layers:
layer.kern.variance.fix(warning=False)
layer.likelihood.variance.fix(warning=False)
m.optimize(messages=False,max_iters=100)
for layer in m.layers:
layer.kern.variance.constrain_positive(warning=False)
m.optimize(messages=False,max_iters=100)
for layer in m.layers:
layer.likelihood.variance.constrain_positive(warning=False)
m.optimize(messages=True,max_iters=10000)
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'],'size':20})
fig, ax = plt.subplots(figsize=plot.big_figsize)
for d in digits:
ax.plot(m.layer_1.X.mean[labels==d,0],m.layer_1.X.mean[labels==d,1],'.',label=str(d))
_ = plt.legend()
mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/usps-digits-latent.svg", transparent=True)
m.obslayer.kern.lengthscale
fig, ax = plt.subplots(figsize=plot.big_figsize)
for i in range(5):
for j in range(i):
dims=[i, j]
ax.cla()
for d in digits:
ax.plot(m.obslayer.X.mean[labels==d,dims[0]],
m.obslayer.X.mean[labels==d,dims[1]],
'.', label=str(d))
plt.legend()
plt.xlabel('dimension ' + str(dims[0]))
plt.ylabel('dimension ' + str(dims[1]))
mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/usps-digits-hidden-" + str(dims[0]) + '-' + str(dims[1]) + '.svg', transparent=True)
rows = 10
cols = 20
t=np.linspace(-1, 1, rows*cols)[:, None]
kern = GPy.kern.RBF(1,lengthscale=0.05)
cov = kern.K(t, t)
x = np.random.multivariate_normal(np.zeros(rows*cols), cov, num_latent).T
yt = m.predict(x)
fig, axs = plt.subplots(rows,cols,figsize=(10,6))
for i in range(rows):
for j in range(cols):
#v = np.random.normal(loc=yt[0][i*cols+j, :], scale=np.sqrt(yt[1][i*cols+j, :]))
v = yt[0][i*cols+j, :]
axs[i,j].imshow(v.reshape(28,28),
cmap='gray', interpolation='none',
aspect='equal')
axs[i,j].set_axis_off()
mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/digit-samples-deep-gp.svg", transparent=True)
Deep nets are powerful approach to images, speech, language.
Proposal: Deep GPs may also be a great approach, but better to deploy according to natural strengths.
Probabilistic numerics, surrogate modelling, emulation, and UQ.
Not a fan of AI as a term.
But we are faced with increasing amounts of algorithmic decision making.
When trading off decisions: compute or acquire data?
There is a critical need for uncertainty.
Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known.
Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.
How to combine them?
where $\textbf{u}_t$ is the action force, $\inputVector_t = (p_t, v_t)$ is the vehicle state
import gym
env = gym.make('MountainCarContinuous-v0')
import mountain_car as mc
import GPyOpt
obj_func = lambda x: mc.run_simulation(env, x)[0]
objective = GPyOpt.core.task.SingleObjective(obj_func)
## --- We define the input space of the emulator
space= [{'name':'postion_parameter', 'type':'continuous', 'domain':(-1.2, +1)},
{'name':'velocity_parameter', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition) # Collect points sequentially, no parallelization.
from GPyOpt.experiment_design.random_design import RandomDesign
n_initial_points = 25
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(n_initial_points)
import numpy as np
random_controller = initial_design[0,:]
_, _, _, frames = mc.run_simulation(env, np.atleast_2d(random_controller), render=True)
anim=mc.animate_frames(frames, 'Random linear controller')
from IPython.core.display import HTML
HTML(anim.to_jshtml())
mc.save_frames(frames,
diagrams='../slides/diagrams/uq',
filename='mountain_car_random.html')
max_iter = 50
bo = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective, acquisition, evaluator, initial_design)
bo.run_optimization(max_iter = max_iter )
_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller after 50 iterations of Bayesian optimization')
HTML(anim.to_jshtml())
mc.save_frames(frames,
diagrams='../slides/diagrams/uq',
filename='mountain_car_simulated.html')
For standard Bayesian Optimization ignored dynamics of the car.
For more data efficiency, first emulate the dynamics.
Then do Bayesian optimization of the emulator.
import gym
env = gym.make('MountainCarContinuous-v0')
import GPyOpt
space_dynamics = [{'name':'position', 'type':'continuous', 'domain':[-1.2, +0.6]},
{'name':'velocity', 'type':'continuous', 'domain':[-0.07, +0.07]},
{'name':'action', 'type':'continuous', 'domain':[-1, +1]}]
design_space_dynamics = GPyOpt.Design_space(space=space_dynamics)
Use a Gaussian process to model $$\Delta v_{t+1} = v_{t+1} - v_{t}$$ and $$\Delta x_{t+1} = p_{t+1} - p_{t}$$
Two processes, one with mean $v_{t}$ one with mean $p_{t}$
position_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
velocity_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
import numpy as np
from GPyOpt.experiment_design.random_design import RandomDesign
import mountain_car as mc
### --- Random locations of the inputs
n_initial_points = 500
random_design_dynamics = RandomDesign(design_space_dynamics)
initial_design_dynamics = random_design_dynamics.get_samples(n_initial_points)
### --- Simulation of the (normalized) outputs
y = np.zeros((initial_design_dynamics.shape[0], 2))
for i in range(initial_design_dynamics.shape[0]):
y[i, :] = mc.simulation(initial_design_dynamics[i, :])
# Normalize the data from the simulation
y_normalisation = np.std(y, axis=0)
y_normalised = y/y_normalisation
Used 500 randomly selected points to train emulators.
Can make proces smore efficient through experimental design.
position_model.updateModel(initial_design_dynamics, y[:, [0]], None, None)
velocity_model.updateModel(initial_design_dynamics, y[:, [1]], None, None)
from IPython.html.widgets import interact
control = mc.plot_control(velocity_model)
interact(control.plot_slices, control=(-1, 1, 0.05))
controller_gains = np.atleast_2d([0, .6, 1]) # change the valus of the linear controller to observe the trayectories.
mc.emu_sim_comparison(env, controller_gains, [position_model, velocity_model],
max_steps=500, diagrams='../slides/diagrams/uq')
### --- Optimize control parameters with emulator
car_initial_location = np.asarray([-0.58912799, 0])
### --- Reward objective function using the emulator
obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0]
objective_emulator = GPyOpt.core.task.SingleObjective(obj_func_emulator)
### --- Elements of the optimization that will use the multi-fidelity emulator
model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)},
{'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(25)
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition)
bo_emulator = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_emulator, acquisition, evaluator, initial_design)
bo_emulator.run_optimization(max_iter=50)
_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_emulator.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller using the emulator of the dynamics')
from IPython.core.display import HTML
HTML(anim.to_jshtml())
mc.save_frames(frames,
diagrams='../slides/diagrams/uq',
filename='mountain_car_emulated.html')
Our emulator used only 500 calls to the simulator.
Optimizing the simulator directly required 37,500 calls to the simulator.
import gym
env = gym.make('MountainCarContinuous-v0')
import GPyOpt
### --- Collect points from low and high fidelity simulator --- ###
space = GPyOpt.Design_space([
{'name':'position', 'type':'continuous', 'domain':(-1.2, +1)},
{'name':'velocity', 'type':'continuous', 'domain':(-0.07, +0.07)},
{'name':'action', 'type':'continuous', 'domain':(-1, +1)}])
n_points = 250
random_design = GPyOpt.experiment_design.RandomDesign(space)
x_random = random_design.get_samples(n_points)
import numpy as np
import mountain_car as mc
d_position_hf = np.zeros((n_points, 1))
d_velocity_hf = np.zeros((n_points, 1))
d_position_lf = np.zeros((n_points, 1))
d_velocity_lf = np.zeros((n_points, 1))
# --- Collect high fidelity points
for i in range(0, n_points):
d_position_hf[i], d_velocity_hf[i] = mc.simulation(x_random[i, :])
# --- Collect low fidelity points
for i in range(0, n_points):
d_position_lf[i], d_velocity_lf[i] = mc.low_cost_simulation(x_random[i, :])
### --- Optimize controller parameters
obj_func = lambda x: mc.run_simulation(env, x)[0]
obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0]
objective_multifidelity = GPyOpt.core.task.SingleObjective(obj_func)
from GPyOpt.experiment_design.random_design import RandomDesign
model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True)
space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)},
{'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)},
{'name':'constant', 'type':'continuous', 'domain':(-1, +1)}]
design_space = GPyOpt.Design_space(space=space)
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space)
n_initial_points = 25
random_design = RandomDesign(design_space)
initial_design = random_design.get_samples(n_initial_points)
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer)
evaluator = GPyOpt.core.evaluators.Sequential(acquisition)
bo_multifidelity = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_multifidelity, acquisition, evaluator, initial_design)
bo_multifidelity.run_optimization(max_iter=50)
_, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_multifidelity.x_opt), render=True)
anim=mc.animate_frames(frames, 'Best controller with multi-fidelity emulator')
from IPython.core.display import HTML
HTML(anim.to_jshtml())
mc.save_frames(frames,
diagrams='../slides/diagrams/uq',
filename='mountain_car_multi_fidelity.html')
Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner, Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin.
Powerful framework but
Software isn't there yet.
Our focus: Gaussian Processes driven by MXNet
Composition of GPs, Neural Networks, Other Models