Flexible Parametric Representations of Non Parametric Models

18th March 2015 Neil D. Lawrence

Inducing point description initially inspired by a notebook of James Hensman

First import some relevant libraries and set a random seed.

In [3]:
import GPy
import pods
from GPy.util.linalg import pdinv

from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np

from IPython.display import display 


Flexible Parametric Approximation to Gaussian Process

In this notebook we will try and demonstrate the principles underpinning a flexible parameteric approximation to a non-parameteric model. In particular, the argument goes along these lines. In general, we want to be non-parametric. What do we mean by non-parametric. Here we mean non-parametric in the sense that when we try and represent the relationship between training and test data, $p(\mathbf{y}^\ast|\mathbf{y})$, through a posterior density over a vector of parameters, $\mathbf{w}$, $$ p(\mathbf{y}^\ast|\mathbf{y}) = \int p(\mathbf{y}^\ast|\mathbf{w})p(\mathbf{w}|\mathbf{y}) \text{d}\mathbf{w}, $$ we find that the vector $\mathbf{w}$ cannot be fixed dimensional.

In a Gaussian process model, we normally relate the observations to a latent function through a likelihood that factorizes, $$ p(\mathbf{y}|\mathbf{f}) = \prod_{i=1}^n p(y_i|f_i), $$ Variational inducing variables involve augmenting the prior over functions with a vector of variables, $\mathbf{u}$, $$ p(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) p(\mathbf{u}) \text{d}\mathbf{u} $$ and then lower bounding the conditional distribution, $$ p(\mathbf{y}|\mathbf{u}) = \int p(\mathbf{y}|\mathbf{f}) p(\mathbf{f}|\mathbf{u}) \text{d}\mathbf{u} \geq \prod_{i=1}^n c_i \hat{p}(\mathbf{y}|\mathbf{u}) $$ The lower bound is then used in a model that looks parametric, $$ p(\mathbf{y}^*|\mathbf{y}) = \int p(\mathbf{y}^*|\mathbf{u}) \hat{p}(\mathbf{u}|\mathbf{y})\text{d}\mathbf{u}, $$ but with the important difference that the number of parameters can be changed at run time not just design time.


Toy Data Set

To show the model working in practice, we are first going to sample a function from a Gaussian process. We will use an exponentiated quadratic covariance function, Sample a data set with two different clusters on the inputs.

In [4]:
N = 50
noise_var = 0.01
X = np.zeros((50, 1))
X[:25, :] = np.linspace(0,3,25)[:,None] # First cluster of inputs/covariates
X[25:, :] = np.linspace(7,10,25)[:,None] # Second cluster of inputs/covariates

# Sample response variables from a Gaussian process with exponentiated quadratic covariance.
k = GPy.kern.RBF(1)
y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var))[:, None]

Full Gaussian Process

Now we have the full data set we will construct a Gaussian process and optimize the parameters, showing the plot of the fit.

In [5]:
m_full = GPy.models.GPRegression(X,y)
m_full.optimize(messages=True) # Optimize parameters of covariance function
_ = m_full.plot() # plot the regression

Inducing Variable Approximation

In an inducing variable approximation, we introduce 'pseudo-observations' of the function, $\mathbf{u}$, which 'induce' the approximation. We will start by introducing four 'pseudo-observations' at locations 2.5, 4, 7 and 8.5. We will then display the untrained model.

In [6]:
kern = GPy.kern.RBF(1)
Z = np.hstack(
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.Gaussian_noise.variance = noise_var
clang: warning: argument unused during compilation: '-fopenmp'
In file included from /Users/neil/.cache/scipy/python27_compiled/sc_1790bf65208b11355ffcfd4b65a5f10913.cpp:11:
In file included from /Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/weave/blitz/blitz/array.h:26:
In file included from /Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/weave/blitz/blitz/array-impl.h:37:
/Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/weave/blitz/blitz/range.h:120:34: warning: '&&' within '||' [-Wlogical-op-parentheses]
        return ((first_ < last_) && (stride_ == 1) || (first_ == last_));
                ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~ ~~
/Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/weave/blitz/blitz/range.h:120:34: note: place parentheses around the '&&' expression to silence this warning
        return ((first_ < last_) && (stride_ == 1) || (first_ == last_));
                (                                 )
In file included from /Users/neil/.cache/scipy/python27_compiled/sc_1790bf65208b11355ffcfd4b65a5f10913.cpp:23:
In file included from /Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4:
In file included from /Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:17:
In file included from /Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1761:
/Users/neil/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
#warning "Using deprecated NumPy API, disable it by " \
/Users/neil/.cache/scipy/python27_compiled/sc_1790bf65208b11355ffcfd4b65a5f10913.cpp:24:10: fatal error: 'omp.h' file not found
#include <omp.h>
2 warnings and 1 error generated.

 Weave compilation failed. Falling back to (slower) numpy implementation

WARNING: reconstraining parameters sparse_gp_mpi.rbf.variance
WARNING: reconstraining parameters sparse_gp_mpi.rbf.lengthscale
WARNING: reconstraining parameters sparse_gp_mpi.Gaussian_noise.variance

Model: sparse gp mpi
Log-likelihood: -1822.27351085
Number of Parameters: 7
Updates: True

sparse_gp_mpi. Value Constraint Prior Tied to
inducing inputs (4, 1)
rbf.variance 1.0 fixed
rbf.lengthscale 1.0 fixed
Gaussian_noise.variance 0.01 fixed

When the prior over $\mathbf{u}$ is integrated over the effective likelihood, $\hat{p}(\mathbf{y}|\mathbf{u})$, it's possible to show that the resulting marginal likelihood, $\hat{p}(\mathbf{y})$, which forms the core of the lower bound on the likelihood, is a Gaussian process with a low rank form for the covariance, $$ \hat{p}(\mathbf{y}) \sim \mathcal{N}(\mathbf{0}, \hat{\mathbf{K}}) $$ where $$ \hat{\mathbf{K}} = \mathbf{K}_{fu}\mathbf{K}_{uu}^{-1} \mathbf{K}_{uf} + \sigma^2 \mathbf{I}, $$ where $\mathbf{K}_{fu}$ gives the cross covariance between the variables of our function $f()$ and those of the inducing functions $u()$.

Let's have a look at the form of this approximation for the un-optimized inducing variables.

In [7]:
# Visualise full covariance and approximation.
Kff = m_full.kern.K(m.X,m.X)
Kfu = m.kern.K(m.X, m.Z)
Kuu = m.kern.K(m.Z, m.Z)
Kuf = Kfu.T
noise = m['.*noise'][0]
fig, ax = plt.subplots(1, 2, figsize=(12, 8))
ax[0].matshow(np.dot(np.dot(Kfu,pdinv(Kuu)[0]),Kuf)  + np.eye(m.X.shape[0])*m['.*noise'])
ax[0].set_title('Low Rank Approximation')
ax[1].matshow(Kff + np.eye(m.X.shape[0])*m['.*noise'])
ax[1].set_title('Full Covariance')
<matplotlib.text.Text at 0x110234250>

Variational Compression

The variational compression bound minmizes the additional information obtained about $\mathbf{f}$ by knowing $\mathbf{y}$ with respect to that which is known by observing $\mathbf{u}$ alone. Maximizing the lower bound (whilst keeping model parameters fixed) compresses the information from $\mathbf{y}$ into $q(\mathbf{u})$.

In [8]:

Model: sparse gp mpi
Log-likelihood: -832.753958521
Number of Parameters: 7
Updates: True

sparse_gp_mpi. Value Constraint Prior Tied to
inducing inputs (4, 1)
rbf.variance 1.0 fixed
rbf.lengthscale 1.0 fixed
Gaussian_noise.variance 0.01 fixed
In [9]:
m.optimize('bfgs', messages=True)
_ = m.plot()

Joint Optimization

In practice we actually optimize the model parameters alongside the variational parameters. This means we either find better solutions, or solutions where it's easier to compress the information from $\mathbf{y}$ into $\mathbf{u}$. The former case occurs because for certain choices of prior over $\mathbf{f}$ the values of $\mathbf{y}$ do not provide a lot of information.

In [10]:
M = 8
Z = np.random.rand(M,1)*12
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.likelihood.variance = noise_var

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
print M, "inducing variables"
print "Full model log likelihood: ", m_full.log_likelihood()
print "Lower bound from variational method: ", m.log_likelihood()
print "Information gain (in nats) associated with y ", m_full.log_likelihood() - m.log_likelihood()
8 inducing variables
Full model log likelihood:  -36.4462976038
Lower bound from variational method:  [[-39.99802542]]
Information gain (in nats) associated with y  [[ 3.55172782]]
In [11]:
Kff = m_full.kern.K(m.X,m.X)
Kfu = m.kern.K(m.X, m.Z)
Kuu = m.kern.K(m.Z, m.Z)
Kuf = Kfu.T
sigma2 = m.likelihood.variance
KfuKuuIKuf = np.dot(np.dot(Kfu,pdinv(Kuu)[0]),Kuf)
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].matshow(KfuKuuIKuf  + np.eye(m.X.shape[0])*sigma2)
ax[0].set_title('Low Rank Approximation')
ax[1].matshow(Kff + np.eye(m.X.shape[0])*sigma2)
_ = ax[1].set_title('Full Covariance')

Ceres Data

Now we will look at some real data. This is data as published by Franz von Zach's volume, Monatliche Correspondenz by Giuseppe Piazzi for observations of the (dwarf) planet Ceres, as shown in the Google book below. }

In [12]:
pods.notebook.display_google_book('JBw4AAAAMAAJ', 'PA280')

We've made the data available in the pods library. The data is famous because Gauss fitted the orbit of Ceres, and made a made a prediction about the location of the planet a year after its discovery, allowing the planet to be recovered (it had been lost behind the sun). This established Gauss at the age of 23 as a leading European mathematician. Gauss later claimed that he used the Gaussian error model when making his predictions. The data is in the form of a pandas data frame, which can be loaded and summarized as follows.

In [13]:
import pods
data = pods.datasets.ceres()['data']
Mittlere Sonnenzeit Gerade Aufstig in Zeit Gerade Aufstiegung in Graden Nordlich Abweich Geocentrische Laenger Geocentrische Breite Ort der Sonne + 20" Aberration Logar. d. Distanz
count 21.000000 21.000000 21.000000 20.000000 19.000000 19.000000 19.000000 19.000000
mean 7.460553 3.474579 52.117897 17.052194 1.867660 1.802494 10.089094 9.993320
std 0.784663 0.055710 0.836052 0.985692 0.036103 0.815463 0.444599 0.000609
min 6.199500 3.424925 51.374056 15.628750 1.832791 0.600806 9.396909 9.992616
25% 6.807333 3.435597 51.533972 16.329201 1.839015 1.153542 9.780756 9.992807
50% 7.400750 3.448292 51.724389 17.015889 1.851418 1.707806 10.084920 9.993189
75% 8.038194 3.504792 52.571889 17.827847 1.889333 2.383375 10.431290 9.993735
max 8.721611 3.618483 54.277250 18.799667 1.952000 3.111694 10.813114 9.994582

Gaussian Process Fit to Ceres Data

Now let's try fitting 'Gerade Aufstig in Zeit' using a Gaussian process. First thing to do is remove the missing value.

In [14]:
X = np.delete(np.asarray(data.index.dayofyear, dtype='float')[:, None], 8, axis=0)
y = np.delete(np.asarray(data['Gerade Aufstig in Zeit'])[:, None], 8, axis=0)

Now we will use an exponentiated quadratic covariance. Because observations are in days, we initialise the lengthscale to 10 days. Also the noise on these observations turns out to be very low. To prevent numerical problems, we add a bound to prevent the noise going below 1e-6.

In [15]:
kern = GPy.kern.RBF(1)
kern.lengthscale = 10.

m = GPy.models.GPRegression(X, y, kern)
# noise is so low that we get numerical issues if we allow noise variance to be 'free'
m.Gaussian_noise.variance.constrain_bounded(1e-6, 10)
WARNING: reconstraining parameters GP_regression.Gaussian_noise.variance

Model: GP regression
Log-likelihood: 99.4357991358
Number of Parameters: 3
Updates: True

GP_regression. Value Constraint Prior Tied to
rbf.variance 24.1179178618 +ve
rbf.lengthscale 141.255644761 +ve
Gaussian_noise.variance 1e-061e-06,10.0

Note the log likelihood of the fit which is just over 99.43.

Low Rank Gaussian Process Fit

Now we form the same fit again but using a low rank Gaussian process. Buy changing the number of inducing variables we capture different aspects of the function. We'll start with one inducing variable. What aspect of the function do we expect to capture if we only store 1 thing about it? For each fit check the log likelihood of the result and compare it to the log likelihood of the full Gaussian process above (it should always lower bound this value). How many inducing variables do we need to capture this sequence?

In [19]:
M = 4
Z = np.random.rand(M,1)*45
kern2 = GPy.kern.RBF(1)
kern2.lengthscale = 10.
m2 = GPy.models.SparseGPRegression(X, y, kern2, Z=Z)
m2.Gaussian_noise.variance.constrain_bounded(1e-6, 10)
_ = m2.plot()
WARNING: reconstraining parameters sparse_gp_mpi.Gaussian_noise.variance

Some Things to Try

Do you always get the same result for different initialisations of the inputs to the inducing variables? Can you recover the full model likelihood for enough inducing inputs? If not, why is that?