In [3]:

```
import GPy
import pods
from GPy.util.linalg import pdinv
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
from IPython.display import display
np.random.seed(123)
```

In this notebook we will try and demonstrate the principles underpinning a flexible parameteric approximation to a non-parameteric model. In particular, the argument goes along these lines. In general, we want to be non-parametric. What do we mean by non-parametric. Here we mean non-parametric in the sense that when we try and represent the relationship between training and test data, $p(\mathbf{y}^\ast|\mathbf{y})$, through a posterior density over a vector of parameters, $\mathbf{w}$, $$ p(\mathbf{y}^\ast|\mathbf{y}) = \int p(\mathbf{y}^\ast|\mathbf{w})p(\mathbf{w}|\mathbf{y}) \text{d}\mathbf{w}, $$ we find that the vector $\mathbf{w}$ cannot be fixed dimensional.

In a Gaussian process model, we normally relate the observations to a latent function through a likelihood that factorizes,
$$
p(\mathbf{y}|\mathbf{f}) = \prod_{i=1}^n p(y_i|f_i),
$$
Variational inducing variables involve augmenting the prior over functions with a vector of variables, $\mathbf{u}$,
$$
p(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) p(\mathbf{u}) \text{d}\mathbf{u}
$$
and then lower bounding the conditional distribution,
$$
p(\mathbf{y}|\mathbf{u}) = \int p(\mathbf{y}|\mathbf{f}) p(\mathbf{f}|\mathbf{u}) \text{d}\mathbf{u} \geq \prod_{i=1}^n c_i \hat{p}(\mathbf{y}|\mathbf{u})
$$
The lower bound is then used in a model that *looks* parametric,
$$
p(\mathbf{y}^*|\mathbf{y}) = \int p(\mathbf{y}^*|\mathbf{u}) \hat{p}(\mathbf{u}|\mathbf{y})\text{d}\mathbf{u},
$$
but with the important difference that the number of parameters can be changed at *run* time not just *design* time.

To show the model working in practice, we are first going to sample a function from a Gaussian process. We will use an exponentiated quadratic covariance function, Sample a data set with two different clusters on the inputs.

In [4]:

```
N = 50
noise_var = 0.01
X = np.zeros((50, 1))
X[:25, :] = np.linspace(0,3,25)[:,None] # First cluster of inputs/covariates
X[25:, :] = np.linspace(7,10,25)[:,None] # Second cluster of inputs/covariates
# Sample response variables from a Gaussian process with exponentiated quadratic covariance.
k = GPy.kern.RBF(1)
y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var))[:, None]
```

Now we have the full data set we will construct a Gaussian process and optimize the parameters, showing the plot of the fit.

In [5]:

```
m_full = GPy.models.GPRegression(X,y)
m_full.optimize(messages=True) # Optimize parameters of covariance function
_ = m_full.plot() # plot the regression
```

In an inducing variable approximation, we introduce 'pseudo-observations' of the function, $\mathbf{u}$, which 'induce' the approximation. We will start by introducing four 'pseudo-observations' at locations 2.5, 4, 7 and 8.5. We will then display the untrained model.

In [6]:

```
kern = GPy.kern.RBF(1)
Z = np.hstack(
(np.linspace(2.5,4.,2),
np.linspace(7,8.5,2)))[:,None]
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.Gaussian_noise.variance = noise_var
m.rbf.variance.constrain_fixed()
m.rbf.lengthscale.constrain_fixed()
m.Gaussian_noise.variance.constrain_fixed()
m.plot()
display(m)
```

When the prior over $\mathbf{u}$ is integrated over the effective likelihood, $\hat{p}(\mathbf{y}|\mathbf{u})$, it's possible to show that the resulting marginal likelihood, $\hat{p}(\mathbf{y})$, which forms the core of the lower bound on the likelihood, is a Gaussian process with a low rank form for the covariance, $$ \hat{p}(\mathbf{y}) \sim \mathcal{N}(\mathbf{0}, \hat{\mathbf{K}}) $$ where $$ \hat{\mathbf{K}} = \mathbf{K}_{fu}\mathbf{K}_{uu}^{-1} \mathbf{K}_{uf} + \sigma^2 \mathbf{I}, $$ where $\mathbf{K}_{fu}$ gives the cross covariance between the variables of our function $f()$ and those of the inducing functions $u()$.

Let's have a look at the form of this approximation for the un-optimized inducing variables.

In [7]:

```
# Visualise full covariance and approximation.
Kff = m_full.kern.K(m.X,m.X)
Kfu = m.kern.K(m.X, m.Z)
Kuu = m.kern.K(m.Z, m.Z)
Kuf = Kfu.T
noise = m['.*noise'][0]
fig, ax = plt.subplots(1, 2, figsize=(12, 8))
ax[0].matshow(np.dot(np.dot(Kfu,pdinv(Kuu)[0]),Kuf) + np.eye(m.X.shape[0])*m['.*noise'])
ax[0].set_title('Low Rank Approximation')
ax[1].matshow(Kff + np.eye(m.X.shape[0])*m['.*noise'])
ax[1].set_title('Full Covariance')
```

Out[7]:

The variational compression bound minmizes the additional information obtained about $\mathbf{f}$ by knowing $\mathbf{y}$ with respect to that which is known by observing $\mathbf{u}$ alone. Maximizing the lower bound (whilst keeping model parameters fixed) compresses the information from $\mathbf{y}$ into $q(\mathbf{u})$.

In [8]:

```
m.optimize(messages=True)
m.plot()
display(m)
```

In [9]:

```
m.randomize()
m.Z.unconstrain()
m.optimize('bfgs', messages=True)
_ = m.plot()
```

In practice we actually optimize the model parameters alongside the variational parameters. This means we either find better solutions, or solutions where it's easier to compress the information from $\mathbf{y}$ into $\mathbf{u}$. The former case occurs because for certain choices of prior over $\mathbf{f}$ the values of $\mathbf{y}$ do not provide a lot of information.

In [10]:

```
M = 8
Z = np.random.rand(M,1)*12
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.likelihood.variance = noise_var
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
m.optimize(messages=True)
m.plot(ax=ax[0])
m_full.plot(ax=ax[1])
print M, "inducing variables"
print "Full model log likelihood: ", m_full.log_likelihood()
print "Lower bound from variational method: ", m.log_likelihood()
print "Information gain (in nats) associated with y ", m_full.log_likelihood() - m.log_likelihood()
```

In [11]:

```
Kff = m_full.kern.K(m.X,m.X)
Kfu = m.kern.K(m.X, m.Z)
Kuu = m.kern.K(m.Z, m.Z)
Kuf = Kfu.T
sigma2 = m.likelihood.variance
KfuKuuIKuf = np.dot(np.dot(Kfu,pdinv(Kuu)[0]),Kuf)
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].matshow(KfuKuuIKuf + np.eye(m.X.shape[0])*sigma2)
ax[0].set_title('Low Rank Approximation')
ax[1].matshow(Kff + np.eye(m.X.shape[0])*sigma2)
_ = ax[1].set_title('Full Covariance')
```

Now we will look at some real data. This is data as published by Franz von Zach's volume, Monatliche Correspondenz by Giuseppe Piazzi for observations of the (dwarf) planet Ceres, as shown in the Google book below. }

In [12]:

```
pods.notebook.display_google_book('JBw4AAAAMAAJ', 'PA280')
```

We've made the data available in the `pods`

library. The data is famous because Gauss fitted the orbit of Ceres, and made a made a prediction about the location of the planet a year after its discovery, allowing the planet to be recovered (it had been lost behind the sun). This established Gauss at the age of 23 as a leading European mathematician. Gauss later claimed that he used the Gaussian error model when making his predictions. The data is in the form of a `pandas`

data frame, which can be loaded and summarized as follows.

In [13]:

```
import pods
data = pods.datasets.ceres()['data']
data.describe()
```

Out[13]:

Now let's try fitting 'Gerade Aufstig in Zeit' using a Gaussian process. First thing to do is remove the missing value.

In [14]:

```
X = np.delete(np.asarray(data.index.dayofyear, dtype='float')[:, None], 8, axis=0)
y = np.delete(np.asarray(data['Gerade Aufstig in Zeit'])[:, None], 8, axis=0)
```

Now we will use an exponentiated quadratic covariance. Because observations are in days, we initialise the lengthscale to 10 days. Also the noise on these observations turns out to be very low. To prevent numerical problems, we add a bound to prevent the noise going below 1e-6.

In [15]:

```
kern = GPy.kern.RBF(1)
kern.lengthscale = 10.
m = GPy.models.GPRegression(X, y, kern)
# noise is so low that we get numerical issues if we allow noise variance to be 'free'
m.Gaussian_noise.variance.constrain_bounded(1e-6, 10)
m.optimize(messages=True)
m.plot()
display(m)
```

Note the log likelihood of the fit which is just over 99.43.

Now we form the same fit again but using a low rank Gaussian process. Buy changing the number of inducing variables we capture different aspects of the function. We'll start with one inducing variable. What aspect of the function do we expect to capture if we only store 1 thing about it? For each fit check the log likelihood of the result and compare it to the log likelihood of the full Gaussian process above (it should always lower bound this value). How many inducing variables do we need to capture this sequence?

In [19]:

```
M = 4
Z = np.random.rand(M,1)*45
kern2 = GPy.kern.RBF(1)
kern2.lengthscale = 10.
m2 = GPy.models.SparseGPRegression(X, y, kern2, Z=Z)
m2.Gaussian_noise.variance.constrain_bounded(1e-6, 10)
m2.optimize(messages=True)
_ = m2.plot()
```

Do you always get the same result for different initialisations of the inputs to the inducing variables? Can you recover the full model likelihood for enough inducing inputs? If not, why is that?