Lab 1: Gaussian Process Regression

Gaussian Process Summer School 2020

This lab is designed to introduce Gaussian processes in a practical way, illustrating the concepts introduced in the first two lectures. The key aspects of Gaussian process regression are covered: the covariance function (aka kernels); sampling a Gaussian process; and the regression model. The notebook will introduce the Python library GPy$^\dagger$ which handles the kernels, regression and optimisation of hyperparameter, allowing us to easily access the results we want.

The level of this notebook is aimed at introductory, as the background of attendees is diverse, and so cover a wide range of basic GP concepts. There are seven exercises to complete, the difficulty of which varies, but you should aim to complete all during the lab session. The notebook will not be marked and we will provide answers to the exercises after the lab session.

In addition, there is a second optional notebook with extra work. The content of this is more advanced, so completion is at your discretion.

$^\dagger$GPy: A Gaussian process framework in Python (since 2012). Available from http://github.com/SheffieldML/GPy

1. Getting started

[Binder] If you are using Binder, all libraries required are installed on the Binder image.

[Local] If you are running this notebook on a local machine, make sure that GPy is already installed on your machine. You should be using Python 3.5 or 3.6. A set of instructions for setting up your environment are available from the GPSS site.

First, we need to setup our notebook with the libraries we are going to use. We will use numpy for maths functionality, pyplot for plotting, and of course GPy for Gaussian processes.

In [1]:
# Support for maths
import numpy as np
# Plotting tools
from matplotlib import pyplot as plt
# we use the following for plotting figures in jupyter
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# GPy: Gaussian processes library
import GPy

The documentation for GPy is available at gpy.readthedocs.io. We will be using GPy to define our kernels, and regression. Note that GPy also contains plotting utilities, but we will not use these in this lab.

Covariance functions, aka kernels

We will define a covariance function, from hereon referred to as a kernel, using GPy. The most commonly used kernel in machine learning is the Gaussian-form radial basis function (RBF) kernel. It is also commonly referred to as the exponentiated quadratic or squared exponential kernel – all are equivalent.

The definition of the (1-dimensional) RBF kernel has a Gaussian-form, defined as:

$$ \kappa_\mathrm{rbf}(x,x') = \sigma^2\exp\left(-\frac{(x-x')^2}{2\mathscr{l}^2}\right) $$

It has two parameters, described as the variance, $\sigma^2$ and the lengthscale $\mathscr{l}$.

In GPy, we define our kernels using the input dimension as the first argument, in the simplest case input_dim=1 for 1-dimensional regression. We can also explicitly define the parameters, but for now we will use the default values:

In [2]:
# Create a 1-D RBF kernel with default parameters
k = GPy.kern.RBF(1)
# Preview the kernel's parameters
k
Out[2]:
rbf. valueconstraintspriors
variance 1.0 +ve
lengthscale 1.0 +ve

We can see from the above table that our kernel has two parameters, variance and lengthscale, both with value 1.0. There is also information on the constraints and priors on each parameter, but we will look at this later.

Visualising the kernel

We can visualise our kernel in a few different ways. We can plot the shape of the kernel by plotting $k(x,0)$ over some sample space $x$ which, looking at the equation above, clearly has a Gaussian shape. This describes the covariance between each sample location and $0$.

Alternatively, we can construct a full covariance matrix, $\mathbf{K}_{xx} \triangleq k(x,x')$ with samples $x = x'$. The resulting GP prior is a multivariate normal distribution over the space of samples $x$: $\mathcal{N}(\mathbf{0}, \mathbf{K}_{xx})$. It should be evident then that the elements of the matrix represents the covariance between respective points in $x$ and $x'$, and that it is exactly $\sigma^2[=1]$ in the diagonal.

We can show this using pyplot to plot the vector $k(x,0)$ and the matrix $k(x,x')$ using k.K($\cdot$ , $\cdot$):

In [3]:
# Our sample space: 100 samples in the interval [-4,4]
X = np.linspace(-4.,4.,100)[:, None] # we need [:, None] to reshape X into a column vector for use in GPy

# Set up the plotting environment
plt.figure(figsize=(18,5))

# ==== k(x,0)

plt.subplot(121) # left plot

# First, sample kernel at x' = 0
K = k.K(X, np.array([[0.]])) # k(x,0)

# Plot covariance vector
plt.plot(X,K)

# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa$")
plt.title("$\kappa_{rbf}(x,0)$")

# ==== k(x,x')

plt.subplot(122) # right plot

# The kernel takes two inputs, and outputs the covariance between each respective point in the two inputs
K = k.K(X,X)

# Plot the covariance of the sample space
plt.pcolor(X.T, X, K)

# Format and annotate plot
plt.gca().invert_yaxis(), plt.gca().axis("image")
plt.xlabel("x"), plt.ylabel("x'"), plt.colorbar()
plt.title("$\kappa_{rbf}(x,x')$");

Setting the kernel parameters

Looking at the above definition of the RBF kernel, we can see that the parameters, i.e. variance and lengthscale, control the shape of the covariance function and therefore the value of the covariance between points $x$ and $x'$.

We can access the value of the kernel parameters in GPy and manually set them by calling k.param_name, e.g. k.lengthscale or k.variance for the RBF kernel. The following example demonstrates how the value of the lengthscale affects the RBF kernel:

In [4]:
# Our sample space : 100 samples in the interval [-4,4] 
X = np.linspace(-4.,4.,250)[:, None] # we use more samples to get a smoother plot at low lengthscales

# Create a 1-D RBF kernel with default parameters
k = GPy.kern.RBF(1)

# Set up the plotting environment
plt.figure(figsize=(18, 7))

# Set up our list of different lengthscales
ls = [0.25, 0.5, 1., 2., 4.]

# Loop over the lengthscale values
for l in ls:
    # Set the lengthscale to be l
    k.lengthscale = l
    # Calculate the new covariance function at k(x,0)
    C = k.K(X, np.array([[0.]]))
    # Plot the resulting covariance vector
    plt.plot(X,C)

# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa(x,0)$") 
plt.title("Effects of different lengthscales on the Gaussian RBF kernel")
plt.legend(labels=ls);

Exercise 1

(a) What is the effect of the lengthscale parameter on the covariance function?

It causes the covariances to be higher between more distant locations in the input domain, i.e. nearby samples from the corresponding Gaussian process will be more likely to be similar as the lengthscale increased.

(b) Change the code used above to plot the covariance function showing the effects of the variance on the covariance function. Comment on the effect.

In [5]:
X = np.linspace(-4.,4.,100)[:, None]

k = GPy.kern.RBF(1)

# List of variances
vs = [0.1, 1., 10.]

plt.figure(figsize=(18, 7))

for v in vs:
    # Set the variance parameter of the kernel
    k.variance = v
    C = k.K(X, np.array([[0.]]))
    plt.plot(X,C)

# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa$") 
plt.title("Effects of different variances on the Gaussian RBF kernel")
plt.legend(labels=vs);

Increasing the variance parameter increases the covariance proportionally between all points, allowing for modelling of data with different variance.


2. Types of covariance function

There are many different covariance functions already implemented in GPy. Aside from the RBF kernel, there are others such as the following:

  • Exponential
  • Matern32
  • Matern52
  • Brownian
  • Bias
  • Linear
  • StdPeriodic
  • Cosine
  • PeriodicMatern32,

Note: when defining these, all are preceded by GPy.kern. The following are some examples of the Matérn 5/2 and Cosine kernels, compared with the RBF kernel:

In [6]:
# Our sample space : 100 samples in the interval [-4,4] 
X = np.linspace(-4.,4.,250)[:, None]

# RBF kernel
k_R = GPy.kern.RBF(1)
C_R = k_R.K(X, np.array([[0.]]))

# Matern 5/2
k_M = GPy.kern.Matern52(1)
C_M = k_M.K(X, np.array([[0.]]))

# Cosine 
k_C = GPy.kern.Cosine(1)
C_C = k_C.K(X, np.array([[0.]]))

plt.figure(figsize=(18,7))
plt.plot(X, C_R, X, C_M, X, C_C);
plt.xlabel("x"), plt.ylabel("$\kappa$") 
plt.legend(labels=["Gaussian RBF", "Matérn 5/2", "Cosine"]);

Not every kernel has the same set of parameters. Some kernels are not parameterised by a lengthscale, for example, like the Linear kernel which only has a list of variances corresponding to each linear component

In [7]:
GPy.kern.Linear(1)
Out[7]:
linear. valueconstraintspriors
variances 1.0 +ve

Likewise, not every kernel is stationary. In the case of the Gaussian RBF, or Matérn kernels, the kernel can be written $\kappa(x,x') = f(x-x')$, however this is not true for, e.g., the Brownian motion covariance function, which is defined as $k(x,x') = \min(x,x')$

In [8]:
# Our sample space : 100 samples in the interval [-2,8] 
X = np.linspace(-2., 8., 100)[:,None]

# Note that the Brownian kernel is defined:
#   k = min(abs(x),abs(x')) if sgn(x) = sgn(x')
#   k = 0 if sgn(x) ≠ sgn(x')

# We define our Brownian kernel
k_B = GPy.kern.Brownian(1)

plt.figure(figsize=(18,7))

x_s = [0., 2., 4., 6.] # values of x'
# Loop through values of x'
for x_ in x_s:
    # Evaluate kernel at k(x,x')
    K_B = k_B.K(X, np.array([[x_]])) 
    # Plot covariance vector
    plt.plot(X, K_B)
    
# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa$")
plt.title("Effects of different inputs on a non-stationary, Brownian kernel")
plt.legend(labels=["$\kappa(x,0)$", "$\kappa(x,2)$", "$\kappa(x,4)$", "$\kappa(x,6)$"]);

3. Combining covariance functions

Exercise 2

(a) A matrix, $\mathbf{K}$, is positive semi-definite if the matrix inner product is greater than or equal to zero, $\mathbf{x}^\text{T}\mathbf{K}\mathbf{x} \geq 0$, regardless of the values in $\mathbf{x}$. Given this, it should be easy to see that the sum of two positive semi-definite matrices is also positive semi-definite. In the context of Gaussian processes, this is the sum of two covariance functions. What does this mean from a modelling perspective?

Thinking about the sum of two kernels in terms of the sum of two normal distributions. If we consider two independent kernels, k1 and k2, and corresponding covariance matrices over some (joint) sample space: A=k2(x,xT) and B=k2(x,xT), then the sum of two corresponding normal distributions with zero mean and covariance defined by the respective matrices, N(0,A)+N(0,B)=N(0,A+B). From this we can infer that since the RHS of the equality is a valid distribution, A+B must be postive semi-definite, and so (x,x′)↦k1(x,x′)+k2(x,x′) results in positive semi-definite matrices regardless of x and x′.

The equality holds only when k1 and k2 (and therefore A and B) are independent . From a modelling perspective, this means we can use summative covariance functions to describe different, independent features in the modelled system. For example, we might want to separate some seasonality from an overall increasing trend. The sum of two kernels would allow use to do that. This is the motivation used in Section 7, using the Mauna Loa example.

(b) What about the element-wise product of two covariance functions? If we define $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x},\mathbf{x}')k_2(\mathbf{x},\mathbf{x}')$, then is $k(\mathbf{x},\mathbf{x}')$ a valid covariance function?

The product of two kernels is semi-positive definite, since the product of two semi-positive definite matrices is semi-positive definite.

In terms of modelling, this is less interpretable than in the additive case. In multidimensional input regression, for example, the product of kernels defined over different input variables can be used to combine information. The product of kernels might be considered similar to an AND operator, since the value of the kernel product will have high value if and only if the consitutient kernels both have high values. When modelling a stochastic process that is the product of two Gaussian process, the modelled function in general will not be Gaussian, but the Gaussian process with the covariance of products of the individual GP kernels exists.

Combining kernels in GPy

We can easily combine kernels using GPy using the + and * operators, respectively denoting addition and product of kernels.

Summing kernels

An example of adding kernels is shown here. We create a new kernel that is the sum of an RBF and a Matern 5/2 kernel.

In [9]:
# Create the first kernel: a 1-D RBF with lengthscale 2.0
k_R = GPy.kern.RBF(1, lengthscale=2., name="RBF")
# Create the second kernel: a 1-D Matern52 with variance 2.0 and lengthscale 4.0
k_M = GPy.kern.Matern52(1, variance=2., lengthscale=4., name="Matern52")

# Add the kernels together
k_sum = k_R + k_M
# Preview the properties of the composite kernel
k_sum
Out[9]:
sum. valueconstraintspriors
RBF.variance 1.0 +ve
RBF.lengthscale 2.0 +ve
Matern52.variance 2.0 +ve
Matern52.lengthscale 4.0 +ve

We can visualise our kernel sum to see the resulting effect. It should be fairly clear that the result is simply the sum of evaluations of the respective kernels for each sample point.

In [10]:
# Our sample space : 100 samples in the interval [-10,10] 
X = np.linspace(-10., 10., 100)[:,None]

# Set up the plotting environment
plt.figure(figsize=(18,7))

# Here we sample from the consituent and composite kernels
K_R   = k_R.K(X, np.array([[0.]]))    # RBF
K_M   = k_M.K(X, np.array([[0.]]))    # Matern 5/2
K_sum = k_sum.K(X, np.array([[0.]]))  # RBF + Matern

# Plot each of our covariance vectors
plt.plot(X, K_R, X, K_M, X, K_sum)
    
# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa$")
plt.legend(labels=["Gaussian RBF", "Matérn 5/2", "RBF + Matérn"]);

Multiplying two kernels

We also demonstrate here the effect of multiplying two kernels. Here, we multiply an RBF and Periodic kernel, effectively encapsulating the periodicity into an RBF window:

In [11]:
# Create the first kernel: a 1-D RBF with lengthscale 5.0
k_R = GPy.kern.RBF(1, lengthscale=5., name="RBF")
# Create the second kernel: a 1-D StdPeriodic with period 5.0
k_P = GPy.kern.StdPeriodic(1, period=5., name="Periodic")

# Multiply the kernels together
k_mul = k_R * k_P

# Preview the properties of the composite kernel
k_mul
Out[11]:
mul. valueconstraintspriors
RBF.variance 1.0 +ve
RBF.lengthscale 5.0 +ve
Periodic.variance 1.0 +ve
Periodic.period 5.0 +ve
Periodic.lengthscale 1.0 +ve
In [12]:
# Our sample space : 100 samples in the interval [-10,10] 
X = np.linspace(-10., 10., 100)[:,None]

# Set up the plotting environment
plt.figure(figsize=(18,7))

# Here we sample from the consituent and composite kernels
K_R   = k_R.K(X, np.array([[0.]]))      # RBF
K_P   = k_P.K(X, np.array([[0.]]))      # StdPeriodic
K_mul = k_mul.K(X, np.array([[0.]]))    # RBF * StdPeriodic

# Plot each of our covariance vectors
plt.plot(X, K_R, X, K_P, X, K_mul)
    
# Annotate plot
plt.xlabel("x"), plt.ylabel("$\kappa$")
plt.legend(labels=["Gaussian RBF", "Periodic", "RBF x Periodic"]);

4. Sampling from a Gaussian Process

A Gaussian process provides a prior over some infinite-dimensional function, defined by a mean function and covariance function

$$ f(x) \sim \mathcal{GP}(m(x), k(x,x'))$$

When we sample from the covariance function, $k$, to create a matrix over some sample space, we are creating a matrix of values that describe the covariance between sample points. Since it is not possible to sample every single point in an infinite dimensional function, we have to sample a finite subset of the input domain. Let $\mathbf{X}$ denote some sample inputs, and $\mathbf{K}$ the covariance matrix, with elements $K_{ij} = k(\mathbf{X}_i,\mathbf{X}_j)$, then we can describe the prior over $f(\mathbf{X})$ as a (finite-dimensional) normal distribution with covariance $\mathbf{K}$. As such, we can easily create samples of $f$ which, for a good choice of $\mathbf{X}$, are representative of the true function.

We can also sample from the kernel prior by creating a covariance matrix over a sample space and sampling from a zero-mean multivariate normal distribution with covariance $\mathbf{K}$. Below are examples of different kernels with different parameters, including composite kernels.

In [13]:
ks = [ # List of example kernels
    GPy.kern.RBF(1, lengthscale=1.),
    GPy.kern.RBF(1, lengthscale=0.5),
    GPy.kern.RBF(1, lengthscale=0.25, variance=2.),
    GPy.kern.Exponential(1),
    GPy.kern.Matern32(1),
    GPy.kern.Matern52(1),
    GPy.kern.StdPeriodic(1, period=2.),
    GPy.kern.Cosine(1),
    GPy.kern.Brownian(1),
    GPy.kern.Linear(1),
    GPy.kern.Bias(1),
    GPy.kern.White(1),
    GPy.kern.StdPeriodic(1)*GPy.kern.RBF(1),
    GPy.kern.Linear(1) + GPy.kern.Exponential(1)
]
# The name of our kernels (for the legend)
kernel_name = ["RBF ls=1", "RBF ls=0.5", "RBF ls=0.25, var=2", "Exponential", "Matern 3/2", 
               "Matern 5/2", "Periodic", "Cosine", "Brownian", "Linear", "Bias", "White", "Periodic x RBF", "Linear + Exponential"]

# Our sample space
X = np.linspace(-5., 5., 250)[:, None]

print("The following plots demonstrate samples from a Gaussian process prior and the corresponding covariance matrix")

# Loop through our kernels
for i,k in enumerate(ks):
    # The mean function is set to 0
    mu = np.zeros((250)) # we have 250 sample inputs
    # Get the covariance matrix
    if i is not 11:
        C  = k.K(X,X)
    else: # We have to sample White noise kernel differently
        C = k.K(X)
        
    # Sample 5 times from a multivariate Gaussian distribution with mean 0 and covariance k(X,X)
    Z  = np.random.multivariate_normal(mu, C, 5)
    
    # Setup figure environment
    plt.figure(figsize=(18, 7))
    
    # Show samples on left hand side
    plt.subplot(121)
    for j in range(5 if i < 11 else 2): # Loop through samples
        plt.plot(X[:],Z[j,:])
        plt.title(kernel_name[i])
        
    # Visualise covariance matrix on right hand side
    plt.subplot(122)
    plt.pcolor(X.T, X, C)
    # Annotate plot
    plt.gca().invert_yaxis(), plt.gca().axis("image")
    plt.colorbar()
The following plots demonstrate samples from a Gaussian process prior and the corresponding covariance matrix