Basic (binary) GP classification model¶

This notebook shows how to build a GP classification model using variational inference. Here we consider binary (two-class, 0 vs. 1) classification only (there is a separate notebook on multiclass classification). We first look at a one-dimensional example, and then show how you can adapt this when the input space is two-dimensional.

In [1]:

import numpy as np
import gpflow

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

matplotlib.rcParams['figure.figsize'] = (8, 4)

One-dimensional example¶

First of all, let's have a look at the data. X and Y denote the input and output values. Note that X and Y must be two-dimensional NumPy arrays, $N \times 1$ or $N \times D$, where $D$ is the number of input dimensions/features, with the same number of rows as $N$ (one for each data point):

In [2]:

X = np.genfromtxt('data/classif_1D_X.csv').reshape(-1, 1)
Y = np.genfromtxt('data/classif_1D_Y.csv').reshape(-1, 1)

plt.figure(figsize=(10, 6))
plt.plot(X, Y, 'C3x', ms=8, mew=2);

Reminders on GP classification¶

For a binary classification model using GPs, we can simply use a Bernoulli likelihood. The details of the generative model are as follows:

1. Define the latent GP: we start from a Gaussian process $f \sim \mathcal{GP}(0, k(., .))$:

In [3]:

# build the kernel and covariance matrix
k = gpflow.kernels.Matern52(input_dim=1, variance=20.)
x_grid = np.linspace(0, 6, 200).reshape(-1, 1)
K = k.compute_K_symm(x_grid)

# sample from a multivariate normal
L = np.linalg.cholesky(K)
f_grid = np.dot(L, np.random.RandomState(6).randn(200, 5))
plt.plot(x_grid, f_grid, 'C0', linewidth=1)
plt.plot(x_grid, f_grid[:, 1], 'C0', linewidth=2);

2. Squash them to $[0, 1]$: the samples of the GP are mapped to $[0, 1]$ using the logistic inverse link function: $g(x) = \frac{\exp(f(x))}{1 + \exp(f(x))}$.

In [4]:

def logistic(f):
    return np.exp(f) / (1 + np.exp(f))
p_grid = logistic(f_grid)
plt.plot(x_grid, p_grid, 'C1', linewidth=1)
plt.plot(x_grid, p_grid[:, 1], 'C1', linewidth=2);

3. Sample from a Bernoulli: for each observation point $X_i$, the class label $Y_i \in \{0, 1\}$ is generated by sampling from a Bernoulli distribution $Y_i \sim \mathcal{B}(g(X_i))$.

In [5]:

# Select some input locations
ind = np.random.randint(0, 200, (30,))
X_gen = x_grid[ind]

# evaluate probability and get Bernoulli draws
p = p_grid[ind, 1:2]
Y_gen = np.random.binomial(1, p)

# plot
plt.plot(x_grid, p_grid[:, 1], 'C1', linewidth=2)
plt.plot(X_gen, p, 'C1o', ms=6)
plt.plot(X_gen, Y_gen, 'C3x', ms=8, mew=2);

Implementation with GPflow¶

For the model described above, the posterior $f(x)|Y$ (say $p$) is not Gaussian any more and does not have a closed-form expression. A common approach is then to look for the best approximation of this posterior by a tractable distribution (say $q$) such as a Gaussian distribution. In variational inference, the quality of an approximation is measured by the Kullback-Leibler divergence $\mathrm{KL}[q \| p]$. For more details on this model, see Nickisch and Rasmussen (2008).

The inference problem is thus turned into an optimisation problem: finding the best parameters for $q$. In our case, we introduce $U \sim \mathcal{N}(q_\mu, q_\Sigma)$, and we choose $q$ to have the same distribution as $f | f(X) = U$. The parameters $q_\mu$ and $q_\Sigma$ can be seen as parameters of $q$, which can be optimised in order to minimise $\mathrm{KL}[q \| p]$.

This variational inference model is called VGP in GPflow:

In [6]:

m = gpflow.models.VGP(X, Y,
                      likelihood=gpflow.likelihoods.Bernoulli(),
                      kern=gpflow.kernels.Matern52(input_dim=1))

o = gpflow.train.ScipyOptimizer()
o.minimize(m);

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 11.678611
  Number of iterations: 150
  Number of functions evaluations: 157

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 11.678611
  Number of iterations: 150
  Number of functions evaluations: 157

We can now inspect the result of the optimisation with print(m) or m.as_pandas_table():

In [7]:

m.as_pandas_table()

Out[7]:

	class	prior	transform	trainable	shape	fixed_shape	value
VGP/kern/lengthscales	Parameter	None	+ve	True	()	True	1.6355003192882445
VGP/kern/variance	Parameter	None	+ve	True	()	True	32.92700076548122
VGP/q_mu	Parameter	None	(none)	True	(50, 1)	True	[[-1.1196742258326438], [0.2620987093978069], ...
VGP/q_sqrt	Parameter	None	LoTri->vec	True	(1, 50, 50)	True	[[[0.45544562836404157, 0.0, 0.0, 0.0, 0.0, 0....

In this table, the first two lines are associated with the kernel parameters, and the last two correspond to the variational parameters. Note that, in practice, $q_\Sigma$ is actually parametrised by its lower-triangular square root $q_\Sigma = q_\text{sqrt} q_\text{sqrt}^T$ in order to ensure its positive-definiteness.

For more details on how to handle models in GPflow (getting and setting parameters, fixing some of them during optimisation, using priors, and so on), see Manipulating GPflow models.

Predictions¶

Finally, we will see how to use model predictions to plot the resulting model. We will replicate the figures of the generative model above, but using the approximate posterior distribution given by the model.

In [8]:

plt.figure(figsize=(12, 8))

# bubble fill the predictions
mu, var = m.predict_f(x_grid)
plt.fill_between(x_grid.flatten(),
                 (mu + 2 * np.sqrt(var)).flatten(),
                 (mu - 2 * np.sqrt(var)).flatten(),
                 alpha=0.3, color='C0')
    
# plot samples
samples = m.predict_f_samples(x_grid, 10).squeeze().T
plt.plot(x_grid, samples, 'C0', lw=1)
    
# plot p-samples
p = logistic(samples)  # exp(samples) / (1 + exp(samples))
plt.plot(x_grid, p, 'C1', lw=1)

# plot data
plt.plot(X, Y, 'C3x', ms=8, mew=2)
plt.ylim((-3,3))

Out[8]:

(-3, 3)

Two-dimensional example¶

In this section we will use the following data:

In [9]:

X = np.loadtxt('data/banana_X_train', delimiter=',')
Y = np.loadtxt('data/banana_Y_train', delimiter=',').reshape(-1,1)
mask = Y[:, 0]==1

plt.figure(figsize=(6, 6))
plt.plot(X[mask, 0], X[mask, 1], 'oC0', mew=0, alpha=0.5)
plt.plot(X[np.logical_not(mask), 0], X[np.logical_not(mask), 1], 'oC1', mew=0, alpha=0.5);

The model definition is the same as above; the only important difference is that we now specify that the kernel operates over a two-dimensional input space:

In [10]:

m = gpflow.models.VGP(X, Y,
                      kern=gpflow.kernels.RBF(input_dim=2),
                      likelihood=gpflow.likelihoods.Bernoulli())

opt = gpflow.train.ScipyOptimizer()
opt.minimize(m, maxiter=25) # in practice, the optimisation needs around 250 iterations to converge

INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 109.143688
  Number of iterations: 25
  Number of functions evaluations: 27

INFO:tensorflow:Optimization terminated with:
  Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
  Objective function value: 109.143688
  Number of iterations: 25
  Number of functions evaluations: 27

We can now plot the predicted decision boundary between the two classes. To do so, we can equivalently plot the contour lines $E[f(x)|Y]=0$, or $E[g(f(x))|Y]=0.5$. We will do the latter, because it allows us to introduce the predict_y function, which returns the mean and variance at test points:

In [11]:

x_grid = np.linspace(-3, 3, 40)
xx, yy = np.meshgrid(x_grid, x_grid)
Xplot = np.vstack((xx.flatten(),yy.flatten())).T

p, _ = m.predict_y(Xplot)  # here we only care about the mean
plt.figure(figsize=(7, 7))
plt.plot(X[mask, 0], X[mask, 1], 'oC0', mew=0, alpha=0.5)
plt.plot(X[np.logical_not(mask), 0], X[np.logical_not(mask), 1], 'oC1', mew=0, alpha=0.5);

plt.contour(xx, yy, p.reshape(*xx.shape), [0.5],  # plot the p=0.5 contour line only
            colors='k', linewidths=1.8, zorder=100);

References¶

Hannes Nickisch and Carl Edward Rasmussen. 'Approximations for binary Gaussian process classification'. Journal of Machine Learning Research 9(Oct):2035--2078, 2008.