This notebook shows how to construct a Gaussian process model where different noise is assumed for different data points. The model is:
$$f(\cdot) \sim \mathcal{GP}\big(0, k(\cdot, \cdot)\big)$$$$y_i | f, x_i \sim \mathcal N\big(y_i; f(x_i), \sigma^2_i\big)$$We'll demonstrate two methods. In the first demonstration, we'll assume that the noise variance is known for every data point. We'll incorporate the known noise variances $\sigma^2_i$ into the data matrix $\mathbf Y$, make a likelihood that can deal with this structure, and implement inference using Variational GPs with natural gradients.
In the second demonstration, we'll assume that the noise variance is not known, but we'd like to estimate it for different groups of data. We'll show how to construct an appropriate likelihood for this task and set up inference similarly to the first demonstration, with optimisation over the noise variances.
import numpy as np
import tensorflow as tf
import gpflow
from gpflow.test_util import notebook_range
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1) # for reproducibility
def generate_data(N=80):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
NoiseVar = 2 * np.exp(-(X - 2)**2 / 4) + 0.3 # Noise variances
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, NoiseVar
X, Y, NoiseVar = generate_data()
Here's a plot of the data, with error bars representing two standard deviations:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
_ = ax.errorbar(X, Y, yerr=2*(np.sqrt(NoiseVar)), marker='x', lw=0, elinewidth=1., color='C1')
We need to tell the GP model what the variance is for each data point. To do this, we'll concatenate the observations with the variances into a single data matrix:
Y_data = np.hstack([Y, NoiseVar])
To cope with this data structure, we'll build a new likelihood. Note how the code extracts the observations Y
and the variances NoiseVar
from the data. For more information on creating new likelihoods, see Likelihood design. Here, we're implementing the logp
function (which computes the log-probability of the data given the latent function) and variational_expectations
, which computes the expected log-probability under a Gaussian distribution on the function, and is needed in the evaluation of the evidence lower bound (ELBO). Check out the docstring for the Likelihood
object for more information on what these functions do.
class HetGaussian(gpflow.likelihoods.Likelihood):
def logp(self, F, Y):
# logp is used by the quadrature fallback of variational_expectations and predict_density.
# Because variational_expectations is implemented analytically below, this is not actually needed,
# but is included for pedagogical purposes.
# Note that currently relying on the quadrature would fail due to https://github.com/GPflow/GPflow/issues/966
Y, NoiseVar = Y[:, 0:1], Y[:, 1:2]
return gpflow.logdensities.gaussian(Y, F, NoiseVar)
def conditional_mean(self, F):
raise NotImplementedError
def conditional_variance(self, F):
raise NotImplementedError
def variational_expectations(self, Fmu, Fvar, Y):
Y, NoiseVar = Y[:, 0:1], Y[:, 1:2]
return -0.5 * np.log(2 * np.pi) - 0.5 * tf.log(NoiseVar) \
- 0.5 * (tf.square(Y - Fmu) + Fvar) / NoiseVar
Here we'll build a variational GP model with the previous likelihood on the dataset that we generated. We'll use the natural gradient optimiser (see Natural gradients for more information).
The variational GP object is capable of variational inference with any GPflow-derived likelihood. Usually, the inference is an inexact (but pretty good) approximation, but in the special case considered here, where the noise is Gaussian, it will achieve exact inference. Optimising over the variational parameters is easy using the natural gradients method, which provably converges in a single step.
# model construction (notice that num_latent is 1)
likelihood = HetGaussian()
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
# A single natural gradient step solves the problem:
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# let's do some plotting!
xx = np.linspace(-5, 5, 200)[:, None]
mu, var = model.predict_f(xx)
plt.figure(figsize=(12, 6))
plt.plot(xx, mu, 'C0')
plt.plot(xx, mu + 2*np.sqrt(var), 'C0', lw=0.5)
plt.plot(xx, mu - 2*np.sqrt(var), 'C0', lw=0.5)
plt.errorbar(X, Y, yerr=2*(np.sqrt(NoiseVar)), marker='x', lw=0, elinewidth=1., color='C1')
_ = plt.xlim(-5, 5)
What is the difference in meaning between the orange vertical bars and the blue regions in the prediction?
Why did we not implement conditional_mean
and conditional_var
in the HetGaussian likelihood? What could be done here?
What are some better kernel settings for this dataset? How could they be estimated?
In this demo, we won't assume that the noise variances are known, but we will assume that they're known in two groups. This example represents a case where we might know that an instrument has varying fidelity for different regions, but we do not know what those fidelities are.
Of course it would be straightforward to add more groups, or even one group per data point. We'll stick with two for simplicity.
# This is a completely independent demo. Reset the TensorFlow default graph and session
# to make the TensorFlow graph computations run faster:
gpflow.reset_default_graph_and_session()
np.random.seed(1) # for reproducibility and to make it independent from demo 1
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.where(X>0, 0, 1)
NoiseVar = np.array([0.02, 0.5])[groups] # Different variances for the two groups
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups
X, Y, groups = generate_data()
# here's a plot of the raw data.
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
_ = ax.plot(X, Y, 'kx')
In this case, we need to let the model know which group each data point belongs to. We'll use a similar trick to the above, stacking the group identifier with the data:
Y_data = np.hstack([Y, groups])
This time, we'll use a built-in likelihood, SwitchedLikelihood
, which is a container for other likelihoods, and applies them to the first Y_data
column depending on the index in the second. We're able to access and optimise the parameters of those likelihoods. Here, we'll (incorrectly) initialise the variances of our likelihoods to 1, to demonstrate how we can recover reasonable values for these through maximum-likelihood estimation.
likelihood = gpflow.likelihoods.SwitchedLikelihood([gpflow.likelihoods.Gaussian(variance=1.0),
gpflow.likelihoods.Gaussian(variance=1.0)])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
# A single natural gradient step solves the problem:
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
We've now fitted the VGP model to the data, but without optimising over the hyperparameters. Plotting the data, we see that the fit is not terrible, but hasn't made use of our knowledge of the varying noise.
# let's do some plotting!
xx = np.linspace(-5, 5, 200)[:, None]
mu, var = model.predict_f(xx)
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(xx, mu, 'C0')
ax.plot(xx, mu + 2*np.sqrt(var), 'C0', lw=0.5)
ax.plot(xx, mu - 2*np.sqrt(var), 'C0', lw=0.5)
ax.plot(X, Y, 'C1x', mew=2)
_ = ax.set_xlim(-5, 5)
Here we'll optimise over both the noise variance and the variational parameters, applying natural gradients interleaved with the Adam optimiser. See Natural gradients for more details and explanation.
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in notebook_range(200): # notebook_range() behaves like Python's built-in range here (only needed for continuous integration tests)
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
Now that the noise variances have been estimated, we can see the final model fit. The predictive variance is higher on the left side of the plot, where we know that the data have different variance. We'll plot the known underlying function in green to see how effectively we've recovered the ground truth. We can also print the model to examine the estimated noise variances:
# let's do some plotting!
xx = np.linspace(-5, 5, 200)[:, None]
mu, var = model.predict_f(xx)
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(xx, mu, 'C0')
ax.plot(xx, mu + 2*np.sqrt(var), 'C0', lw=0.5)
ax.plot(xx, mu - 2*np.sqrt(var), 'C0', lw=0.5)
ax.plot(X, Y, 'C1x', mew=2)
ax.set_xlim(-5, 5)
_ = ax.plot(xx, 2.5 * np.sin(6 * xx) + np.cos(3 * xx), 'C2--')
model
class | prior | transform | trainable | shape | fixed_shape | value | |
---|---|---|---|---|---|---|---|
VGP/kern/lengthscales | Parameter | None | +ve | True | () | True | 0.3091906360603466 |
VGP/kern/variance | Parameter | None | +ve | True | () | True | 5.11779936937855 |
VGP/likelihood/likelihood_list/0/variance | Parameter | None | +ve | True | () | True | 0.01046763426205227 |
VGP/likelihood/likelihood_list/1/variance | Parameter | None | +ve | True | () | True | 0.5016463405422175 |
VGP/q_mu | Parameter | None | (none) | False | (100, 1) | True | [[1.141104975484783], [1.0567668719464667], [1... |
VGP/q_sqrt | Parameter | None | LoTri->vec | False | (1, 100, 100) | True | [[[0.3053892368983257, 0.0, 0.0, 0.0, 0.0, 0.0... |