#!/usr/bin/env python
# coding: utf-8

# # Stochastic Variational Inference with Gaussian Processes

# ### Written by Alan Saul, Dec 22 2014

# In 2013 stocastic variational inference was implemented within the setting of Gaussian Processes [Gaussian Processes for Big Data, Hensman, Fusi and Lawrence, UAI 2013]. This allows us to perform inference on much larger datasets than was previously possible.
# 
# Here we will show a simple demonstration of this work implemented within the GPy framework.
# 
# First lets import some of the essentials, first we must note that within the model we rely on a optimization library called [climin](http://climin.readthedocs.org/en/latest/), so this must first be [installed](http://climin.readthedocs.org/en/latest/installation.html).

# In[1]:


import numpy as np
import GPy
from matplotlib import pyplot as plt
import climin
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'svg'")
#plt.rcParams['figure.figsize'] = 8,8


# First generate some data, in this example we will learn a multiple output model with the Stochastic Variational Inference model, where each output is independent given the input, and the noise is Gaussian. We will use a traditionally quite large number of data for a GP model (5000 datapoints) and learn stochastically looking at a batch of only 10 at a time.

# In[2]:


N=5000
X = np.random.rand(N)[:, None]
Y1 = np.sin(6*X) + 0.1*np.random.randn(N,1)
Y2 = np.sin(3*X) + 0.1*np.random.randn(N,1)
Y = np.hstack((Y1, Y2))


# Lets look at the data, since we have two outputs we will need to plot these seperately. You might notice a lag since we are plotting all the data.

# In[3]:


X_variance = None
plt.figure()
plt.plot(X, Y1, 'bx', alpha=0.2)
plt.xlabel('X')
plt.ylabel('Y1')
_ = plt.title('Original data Y1 (output dimension 1)')

plt.figure()
plt.plot(X, Y2, 'bx', alpha=0.2)
plt.xlabel('X')
plt.ylabel('Y2')
_ = plt.title('Original data Y2 (output dimension 2)')


# Now we will create a simple model. We will use only 10 data at a time during inference, this is done by setting the batchsize parameter. The model repeatedly choose a new (size 10) subset of the full data and use this to calculate a stochastic estimate of the full gradient. We choose to add a white variance kernel, this can stablise the covariance inversions that needs to be done, many people refer to this trick as fixed jitter, really it is just adding some additional variance to the diagonal of the covariance matrix, which helps keep it positive definite. 

# In[4]:


Z = np.random.rand(20,1)

batchsize = 10
m = GPy.core.SVGP(X, Y, Z, GPy.kern.RBF(1) + GPy.kern.White(1), GPy.likelihoods.Gaussian(), batchsize=batchsize)
m.kern.white.variance = 1e-5
m.kern.white.fix()


# Now we choose a optimizer, we choose here to use the climin project to allow us to use the Adadelta optimization routine. We have found Adadelta performs well in a stochastic setting but other optimisers can be used easily. Adadelta takes an initial set of parameters to start with (m.optimizer_array is an array which holds all of the parameters that can be optimised in the GPy model), and a function that given some new parameter settings, sets the model parameters and computes the gradients using these parameters against the data (m.stochastic_grad in this case).
# 
# m.stocastic_grad first chooses a new mini-batch from the whole dataset, sets the parameters of the SVI model to be those provided by Adadelta, and then computes the gradients of the log-likelihood of the mini-batch of data under the model, with respect to the newly set parameters.
# 
# The optimisation routine simply uses Adadelta to choose where to step using these gradients, sets the new parameters, and repeats. Some useful information from the optimisation are returned which can be useful for diagnostics. In practice in stochastic methods convergence criteron are not always easy to set, in practice we run for some number of iterations or stop it when we see that the likelihood isn't changing dramatically.
# 
# The important thing to note is that we are only looking at 10 datum at a time, and thus computing the gradients and updating the model is extremely quick, as we do not require a $O(5000^3)$ operation as is usual in vanilla GP models.

# In[5]:


opt = climin.Adadelta(m.optimizer_array, m.stochastic_grad, step_rate=0.2, momentum=0.9)
from ipywidgets import Text
from IPython.display import display

t = Text(align='right')
display(t)

import sys
def callback(i):
    t.value = str(m.log_likelihood())
    #Stop after 5000 iterations
    if i['n_iter'] > 5000:
        return True
    return False
info = opt.minimize_until(callback)


# Now we can plot the two outputs of the model independently, where we also show (black crosses) the data from the last mini-batch used. Here we use which_data_ycols to choose the output to plot. We see a close match between the original data and the GP fit.

# In[6]:


fig1, axes = plt.subplots(1, 2, figsize=(10,5))
ax = axes[0]
ax.plot(X, Y1, 'kx', alpha=0.1)
ax.set_xlabel('X')
ax.set_ylabel('Y1')
ax.set_title('SVI Y1 prediction with data')
_ = m.plot(which_data_ycols=[0], plot_limits=(X.min(),X.max()), ax=ax)
ax.set_xlim((X.min(),X.max()))

ax = axes[1]
ax.plot(X, Y2, 'kx', alpha=0.1)
ax.set_xlabel('X')
ax.set_ylabel('Y2')
ax.set_title('SVI Y2 prediction with data')
_ = m.plot(which_data_ycols=[1], plot_limits=(X.min(),X.max()), ax=ax)
ax.set_xlim((X.min(),X.max()))

fig1.tight_layout()


# In[ ]: