In this notebook we explain how to implement a Mixture Density Network (MDN) [1] using GPflow. In theory, this is similar to this blog post from 2015, but instead of using TensorFlow directly we'll use GPflow. GPflow is typically used for building Gaussian Process-based models, but the framework contains many useful methods and classes that can be used to quickly prototype a wide variety of ML algorithms. Excellent for doing research!
We start by explaining why MDNs can be useful. We then examine a GPflow implementation of the model and use it for a couple of toy experiments.
Imagine we are interested in performing regression on the following dataset.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
CMAP = plt.get_cmap('Blues')
N = 200
NOISE_STD = 5.0e-2
def sinusoidal_data(N, noise):
Y = np.linspace(-2, 2, N)[:, None]
X = np.sin(4 * Y) * 2.0 + Y * 0.5
X += np.random.randn(N, 1) * noise
Y += np.random.randn(N, 1) * noise
return X, Y
X, Y = sinusoidal_data(N, NOISE_STD)
plt.plot(X, Y,'ro', alpha=0.3);
plt.xlabel("$x$");
plt.ylabel("$y$");
At first sight, this dataset doesn't seem overly complex. Both input and output have a single dimension, and the data has a clear sinusoidal pattern. However, notice that a single input $x$ can correspond to multiple output values $y$, so for example $x=0$ can yield any of the values $\{-1.5, -3/4, 0, 0.8, 1.5\}$. Typical regression algorithms such as Linear Regression, Gaussian Process regression and Multilayer Perceptrons (MLPs) struggle as they can only predict one output value for every input.
To model this dataset we can use a Conditional Density Estimation (CDE) model. CDE models infer $p(f(x)|x)$ instead of just calculating the expectation $E[f(x) | x]$. Modeling the complete distribution $p(f(x)|x)$ is typically harder but it reveals more interesting properties, such as the modes, outlier boundaries and samples. A real-world example might be modeling taxi drop-offs, conditioned on the pick-up location. We would expect a taxi drop-off location to be multi-modal as passengers need to go to different destinations (airport/city center/suburbs and so on) and the density depends on the starting location [2].
Mixture Density Networks (MDNs) are a parametric class of models that allow for conditional density estimation. They consist of two parts: a neural net and a Mixture of Gaussians (MoG). The neural net is responsible for producing the characteristics of the MoG. In practice, given that the MoG consists of $M$ Gaussians, the neural net will output a collection of $M$ means, variances and weights $\{\mu_m, \sigma_m^2, \pi_m\}_{m=1}^M$. These means, variances and weights are used to define the conditional probability distribution function: $$ p(Y = y\,|\,X = x) = \sum_{m=1}^{M} \pi_{m}(x)\,\mathcal{N}\big(y\, \left|\,\mu_{m}(x), \sigma_{m}^2(x)\big)\right. $$
Each of the parameters $\pi_{m}(x), \mu_{m}(x), \sigma_{m}(x)$ of the distribution are determined by the neural net, as a function of the input $x$.
We train the MDN's neural net by optimizing the model's likelihood: $$ \mathcal{L} \triangleq \text{argmax}_{\Theta} \prod_{n=1}^N p(Y = y_n | X = x_n) $$
...where $\Theta$ collects the neural net's weights and biases and $\{x_n, y_n\}_{n=1}^N$ represents our training dataset.
GPflow doesn't reinvent the wheel; most of what follows is just plain Python/TensorFlow code. We choose to use GPflow, however, because it provides us with functionality to easily define a model. Once we have a GPflow model, we can specify its objective function, parameters and dataset. This extra layer of abstraction makes interacting with the model much easier, for example optimising or performing inference.
We begin by importing the required packages from TensorFlow and GPflow.
import tensorflow as tf
import gpflow
from gpflow.models.model import Model
from gpflow import DataHolder, Param, ParamList, params_as_tensors, autoflow, settings
from gpflow.training import AdamOptimizer, ScipyOptimizer
Next, we create a MDN
class that inherits from GPflow's Model
class. We need to do the following:
DataHolder
object.Parameter
and ParamList
objects._build_likelihood
method. When we optimise the model the negative of this function will be minimised.class MDN(Model):
def __init__(self, X, Y, inner_dims=[10, 10,], activation=tf.nn.tanh, num_mixtures=5):
Model.__init__(self)
self.Dim = X.shape[1]
# `self.dims` collects the neural net's input, hidden and output dimensions.
# The number of output dims `self.dims[-1]` equals `num_mixtures` means +
# `num_mixtures` variances + `num_mixtures` weights, a total of
# 3 times `num_mixtures` variables.
self.dims = [self.Dim, ] + list(inner_dims) + [3 * num_mixtures]
self.activation = activation
self.X = DataHolder(X)
self.Y = DataHolder(Y)
self._create_network()
def _create_network(self):
Ws, bs = [], []
for dim_in, dim_out in zip(self.dims[:-1], self.dims[1:]):
init_xavier_std = (2.0 / (dim_in + dim_out)) ** 0.5
Ws.append(Param(np.random.randn(dim_in, dim_out) * init_xavier_std))
bs.append(Param(np.zeros(dim_out)))
self.Ws, self.bs = ParamList(Ws), ParamList(bs)
@params_as_tensors
def _eval_network(self, X):
for i, (W, b) in enumerate(zip(self.Ws, self.bs)):
X = tf.matmul(X, W) + b
if i < len(self.bs) - 1:
X = self.activation(X)
pis, mus, sigmas = tf.split(X, 3, axis=1)
pis = tf.nn.softmax(pis) # make sure they normalize to 1
sigmas = tf.exp(sigmas) # make sure std. dev. are positive
return pis, mus, sigmas
@params_as_tensors
def _build_likelihood(self):
pis, mus, sigmas = self._eval_network(self.X)
Z = (2 * np.pi)**0.5 * sigmas
log_probs_mog = (-0.5 * (mus - self.Y)**2 / sigmas**2) - tf.log(Z) + tf.log(pis)
log_probs = tf.reduce_logsumexp(log_probs_mog, axis=1)
return tf.reduce_sum(log_probs)
@autoflow((settings.float_type, [None, None]))
def eval_network(self, X):
pis, mus, sigmas = self._eval_network(X)
return pis, mus, sigmas
We achieve this by applying the softmax
operator to the $\pi$'s and by taking the exp
to the $\sigma$'s.
We use the "Xavier" initialisation for the neural net's weights. (Glorot and Bengio, 2010).
Instead of calculating the pdf of the Gaussians, we work with the pdf log
and use tf.reduce_logsumexp
. This is mainly for numerical stability.
We store vanilla NumPy arrays in Parameter
and DataHolder
objects. The @params_as_tensors
decorator ensures these variables are transformed into TensorFlow tensors once we are inside the decorated method.
The @autoflow
decorator specifies the type and shape of a method's input variables. It ensures that the graph, constructed inside the decorated method, is executed inside a TF session and that the output is returned as a np.array
when the method is called. This decorator enables execution of TensorFlow graphs without having to worry about managing tf.session
objects, nor creating any placeholders or other TensorFlow objects.
Let's see how our model works in practice with the sinusoidal dataset presented earlier. We do this by initialising a new instance of our MDN model, and then specifying the dataset $(X, Y)$, the number of hidden units of the MDN's neural net, and the number of mixture components $M$.
model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=25)
MDN instances are aware of their objective function. We can therefore start by minimising the model. GPflow ensures that only the variables stored in Parameter
objects are optimised. For the MDN, the only parameters are the weights and the biases of the neural net, stored as ParamList
objects in self.Ws
and self.bs
respectively.
We use the ScipyOptimizer
, which is a wrapper around SciPy's L-BFGS optimisation algorithm. Note that GPflow supports other TensorFlow optimisers such as Adam
, Adagrad
, and Adadelta
as well.
from gpflow.test_util import notebook_niter
ScipyOptimizer().minimize(model, maxiter=notebook_niter(1500))
INFO:tensorflow:Optimization terminated with: Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT' Objective function value: -38.581859 Number of iterations: 1500 Number of functions evaluations: 1706
INFO:tensorflow:Optimization terminated with: Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT' Objective function value: -38.581859 Number of iterations: 1500 Number of functions evaluations: 1706
To evaluate the validity of our model, we draw the posterior density. We also plot $\mu(x)$ of the optimised neural net. Remember that for every $x$ the neural net outputs $M$ means $\mu_m(x)$. These determine the location of the Gaussians. We plot all $M$ means and use their corresponding mixture weight $\pi_m(X)$ to determine their size. Larger dots will have more impact in the Gaussian ensemble.
from mdn_plotting import plot
fig, axes = plt.subplots(1, 2, figsize=(12,6))
for a in axes:
a.set_xlim(-4, 4)
a.set_ylim(-3, 3)
plot(model, X, Y, axes, cmap=CMAP)
The half moon dataset is available in the scikit-learn
package.
from sklearn.datasets import make_moons
def moon_data(N, noise):
data, _ = make_moons(n_samples=N, shuffle=True, noise=noise)
X, Y = data[:, 0].reshape(-1, 1), data[:, 1].reshape(-1, 1)
return X, Y
X, Y = moon_data(N, NOISE_STD)
plt.plot(X, Y, 'ro', alpha=0.3);
plt.xlabel("$x$");
plt.ylabel("$y$");
The only difference in the MDN's setup is that we lower the number of mixture components.
model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=5)
ScipyOptimizer().minimize(model, maxiter=notebook_niter(10000))
INFO:tensorflow:Optimization terminated with: Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' Objective function value: -205.818872 Number of iterations: 4208 Number of functions evaluations: 4659
INFO:tensorflow:Optimization terminated with: Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' Objective function value: -205.818872 Number of iterations: 4208 Number of functions evaluations: 4659
fig, axes = plt.subplots(1, 2, figsize=(12,6))
for a in axes:
a.set_xlim(-2, 3)
a.set_ylim(-1.5, 2)
plot(model, X, Y, axes, cmap=CMAP)
[1] Bishop, Christopher M. Mixture density networks. Technical Report NCRG/4288, Aston University, Birmingham, UK, 1994.
[2] Dutordoir, Vincent, et al. "Gaussian Process Conditional Density Estimation." Advances in Neural Information Processing Systems. 2018.