Notebook

In [1]:

import torch
import numpy as np
import matplotlib.pyplot as plt

Differentiating Through Randomness¶

In the previous notebook we covered the basics of automatic differentiation (AD). However, we focused on differentiating deterministic programs that always produce the same output given the same input. In contrast, ABMs are typically stochastic in nature. In this notebook we will explore how one can employ AD to differentiate through randomness.

Fixed noise distributions¶

Consider the following stochastic model $f$ with input parameter $\theta$:

$$ \begin{align*} z &\sim \mathcal N(1, 2) \\ y &= \theta^2(z+4) \end{align*} $$

Here, $\mathcal N(\mu, \sigma)$ denotes the Normal distribution with mean $\mu$ and standard deviation $\sigma^2$. We can implement this model via the following stochastic program:

In [3]:

def f(theta):
    z = np.random.normal(1, 2)
    return theta**2 * (z + 4)

We can now ask what it means to compute the derivative of $f$ with respect to $\theta$. Recall the standard definition of the derivative for scalar univariate functions:

$$ \frac{d f(\theta)}{d \theta} = \lim_{\epsilon \to 0} \frac{f(\theta+\epsilon) - f(\theta)}{\epsilon}. $$

Note that the difference $f(\theta+\epsilon) - f(\theta)$ is now a random quantity, and as a result it is unclear how we should treat the limit on the right-hand side. Significant research effort has been dedicated to the development of sample derivatives which address this issue. Whilst much of what we will discuss here can be framed in terms of sample derivatives, we will take a simpler approach and circumvent this theory for the time being.

In practice, we are typically interested in the average behaviour of a model, rather than the output associated with a single run. In the case of our example, this means we want to compute the following gradient:

$$ \nabla_{\theta}\mathbb{E}_{z \sim \mathcal{N}(1,2)}[\theta^{2}(z+4)]$$

This is a specific instance of the more general gradient estimation problem:

$$ \nabla_{\theta}\mathbb E_{p(z)} [f_\theta(z)] $$

where $f_{\theta}$ is any function which is deterministic with respect to $z$ and differentiable with respect to $\theta$. Note that the noise distribution, as specified by the density $p$, does not depend on $\theta$. We will relax this assumption later on. Let's expand and rewrite this gradient slighty.

$$ \begin{align*} \nabla_{\theta}\mathbb E_{p(z)} [f_\theta(z)] &= \nabla_{\theta}\int p(z)f_\theta(z)dz \\ &= \int p(z)\nabla_{\theta}f_{\theta}(z)dz \\ &= \mathbb{E}_{z \sim p(z)}[\nabla_{\theta}f_{\theta}(z)]. \end{align*} $$

Put simply, we find that the gradient of the expectation is equal to the expectation of the gradient. In the second equality, we assumed that $f_{\theta}(x)$ is sufficiently regular so that Leibniz's rule applies. That is, we are allowed to exchange the integral and derivative operators. Almost all the functions used in machine learning satisfy this criteria. If you are curious about the technical details, I recommend checking out this paper, which investigates when the exchange of integration and differentiation is generally permitted.

We can now derive a closed-form expression for the gradient in our running example:

$$ \nabla_\theta \mathbb E_{z \sim \mathcal N(1, 2)}\;\left[\theta(z+4)\right] = \mathbb E_{z\sim \mathcal N(1,2)} \left[\nabla_\theta \theta^2(z+4)\right] = 10\, \theta $$

We can check this is correct numerically by applying finite differences (FD) directly to the expectation. In this case, FD corresponds to the following approximation scheme:

$$ \nabla_{\theta}\mathbb E_{p(z)} [f_\theta(z)] \approx \frac{\mathbb{E}_{z \sim p(z)}[f_{\theta + \epsilon}(z)] - \mathbb{E}_{z \sim p(z)}[f_{\theta}(z)]}{\epsilon}$$

Generally speaking, we can't compute the expectations within this quotient, as this requires integrating over the support of $z$. However, we can unbiasedly estimate them using a standard Monte Carlo approximation:

$$ \nabla_{\theta}\mathbb E_{p(z)} [f_\theta(z)] \approx \frac{\frac{1}{m}\sum^{m}_{i=1}f_{\theta + \epsilon}(z_{i, 1}) - \frac{1}{m}\sum_{i=1}^{m}f_{\theta}(z_{i, 2})}{\epsilon}, $$

where $z_{i, 1} \sim p$ and $z_{i, 2} \sim p$. Let's plot a histogram of the gradients returned by the above FD estimator for different values of $\epsilon$!

In [4]:

theta = 2.0
epsilons = [1.0, 0.5, 0.1]
n_samples = 10_000
samples_per_epsilon = {epsilon: [] for epsilon in epsilons}
for epsilon in epsilons:
    for i in range(n_samples):
        f_deriv = (f(theta + epsilon) - f(theta)) / (epsilon)
        samples_per_epsilon[epsilon].append(f_deriv)

In [5]:

fig, ax = plt.subplots()
for i, (epsilon, samples) in enumerate(samples_per_epsilon.items()):
    ax.hist(
        samples,
        bins=100,
        alpha=0.6,
        label=f"$\epsilon={epsilon}$",
        density=True,
        color=f"C{i}",
    )
ax.axvline(
    theta * 10,
    color="black",
    linestyle="dashed",
    linewidth=1,
    label="Expected derivative",
)
ax.legend()

Out[5]:

<matplotlib.legend.Legend at 0x17e7b7cd0>

For all $\epsilon$ tested, the mean of our estimator roughly corresponds to the true analytical gradient derived above, which is represented by the vertical dashed line. We also notice another striking pheneomena. As $\epsilon \to 0$, the variance of our estimator increases. Why is this?

Recall the Monte Carlo estimate we employed to approximate FD:

$$ \sum^{m}_{i=1}f_{\theta + \epsilon}(z_{i, 1}) - \sum_{i=1}^{m}f_{\theta}(z_{i, 2}) $$

This is the only random quantity and hence the source of all the variance. When we divide through by $\epsilon$ this term is scaled up, dramatically increasing its variance. The smaller the $\epsilon$, the worse this effect will be. As we saw in the last notebook, sometimes we need set $\epsilon$ to a small value to get a good approximation of the gradient. As a result, we can't rely on increasing $\epsilon$ as a strategy to reduce variance. Our only option is to reduce the initial variance of the Monte Carlo term above.

The simplest way to reduce variance is increasing the number of samples $m$ used in each Monte-Carlo approximation. However, this involves more function evaluations, which in turn may be computationally expensive, especially if $f$ is an ABM. Alternatively, one can reduce variance by coupling the noise terms used in each Monte Carlo sum. That is, setting $z_{i, 1} = z_{i, 2}$. This reduces noise introduced by evaluating $f$ at different sample points when taking the difference between sums. However, syncrhonising the noise terms in practice can be difficult, and is not as simple as fixing a random seed!

Next, let's consider the following approximation of the derivative:

$$ \frac{1}{m}\sum^{m}_{i=1}\nabla_{\theta}f_{\theta}(z_{i}), $$

where each gradient term is computed using AD instead of FD. Note that we no longer need to take a difference of terms, nor do we need to divide by a small value. As a result, we should expect the variance of this estimator to be far lower. Let's plot its corresponding histogram and check!

In [6]:

def f_torch(theta):
    z = torch.distributions.Normal(1.0, 2.0).sample()
    return theta**2 * (z + 4)


n_samples = 10_000
samples_autograd = []
for i in range(n_samples):
    theta = torch.tensor(2.0, requires_grad=True)
    f_torch(theta).backward()
    samples_autograd.append(theta.grad)

In [7]:

fig, ax = plt.subplots()
ax.hist(
    samples_autograd, bins=100, label="Autograd", density=True, color="C0"
)
ax.axvline(
    10*theta.item(), color="black", linestyle="dashed", linewidth=1, label="Expected derivative"
)
ax.legend()

Out[7]:

<matplotlib.legend.Legend at 0x17ec2cac0>

As we can see, using AD leads to much lower variance compared to using FD!

Parameterizable Noise¶

So far, we have assumed that the randomness of our model, described by $p(z)$, did not depend on the structural parameters $\theta$ that we want to differentiate with respect to. However, in most ABMs this assumption does not hold. For instance, consider an epdiemlogical model where agents become infected with some probability. This probability may depend on structural parameters representing factors that we are interested in calibrating, such as disease contagiousness or social distancing measures.

In such cases, we need to estimate the following gradient:

$$ \nabla_\theta\mathbb E_{p_\theta(z)} [f_\theta(z)], $$

where $p$ is parameterized by $\theta$. Let's expand this gradient just like we did in the previous section.

$$ \begin{align*} \nabla_\theta\mathbb E_{p_\theta(z)} [f_\theta(z)] &= \nabla_\theta \left[ \int_z p_\theta(z) f_\theta(z) \mathrm{d}z\right]\\ &= \int_z \nabla_\theta\left[p_\theta(z) f_\theta(z)\right] \mathrm{d}z\\ &= \int_z f_\theta(z)\nabla_\theta p_\theta(z)\mathrm{d} z + \int_z p_\theta(z) \nabla_\theta f_\theta(z)\mathrm{d}z\\ &= \int_z f_\theta(z) \nabla_\theta p_\theta(z) \mathrm{d} z + \mathbb E_{p_\theta(z)} \left[\nabla_\theta f_\theta(z)\right] \end{align*} $$

Notice that we now have an additional term, $\int_z f_\theta(z) \nabla_\theta p_\theta(z) \mathrm{d} z$, that prevents us from commuting the gradient and the expectation. Hence, in general, the gradient of the expectation is not the expectation of the gradient when our structural parameters parameterize the randomness. In the next subsection, we will introduce a trick for estimating the gradient when $p$ is continuous.

The Reparameterization Trick¶

Roughly speaking, there are two basic ways to sample from a continuous distribution $p_{\theta}$. We may sample directly from $p_{\theta}$:

$$ x \sim p_\theta(x), $$

or we can sample auxillary noise $\varepsilon$ from another distribution $q$, and apply a deterministic transform $g_{\theta}$ so that the resulting output is distributed according to $p_{\theta}$. This is broadly known as indirect sampling,

$$ x = g_{\theta}(\varepsilon), \quad \varepsilon \sim q(\varepsilon) $$

When $q$ is the standard uniform density, this procedure corresponds to inverse transform sampling, and $g$ corresponds to the inverse CDF function associated with $p_{\theta}$. However, sometimes there are better choices for $q$ that make $g_{\theta}$ easier to compute. For instance, to sample from the normal distribution $\mathcal N(\mu, \sigma^2)$, we can use the standard normal distribution $\mathcal{N}(0, 1)$ to sample auxillary noise:

$$ \varepsilon \sim \mathcal N(0, 1), \quad x = \mu + \sigma \varepsilon $$

So why do we care about indirect sampling? Since indirect and direct sampling produce the same distribution we clearly have:

$$ \mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] = \mathbb{E}_{\varepsilon \sim q(\varepsilon)}[f_{\theta}(g_{\theta}(\varepsilon))]. $$

Thus, taking gradients we have $$ \nabla_{\theta}\mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] = \nabla_{\theta}\mathbb{E}_{\varepsilon \sim q(\varepsilon)}[f_{\theta}(g_{\theta}(\varepsilon))] $$

Note that the sampling distribution $q(\varepsilon)$ does not depend on $\theta$. Thus, if $g_{\theta}$ and $f_{\theta}$ are both differentiable and sufficiently regular, we can exchange the gradient and expectation operators on the right-hand side, just as we had done in the previous section!

$$ \nabla_{\theta}\mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] = \mathbb{E}_{\varepsilon \sim q(\varepsilon)}[\nabla_{\theta}f_{\theta}(g_{\theta}(\varepsilon))] $$

Proceeding as before, we can now construct an AD-based Monte Carlo estimator for the gradient of interest:

$$ \nabla_{\theta}\mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] \approx \frac{1}{m}\sum^{m}_{i=1}\nabla_{\theta}f_{\theta}(g_{\theta}(\varepsilon_{i})), \quad \varepsilon_{i} \sim q $$

This method of gradient computation is known as the reparameterization trick within the machine learning community, and was originally exploited to train VAEs. Indirect sampling effectively reparameterizes the noise variable $z$ so that it not longer depends on the structural parameters $\theta$.

Let's study an example. Consider the following model:

$$ \begin{align*} z \sim \mathcal N(\theta, 4) \\ y = \theta^2 (4 + z) \end{align*} $$

We can reparameterize the noise variable $z$ using the standard normal distribution:

$$ \begin{align*} \varepsilon \sim \mathcal{N}(0, 1) \\ y = \theta^2 (4 + \theta + 2\varepsilon) \end{align*} $$

We can now compute the gradient:

$$ \mathbb E_{\epsilon \sim \mathcal N(0,1)}\left[ \nabla_\theta \theta^2 (4 + \theta + 2\epsilon )\right ] =\mathbb E_{\epsilon \sim \mathcal N(0,1)}[8\theta + 3\theta^2 + 4\theta\epsilon] = 8\theta + 3\theta^2 $$

Let's now explore how the reparameterization trick is implemented in PyTorch. Below, we define two stochatsic programs. The program f_torch_norep implements the previous example without performing any explicit reparameterization by sampling $z$ directly from torch.distributions.Normal(theta, 2.0). The program f_torch explicitly reparameterizes by first sampling from torch.distributions.Normal(0,1) and then constructing z. Let's plot the density of these two estimators.

In [8]:

def f_torch_norep(theta):
    z = torch.distributions.Normal(theta, 2.0).sample()
    return theta**2 * (z + 4)


def f_torch(theta):
    epsilon = torch.distributions.Normal(0, 1).sample()
    z = theta + epsilon * 2
    return theta**2 * (4 + z)


rep_samples = []
norep_samples = []
n_samples = 2_000
theta_value = 2.0
analytical_result = 8 * theta_value + 3 * theta_value**2
for i in range(n_samples):
    theta = torch.tensor(theta_value, requires_grad=True)
    f_torch_norep(theta).backward()
    norep_samples.append(theta.grad.item())
    theta = torch.tensor(theta_value, requires_grad=True)
    f_torch(theta).backward()
    rep_samples.append(theta.grad.item())

In [9]:

fig, ax = plt.subplots()
ax.hist(
    norep_samples,
    bins=100,
    alpha=0.35,
    label="No Reparametrization",
    density=True,
    color="C0",
)
ax.hist(
    rep_samples,
    bins=100,
    alpha=0.35,
    label="Reparametrization",
    density=True,
    color="C1",
)
ax.axvline(
    analytical_result,
    color="black",
    linestyle="dashed",
    linewidth=1,
    label="Expected derivative",
)
ax.legend()

Out[9]:

<matplotlib.legend.Legend at 0x17ecd80d0>

As we can see, f_torch is an unbiased estimator of the analytical gradient as it explicitly employs the reparameterization trick. In contrast,f_torch_norep is biased. This is because .sample() does not implicitly perform the reparamterization trick. Instead, gradient propoagation stops at .sample(). Thankfully, PyTorch implements the shortcut function .rsample which implicitly performs the reparameterization trick for us so we do not need to implement it over and over again. Below we perform the reparameterization trick for our example implicitly via .rsample. Observe that an accurate estimate for the analytical gradient is returned.

In [10]:

def f_torch(theta):
    z = torch.distributions.Normal(theta, 2.0).rsample()
    return theta**2 * (z + 4)


rep = (
    sum([torch.func.grad(f_torch)(torch.tensor(2.0)) for i in range(n_samples)])
    / n_samples
)
print(f"{rep} ~ {analytical_result}")

27.927927017211914 ~ 28.0

Discrete Randomness¶

The reparameterization trick assumes that a probability density can be reparameterised effectively with a smooth transform. This assumption holds for most of the continuous distributions we encounter in machine learning. Unfortunately, this assumption does not hold for Bernoulli and categorical distributions, which are frequently used in ABM implementations, since they are discrete. To differentiate through discrete distributions, we need to seek an alternative to the reparamterization trick.

Multiple methods have been proposed aiming to address this gap. These include surrogate gradient methods, such as the straight-through gradient estimator (Bengio et al. 2013), which replace discrete sample operations with a smooth approximation when applying the chain rule. More recent approaches include the Julia package StochasticAD.jl, which leverages the theory of sample derivatives to efficiently compute unbiased pathwise gradient estimators. Unfortunately, we do not have time to cover all of these methods in detail. Instead will focus on one prominent method: the Gumbel-Softmax (GS) trick (Jang et al. 2016).

One way to circumvent differentiability issues with a discrete distribution is to replace it with a continuous distribution that can be reparameterized. Of course, we should choose a continuous distribution that approximates the original distribution well so that gradients are informative. Consider the categorical distribution of $n$ categories with parameters $\pi =(\pi_{1}, \dots, \pi_{n})$. Here, $\pi_{i}$ indicates the probability that category $i$ is sampled. How might we construct an appropriate continuous distribution in this case?

It turns out that the categorical distribution is closely related to the Gumbel distribution. By exploiting the structure of the Gumbel distribution and basic properties of order statistics one can show that:

$$ \arg\max_{i}(g_{i} + \log\pi_{i}) \sim \text{Cat}(\pi), \quad g_{i} \sim \text{Gumbel}(0)$$

That is, we can reparameterize the categorical distribution using the Gumbel distribution. See this paper for the technical details. Unsurprisingly, this is known as the Gumbel-max trick. Unfortunately, this reparameterization is not differentiable as the $\arg\max$ function is not differentiable. To rectify this we can replace $\arg\max$ with a $\text{softmax}$ operator, obtaining the Gumbel-Softmax distribution. $$ y_{i} = \frac{\exp((\log\pi_{i} + g_{i})/\tau)}{\sum_{j=1}^{n}\exp((\log\pi_{j} + g_{j})/\tau)} \quad g_{i} \sim \text{Gumbel}(0)$$

Note that, when we apply the $\text{softmax}$ operator, we also include a temperature parameter $\tau > 0$. As we decrease $\tau$, the GS distribution forms a better approximation of $\text{Cat}(\pi)$. In fact, one can show that as $\tau \to 0$, the GS distribution converges to $\text{Cat}(\pi)$. Unfortunately, the variance of gradient estimates increase rapidly as $\tau \to 0$. This is reminiscent of our experience with FD, where dividing by small values caused high variance. As a result, we face a bias-variance trade-off when setting $\tau$.

PyTorch has a standard implementation of the GS estimator built in. In particular, PyTorch implements a hard version of the GS trick, which uses samples from $\text{Cat}(\pi)$ when performing the forward pass, but uses samples from the GS distribution during the backwards pass. By doing this, PyTorch guarantees that evaluations of the stochastic program are unaltered, whilst gradients can still be computed. In other words, GS samples are used as surrogates for categorical samples during the backwards pass in reverse-mode AD. This is particularly useful for ABMs which often use operations defined only on discrete values. With that being said, let's experiment with the GS trick in PyTorch!

Consider the following program:

$$ x = 2 \mathrm{Bern}(3\theta) + \mathrm{Bern}(\theta), $$

where $\text{Bern}$ denotes the Bernoulli distribution. We can easily compute the gradient of the expectation in this case:

$$ \nabla_\theta \mathbb E [2 \mathrm{Bernoulli}(3\theta) + \mathrm{Bernoulli}(\theta)] = \nabla_\theta (6\theta + \theta) = 7 $$

Let's implement this program using the GS trick and investigate how the distribution of the gradient changes as we change the temperature.

In [11]:

def sample_bernoulli(theta, gs_tau):
    logits = torch.cat([theta, 1 - theta]).log()
    return torch.nn.functional.gumbel_softmax(logits=logits, tau=gs_tau, hard=True)


def f(theta, gs_tau):
    return 2 * sample_bernoulli(3 * theta, gs_tau) + sample_bernoulli(theta, gs_tau)

In [12]:

n_samples = 1_000
taus = [0.1, 0.5, 0.9]
gradients_per_tau = {}
for tau in taus:
    gradients = []
    for i in range(n_samples):
        theta = torch.tensor([0.1], requires_grad=True)
        f(theta, tau)[0].backward()
        gradients.append(theta.grad.item())
    gradients_per_tau[tau] = gradients

In [13]:

fig, ax = plt.subplots()
for i, (tau, gradients) in enumerate(gradients_per_tau.items()):
    ax.boxplot(gradients, showmeans=True, positions=[i], labels=[f"$\\tau={tau}$"], showfliers=False)
ax.axhline(7, color="blue", linestyle="dashed", linewidth=1, label="Expected gradient")
ax.legend()

Out[13]:

<matplotlib.legend.Legend at 0x17f86ac80>

The mean and median of each gradient distribution is denoted by a green triangle and orange line respectively. As we would expect, using a low temperature, such as $\tau=0.1$, leads to very low bias. However, we also see that the median is far away from the mean, indiciating that the estimator is also very erratic. In contrast, by using a higher temperature, such as $\tau = 0.9$, we obtain a low varaince estimator, where the mean and median are closer together. Unfortunately, this estimator also has very bias. Using an intermediary temperature, such as $\tau = 0.5$, allows us to strike a trade-off between the behaviours of the aforementioned estimators.

The Gumbel-Softmax trick is an example of a surrogate gradient. The fundamental idea behind surrogate gradients is as follows. Consider an operation $f$ which is not differentiable. When applying chain rule, we can replace $f$ with a differentiable operation $g$ and use its gradient instead. The original surrogate gradient was the straight-through gradient estimator, which simply replaces $f$ with the identity. That is, the gradient passes straight through and is left unaltered by the non-differentiable operation $f$. This works surprisingly well in many settings. However, one would intuitively expect the overall gradient to contain more useful information if we choose $g$ to be close to $f$.

In the final notebook, we will encounter another surrogate used to differentiate through the indicator function. More specifically, for the indicator $\mathbb{I}_{x \geq a}$ we use the following surrogate:

$$ g(x) = \frac{1}{1+e^{-k(x - a)}} $$

This is a essentially a Sigmoid function centered at the boundary point $a$. Note that we include the free paramter $k$ which allows us to adapt the steepness of the Sigmoid curve. Increasing $k$ ensures that $g$ is closer to the original indicator function but also increases the risk of observing exploding and vanishing gradients. Thus, similar to the temperature in the GS trick, selecting $k$ involves making a trade-off.

In [20]:

def surrogate(x, a, k):
    return torch.sigmoid( k * (x- a))

n = 1000
xvals = torch.linspace(0., 10, n)
a = torch.tensor(5.)

k1 = surrogate(xvals, a, 0.5)
k2 = surrogate(xvals, a, 1.)
k3 = surrogate(xvals, a, 3.)
k4 = surrogate(xvals, a, 5.)

fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(xvals, k1, label='k=0.5')
ax.plot(xvals, k2, label='k=1')
ax.plot(xvals, k3, label='k=3')
ax.plot(xvals, k4, label='k=5')

ax.axvline(
    a,
    color="black",
    linestyle="dashed",
    linewidth=1,
    label="a",
)

ax.legend()

Out[20]:

<matplotlib.legend.Legend at 0x17c08f940>

Other Resources¶

To finish, we would like to highlight some relevant literature that is worth checking out if you want to learn more about applying AD in stochastic or discrete settings.

The book "The Elements of Differentiable Programming" is a fantastic resource detailing how a wide range of operations can be made differentiable. For instance, the book covers differentiable dictionaries and their connection to attention, as well as how to differentiate through optimisation problems.
The book "Discrete Latent Structure in Neural Networks" has a similar flavour. In particular, the book throroughly covers surrogate gradients and smooth relaxations of the $\arg\max$ operator.
The paper "Monte Carlo Gradient Estimation in Machine Learning" is a seminal paper that provides a broad overview of Monte Carlo gradient estimation, especially for settings with continuous densities.
The papers "A* star sampling" (by Maddison, Tarlow and Minka) and "Categorical Reparameterisation with Gumbel-Softmax" (by Jang, Gu and Poole) are fantastic resources to learn more about the GS trick.
The technical report "Stochastic Gradient Estimation" (by Michael Fu) provides a computational perspective on differentiation from the viewpoint of discrete event systems. In particular, the report covers FD in great detail. The report also covers smoothed perturbation analysis (SPA), which combines the theory of sample derivatives with conditonal expectations to derive gradient estimators.
The paper "Automatic Differentiation with Discrete Randomness" (by Arya, Schauer, Rakauckas) covers the theory behind StochasticAD.jl, which effectively combines ideas from SPA and AD to differentiate through discrete distributions. The authors have also written some lecture notes that may be more palatable to first time readers.

Be sure to check out the next notebook, where we will apply what we have learned so far to build a differentiable ABM!