import torch
import numpy as np
import matplotlib.pyplot as plt
In previous notebooks we learned how automatic differentiation (AD) can be used to obtain gradients from stochastic computer programs. Next, we will use what we have learned to implement several differentiable programs, including a differentiable agent-based model (ABM).
To begin let's implement a one-dimensonal random walk. Recall that a one-dimensional random walk is given by the following recursion:
$$ x_{t+1} = x_t + \begin{cases}1 &\mathrm{if} &\xi = 1 \\ -1 &\mathrm{if} &\xi = 0\end{cases}, \quad \xi \sim \mathrm{Bern}({\theta}) $$where $\text{Bern}$ denotes the Bernoulli distribution, $x_{t}$ denotes the position observed after $t$ steps, and $\theta$ denotes a parameter determining the probability of moving in the positive direction. Let's use PyTorch to implement this model as a stochastic program called random_walk
.
def random_walk(theta, n_timesteps):
x = torch.tensor([0.])
for i in range(n_timesteps-1):
xi = torch.distributions.Bernoulli(theta).sample()
if xi == 1:
next_x = x[-1] + 1
else:
next_x = x[-1] - 1
x = torch.hstack((x, next_x))
return x
Let's forward simulate our stochastic program to make sure everything is working correclty!
theta = 0.2
n_timesteps = 100
x = random_walk(theta, n_timesteps)
plt.plot(x)
[<matplotlib.lines.Line2D at 0x13eb442e0>]
Now that we are confident the model is correclty implemented, let's check whether PyTorch can compute a gradient. Since random_walk
returns a time series and has one parameter the Jacobian,
will be a row vector. Let's plot the Jacobian at $\theta = 0.2$ by using the torch.autograd
module.
dx_dtheta = torch.autograd.functional.jacobian(
lambda x: random_walk(theta=x, n_timesteps=n_timesteps), torch.tensor(theta)
)
plt.plot(dx_dtheta)
[<matplotlib.lines.Line2D at 0x13ec03790>]
We immediately see that all entries of the Jacobian are zero! This is happening for two reasons. Firstly, as we saw in the previous notebook, the Bernoulli distribution is not differentiable by default and we need to a use technique like the Gumbel-Softmax trick to differentiate through it. We can implement the Gumbel-Sofxmax trick in random walk
by replacing
xi = torch.distributions.Bernoulli(theta).sample()
with
logits = torch.hstack((theta, 1 - theta)).log()
xi = torch.nn.functional.gumbel_softmax(logits, tau=tau, hard=True)[0]
There is also an issue with control flow. Under the hood, PyTorch builds a static computational graph. This means that the structure of the computational graph cannot change at run-time based on the outcome of control flow statements. As a result, the gradient will not correctly flow backwards through if
statements that depend on the parameters we want to differentiate with respect to. Analysing the code of random_walk
we find the following control flow statement:
xi = torch.distributions.Bernoulli(theta).sample()
if xi == 1:
next_x = x[-1] + 1
else:
next_x = x[-1] - 1
Since the if
statement predicate depends on xi
which in turn depends on theta
, the gradient will not flow backwards properly. To fix this issue, we can reimplement the if
statement using masking.
xi = torch.distributions.Bernoulli(theta)
next_x = xi * (x[-1] + 1) + (1-xi) * (x[-1] - 1)
Note that if xi = 1
then next_x
will be equal to x[-1] +1
as (1-xi)
will be equal to zero. That is, the else
output of the original if
statement is masked out. Likewise, if xi = 0
then next_x
will be equal to x[-1] - 1
as the output (x[-1] + 1)
is masked out by its multiplier xi
. In other words, the code above is completely equivalent to the previous if
statement. Moreover this code will play nicely with the static compuational graph generated by PyTorch as it contains no control flow statements.
Let's rewrite random_walk
to implement both masking and the Gumbel-Softmax trick!
def random_walk(theta, n_timesteps, tau=0.1):
x = torch.tensor([0.0])
for i in range(n_timesteps - 1):
logits = torch.hstack((theta, 1 - theta)).log()
xi = torch.nn.functional.gumbel_softmax(logits, tau=tau, hard=True)[0]
next_x = x[-1] + 2 * xi - 1
x = torch.hstack((x, next_x))
return x
Let's run a simulation to check our model runs properly.
theta = torch.tensor(0.4)
n_timesteps = 100
x = random_walk(theta, n_timesteps)
plt.plot(x)
[<matplotlib.lines.Line2D at 0x13ec7bbb0>]
Both implementations of random_walk
behave identically. Any difference observed between simulation outputs is caused by the randomness of the Bernoulli distribution. This is because we set hard=True
when employing the Gumbel-Softmax trick, ensuring that the Bernoulli distribution is only replaced by a continuous relaxation when applying the chain rule during the backwards pass. If the previous sentence is confusing, you should review the previous notebook. Let's check the Jacobian again!
dx_dtheta = torch.autograd.functional.jacobian(
lambda x: random_walk(theta=x, n_timesteps=n_timesteps, tau=1.0), theta
)
plt.plot(dx_dtheta)
[<matplotlib.lines.Line2D at 0x13f103160>]
PyTorch is now giving us non-zero derivatives thanks to our changes. Recall that the bias-variance trade-off of the Gumbel-Softmax trick is controlled implicitly by setting the temperatrue $\tau$. Let's investigate this trade-off in the context of the random walk. Thankfully, the exact analytical gradient is easy to derive for the random walk:
$$ \begin{align} \frac{\partial}{\partial \theta} \mathbb E[x_t] & = \frac{\partial}{\partial \theta} \mathbb E\left[ \sum_{j=1}^{t} 2 \mathrm{Bern(\theta)} - 1 \right]\\ & = \frac{\partial}{\partial \theta} (2 t \theta - t) \\ & = 2t \end{align} $$This means we can compare the gradients output by PyTorch against the analytical gradient for different temperature settings!
taus = [0.1, 0.5, 1.0]
n_gradient_samples = 50
n_timesteps = 50
gradients_per_tau = {tau: [] for tau in taus}
for tau in taus:
for i in range(n_gradient_samples):
dx_dtheta = torch.autograd.functional.jacobian(
lambda x: random_walk(theta=x, n_timesteps=n_timesteps, tau=tau),
theta,
vectorize=True,
)
gradients_per_tau[tau].append(dx_dtheta)
fig, ax = plt.subplots()
for i, tau in enumerate(gradients_per_tau):
for grad in gradients_per_tau[tau]:
ax.plot(grad, color=f"C{i}", alpha=0.2)
ax.plot(
sum(gradients_per_tau[tau]) / n_gradient_samples,
color=f"C{i}",
linestyle="--",
label=rf"$\tau$ = {tau} [mean]",
lw=3
)
ax.plot([], [], color=f"C{i}", label=rf"$\tau$ = {tau}")
ax.plot(
range(n_timesteps),
2 * np.array(range(n_timesteps)),
color="black",
linestyle="--",
label="true gradient",
)
ax.legend()
<matplotlib.legend.Legend at 0x13eb44b80>
Unsurprisingly, as $\tau$ decreases, the mean of each gradient estimate gets closer to the analytical gradient, whilst the variance increases significantly. In this case, it may be more reasonable to use $\tau=0.5$ rather than $\tau=0.1$, since a small increase in bias is compensated by a large reduction in variance.
In this notebook, we used the tools we have learned so far to build a simple differenitable program. In the next notebook, we will see how the same tools can be applied to build a differentiable economic ABM. Moreover, we will finally highlight the advantages of differentiability when it comes to calibration!