One problem with learning MCMC samplers using our method is the difficulty of obtaining a good estimate of the transition probabilities. This is mainly because the transition probabilities involve intractable integrals. For example, for the Metropolis-Hastings algorithm, the transition probability in the reject case involves an integral over all possible states. In our current implementation, we approximate this using the current sample but that is far from a good estimate. What if we focus on sampling techniques that do not involve accept/reject operations? In that case, there will be no issues with estimating the transition probabilities, and this will probably lead to a much more efficient algorithm.
I decided to look at importance sampling first. Importance sampling is not really a sampling technique; it is a method for estimating expectations with respect to a target distribution using samples from a easy-to-sample from proposal distribution. However, since the optimal proposal distribution for importance sampling is the target distribution itself, if we can learn the optimal proposal, we will have learned a good sampler for the target distribution. The optimal proposal distribution is the one that leads to equal importance weights for each sample. Hence, the variance of importance weights is a good measure of the quality of a sampler. More formally, let $p(x)=\frac{p^*(x)}{Z}$ be our target distribution and $q_{\theta}$ be our proposal distribution. We would like to calculate some expectation with respect to $p$: $\mathbb{E}_p[f(x)]$. In importance sampling, we get $N$ samples $x_1, \dots, x_N$ from $q_{\theta}$ and use these to calculate the desired expectation: $$ \mathbb{E}_p[f(x)] = \int \frac{p^*(x)}{Z} f(x) dx = \frac{\int p^*(x) \frac{q(x)}{q(x)} f(x) dx}{\int p^*(x) \frac{q(x)}{q(x)} dx} $$ Defining $w(x) = \frac{p^*(x)}{q(x)}$, we can write $$ \mathbb{E}_p[f(x)] = \frac{\int q(x) w(x) f(x) dx}{\int q(x) w(x) dx} \approx \frac{\sum_{x \sim q} w(x) f(x)}{\sum_{x \sim q} w(x)} $$ The optimal $q$ is the target distribution $p$ itself, in which case all $w$ are equal. Therefore, for the optimal $q$, the variance of importance weights is 0, $\text{Var}(w) = 0$. Now, if we use the negative variance of importance weights as our reward function, we can measure the goodness of a proposal distribution with the following expectation $$ \mathbb{E}_{(x_1,\dots,x_N) \sim q_{\theta}} [-\text{Var}((w(x_1),\dots,w(x_N))] $$ Then we can learn a better proposal by maximizing this expectation with respect to $\theta$. We can get an estimate of the gradient of this expectation using a REINFORCE-style algorithm (see Manuscript for details). However, differently from the case in our Manuscript, now the reward function (i.e., variance of importance weights) depends on $\theta$ too (importance weights involve $q$ which depend on $\theta$). Taking this into account, we get the following expression for the gradient $$ \nabla_{\theta} \mathbb{E}_{(x_1,\dots,x_N) \sim q_{\theta}} [-\text{Var}((w(x_1),\dots,w(x_N))] \approx \sum_{(x_1,\dots,x_N) \sim q_{\theta}} -\nabla_{\theta} \log \left( \prod_{n=1}^{N} q_{\theta}(x_n) \right) \text{Var}((w(x_1),\dots,w(x_N)) \\ - \sum_{(x_1,\dots,x_N) \sim q_{\theta}} \nabla_{\theta} \text{Var}((w(x_1),\dots,w(x_N)) $$ where we get an extra second term involving the gradient of the reward function. In some cases (e.g., when we can differentiate the target distribution), we can reparameterize $q$ and get a more efficient estimate of the gradient. We will see an example of this in the below 1D Gaussian example.
Note that what we have talked about so far is not really a Markov chain technique because samples are drawn independently from $q$. However, we can modify the above to use a Markov chain as we will see below.
Let us look at a simple example. We'd like to sample from $p(x) = \mathcal{N}(0, 1)$ using a Gaussian distribution $q_{m,s}(x) = \mathcal{N}(m, s^2)$. Let $x_1, \dots, x_N$ be samples from $q_{m,s}$. Our objective, which measures the goodness of proposal $q$, can be written as $$ \mathbb{E}_{(x_1,\dots,x_N) \sim q_{m,s}} [-\text{Var}((w(x_1),\dots,w(x_N)))] = -\sum_{(x_1,\dots,x_N)} \left(\prod_{n=1}^{N} q_{m,s}(x_n) \right) \text{Var}((w(x_1),\dots,w(x_N))) $$ Reparameterize $q_{m,s}$ to get $$ -\sum_{(x_1,\dots,x_N)} \left(\prod_{n=1}^{N} q_{m,s}(x_n) \right) \text{Var}((w(x_1),\dots,w(x_N))) = -\sum_{(\epsilon_1,\dots,\epsilon_N)} \left(\prod_{n=1}^{N} \mathcal{N}(\epsilon_n;0, 1) \right) \text{Var}((w(m + s\epsilon_1),\dots,w(m + s\epsilon_N))) $$ where $(\epsilon_1,\dots,\epsilon_N) \sim \mathcal{N}(0,1)$. Then the gradient is $$ \nabla_{m,s} -\sum_{(\epsilon_1,\dots,\epsilon_N)} \left(\prod_{n=1}^{N} \mathcal{N}(\epsilon_n;0, 1) \right) \text{Var}((w(m + s\epsilon_1),\dots,w(m + s\epsilon_N))) \approx -\sum_{(\epsilon_1,\dots,\epsilon_N) \sim \mathcal{N}(0,1)} \nabla_{m,s} \text{Var}((w(m + s\epsilon_1),\dots,w(m + s\epsilon_N))) $$ The code below does gradient ascent using this gradient estimate to learn the optimal $m$ and $s$.
import autograd.numpy as np
import autograd as ag
def p(x):
"""
Unnormalized target distribution
"""
return np.exp(-0.5*(x)**2)
def q(x, params):
"""
Proposal probability q(x)
We don't use this.
"""
m, logs = params
s = np.exp(logs)
return (1./np.sqrt(2.*np.pi)) * (1./s) * np.exp(-0.5*(x-m)**2/(s**2))
def unit_normal(e):
"""
Unit normal pdf
"""
return (1./np.sqrt(2.*np.pi)) * np.exp(-0.5*e**2)
def reward(es, params):
"""
Calculate the reward for a given set of samples.
es are samples from a unit normal used to calculate samples from q.
"""
m, logs = params
s = np.exp(logs)
px = p(m + es*s)
qx = unit_normal(es)
# importance weights
wx = px / qx
# normalize weights
wx /= np.sum(wx)
return -np.var(wx)
# gradient of the reward function
grad_reward = ag.grad(reward, argnum=1)
m = np.random.randn()
logs = np.random.randn()
params = np.array([m, logs])
print "Initial params:", params
print "True params: [0.0, 0.0]"
samples_per_episode = 20
episode_count = 5000
report_interval = 500
lr = 1.0
for e in range(episode_count):
es = np.random.randn(samples_per_episode)
g = grad_reward(es, params)
params += lr*g
if (e+1) % report_interval == 0:
print (e+1) / report_interval, params
print "Learned params:", params
Initial params: [ 0.7975914 -1.63233556] True params: [0.0, 0.0] 1 [ 0.43329938 -1.32309203] 2 [ 0.01607027 -0.0349198 ] 3 [ 0.00139609 -0.00024249] 4 [ 1.29744388e-04 -5.29515308e-07] 5 [ 1.24870887e-05 -3.00159895e-07] 6 [ 1.18580291e-06 -2.94458797e-08] 7 [ 1.13064346e-07 2.02749769e-09] 8 [ 1.03415961e-08 3.67767226e-10] 9 [ 9.42626877e-10 5.39614571e-11] 10 [ 8.33903628e-11 -9.54070361e-13] Learned params: [ 8.33903628e-11 -9.54070361e-13]
In the above example, we collected independent samples from $q_{\theta}$ and optimized the variance of importance weights over these samples to learn a good $q_{\theta}$, which we ultimately use to obtain approximate samples from the target distribution $p$. However, such a sampler is limited by the form of $q_{\theta}$. Since we use a simple $q_{\theta}$ (i.e., a Gaussian distribution) in most cases, we won't be able to learn a good approximation to $p$. What if we use a Markov chain instead of obtaining independent samples from $q_{\theta}$? Imagine we learn the transition kernel $q_{\theta}(x_t|x_{t-1})$ for a Markov chain. Hence each sample is in fact a collection of samples $x_1,\dots,x_T$. In other words, our target distribution is $p(x_1, \dots, x_T)$ and we are doing sequential importance sampling with proposal distribution $q_{\theta}(x_t|x_{t-1})$. All $x_t$ have the same distribution $p(x)$ and $p(x_t|x_{t-1}) = p(x_t)$. Again, the importance weight for a sample $\mathbf{x} = x_1,\dots,x_T$ is $$ w(\mathbf{x}) = \frac{\prod_t p^*(x_t)}{q_1(x_1)\prod_t q_{\theta}(x_t|x_{t-1})} $$ where $q_1$ is the initial state distribution (which we assume is fixed and do not learn). Now we can apply the approach outlined above to learn a good $q_{\theta}$. Let $\mathbf{x}_1,\dots, \mathbf{x}_N$ be samples from $q_{\theta}(\mathbf{x})$. Note that each $\mathbf{x}_n$ is in fact a collection of samples $x_1,\dots,x_T$ from the Markov chain defined by $q_{\theta}(x_t|x_{t-1})$. We want to maximize the following expectation $$ \mathbb{E}_{(\mathbf{x}_1,\dots,\mathbf{x}_N) \sim q_{\theta}} [-\text{Var}((w(\mathbf{x}_1),\dots,w(\mathbf{x}_N)))] = -\sum_{(\mathbf{x}_1,\dots,\mathbf{x}_N)} \left(\prod_{n=1}^{N} q_{\theta}(\mathbf{x}_n) \right) \text{Var}((w(\mathbf{x}_1),\dots,w(\mathbf{x}_N))) $$
Let $q_{\theta=(w_m,b_m,w_s,b_s)}(x_t|x_{t-1}) = \mathcal{N}(w_m x_{t-1} + b_m, \exp(w_s x_{t-1} + b_s)^2)$. Reparameterize $q_{\theta}$ to get $$ -\sum_{(\mathbf{x}_1,\dots,\mathbf{x}_N)} \left(\prod_{n=1}^{N} q_{\theta}(\mathbf{x}_n) \right) \text{Var}((w(\mathbf{x}_1),\dots,w(\mathbf{x}_N))) = -\sum_{(\mathbf{\epsilon}_1,\dots,\mathbf{\epsilon}_N)} \left(\prod_{n=1}^{N} \mathcal{N}(\mathbf{\epsilon}_n;0, \mathbf{I}) \right) \text{Var}((w(\theta, \mathbf{\epsilon}_1),\dots,w(\theta, \mathbf{\epsilon}_N))) $$
Here $q_{\theta}(\mathbf{x}_n) = \mathcal{N}(\mathbf{\epsilon}_n;0, \mathbf{I}) = \prod_{t=1}^{T} \mathcal{N}(\epsilon_{nt};0, 1)$ and $w(\mathbf{x}_n) = w(\theta, \mathbf{\epsilon}_n) = \frac{\prod_t p^*(\theta, \epsilon_nt)}{q_1(\epsilon_{n1})\prod_t N(\epsilon_{nt};0, 1)}$. In the following, we assume $q_1(x) = \mathcal{N}(0,1)$. Again, we can form a Monte Carlo estimate of the gradient of this expression. $$ \nabla_{\theta} \mathbb{E}_{(\mathbf{x}_1,\dots,\mathbf{x}_N) \sim q_{\theta}} [-\text{Var}((w(\mathbf{x}_1),\dots,w(\mathbf{x}_N)))] \approx -\sum_{(\mathbf{\epsilon}_1,\dots,\mathbf{\epsilon}_N) \sim \mathcal{N}(0, \mathbf{I})} \nabla_{\theta} \text{Var}((w(\theta, \mathbf{\epsilon}_1),\dots,w(\theta, \mathbf{\epsilon}_N))) $$ The code below does gradient ascent using this gradient estimate to learn the optimal $q$.
import autograd.numpy as np
import autograd as ag
def p(x):
"""
Unnormalized target distribution
"""
return np.exp(-0.5*x**2)
def q(xp, x, params):
"""
Proposal probability q(xp|x)
We don't use this.
"""
w_m, b_s, w_logs, b_logs = params
m = w_m * x + b_m
logs = w_logs * x + b_logs
s = np.exp(logs)
return (1./np.sqrt(2.*np.pi)) * (1./s) * np.exp(-0.5*(x-m)**2/(s**2))
def get_proposal_distribution(x, params):
w_m, b_m, w_logs, b_logs = params
m = w_m * x + b_m
logs = w_logs * x + b_logs
s = np.exp(logs)
return m, s
def unit_normal(e):
"""
Unit normal pdf
"""
return (1./np.sqrt(2.*np.pi)) * np.exp(-0.5*e**2)
def reward(es, params):
"""
Calculate the reward for a given set of samples.
es are N samples from a T dimensional unit normal used to calculate samples from q.
"""
px = np.zeros(es.shape[0])
x = es[:, 0]
px = px + p(x)
for t in range(1, es.shape[1]):
m, s = get_proposal_distribution(x, params)
x = m + es[:, t]*s
px = px + p(x)
qx = np.sum(unit_normal(es), axis=1)
# importance weights
wx = px / qx
# normalize weights
wx /= np.sum(wx)
return -np.var(wx)
# gradient of the reward function
grad_reward = ag.grad(reward, argnum=1)
w_m, b_m = np.random.randn(2)
# w_m, b_m = [0.1, 0.2]
w_logs, b_logs = np.random.randn(2) * 0.1
# w_logs, b_logs = [0.15, -0.2]
init_params = np.array([w_m, b_m, w_logs, b_logs])
print "Initial params:", init_params
chain_length = 10
samples_per_episode = 20
episodes_per_epoch = 50
epoch_count = 10
def estimate_grad(params, i, reps=10):
notnan_count = 0
tot_g = 0.0
for _ in range(0, reps):
es = np.random.randn(samples_per_episode, chain_length)
g = grad_reward(es, params)
if not np.any(np.isnan(g)):
tot_g += g
notnan_count += 1
return -g / notnan_count
def report_progress(params, i, g):
if (i+1) % episodes_per_epoch == 0:
print "Iteration ", i+1, ", params: ", params
from autograd.optimizers import adam
new_params = adam(estimate_grad, init_params, callback=report_progress,
num_iters=epoch_count*episodes_per_epoch, step_size=0.01)
print "Learned params: ", new_params
Initial params: [ 0.47322725 -0.19712538 -0.15801755 -0.0404277 ] Iteration 50 , params: [ 0.16891344 0.02214555 0.01604627 -0.14457628] Iteration 100 , params: [ 0.03594182 0.00175411 -0.00525639 -0.0019805 ] Iteration 150 , params: [ 4.76444917e-04 1.07305347e-03 -1.95869994e-04 8.68722901e-05] Iteration 200 , params: [ -1.25483150e-04 1.08470168e-04 -9.88970331e-05 8.70638576e-05] Iteration 250 , params: [ 1.65369863e-05 8.83484519e-06 -1.03442468e-05 -4.27345602e-07] Iteration 300 , params: [ -1.47296818e-06 7.30363011e-07 -4.11824583e-07 -7.70735225e-07] Iteration 350 , params: [ 1.01098106e-07 -1.91269422e-08 3.06195797e-08 7.53500840e-09] Iteration 400 , params: [ -7.72668618e-09 -4.06799289e-09 3.85957588e-09 1.07956371e-09] Iteration 450 , params: [ 5.35235409e-10 2.70979333e-10 -3.22116741e-10 2.18078324e-10] Iteration 500 , params: [ -2.36859315e-11 3.79240848e-12 -1.64019694e-12 1.31753273e-11] Learned params: [ -2.07087027e-11 3.07096776e-12 -9.51726090e-14 1.25121962e-11]
As we see above, our method does converge to a solution where all parameters are 0. This is the optimal proposal $q_{\theta}(x_t|x_{t-1}) = \mathcal{N}(0, 1)$, which is the target distribution itself.
The above example is not very useful because the target distribution is simple and can be exactly learned by a single step Markov chain. Therefore, there is not much point in running a Markov chain to get samples. Now, I'd like to look at a target distribution that cannot be learned exactly by a single step Markov chain.
def p(x):
"""
Unnormalized target distribution, a mixture of 2 Gaussians
"""
return np.exp(-0.5*((x-3.0)**2 + (x+3.0)**2))
def get_proposal_distribution(x, params):
"""
MLP with a single hidden layer
"""
w_h, b_h, w_o, b_o = params
h = np.maximum(np.dot(x[:, np.newaxis], w_h) + b_h, 0.0)
o = np.dot(h, w_o) + b_o
m = o[:, 0]
s = np.exp(o[:, 1])
return m, s
hidden_unit_count = 5
w_h, b_h = np.random.randn(1, hidden_unit_count)*0.01, np.random.randn(hidden_unit_count)*0.01
w_o, b_o = np.random.randn(hidden_unit_count, 2)*0.01, np.random.randn(2)*0.01
init_params = [w_h, b_h, w_o, b_o]
print "Initial params:", init_params
chain_length = 2
samples_per_episode = 20
episodes_per_epoch = 100
epoch_count = 20
def estimate_reward(params, i, reps=10):
tot_r = 0
for _ in range(0, reps):
es = np.random.randn(samples_per_episode, chain_length)
r = reward(es, params)
if not np.isnan(r):
tot_r += r
return -tot_r / reps
estimate_grad = ag.grad(estimate_reward, argnum=0)
def report_progress(params, i, g):
if (i+1) % episodes_per_epoch == 0:
print "Iteration", i+1, ", reward:", estimate_reward(params, 0, reps=100)
from autograd.optimizers import adam
new_params = adam(estimate_grad, init_params, callback=report_progress,
num_iters=epoch_count*episodes_per_epoch, step_size=0.0005)
print "Learned params:", new_params
Initial params: [array([[-0.00164492, -0.00703599, -0.00222673, 0.00650332, 0.01235202]]), array([ 0.00266052, 0.00910101, -0.01043781, 0.01402108, 0.02410367]), array([[ 0.01080268, -0.0076234 ], [ 0.0014125 , 0.00636405], [ 0.02105997, 0.00282899], [ 0.00583331, 0.00413195], [-0.00981871, 0.00461101]]), array([-0.01002317, 0.01563595])] Iteration 100 , reward: 9.37926103229e-05 Iteration 200 , reward: 6.70920194135e-05 Iteration 300 , reward: 3.34437673073e-05 Iteration 400 , reward: 2.44834140503e-05 Iteration 500 , reward: 2.49889835873e-05 Iteration 600 , reward: 2.49626558336e-05 Iteration 700 , reward: 2.25919763367e-05 Iteration 800 , reward: 2.10479868791e-05 Iteration 900 , reward: 2.24062826549e-05 Iteration 1000 , reward: 1.8265696199e-05 Iteration 1100 , reward: 2.32710591097e-05 Iteration 1200 , reward: 2.18744431908e-05 Iteration 1300 , reward: 2.10094341649e-05 Iteration 1400 , reward: 2.35338710678e-05 Iteration 1500 , reward: 2.22089476915e-05 Iteration 1600 , reward: 1.96426690816e-05 Iteration 1700 , reward: 2.13385311428e-05 Iteration 1800 , reward: 2.29163528332e-05 Iteration 1900 , reward: 1.74914331388e-05 Iteration 2000 , reward: 2.06163819676e-05 Learned params: [array([[ -1.38996613e-01, 1.15532745e-04, -2.22673174e-03, 1.15924545e-01, 9.93648669e-03]]), array([ 0.2859183 , 0.33324286, -0.01043781, 0.339953 , 0.34620153]), array([[ 1.79043412e-03, -2.01520682e-01], [ 1.53090514e-03, -3.24090968e-01], [ 2.10599672e-02, 2.82899353e-03], [ 7.62887522e-03, -1.94989227e-01], [ 2.74785101e-04, -3.08037629e-01]]), array([-0.00215867, -0.17233097])]
N = 500
x1 = np.random.randn(N)
m, s = get_proposal_distribution(x1, new_params)
x2 = m + s*np.random.randn(N)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x1, x2)
# plt.hist(x2)
<matplotlib.collections.PathCollection at 0x7f1c03c29210>
It doesn't seem to learn anything useful. I was hoping that the marginal of $x_2$ would be bimodal (like the target distribution)
Maybe, I should consider only the last state from a chain as a sample. That would require me to get the marginal probability of last state (which is not easy).