Material for a UC Irvine course offered by the Department of Physics and Astronomy.
Content is maintained on github and distributed under a BSD3 license.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
from sklearn import neighbors
Markov-chain Monte Carlo (MCMC) is an algorithm to generate random samples from an un-normalized probability density. In other words, you want sample from $P(\vec{z})$ but can only evaluate $f(\vec{z})$ where $$ P(\vec{z}) = \frac{f(\vec{z})}{\int d\vec{z}\,f(\vec{z})} \; . $$ Note that $0 \le P(\vec{z}) \le 1$ requires that $f(\vec{z}) \ge 0$ everywhere and that the integral has a non-zero finite value.
We will start with some simple motivating examples before diving into the Bayesian applications and the theory of Markov chains.
The function $$ f(z) = \begin{cases} \sqrt{1 - z^4} & |z| < 1 \\ 0 & |z| \ge 1 \end{cases} $$ is never negative and has a finite integral:
def plotf(zlim=1.2):
z = np.linspace(-zlim, +zlim, 250)
plt.plot(z, np.sqrt(np.maximum(0, 1 - z ** 4)))
plt.xlim(-zlim, +zlim)
plotf()
However, the normalization integral cannot be evaluated analytically (it is related to the complete elliptic integral of the first kind), so this is a good candidate for MCMC sampling using the MLS MCMC_sample
function (which wraps emcee):
from mls import MCMC_sample
def logf(z):
return 0.5 * np.log(1 - z ** 4) if np.abs(z) < 1 else -np.inf
gen = np.random.RandomState(seed=123)
samples = MCMC_sample(logf, z=[0], nsamples=20000, random_state=gen)
The notation z=[0]
identifies z
as the parameter we want to sample (starting at the value 0). The result is a Pandas DataFrame of generated samples:
samples[:5]
z | |
---|---|
0 | 0.062330 |
1 | 0.047944 |
2 | -0.229985 |
3 | 0.061017 |
4 | 0.181863 |
The generated samples are (approximately) drawn from the normalized $P(z)$ corresponding to the $f(z)$ provided:
plt.hist(samples['z'], range=(-1,1), bins=25);
What are MCMC samples good for? They allow us to estimate the expectation value of an arbitrary $g(z)$ using importance sampling: $$ \langle g(\vec{z})\rangle_P \equiv \int d\vec{z}\, g(\vec{z})\, P(\vec{z}) \simeq \frac{1}{N} \sum_{i=1}^N g(\vec{z}_i) \; , $$ where $\vec{z}_1, \vec{z}_2, \ldots$ are the MCMC samples.
For example, to estimate the expectation value of $g(z) = z^2$ (aka the variance) with the samples generated above:
np.mean(samples['z'] ** 2)
0.2729666505147651
Expectation values of more complex functions are equally easy, for example, $g(z) = \sin(\pi z)^2$,
np.mean(np.sin(np.pi * samples['z']) ** 2)
0.5414157152385632
Recall that the reason we are using MCMC is because we do not know the value of the normalization constant: $$ \int d\vec{z}\,f(\vec{z}) \; . $$ However, we can use MCMC samples to estimate its value as follows:
For example, use KDE to estimate the density of our generated samples:
fit = neighbors.KernelDensity(kernel='gaussian', bandwidth=0.01).fit(samples)
Now take the ratio of the (normalized and noisy) KDE density estimate and the (un-normalized and smooth) $f(z)$ on a grid of $z$ values:
def plotfit(zlim=1.2, Pmin=0.1):
z = np.linspace(-zlim, +zlim, 250)
f = np.sqrt(np.maximum(0, 1 - z ** 4))
P = np.exp(fit.score_samples(z.reshape(-1, 1)))
plt.plot(z, f, label='$f(z)$')
plt.fill_between(z, P, alpha=0.5, label='$P(z)$')
ratio = f / P
sel = P > Pmin
plt.plot(z[sel], ratio[sel], '.', label='$P(z)/f(z)$')
mean_ratio = np.mean(ratio[sel])
print('mean P(z)/f(z) = {:.3f}'.format(mean_ratio))
plt.axhline(mean_ratio, ls='--', c='k')
plt.xlim(-zlim, +zlim)
plt.legend(loc='upper left', ncol=3)
plotfit()
mean P(z)/f(z) = 1.761
The estimated $P(z)$ does not look great, but the mean of $f(z) / P(z)$ estimates the normalization constant. In practice, we restrict this mean to $z$ values where $P(z)$ is above some minimum to avoid regions where the empirical density estimate is poorly determined.
In the example above, the true value of the integral rounds to 1.748 so our numerical accuracy is roughly 1%.
Note that we cannot simply use $g(z) = 1$ in the importance sampled integral above to estimate the normalization constant since it gives exactly one! The unknown constant is the integral of $f(z)$, not $P(z)$.
Next, we try a multidimensional example: $$ f(\vec{z}, \vec{z}_0, r) = \begin{cases} \exp\left(-|\vec{z} - \vec{z}_0|^2/2\right) & |\vec{z}| < r \\ 0 & |\vec{z}| \ge r \end{cases} $$ This function describes an un-normalized Gaussian PDF centered at $\vec{z}_0$ and clipped outside $|\vec{z}| < r$. The normalization integral has no analytic solution except in the limits $\vec{z}_0\rightarrow 0$ or $r\rightarrow\infty$.
To generate MCMC samples in 2D:
def logf(x, y, x0, y0, r):
z = np.array([x, y])
z0 = np.array([x0, y0])
return -0.5 * np.sum((z - z0) ** 2) if np.sum(z ** 2) < r ** 2 else -np.inf
The variables to sample are assigned initial values in square brackets and all other arguments are treated as fixed hyperparameters:
samples = MCMC_sample(logf, x=[0], y=[0], x0=1, y0=-2, r=3, nsamples=10000)
The generated samples now have two columns:
samples[:5]
x | y | |
---|---|---|
0 | 0.286831 | -1.843221 |
1 | 0.014914 | -2.954994 |
2 | 0.014914 | -2.954994 |
3 | 0.014914 | -2.954994 |
4 | 0.014914 | -2.954994 |
A scatter plot shows a 2D Gaussian distribution clipped to a circle and offset from its center, as expected:
plt.scatter(samples['x'], samples['y'], s=10)
plt.scatter(1, -2, marker='+', s=500, lw=5, c='white')
plt.gca().add_artist(plt.Circle((0, 0), 3, lw=4, ec='red', fc='none'))
plt.gca().set_aspect(1)
With multidimensional samples, we can estimate expectation values of marginal PDFs just as easily as the full joint PDF. In our 2D example, the marginal PDFs are: $$ P_X(x) = \int dy\, P(x, y) \quad , \quad P_Y(y) = \int dx\, P(x, y) \; . $$ For example, the expectation value of $g(x)$ with respect to $P_X$ is: $$ \langle g\rangle \equiv \int dx\, g(x) P_X(x) = \int dx\, g(x) \int dy\, P(x, y) = \int dx dy\, g(x)\, P(x,y) \; . $$ In other words, the expectation value with respect to a marginal PDF is equal to the expectation with respect to the full joint PDF.
For example, the expectation value of $g(x) = x$ (aka the mean) with respect to $P_X(x)$ is:
np.mean(samples['x'])
0.7418902711774612
We can also estimate the density of a marginal PDF by simply dropping the columns that are integrated out before plugging into a density estimator. For example:
fitX = neighbors.KernelDensity(kernel='gaussian', bandwidth=0.1).fit(samples.drop(columns='y'))
fitY = neighbors.KernelDensity(kernel='gaussian', bandwidth=0.1).fit(samples.drop(columns='x'))
def plotfitXY(r=3):
xy = np.linspace(-r, +r, 250)
Px = np.exp(fitX.score_samples(xy.reshape(-1, 1)))
Py = np.exp(fitY.score_samples(xy.reshape(-1, 1)))
plt.plot(xy, Px, label='$P_X(x)$')
plt.plot(xy, Py, label='$P_Y(y)$')
plt.legend()
plotfitXY()
We introduced MCMC above as a general purpose algorithm for sampling any un-normalized PDF, without any reference to Bayesian (or frequentist) statistics. We also never specified whether $\vec{z}$ was something observed (data) or latent (parameters and hyperparameters), because it doesn't matter to MCMC.
However, MCMC is an excellent tool for performing numerical inferences using the generalized Bayes' rule we met earlier: $$ P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$ In particular, the normalizing denominator (aka the "evidence"): $$ P(D\mid M) = \int d\Theta_M' P(D\mid \Theta_M', M)\, P(\Theta_M'\mid M) $$ is often not practical to calculate, so we can only calculate the un-normalized numerator $$ P(D\mid \Theta_M, M)\,P(\Theta_M\mid M) \; , $$ which combines the likelihood of the data and the prior probability of the model.
If we treat the observed data $D$ and hyperparameters $M$ as fixed, then the appropriate function to plug into an MCMC is: $$ \log f(\Theta) = \log P(D\mid \Theta_M, M) + \log P(\Theta_M\mid M) \; . $$ The machinery described above then enables us to generate samples $\Theta_1, \Theta_2, \ldots$ drawn from the posterior distribution, and therefore make interesting statements about probabilities involving model parameters.
The likelihood function depends on the data and model, so could be anything, but we often assume Gaussian errors in the data, which leads to the multivariate Gaussian PDF we met earlier ($d$ is the number of data features): $$ P(\vec{x}\mid \Theta_M, M) = \left(2\pi\right)^{-d/2}\,\left| C\right|^{-1/2}\, \exp\left[ -\frac{1}{2} \left(\vec{x} - \vec{\mu}\right)^T C^{-1} \left(\vec{x} - \vec{\mu}\right) \right] $$ In the most general case, $\vec{\mu}$ and $C$ are functions of everything: the data $D$, the parameters $\Theta_M$ and the hyperparameters $M$.
When we have $N$ independent observations, $\vec{x}_1, \vec{x}_2, \ldots$, their combined likelihood is the product of each sample's likelihood: $$ P(\vec{x}_1, \vec{x}_2, \ldots\mid \Theta_M, M) = \prod_{i=1}^N\, P(\vec{x}_i\mid \Theta_M, M) $$
As an example, consider fitting a straight line $y = m x + b$, with parameters $m$ and $b$, to data with two features $x$ and $y$. The relevant log-likelihood function is: $$ \log{\cal L}(m, b; D) = -\frac{N}{2}\log(2\pi\sigma_y^2) -\frac{1}{2\sigma_y^2} \sum_{i=1}^N\, (y_i - m x_i - b)^2 \; , $$ where the error in $y$, $\sigma_y$, is a fixed hyperparameter. Note that the first term is the Gaussian PDF normalization factor.
First generate some data on a straight line with measurement errors in $y$ (so our assumed model is correct):
gen = np.random.RandomState(seed=123)
N, m_true, b_true, sigy_true = 10, 0.5, -0.2, 0.1
x_data = gen.uniform(-1, +1, size=N)
y_data = m_true * x_data + b_true + gen.normal(scale=sigy_true, size=N)
plt.errorbar(x_data, y_data, sigy_true, fmt='o', markersize=5)
plt.plot([-1, +1], [-m_true+b_true,+m_true+b_true], 'r:')
plt.xlabel('x'); plt.ylabel('y');
Next, define the log-likelihood function:
def loglike(x, y, m, b, sigy):
N = len(x)
norm = 0.5 * N * np.log(2 * np.pi * sigy ** 2)
return -0.5 * np.sum((y - m * x - b) ** 2) / sigy ** 2 - norm
Finally, generate some MCMC samples of the posterior $P(m, b\mid D, M)$ assuming uniform priors $P(b,m\mid \sigma_y) = 1$:
samples = MCMC_sample(loglike, m=[m_true], b=[b_true],
x=x_data, y=y_data, sigy=sigy_true, nsamples=10000, random_state=gen)
sns.jointplot('m', 'b', samples, xlim=(0.2,0.8), ylim=(-0.3,0.0), stat_func=None);
samples.describe(percentiles=[])
m | b | |
---|---|---|
count | 10000.000000 | 10000.000000 |
mean | 0.531518 | -0.163364 |
std | 0.074150 | 0.032178 |
min | 0.277887 | -0.271737 |
50% | 0.533564 | -0.162615 |
max | 0.799155 | -0.039917 |
EXERCISE: We always require a starting point to generate MCMC samples. In this example, we used the true parameter values as starting points:
m=[m_true], b=[b_true]
What happens if you chose different starting points? Try changing the starting values by $\pm 0.1$ and see how this affects the resulting means and standard deviations for $m$ and $b$.
samples = MCMC_sample(loglike, m=[m_true+0.1], b=[b_true+0.1],
x=x_data, y=y_data, sigy=sigy_true, nsamples=10000, random_state=gen)
samples.describe(percentiles=[])
m | b | |
---|---|---|
count | 10000.000000 | 10000.000000 |
mean | 0.526889 | -0.164017 |
std | 0.071437 | 0.033185 |
min | 0.287060 | -0.263188 |
50% | 0.524377 | -0.164970 |
max | 0.800168 | -0.051169 |
samples = MCMC_sample(loglike, m=[m_true-0.1], b=[b_true-0.1],
x=x_data, y=y_data, sigy=sigy_true, nsamples=10000, random_state=gen)
samples.describe(percentiles=[])
m | b | |
---|---|---|
count | 10000.000000 | 10000.000000 |
mean | 0.526197 | -0.163166 |
std | 0.073906 | 0.032095 |
min | 0.287884 | -0.272038 |
50% | 0.525020 | -0.163915 |
max | 0.811648 | -0.042163 |
The changes are small compared with the offsets ($\pm 0.1$) and the standard deviations in each parameter.
# Add your solution here...
The MCMC_sample
function can apply independent (i.e., factorized) priors on each parameter:
$$
P(\Theta\mid M) = \prod_j P(\theta_j\mid M)
$$
Define the two most commonly used independent priors:
def TopHat(lo, hi):
"""Return un-normalized log(prior) for x in [lo,hi]"""
return lambda x: 0 if (lo <= x <= hi) else -np.inf
def Gauss(mu, sigma):
"""Return un-normalized log(prior) for x ~ N(mu,sigma)"""
return lambda x: -0.5 * ((x - mu) / sigma) ** 2
To apply a prior, we replace z=[value]
with z=[value,logprior]
. For example, suppose we believe that $0.4 \le m \le 0.7$:
samples = MCMC_sample(loglike, m=[m_true,TopHat(0.4,0.7)], b=[b_true],
x=x_data, y=y_data, sigy=sigy_true, nsamples=10000, random_state=gen)
sns.jointplot('m', 'b', samples, xlim=(0.2,0.8), ylim=(-0.3,0.0), stat_func=None);
We can also add a prior on $b$. For example, suppose a previous measurement found $b = -0.20 \pm 0.02$ (in which case, the new data is not adding much information about $b$):
samples = MCMC_sample(loglike, m=[m_true,TopHat(0.4,0.7)], b=[b_true,Gauss(-0.20,0.02)],
x=x_data, y=y_data, sigy=sigy_true, nsamples=10000, random_state=gen)
sns.jointplot('m', 'b', samples, xlim=(0.2,0.8), ylim=(-0.3,0.0), stat_func=None);
EXERCISE: Suppose we know that all $y_i$ values have the same error $\sigma_y$ but we do not know its value.
m=[m_true], b=[b_true], sigy=[sigy_true]
.sns.pairplot
.gen = np.random.RandomState(seed=123)
samples = MCMC_sample(loglike, m=[m_true], b=[b_true], sigy=[sigy_true],
x=x_data, y=y_data, nsamples=10000, random_state=gen)
sns.pairplot(samples);
samples = MCMC_sample(loglike, m=[m_true], b=[b_true], sigy=[sigy_true, TopHat(0.01,1)],
x=x_data, y=y_data, nsamples=10000, random_state=gen)
sns.pairplot(samples);
# Add your solution here...
For a more in-depth case study of the many subtleties in fitting a straight line, read this 55-page article by Hogg, Bovy and Lang.