%pylab inline
rc('text', usetex=True)
rc('axes', labelsize=15, titlesize=15)
Populating the interactive namespace from numpy and matplotlib
import seaborn as sns
sns.set_style("white")
np.random.seed(10)
In this example we're looking at a dataset that has been drawn from a 2D gaussian distribution. We're going to assume that we don't have a proper likelihood but that we know the covariance matrix $\Sigma$ of the distribution. Using the ABC PMC algorithm we will approximate the posterior of the distribtion of the mean values.
First we generate a new dataset by drawing random variables from a mulitvariate gaussian around mean=[1.1, 1.5]. This is going to be our observed data set
samples_size = 1000
sigma = np.eye(2) * 0.25
means = [1.1, 1.5]
data = np.random.multivariate_normal(means, sigma, samples_size)
matshow(sigma)
title("covariance matrix sigma")
colorbar()
<matplotlib.colorbar.Colorbar instance at 0x10ad0f9e0>
Then we need to define our model/simulation. In this case this is simple: we draw again random variables from a multivariate gaussian distribution using the given mean and the sigma from above
def create_new_sample(theta):
return np.random.multivariate_normal(theta, sigma, samples_size)
Next, we need to define a distance measure. We will use the sum of the absolute differences of the means of the simulated and the observed data
def dist_measure(x, y):
return np.sum(np.abs(np.mean(x, axis=0) - np.mean(y, axis=0)))
To verify if everything works and to see the effect of the random samples in the simulation we compute the distance for 1000 simulations at the true mean values
distances = [dist_measure(data, create_new_sample(means)) for _ in range(1000)]
sns.distplot(distances, axlabel="distances", )
title("Variablility of distance from simulations")
<matplotlib.text.Text at 0x10af34950>
Now we're going to set up the ABC PMC sampling
import abcpmc
As a prior we're going to use a gaussian prior using our best guess about the distribution of the means.
prior = abcpmc.GaussianPrior(mu=[1.0, 1.0], sigma=np.eye(2) * 0.5)
As threshold $\epsilon$ we're going to use the $\alpha^{th}$ percentile of the sorted distances of the particles of the current iteration. The simplest way to do this is to define a constant $\epsilon$ and iteratively adapt the theshold. As starting value we're going to define a sufficiently high value so that the acceptance ratio is reasonable and we will sample for T iterations
alpha = 75
T = 20
eps_start = 1.0
eps = abcpmc.ConstEps(T, eps_start)
Finally, we create an instance of your sampler. We want to use 5000 particles and the functions we defined above. Additionally we're going to make use of the built-in parallelization and use 7 cores for the sampling
sampler = abcpmc.Sampler(N=5000, Y=data, postfn=create_new_sample, dist=dist_measure, threads=7)
Optionally, we can customize the proposal creation. Here we're going to use a "Optimal Local Covariance Matrix"-kernel (OLCM) as proposed by (Fillipi et al. 2012). This has shown to yield a high acceptance ratio togheter with a faster decrease of the thresold.
sampler.particle_proposal_cls = abcpmc.OLCMParticleProposal
Now we're ready to sample. All we need to do is to iterate over the yielded values of your sampler instance. The sample function returns a namedtuple per iteration that contains all the information that we're interestend in
def launch():
eps = abcpmc.ConstEps(T, eps_start)
pools = []
for pool in sampler.sample(prior, eps):
print("T: {0}, eps: {1:>.4f}, ratio: {2:>.4f}".format(pool.t, eps(pool.eps), pool.ratio))
for i, (mean, std) in enumerate(zip(*abcpmc.weighted_avg_and_std(pool.thetas, pool.ws, axis=0))):
print(u" theta[{0}]: {1:>.4f} \u00B1 {2:>.4f}".format(i, mean,std))
eps.eps = np.percentile(pool.dists, alpha) # reduce eps value
pools.append(pool)
sampler.close()
return pools
import time
t0 = time.time()
pools = launch()
print "took", (time.time() - t0)
T: 0, eps: 1.0000, ratio: 0.3780 theta[0]: 1.0626 ± 0.3877 theta[1]: 1.3582 ± 0.3641 T: 1, eps: 0.8446, ratio: 0.4938 theta[0]: 1.0789 ± 0.3098 theta[1]: 1.3920 ± 0.3021 T: 2, eps: 0.6796, ratio: 0.4560 theta[0]: 1.0735 ± 0.2489 theta[1]: 1.4098 ± 0.2668 T: 3, eps: 0.5569, ratio: 0.4649 theta[0]: 1.0690 ± 0.2053 theta[1]: 1.4399 ± 0.2193 T: 4, eps: 0.4579, ratio: 0.4680 theta[0]: 1.0848 ± 0.1721 theta[1]: 1.4416 ± 0.1831 T: 5, eps: 0.3767, ratio: 0.4684 theta[0]: 1.0771 ± 0.1425 theta[1]: 1.4640 ± 0.1460 T: 6, eps: 0.3099, ratio: 0.4490 theta[0]: 1.0817 ± 0.1218 theta[1]: 1.4691 ± 0.1241 T: 7, eps: 0.2595, ratio: 0.4433 theta[0]: 1.0861 ± 0.0992 theta[1]: 1.4734 ± 0.1009 T: 8, eps: 0.2080, ratio: 0.4411 theta[0]: 1.0898 ± 0.0799 theta[1]: 1.4799 ± 0.0825 T: 9, eps: 0.1688, ratio: 0.4418 theta[0]: 1.0921 ± 0.0718 theta[1]: 1.4926 ± 0.0678 T: 10, eps: 0.1419, ratio: 0.4369 theta[0]: 1.0880 ± 0.0562 theta[1]: 1.4946 ± 0.0581 T: 11, eps: 0.1181, ratio: 0.4272 theta[0]: 1.0927 ± 0.0503 theta[1]: 1.4934 ± 0.0483 T: 12, eps: 0.0962, ratio: 0.3870 theta[0]: 1.0927 ± 0.0408 theta[1]: 1.4930 ± 0.0394 T: 13, eps: 0.0792, ratio: 0.3830 theta[0]: 1.0903 ± 0.0338 theta[1]: 1.4931 ± 0.0330 T: 14, eps: 0.0647, ratio: 0.3789 theta[0]: 1.0918 ± 0.0276 theta[1]: 1.4929 ± 0.0288 T: 15, eps: 0.0530, ratio: 0.3490 theta[0]: 1.0930 ± 0.0249 theta[1]: 1.4935 ± 0.0248 T: 16, eps: 0.0443, ratio: 0.3006 theta[0]: 1.0928 ± 0.0224 theta[1]: 1.4936 ± 0.0222 T: 17, eps: 0.0374, ratio: 0.2688 theta[0]: 1.0926 ± 0.0188 theta[1]: 1.4944 ± 0.0209 T: 18, eps: 0.0313, ratio: 0.2617 theta[0]: 1.0912 ± 0.0174 theta[1]: 1.4945 ± 0.0188 T: 19, eps: 0.0263, ratio: 0.2039 theta[0]: 1.0924 ± 0.0170 theta[1]: 1.4953 ± 0.0167 took 94.172825098
How did the sampled values evolve over the iterations? As the threshold is decreasing we expect the errors to shrink while the means converge to the true means.
for i in range(len(means)):
moments = np.array([abcpmc.weighted_avg_and_std(pool.thetas[:,i], pool.ws, axis=0) for pool in pools])
errorbar(range(T), moments[:, 0], moments[:, 1])
hlines(means, 0, T, linestyle="dotted", linewidth=0.7)
_ = xlim([-.5, T])
How does the distribution of the distances look like after we have approximated the posterior? If we're close to the true posterior we expect to have a high bin count around the values we've found in the earlier distribution plot
distances = np.array([pool.dists for pool in pools]).flatten()
sns.distplot(distances, axlabel="distance")
<matplotlib.axes._subplots.AxesSubplot at 0x10b012690>
How did our $\epsilon$ values behave over the iterations? Using the $\alpha^{th}$ percentile causes the threshold to decrease relatively fast in the beginning and to plateau later on
eps_values = np.array([pool.eps for pool in pools])
plot(eps_values, label=r"$\epsilon$ values")
xlabel("Iteration")
ylabel(r"$\epsilon$")
legend(loc="best")
<matplotlib.legend.Legend at 0x10b490750>
What about the acceptance ratio? ABC PMC with the OLCM kernel gives us a relatively high acceptance ratio.
acc_ratios = np.array([pool.ratio for pool in pools])
plot(acc_ratios, label="Acceptance ratio")
ylim([0, 1])
xlabel("Iteration")
ylabel("Acceptance ratio")
legend(loc="best")
<matplotlib.legend.Legend at 0x10b9dec50>
%pylab inline
rc('text', usetex=True)
rc('axes', labelsize=15, titlesize=15)
Populating the interactive namespace from numpy and matplotlib
Finally what does our posterior look like? For the visualization we're using triangle.py (https://github.com/dfm/triangle.py)
import triangle
samples = np.vstack([pool.thetas for pool in pools])
fig = triangle.corner(samples, truths= means)
Omitting the first couple of iterations..
idx = -1
samples = pools[idx].thetas
fig = triangle.corner(samples, weights=pools[idx].ws, truths= means)
for mean, std in zip(*abcpmc.weighted_avg_and_std(samples, pools[idx].ws, axis=0)):
print(u"mean: {0:>.4f} \u00B1 {1:>.4f}".format(mean,std))
mean: 1.0924 ± 0.0170 mean: 1.4953 ± 0.0167