In [1]:

%pylab inline
rc('text', usetex=True)
rc('axes', labelsize=15, titlesize=15)

Populating the interactive namespace from numpy and matplotlib

In [2]:

import seaborn as sns
sns.set_style("white")
np.random.seed(10)

ABC PMC on a 2D gaussian example¶

In this example we're looking at a dataset that has been drawn from a 2D gaussian distribution. We're going to assume that we don't have a proper likelihood but that we know the covariance matrix $\Sigma$ of the distribution. Using the ABC PMC algorithm we will approximate the posterior of the distribtion of the mean values.

First we generate a new dataset by drawing random variables from a mulitvariate gaussian around mean=[1.1, 1.5]. This is going to be our observed data set

In [3]:

samples_size = 1000
sigma = np.eye(2) * 0.25
means = [1.1, 1.5]

data = np.random.multivariate_normal(means, sigma, samples_size)
matshow(sigma)
title("covariance matrix sigma")
colorbar()

Out[3]:

<matplotlib.colorbar.Colorbar instance at 0x10ad0f9e0>

Then we need to define our model/simulation. In this case this is simple: we draw again random variables from a multivariate gaussian distribution using the given mean and the sigma from above

In [4]:

def create_new_sample(theta):
    return np.random.multivariate_normal(theta, sigma, samples_size)

Next, we need to define a distance measure. We will use the sum of the absolute differences of the means of the simulated and the observed data

In [5]:

def dist_measure(x, y):
    return np.sum(np.abs(np.mean(x, axis=0) - np.mean(y, axis=0)))

Verification¶

To verify if everything works and to see the effect of the random samples in the simulation we compute the distance for 1000 simulations at the true mean values

In [6]:

distances = [dist_measure(data, create_new_sample(means)) for _ in range(1000)]

In [7]:

sns.distplot(distances, axlabel="distances", )
title("Variablility of distance from simulations")

Out[7]:

<matplotlib.text.Text at 0x10af34950>

Setup¶

Now we're going to set up the ABC PMC sampling

In [8]:

import abcpmc

As a prior we're going to use a gaussian prior using our best guess about the distribution of the means.

In [9]:

prior = abcpmc.GaussianPrior(mu=[1.0, 1.0], sigma=np.eye(2) * 0.5)

As threshold $\epsilon$ we're going to use the $\alpha^{th}$ percentile of the sorted distances of the particles of the current iteration. The simplest way to do this is to define a constant $\epsilon$ and iteratively adapt the theshold. As starting value we're going to define a sufficiently high value so that the acceptance ratio is reasonable and we will sample for T iterations

In [10]:

alpha = 75
T = 20
eps_start = 1.0
eps = abcpmc.ConstEps(T, eps_start)

Finally, we create an instance of your sampler. We want to use 5000 particles and the functions we defined above. Additionally we're going to make use of the built-in parallelization and use 7 cores for the sampling

In [11]:

sampler = abcpmc.Sampler(N=5000, Y=data, postfn=create_new_sample, dist=dist_measure, threads=7)

Optionally, we can customize the proposal creation. Here we're going to use a "Optimal Local Covariance Matrix"-kernel (OLCM) as proposed by (Fillipi et al. 2012). This has shown to yield a high acceptance ratio togheter with a faster decrease of the thresold.

In [12]:

sampler.particle_proposal_cls = abcpmc.OLCMParticleProposal

Sampling¶

Now we're ready to sample. All we need to do is to iterate over the yielded values of your sampler instance. The sample function returns a namedtuple per iteration that contains all the information that we're interestend in

In [13]:

def launch():
    eps = abcpmc.ConstEps(T, eps_start)

    pools = []
    for pool in sampler.sample(prior, eps):
        print("T: {0}, eps: {1:>.4f}, ratio: {2:>.4f}".format(pool.t, eps(pool.eps), pool.ratio))

        for i, (mean, std) in enumerate(zip(*abcpmc.weighted_avg_and_std(pool.thetas, pool.ws, axis=0))):
            print(u"    theta[{0}]: {1:>.4f} \u00B1 {2:>.4f}".format(i, mean,std))

        eps.eps = np.percentile(pool.dists, alpha) # reduce eps value
        pools.append(pool)
    sampler.close()
    return pools

In [14]:

import time
t0 = time.time()
pools = launch()
print "took", (time.time() - t0)

T: 0, eps: 1.0000, ratio: 0.3780
    theta[0]: 1.0626 ± 0.3877
    theta[1]: 1.3582 ± 0.3641
T: 1, eps: 0.8446, ratio: 0.4938
    theta[0]: 1.0789 ± 0.3098
    theta[1]: 1.3920 ± 0.3021
T: 2, eps: 0.6796, ratio: 0.4560
    theta[0]: 1.0735 ± 0.2489
    theta[1]: 1.4098 ± 0.2668
T: 3, eps: 0.5569, ratio: 0.4649
    theta[0]: 1.0690 ± 0.2053
    theta[1]: 1.4399 ± 0.2193
T: 4, eps: 0.4579, ratio: 0.4680
    theta[0]: 1.0848 ± 0.1721
    theta[1]: 1.4416 ± 0.1831
T: 5, eps: 0.3767, ratio: 0.4684
    theta[0]: 1.0771 ± 0.1425
    theta[1]: 1.4640 ± 0.1460
T: 6, eps: 0.3099, ratio: 0.4490
    theta[0]: 1.0817 ± 0.1218
    theta[1]: 1.4691 ± 0.1241
T: 7, eps: 0.2595, ratio: 0.4433
    theta[0]: 1.0861 ± 0.0992
    theta[1]: 1.4734 ± 0.1009
T: 8, eps: 0.2080, ratio: 0.4411
    theta[0]: 1.0898 ± 0.0799
    theta[1]: 1.4799 ± 0.0825
T: 9, eps: 0.1688, ratio: 0.4418
    theta[0]: 1.0921 ± 0.0718
    theta[1]: 1.4926 ± 0.0678
T: 10, eps: 0.1419, ratio: 0.4369
    theta[0]: 1.0880 ± 0.0562
    theta[1]: 1.4946 ± 0.0581
T: 11, eps: 0.1181, ratio: 0.4272
    theta[0]: 1.0927 ± 0.0503
    theta[1]: 1.4934 ± 0.0483
T: 12, eps: 0.0962, ratio: 0.3870
    theta[0]: 1.0927 ± 0.0408
    theta[1]: 1.4930 ± 0.0394
T: 13, eps: 0.0792, ratio: 0.3830
    theta[0]: 1.0903 ± 0.0338
    theta[1]: 1.4931 ± 0.0330
T: 14, eps: 0.0647, ratio: 0.3789
    theta[0]: 1.0918 ± 0.0276
    theta[1]: 1.4929 ± 0.0288
T: 15, eps: 0.0530, ratio: 0.3490
    theta[0]: 1.0930 ± 0.0249
    theta[1]: 1.4935 ± 0.0248
T: 16, eps: 0.0443, ratio: 0.3006
    theta[0]: 1.0928 ± 0.0224
    theta[1]: 1.4936 ± 0.0222
T: 17, eps: 0.0374, ratio: 0.2688
    theta[0]: 1.0926 ± 0.0188
    theta[1]: 1.4944 ± 0.0209
T: 18, eps: 0.0313, ratio: 0.2617
    theta[0]: 1.0912 ± 0.0174
    theta[1]: 1.4945 ± 0.0188
T: 19, eps: 0.0263, ratio: 0.2039
    theta[0]: 1.0924 ± 0.0170
    theta[1]: 1.4953 ± 0.0167
took 94.172825098

Postprocessing¶

How did the sampled values evolve over the iterations? As the threshold is decreasing we expect the errors to shrink while the means converge to the true means.

In [15]:

for i in range(len(means)):
    moments = np.array([abcpmc.weighted_avg_and_std(pool.thetas[:,i], pool.ws, axis=0) for pool in pools])
    errorbar(range(T), moments[:, 0], moments[:, 1])
hlines(means, 0, T, linestyle="dotted", linewidth=0.7)
_ = xlim([-.5, T])

How does the distribution of the distances look like after we have approximated the posterior? If we're close to the true posterior we expect to have a high bin count around the values we've found in the earlier distribution plot

In [16]:

distances = np.array([pool.dists for pool in pools]).flatten()
sns.distplot(distances, axlabel="distance")

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b012690>

How did our $\epsilon$ values behave over the iterations? Using the $\alpha^{th}$ percentile causes the threshold to decrease relatively fast in the beginning and to plateau later on

In [17]:

eps_values = np.array([pool.eps for pool in pools])
plot(eps_values, label=r"$\epsilon$ values")
xlabel("Iteration")
ylabel(r"$\epsilon$")
legend(loc="best")

Out[17]:

<matplotlib.legend.Legend at 0x10b490750>

What about the acceptance ratio? ABC PMC with the OLCM kernel gives us a relatively high acceptance ratio.

In [18]:

acc_ratios = np.array([pool.ratio for pool in pools])
plot(acc_ratios, label="Acceptance ratio")
ylim([0, 1])
xlabel("Iteration")
ylabel("Acceptance ratio")
legend(loc="best")

Out[18]:

<matplotlib.legend.Legend at 0x10b9dec50>

In [19]:

%pylab inline
rc('text', usetex=True)
rc('axes', labelsize=15, titlesize=15)

Populating the interactive namespace from numpy and matplotlib

Finally what does our posterior look like? For the visualization we're using triangle.py (https://github.com/dfm/triangle.py)

In [20]:

import triangle
samples = np.vstack([pool.thetas for pool in pools])
fig = triangle.corner(samples, truths= means)

Omitting the first couple of iterations..

In [21]:

idx = -1
samples = pools[idx].thetas
fig = triangle.corner(samples, weights=pools[idx].ws, truths= means)

for mean, std in zip(*abcpmc.weighted_avg_and_std(samples, pools[idx].ws, axis=0)):
    print(u"mean: {0:>.4f} \u00B1 {1:>.4f}".format(mean,std))

mean: 1.0924 ± 0.0170
mean: 1.4953 ± 0.0167

In [ ]: