Notebook

Running Parallel Chains¶

Author(s): Paul Miles | Date Created: August 31, 2018

This demonstration was made using version 1.6.0 of pymcmcstat.

Determining whether or not your chains have converged to the posterior density is a significant challenge when performing MCMC simulations. There are many diagnostics available for assessing chain convergence. The most robust approach is to use the Gelman-Rubin diagnostic, which requires several sets of chains for comparison. The Gelman-Rubin approach essentially performs an analysis of the variances within each chain set and between each chain set. For more details regarding this approach see

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical science, 7(4), 457-472. https://doi.org/10.1214/ss/1177011136
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, 7(4), 434-455. https://doi.org/10.1080/10618600.1998.10474787

While this diagnostic approach does not require computations to be run in parallel, it is certainly more convenient if you can generate a set of chains by running parallel MCMC simulations. In this tutorial we demonstrate how to use the ParallelMCMC class of the pymcmcstat package.

In [10]:

# import required packages
import numpy as np
import os
from pymcmcstat.MCMC import MCMC
from pymcmcstat.ParallelMCMC import ParallelMCMC
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import pymcmcstat
print(pymcmcstat.__version__)

1.9.0

Define Model and Sum-of-Squares Function¶

In [3]:

# define test model function
def modelfun(xdata, theta):
    m = theta[0]
    b = theta[1]
    nrow, ncol = xdata.shape
    y = np.zeros([nrow,1])
    y[:, 0] = m*xdata.reshape(nrow,) + b
    return y

def ssfun(theta, data):
    xdata = data.xdata[0]
    ydata = data.ydata[0]
    # eval model
    ymodel = modelfun(xdata, theta)
    # calc sos
    res = ymodel[:, 0] - ydata[:, 0]
    return (res**2).sum(axis = 0)

Define Data Set - Plot¶

In [5]:

nds = 100
m = 2.0
b = 3.0
x = np.linspace(2, 3, num=nds).reshape(nds, 1)
y = m*x + b + 0.1*np.random.standard_normal(x.shape)
res = y - modelfun(x, [m, b])

sns.set_context('talk')
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.plot(x, y, '.b');
plt.plot(x, modelfun(x, [m, b]), '-r', linewidth=3)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.subplot(1,2,2)
plt.plot(x, res, '.g')
mr = res.mean()
plt.plot([x[0], x[-1]], [mr, mr], '-k', linewidth=3)
plt.xlabel('$x$')
plt.ylabel(str('Residual, ($\\mu$ = {:5.4e})'.format(mr)))
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Initialize MCMC Object and Setup Simulation¶

In the simulation options, we turn off the waitbar and set the output verbosity to 0. The results will be saved to a json file.
We define a directory name based on the date/time to ensure we do not overwrite any results.
We add the model parameters, but their initial values will actually be determine later.

In [16]:

mcset = MCMC()
# Add data
mcset.data.add_data_set(x, y)
datestr = datetime.now().strftime('%Y%m%d_%H%M%S')
savedir = 'resources' + os.sep + str('{}_{}'.format(datestr, 'parallel_chains'))
mcset.simulation_options.define_simulation_options(
    nsimu=5.0e3, updatesigma=True, method='dram',
    savedir=savedir, savesize=1000, save_to_json=True,
    verbosity=0, waitbar=False, save_lightly=True, save_to_bin=True)
mcset.model_settings.define_model_settings(sos_function=ssfun)
mcset.parameters.add_model_parameter(name='m',
                                     theta0=2.,
                                     minimum=-10,
                                     maximum=200,
                                     sample=1)
mcset.parameters.add_model_parameter(name='b',
                                     theta0=2.75,
                                     minimum=-10,
                                     maximum=100,
                                     sample=1)

Setup Parallel Simulation and Define Initial Values¶

You can specify the number of chains to be generated (num_chain) and the number of cores to use (num_cores).
Note, the initial values are defined in a 3x2 array. The expected size is [num_chain x num_par], which is 3x2 in this case. If you don't specify initial values, then a random set will be generated within the parameter space.

In [17]:

# setup parallel MCMC
parMC = ParallelMCMC()
initial_values = np.array([[2.5, 2.5], [1.8, 3.8], [2.05, 3.42]])
parMC.setup_parallel_simulation(mcset=mcset,
                                initial_values=initial_values,
                                num_chain=3,
                                num_cores=4)

Run Parallel Simulation and Display Chain Statistics¶

In [18]:

parMC.run_parallel_simulation()
parMC.display_individual_chain_statistics()

Processing: resources/20190719_131740_parallel_chains/chain_0
Processing: resources/20190719_131740_parallel_chains/chain_1
Processing: resources/20190719_131740_parallel_chains/chain_2
Parallel simulation run time: 1.7680649757385254 sec

****************************************
Displaying results for chain 0
Files: resources/20190719_131740_parallel_chains/chain_0


------------------------------
name      :       mean        std     MC_err        tau     geweke
m         :     2.0206     0.0372     0.0017    13.7387     0.9941
b         :     2.9497     0.0903     0.0042    14.3948     0.9910
------------------------------
==============================
Acceptance rate information
---------------
Results dictionary:
Stage 1: 17.16%
Stage 2: 52.70%
Net    : 69.86% -> 3493/5000
---------------
Chain provided:
Net    : 69.86% -> 3493/5000
---------------
Note, the net acceptance rate from the results dictionary
may be different if you only provided a subset of the chain,
e.g., removed the first part for burnin-in.
------------------------------

****************************************
Displaying results for chain 1
Files: resources/20190719_131740_parallel_chains/chain_1


------------------------------
name      :       mean        std     MC_err        tau     geweke
m         :     2.0180     0.0424     0.0032    22.3207     0.9806
b         :     2.9565     0.1074     0.0081    21.9919     0.9662
------------------------------
==============================
Acceptance rate information
---------------
Results dictionary:
Stage 1: 22.72%
Stage 2: 54.94%
Net    : 77.66% -> 3883/5000
---------------
Chain provided:
Net    : 77.66% -> 3883/5000
---------------
Note, the net acceptance rate from the results dictionary
may be different if you only provided a subset of the chain,
e.g., removed the first part for burnin-in.
------------------------------

****************************************
Displaying results for chain 2
Files: resources/20190719_131740_parallel_chains/chain_2


------------------------------
name      :       mean        std     MC_err        tau     geweke
m         :     2.0158     0.0512     0.0049    40.5945     0.9675
b         :     2.9616     0.1299     0.0127    42.6359     0.9448
------------------------------
==============================
Acceptance rate information
---------------
Results dictionary:
Stage 1: 15.16%
Stage 2: 53.60%
Net    : 68.76% -> 3438/5000
---------------
Chain provided:
Net    : 68.76% -> 3438/5000
---------------
Note, the net acceptance rate from the results dictionary
may be different if you only provided a subset of the chain,
e.g., removed the first part for burnin-in.
------------------------------

Processing Chains¶

For this example I assume that the MCMC simulations were run, then the user went to process the saved results at a later time. We load the results from the simulation and demonstrate how to plot the parameter distributions and pairwise correlation using the mcmcplot package.

To load the results we must specify the name of the top level directory where the parallel results are stored. We previously defined the savedir variable, so it still exists in memory. If it wasn't, then we would need to define it as the string matching the directory name.

In [19]:

savedir

Out[19]:

'resources/20190719_131740_parallel_chains'

In [31]:

from pymcmcstat import mcmcplot as mcp
from pymcmcstat.chain import ChainProcessing as CP

pres = CP.load_parallel_simulation_results(savedir)
combined_chains, index = CP.generate_combined_chain_with_index(pres)

settings = dict(
    pairgrid=dict(height=3.75, hue='index', despine=False),
    ld_type=sns.kdeplot,
    ld=dict(n_levels=5, shade=True, shade_lowest=False),
    md=dict(lw=3))
sns.set_context('talk')
fpg = mcp.plot_paired_density_matrix(
    chains=combined_chains,
    settings=settings,
    index=index)
tmp = fpg.add_legend()

Qualitatively, it appears as though the chains have converged to about the same distribution.

Gelman-Rubin Diagnostic¶

In the previous section we generated the list of chains which are required for the Gelman-Rubin diagnostic. Note, this implementation assumes that all chains have the same dimension. The output will be a dictionary where the keys are the parameter names. Each parameter references another dictionary which contains

R, Potential Scale Reduction Factor (PSRF)
B, Between Sequence Variance
W, Within Sequence Variance
V, Mixture-of-Sequences Variance
neff, Effective Number of Samples

In [30]:

from pymcmcstat.chain import ChainStatistics as CS
chains = CP.generate_chain_list(pres)
psrf = CS.gelman_rubin(chains=chains, display=True)

Parameter: $p_{0}$
  R = 0.9999050250834056
  B = 0.0005446504072835234
  W = 0.0013828493830858575
  V = 0.0013825867235498412
  neff = 7500
Parameter: $p_{1}$
  R = 0.9998927399338449
  B = 0.0030520757695050565
  W = 0.008775468759398558
  V = 0.008773586345638533
  neff = 7500

In general, R closer to 1 indicates your chains have converged.