Author(s): Paul Miles | Date Created: August 31, 2018
This demonstration was made using version 1.6.0 of pymcmcstat.
Determining whether or not your chains have converged to the posterior density is a significant challenge when performing MCMC simulations. There are many diagnostics available for assessing chain convergence. The most robust approach is to use the Gelman-Rubin diagnostic, which requires several sets of chains for comparison. The Gelman-Rubin approach essentially performs an analysis of the variances within each chain set and between each chain set. For more details regarding this approach see
While this diagnostic approach does not require computations to be run in parallel, it is certainly more convenient if you can generate a set of chains by running parallel MCMC simulations. In this tutorial we demonstrate how to use the ParallelMCMC
class of the pymcmcstat package.
# import required packages
import numpy as np
import os
from pymcmcstat.MCMC import MCMC
from pymcmcstat.ParallelMCMC import ParallelMCMC
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import pymcmcstat
print(pymcmcstat.__version__)
1.9.0
# define test model function
def modelfun(xdata, theta):
m = theta[0]
b = theta[1]
nrow, ncol = xdata.shape
y = np.zeros([nrow,1])
y[:, 0] = m*xdata.reshape(nrow,) + b
return y
def ssfun(theta, data):
xdata = data.xdata[0]
ydata = data.ydata[0]
# eval model
ymodel = modelfun(xdata, theta)
# calc sos
res = ymodel[:, 0] - ydata[:, 0]
return (res**2).sum(axis = 0)
nds = 100
m = 2.0
b = 3.0
x = np.linspace(2, 3, num=nds).reshape(nds, 1)
y = m*x + b + 0.1*np.random.standard_normal(x.shape)
res = y - modelfun(x, [m, b])
sns.set_context('talk')
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.plot(x, y, '.b');
plt.plot(x, modelfun(x, [m, b]), '-r', linewidth=3)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.subplot(1,2,2)
plt.plot(x, res, '.g')
mr = res.mean()
plt.plot([x[0], x[-1]], [mr, mr], '-k', linewidth=3)
plt.xlabel('$x$')
plt.ylabel(str('Residual, ($\\mu$ = {:5.4e})'.format(mr)))
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
mcset = MCMC()
# Add data
mcset.data.add_data_set(x, y)
datestr = datetime.now().strftime('%Y%m%d_%H%M%S')
savedir = 'resources' + os.sep + str('{}_{}'.format(datestr, 'parallel_chains'))
mcset.simulation_options.define_simulation_options(
nsimu=5.0e3, updatesigma=True, method='dram',
savedir=savedir, savesize=1000, save_to_json=True,
verbosity=0, waitbar=False, save_lightly=True, save_to_bin=True)
mcset.model_settings.define_model_settings(sos_function=ssfun)
mcset.parameters.add_model_parameter(name='m',
theta0=2.,
minimum=-10,
maximum=200,
sample=1)
mcset.parameters.add_model_parameter(name='b',
theta0=2.75,
minimum=-10,
maximum=100,
sample=1)
num_chain
) and the number of cores to use (num_cores
).# setup parallel MCMC
parMC = ParallelMCMC()
initial_values = np.array([[2.5, 2.5], [1.8, 3.8], [2.05, 3.42]])
parMC.setup_parallel_simulation(mcset=mcset,
initial_values=initial_values,
num_chain=3,
num_cores=4)
parMC.run_parallel_simulation()
parMC.display_individual_chain_statistics()
Processing: resources/20190719_131740_parallel_chains/chain_0 Processing: resources/20190719_131740_parallel_chains/chain_1 Processing: resources/20190719_131740_parallel_chains/chain_2 Parallel simulation run time: 1.7680649757385254 sec **************************************** Displaying results for chain 0 Files: resources/20190719_131740_parallel_chains/chain_0 ------------------------------ name : mean std MC_err tau geweke m : 2.0206 0.0372 0.0017 13.7387 0.9941 b : 2.9497 0.0903 0.0042 14.3948 0.9910 ------------------------------ ============================== Acceptance rate information --------------- Results dictionary: Stage 1: 17.16% Stage 2: 52.70% Net : 69.86% -> 3493/5000 --------------- Chain provided: Net : 69.86% -> 3493/5000 --------------- Note, the net acceptance rate from the results dictionary may be different if you only provided a subset of the chain, e.g., removed the first part for burnin-in. ------------------------------ **************************************** Displaying results for chain 1 Files: resources/20190719_131740_parallel_chains/chain_1 ------------------------------ name : mean std MC_err tau geweke m : 2.0180 0.0424 0.0032 22.3207 0.9806 b : 2.9565 0.1074 0.0081 21.9919 0.9662 ------------------------------ ============================== Acceptance rate information --------------- Results dictionary: Stage 1: 22.72% Stage 2: 54.94% Net : 77.66% -> 3883/5000 --------------- Chain provided: Net : 77.66% -> 3883/5000 --------------- Note, the net acceptance rate from the results dictionary may be different if you only provided a subset of the chain, e.g., removed the first part for burnin-in. ------------------------------ **************************************** Displaying results for chain 2 Files: resources/20190719_131740_parallel_chains/chain_2 ------------------------------ name : mean std MC_err tau geweke m : 2.0158 0.0512 0.0049 40.5945 0.9675 b : 2.9616 0.1299 0.0127 42.6359 0.9448 ------------------------------ ============================== Acceptance rate information --------------- Results dictionary: Stage 1: 15.16% Stage 2: 53.60% Net : 68.76% -> 3438/5000 --------------- Chain provided: Net : 68.76% -> 3438/5000 --------------- Note, the net acceptance rate from the results dictionary may be different if you only provided a subset of the chain, e.g., removed the first part for burnin-in. ------------------------------
For this example I assume that the MCMC simulations were run, then the user went to process the saved results at a later time. We load the results from the simulation and demonstrate how to plot the parameter distributions and pairwise correlation using the mcmcplot package.
To load the results we must specify the name of the top level directory where the parallel results are stored. We previously defined the savedir
variable, so it still exists in memory. If it wasn't, then we would need to define it as the string matching the directory name.
savedir
'resources/20190719_131740_parallel_chains'
from pymcmcstat import mcmcplot as mcp
from pymcmcstat.chain import ChainProcessing as CP
pres = CP.load_parallel_simulation_results(savedir)
combined_chains, index = CP.generate_combined_chain_with_index(pres)
settings = dict(
pairgrid=dict(height=3.75, hue='index', despine=False),
ld_type=sns.kdeplot,
ld=dict(n_levels=5, shade=True, shade_lowest=False),
md=dict(lw=3))
sns.set_context('talk')
fpg = mcp.plot_paired_density_matrix(
chains=combined_chains,
settings=settings,
index=index)
tmp = fpg.add_legend()
Qualitatively, it appears as though the chains have converged to about the same distribution.
In the previous section we generated the list of chains which are required for the Gelman-Rubin diagnostic. Note, this implementation assumes that all chains have the same dimension. The output will be a dictionary where the keys are the parameter names. Each parameter references another dictionary which contains
R
, Potential Scale Reduction Factor (PSRF)B
, Between Sequence VarianceW
, Within Sequence VarianceV
, Mixture-of-Sequences Varianceneff
, Effective Number of Samplesfrom pymcmcstat.chain import ChainStatistics as CS
chains = CP.generate_chain_list(pres)
psrf = CS.gelman_rubin(chains=chains, display=True)
Parameter: $p_{0}$ R = 0.9999050250834056 B = 0.0005446504072835234 W = 0.0013828493830858575 V = 0.0013825867235498412 neff = 7500 Parameter: $p_{1}$ R = 0.9998927399338449 B = 0.0030520757695050565 W = 0.008775468759398558 V = 0.008773586345638533 neff = 7500
In general, R
closer to 1 indicates your chains have converged.