In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [2]:
%matplotlib notebook
import paramonte as pm
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))
sns.set()

paramonte 2.4.0
numpy 1.19.4
scipy 1.5.4
pandas 1.1.5
seaborn 0.11.0
matplotlib 3.3.3


Modeling the bivariate distribution of a data set subject to sample incompleteness¶

Supposed we have observed a dataset comprised of $1000$ events with two attributes variable1 and variable2 in this data.csv file. Plotting these points would yield a red-colored bivariate distribution similar to the following figure,

The pale black points represent the missing data from our observational dataset. These are points that we could NOT observe (or detect) because of some instrumental bias and sample incompleteness as represented by the black line.

Statement of problem

Now, our goal is to form a hypothesis about this dataset, that is, a hypothesis about the distribution of the red points in the above plot. To make a correct assessment, we will have to also carefully consider the effects of the detection threshold (the black line) in our inference.

The first thing we can do is to obtain a better visualization of this dataset. A Good visualization of data is essential to form a good hypothesis (model) for our problem. The bivariate data in the above plot looks highly skewed to the right. Therefore, it makes sense to take the logarithm of this dataset in the hope that it can help us better understand the possible underlying generating distribution of this data,

Just by looking at the observed (red) data points, we can form a relatively good hypothesis about the distribution of the data: If the detection threshold did not exist, the complete dataset (including the points in the black points) would have likely very well resembled a bivariate lognormal distribution (or a normal distribution on the logarithmic axes).

However, this dataset is affected by the detection threshold and we need to take a model of the detection threshold also into account. The logarithmic transformation makes it crystal-clear to us that the detection threshold is likely best modeled as a line.

Therefore, we form the following hypothesis,

Hypothesis: Our data comes from a bivariate Lognormal distribution subject to censorship by a sharp hard cutoff threshold in the form of a line. The proposition that the data is jointly lognormally distributed is mathematically equivalent to the proposition stating that log(data) is normally distributed.

Goal: To estimate the unknown parameters of the bivariate Normal distribution which is hypothesized to well fit the bivariate distribution of the log-transformed data. These parameters include the mean vector of length two and the covariance matrix of the full bivariate Normal distribution ($\mu, \Sigma$), as well as the intercept and the slope ($c_0,c_1$) of the sharp linear detection threshold on our data set ($\eta$). A two-dimensional covariance matrix has 3 free parameters because of its symmetry. Therefore, our censored model has overall 7 unknown parameters which must be constrained.

Methodology: We will use the maximum likelihood approach and the ParaDRAM (Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo) sampler of the paramonte library to estimate the 7 unknown parameters of our hypothesis.

Step 1: Read the data and log-transform it¶

First read the data using Pandas library, then log-transform data to make it look like a bivariate Normal distribution.

In [3]:
url = "https://github.com/cdslaborg/paramontex/raw/main/Python/Jupyter/regression_censored_bivariate_lognormal_data_paradram/data.csv"
df = pd.read_csv(url) # log-transform data to be fit by a Normal distribution
X = np.double(df.iloc[:,0])
Y = np.double(df.iloc[:,1])
logX = np.log(X)
logY = np.log(Y)

Out[3]:
variable1 variable2
0 0.87479 1.46020
1 0.12309 0.19028
2 3.32940 6.21360
3 0.62906 1.66630
4 0.38734 0.52181
In [4]:
# plot the bivariate distribution of data

fig = plt.figure()
ax = fig.gca()
plt.scatter(X,Y,s=1,c="red")
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.title('The Bivariate Distribution of Data')
ax.set_xlim([0,10])
plt.grid(True)
plt.show()

# plot the bivariate distribution of log(data)

fig = plt.figure()
ax = fig.gca()
ax.set_xscale("log")
ax.set_yscale("log")
plt.scatter(X,Y,s=1,c="red")
plt.xlabel("Variable 1")
plt.ylabel("Variable 2")
plt.title('The Bivariate Distribution of Log-transformed Data')
plt.grid(True)
plt.show()


Step 2: Compute the probability of observing each data point under the hypothesis¶

We will write a class that takes the log(data) as input and has three methods, most importantly getLogLike(param). This method must take the set of the parameters of our censored model as a vector. We want to explore the shape of this 7-dimensional likelihood function to find its mode and its width along the axes, each of which represents one parameter of our model.

Here is the mathematical equation of the MultiVariate Normal (MVN) Probability Density Function (PDF), $\large\pi(\cdot)$,

$$\large \pi ( \vec{x} ~|~ \mu, \Sigma, k = 2) = (2\pi)^{-\frac{k}{2}} ~ |\Sigma|^{-\frac{1}{2}} ~ \exp \bigg( -\frac{1}{2} (x-\mu)^T ~ \Sigma^{-1} ~ (x-\mu) \bigg)$$

where $k=2$ represents the number of dimensions of our data set, and in this paricular problem,

$$\large \begin{eqnarray} \mu ~&=&~ \big[ \mu_{x}, \mu_{y} \big] ~, \\ \Sigma ~&=&~ \begin{bmatrix} \sigma_{x} ~,~ \rho\sigma_{x}\sigma_{y} \\ \rho\sigma_{x}\sigma_{y} ~,~ \sigma_{y} \end{bmatrix} ~, \end{eqnarray}$$

where $\sigma_{x}$ and $\sigma_{y}$ represent the standard deviations of log-transformations of variable1 and variable2 in our dataset, and $\rho$ represents the amount of corelation between the two variables.

However, our problem is more complicated than the above equation. The above equation gives the probability of obtaining a specific observation with a pair of attributes ($x,y$) from a full MVN distribution, whereas the MVN distribution in our problem is heavily censored on its right. We must therefore take this censorship into account by renormalizing the above equation such that the integral of the censored MVN PDF becomes unity again just as is the case for the above equation.

For a given censorship cutoff line ($\eta$) with intercept and slope ($c_0,c_1$), the intergal of the censored MVN PDF does not have a closed-form analytical solution (except in trivial cases, such as when the detection line is vertical or horizontal). Therefore, we will have to use numerical integration methods to intergate the censored MVN PDF over the entire area above the censorship line.

We will take the following strategy to integrate our censored 2D MVN PDF model: The integral of a univariate normal distribution is analytically known as a function of the error function. For a given value of variable2 along the y-axis and an upper ($\eta$) limit for the value of variable1 (which is dictated by the detection threshold), the differential probability of having an observation with a value of $x_2$ for variable2 is the probability of observing the value $y$ marginalized over all possible values of variable1 (subject to the detection threshold).

$$\large \begin{eqnarray} \pi (Y = y ~|~ \mu, \Sigma, c_0, c_1) ~&=&~ \pi\big(Y = y ~|~ \mu_y, \sigma_y \big) ~ \int^{\eta}_{-\infty} \mathcal{N}\big(X=x ~|~ \mu_{x|y}, \sigma_{x|y}\big) ~ dx \\ ~&=&~ \mathcal{N} \big( Y = y ~|~ \mu_y, \sigma_y \big) ~ \frac{1}{2} \bigg[ 1 + \mathrm{erf} \bigg( \frac{\eta-\mu_{x|y}}{\sigma_{x|y}\sqrt{2}} \bigg) \bigg] ~, \end{eqnarray}$$

where ($\mu_{x|y},\sigma_{x|y}$) represent the mean and the standard deviation of the conditional distribution of variable2 given a particular value for variable1,

$$\large \begin{eqnarray} \mu_{x|y} &=& \mu_x + \rho\frac{\sigma_x}{\sigma_y}(y - \mu_y) ~, \\ \sigma_{x|y} &=& ( 1 - \rho^2) ~ \sigma_x^2 ~, \end{eqnarray}$$

Therefore, the full integral of the censored MVN PDF for a given set of parameters ($\mu,\Sigma,c_0,c_1$) can be computed as,

$$\large I = \int^{-\infty}_{+\infty} ~ \pi (y ~|~ \mu, \Sigma, c_0, c_1) ~ dy ~,$$

implying that we will have to multiply all PDF values by,

$$\large A = \frac{1}{I} ~,$$

to properly normalize the censored MVN PDF. The above intergal will have to computed numerically and we will do so via Reomberg's method of integration. In practice, much of the integrand is concentrated close to the mean. Therefore, integrating the integrad within a few standard deviations neighborhood of the mean of the integrand should be accurate enough for the purpose of this example.
All in all, the probability of observing each pair of ($x,y$) represented by the vector $\vec{x}$ in our data set, subject to a given detection threshold, can be computed via the following equation,

$$\large \pi ( \vec{x} ~|~ \mu, \Sigma, c_0, c_1, k = 2) = \begin{cases} A~(2\pi)^{-\frac{k}{2}} ~ |\Sigma|^{-\frac{1}{2}} ~ \exp \bigg( -\frac{1}{2} (x-\mu)^T ~ \Sigma^{-1} ~ (x-\mu) \bigg) ~,~ {\normalsize\text{when data is observable}} \\ 0 ~,~ {\normalsize\text{when data is non-observable (i.e., falls below the censorship line)}} \end{cases}$$

Step 3: Compute the likelihood of data for a given set of parameters¶

Now, with an equation at hand to compute the probabilities of individual data points for a given set of model parameters, we can go on to write the likelihood of observing the entire dataset of $N=1000$ observations for a given set of input parameters,

$$\large \begin{eqnarray} \mathcal{L} ( \text{data} ~;~ \mu, \Sigma, c_0, c_1, k = 2) ~&\equiv&~ \pi ( \text{data} ~|~ \mu, \Sigma, c_0, c_1, k = 2) \\ ~&=&~ \prod_{i=1}^{1000} ~ \pi \bigg( (x_i,y_i) ~\bigg|~ \mu, \Sigma, c_0, c_1, k = 2 \bigg) \end{eqnarray}$$

We can now implement the above likelihood function to find the most likely set of parameters for our hypothesis.

Since the mathematical objective functions (e.g., probability density functions and likelihood functions) can take extremely small or large values, we often work with their natural logarithms instead. This is the reason behind the naming convention used in the ParaMonte library for the user's objective functions: getLogFunc, implying that the user must provide a function that returns the natural logarithm of the target objective function.

See this Jupyter Notebook for an in-depth discussion of why we need to work with the logarithm of mathematical objective functions in optimization and sampling problems.

Tip

Before implementing the likelihood function, let's ask ourselves a few questions,

• What is the possible range of the mean of a Normal distribution? $[-\infty, +\infty]$

• What are the possible ranges of the intercept and slope of the censorship line in our problem? $[-\infty, +\infty]$, although practically, a range of slope and intercept values for which the observed data falls below the censirship line cannot be valid, because data cannot exist below the detection threshold.

• What is the possible range of the standard deviation ($\sigma$) of a Normal distribution? By definition, $[0, +\infty]$

• What is the possible range of a correlation coefficient ($\rho$) between two attributes of a dataset? By definition, $[-1, +1]$

Now, compare the range of the standard deviation of a Normal distribution with the ranges of the other parameters in the above.

Did you notice the asymmetric range of the standard deviation compared to the others?

This may not seem like a big deal, but wouldn't it be much nicer if the theoretical range of std also looked symmetric like the other parameters? This is surely possible.

Symmetrizing the range of the standard deviation values

To make the range of a standard deviation ($\sigma$) parameter span from negative infinity to positive infinity, we can take the logarithm of $\sigma$ and instead try to find the best-fit $\log(\sigma)$ whose range is $[-\infty, +\infty]$.

Symmetrizing the range of the correlation coefficient values

To make the range of a correlation coefficient ($\rho$) parameter also span from negative infinity to positive infinity, we can use the Fisher transform of $\rho$ to find the best-fit $z\equiv\text{Fisher-transform}(\rho)$ whose range is $[-\infty, +\infty]$.

Here is an implementation of our censored Normal Hypothesis class,

In [5]:
from scipy.integrate import romberg
from scipy.stats import norm

class Normal():

def __init__(self, logX, logY):
if len(logX) != len(logY): return print("Error: dataset lengths not equivalent")

self.data = np.transpose(np.double([logX,logY]))
self.ndata = len(logX)
self.NVAR = 2
self.coef0 = self.NVAR * np.log( 1. / np.sqrt(2.*np.pi) )

self._rho = None
self._stdX = None
self._stdY = None
self._covMat = None
self._meanVec = None
self._stdRatio = None
self._stdXGivenY = None
self._cutoffIntercept = None
self._cutoffSlopeInverse = None
self._lowerLimY = None # for romberg intergation
self._upperLimY = None # for romberg intergation

def integrateXGivenY(self,yval):
avgXGivenY = self._meanVec[0] + self._stdRatio * self._rho * (yval - self._meanVec[1])
threshX = (yval - self._cutoffIntercept) * self._cutoffSlopeInverse
return norm.pdf(x = yval, loc = self._meanVec[1], scale = self._stdY) \
* norm.cdf(x = threshX, loc = avgXGivenY, scale = self._stdXGivenY)

def getNormFac(self, vec_func = False):
return romberg ( function = self.integrateXGivenY
, a = self._lowerLimY
, b = self._upperLimY
, vec_func = vec_func
, tol = 1.e-4
, divmax = 8
)

def getLogLike_slow(self, param):
"""
Return the natural logarithm of the likelihood of
observing the input data given the input parameters.
This function is a naive slow implementation of the
likelihood function which relies on looping over data.
As such, it is about 100 times slower than the equivalent
vectorized fast version of this method named getLogLike() below.
This function exists only for the purpose of illustrating the importance of
avoiding loops in Python and vectorizing your calculations as much as possible.
Order of variables in the input param vector:
param[0] = avg(x)
param[1] = avg(y)
param[2] = logStd(x)
param[3] = logStd(y)
param[4] = FirsherTrans(self._rho)
param[5] = cutoffSlope
param[6] = cutoffIntercept
"""
self._rho = np.tanh(param[4])
self._meanVec = np.double(param[0:2])
self._covMat = np.zeros([self.NVAR,self.NVAR])
self._stdX = np.exp(param[2])
self._stdY = np.exp(param[3])
self._covMat[0,0] = (self._stdX)**2
self._covMat[1,1] = (self._stdY)**2
self._covMat[1,0] = self._rho * self._stdX * self._stdY
self._covMat[0,1] = self._covMat[1,0]
self._stdXGivenY = np.sqrt( self._covMat[0,0] * (1-self._rho**2) )
self._stdRatio = self._stdX / self._stdY
self._cutoffSlopeInverse = 1 / param[5]
self._cutoffIntercept = param[6]
significance = 3 * self._stdY
self._lowerLimY = self._meanVec[1] - significance
self._upperLimY = self._meanVec[1] + significance

if np.linalg.det(self._covMat)==0:
sys.exit("The impossible happended. Covariance matrix not positive-definite.")

invCovMat = np.linalg.inv(self._covMat) # This is the inverse of the covariance matrix of the MVN distribution.

# The following is the log of the coefficient used in the definition of the MVN.

coef = self.coef0 + np.log( np.sqrt(np.linalg.det(invCovMat)) )

# the logarithm of objective function: log(MVN)

logLike = 0.
for obs in self.data:
if obs[1] < (param[5] * obs[0] + param[6]):
# It is impossible to observe any data point that falls below the
# censorship line. Therefore, if the input parameters result in a
# censorship line that is above even a single observation, then,
# return negative infinity to indicate impossible set of model parameters.
return -1.e100 # -np.inf
else:
normedPoint = self._meanVec - obs
logLike += coef - 0.5 * ( np.dot(normedPoint, np.matmul(invCovMat,normedPoint)) )
return logLike - np.log(self.getNormFac())*self.ndata

def getLogLike_fast(self, param):
"""
Return the natural logarithm of the likelihood of
observing the input data given the input parameters.
Order of variables in the input param vector:
param[0] = avg(x)
param[1] = avg(y)
param[2] = logStd(x)
param[3] = logStd(y)
param[4] = FirsherTrans(self._rho)
param[5] = cutoffSlope
param[6] = cutoffIntercept
"""
self._rho = np.tanh(param[4])
self._meanVec = np.double(param[0:2])
self._covMat = np.zeros([self.NVAR,self.NVAR])
self._stdX = np.exp(param[2])
self._stdY = np.exp(param[3])
self._covMat[0,0] = (self._stdX)**2
self._covMat[1,1] = (self._stdY)**2
self._covMat[1,0] = self._rho * self._stdX * self._stdY
self._covMat[0,1] = self._covMat[1,0]
self._stdXGivenY = np.sqrt( self._covMat[0,0] * (1-self._rho**2) )
self._stdRatio = self._stdX / self._stdY
self._cutoffSlopeInverse = 1 / param[5]
self._cutoffIntercept = param[6]
significance = 3 * self._stdY
self._lowerLimY = self._meanVec[1] - significance
self._upperLimY = self._meanVec[1] + significance

if np.linalg.det(self._covMat)==0:
sys.exit("The impossible happended. Covariance matrix not positive-definite.")

invCovMat = np.linalg.inv(self._covMat) # This is the inverse of the covariance matrix of the MVN distribution.

# The following is the log of the coefficient used in the definition of the MVN.

coef = self.coef0 + np.log( np.sqrt(np.linalg.det(invCovMat)) )

# the logarithm of objective function: log(MVN)

if np.any( normal.data[:,1] < (param[5] * normal.data[:,0] + param[6]) ):
# It is impossible to observe any data point that falls below the
# censorship line. Therefore, if the input parameters result in a
# censorship line that is above even a single observation, then,
# return negative infinity to indicate impossible set of model parameters.
return -1.e100 # -np.inf
else:
# normalize all observations with respect to the mean vector
normedPoint = np.subtract(normal._meanVec, normal.data).reshape(normal.ndata,normal.NVAR,1)
# Create a stack of multiplications of each normalized data with the inverse covariance matrix
invMatTimesNormedPoint = np.matmul(invCovMat,normedPoint)
# Compute the log(likelihood) adding all log(probabilities) minus
# the logarithm of the normalization constant of each computed probability.
logLike = normal.ndata * ( coef - np.log(self.getNormFac(vec_func = True)) ) \
- 0.5 * np.sum( np.multiply(normedPoint,invMatTimesNormedPoint) )

return logLike


Tip: The importance of avoiding loops and vectorizing functions and calculations.

Note the definitions of the two different instances of the same class method in the above code: getLogFunc_slow() and getLogFunc_fast(). The naive implementation of the likelihood function (getLogFunc_slow()), which uses scalar function calls and loops over data, is on average one order of magnitude slower than the vectorized version of the same likelihood function (getLogFunc_fast()). The latter fast version avoids looping over data and vectorizes all computations, including calls to the method integrateXGivenY().

To see the speed difference, we will time both methods below. But first, we will instantiate an object of Normal() class,

In [6]:
normal = Normal(logX,logY)

In [7]:
# Define the parameter names and our best-guess values for them.
# This will be later used as input to the ParaDRAM sampler below.
bestGuess = { "AverageLogX" : 0
, "AverageLogY" : 0
, "LogStandardDeviationX" : 0
, "LogStandardDeviationY" : 0
, "FirsherTrans(rho)" : 0.7
, "cutoffSlope" : 1.5
, "cutoffIntercept" : 0
}

In [8]:
%%timeit
"""
Time the slow implmentation of the likelihhod function.
"""
normal.getLogLike_slow(list(bestGuess.values()))

22.5 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]:
%%timeit
"""
Time the fast vectorized implmentation of the likelihhod function.
"""
normal.getLogLike_fast(list(bestGuess.values()))

2.99 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The above benchmark results highlight again the importance of vectorization and avoiding loops for scientific computing with Python.

Step 4: Sample the likelihood function to find the best-fit parameters of the hypothesized model¶

In [10]:
# sample the log-likelihood function via the ParaDRAM MCMC sampler

import paramonte as pm

pmpd.spec.randomSeed = 3571 # to make the simulation fully reproducible
pmpd.spec.chainSize = 10000 # change the number of sampled points from default 100,000 to a smaller value.
pmpd.spec.overwriteRequested = True # overwrite old existing simulation files with the same name
pmpd.spec.outputFileName = "./out/censored"
pmpd.spec.variableNameList = list(bestGuess.keys())
pmpd.spec.startPointVec = list(bestGuess.values())
pmpd.spec.targetAcceptanceRate = [0.1,0.3] # ensure the MCMC sampling efficiency does not become too large or too small.
pmpd.spec.sampleSize = 2000 # the final output sample size (optional)

# call the MCMC sampler

pmpd.runSampler( ndim = 7
, getLogFunc = normal.getLogLike_fast
)

ParaDRAM - NOTE: Running the ParaDRAM sampler in serial mode...
ParaDRAM - NOTE: To run the ParaDRAM sampler in parallel mode visit:
ParaDRAM - NOTE: If you are using Jupyter notebook, check the Jupyter's
ParaDRAM - NOTE: terminal window for realtime simulation progress and report.

ParaDRAM - NOTE: where you should replace pmpd with your ParaDRAM sampler's object name.



Step 5: Ensure no evidence for a lack-of-convergence exists¶

Done. The simulation output files are now stored in the current working directory of Python. You can also see the generated output files on this GitHub page.

We will now follow the guidelines provided by the sampler to read the resulting MCMC chain, in compact format, from the output *_chain.txt.

In [11]:
chain = pmpd.readChain(renabled = True)[0] # enable return. Pick the contents of the first chain file.

ParaDRAM - WARNING: The delimiter is neither given as input to readChain()
ParaDRAM - WARNING: nor set as a simulation specification of the ParaDRAM object.
ParaDRAM - WARNING: This information is essential, otherwise how could the output files be parsed?
ParaDRAM - WARNING: For now, the ParaDRAM sampler will assume a comma-separated
ParaDRAM - WARNING: file format for the contents of the chain file(s) to be parsed.

ParaDRAM - NOTE: 1 files detected matching the pattern: "./out/censored*_chain.txt"

ParaDRAM - NOTE: ndim = 7, count = 10000
ParaDRAM - NOTE: parsing file contents...
ParaDRAM - NOTE: computing the sample correlation matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.053402 seconds.
ParaDRAM - NOTE: computing the sample covariance matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.069011 seconds.
ParaDRAM - NOTE: computing the sample autocorrelations...
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a jointplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a histplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot1 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot2 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contourf plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a grid plot object from scratch... done in 0.015628 seconds.

ParaDRAM - NOTE: The processed chain files are now stored in the output variable as a
ParaDRAM - NOTE: Python list. For example, to access the contents of the first (or the only)
ParaDRAM - NOTE: chain file stored in an output variable named chainList, try:
ParaDRAM - NOTE: where you will have to replace pmpd with your ParaDRAM instance name.
ParaDRAM - NOTE: To access the plotting tools, try:
ParaDRAM - NOTE:     chainList[0].plot.<PRESS TAB TO SEE THE LIST OF PLOTS>
ParaDRAM - NOTE:     chainList[0].plot.line()         # to make 2D line plots.
ParaDRAM - NOTE:     chainList[0].plot.scatter()      # to make 2D scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.lineScatter()  # to make 2D line-scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.line3()        # to make 3D line plots.
ParaDRAM - NOTE:     chainList[0].plot.scatter3()     # to make 3D scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.lineScatter3() # to make 3D line-scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.contour()      # to make fast 2D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.contourf()     # to make fast 2D kernel density filled contour plots.
ParaDRAM - NOTE:     chainList[0].plot.contour3()     # to make fast 3D kernel density contour plots.
ParaDRAM - NOTE:     chainList[0].plot.histplot()     # to make seaborn 1D distribution plots.
ParaDRAM - NOTE:     chainList[0].plot.kdeplot1()     # to make seaborn 1D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.kdeplot2()     # to make seaborn 2D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.grid()         # to make GridPlot
ParaDRAM - NOTE: To plot or inspect the variable autocorrelations or the correlation/covariance matrices, try:
ParaDRAM - NOTE:     chainList[0].stats.<PRESS TAB TO SEE THE LIST OF COMPONENTS>



Then, we will ensure that there is no evidence for a lack-of-convergence. We will do so by visualizing the AdaptationMeasure column in th output chain file, which represents the amount of adaptation of the proposal distribution of the MCMC sampler throughout the simulation.

In [12]:
chain.plot.scatter.scatter.kws.cmap = "winter"
chain.plot.scatter.currentFig.axes.set_ylim([1.e-5,1])
chain.plot.scatter.currentFig.axes.set_yscale("log")

ParaDRAM - NOTE: making the scatter plot...
done in 0.315397 seconds.


The above graph looks quite reassuring, since it clearly indicates that the amount of adaptation of the proposal distribution decays toward zero very fast, like a power-law. The initial sharp drop in the AdaptationMeasure represent the simulation episode during which the proposal distribution of the MCM sampler adjusts its shape to the shape of the target likelihood function.

Step 6: Visualize the sampled parameters¶

We will now read the resulting sample from the output *_sample.txt file.

In [13]:
pmpd.readSample() # read the final refined sample from the MCMC simulation
sample = pmpd.sampleList[0]

ParaDRAM - WARNING: The delimiter is neither given as input to readSample()
ParaDRAM - WARNING: nor set as a simulation specification of the ParaDRAM object.
ParaDRAM - WARNING: This information is essential, otherwise how could the output files be parsed?
ParaDRAM - WARNING: For now, the ParaDRAM sampler will assume a comma-separated
ParaDRAM - WARNING: file format for the contents of the sample file(s) to be parsed.

ParaDRAM - NOTE: 1 files detected matching the pattern: "./out/censored*_sample.txt"

ParaDRAM - NOTE: ndim = 7, count = 2000
ParaDRAM - NOTE: parsing file contents...
ParaDRAM - NOTE: computing the sample correlation matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.050904 seconds.
ParaDRAM - NOTE: computing the sample covariance matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.040587 seconds.
ParaDRAM - NOTE: computing the sample autocorrelations...
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.002039 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a jointplot plot object from scratch... done in 0.008083 seconds.
ParaDRAM - NOTE: creating a histplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot1 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot2 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contourf plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a grid plot object from scratch... done in 0.0 seconds.

ParaDRAM - NOTE: The processed sample files are now stored in the newly-created
ParaDRAM - NOTE: component sampleList of the ParaDRAM object as a Python list.
ParaDRAM - NOTE: For example, to access the contents of the first (or the only) sample file, try:
ParaDRAM - NOTE: where you will have to replace pmpd with your ParaDRAM instance name.
ParaDRAM - NOTE: To access the plotting tools, try:
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.<PRESS TAB TO SEE THE LIST OF PLOTS>
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.line()         # to make 2D line plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.scatter()      # to make 2D scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.lineScatter()  # to make 2D line-scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.line3()        # to make 3D line plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.scatter3()     # to make 3D scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.lineScatter3() # to make 3D line-scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contour()      # to make fast 2D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contourf()     # to make fast 2D kernel density filled contour plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contour3()     # to make fast 3D kernel density contour plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.histplot()     # to make seaborn 1D distribution plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.kdeplot1()     # to make seaborn 1D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.kdeplot2()     # to make seaborn 2D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.grid()         # to make GridPlot
ParaDRAM - NOTE: To plot or inspect the variable autocorrelations or the correlation/covariance matrices, try:
ParaDRAM - NOTE:     pmpd.sampleList[0].stats.<PRESS TAB TO SEE THE LIST OF COMPONENTS>


In [14]:
# plot the sampled variables

for colname in pmpd.sampleList[0].df.columns:
sample.plot.line.ycolumns = colname
sample.plot.line()
sample.plot.line.currentFig.axes.set_xlabel("MCMC Count")
sample.plot.line.currentFig.axes.set_ylabel(colname)
sample.plot.line.savefig( fname = "./out/traceplot_" + colname )

# plot the histograms of the sampled parameters

for colname in pmpd.sampleList[0].df.columns:
sample.plot.histplot(xcolumns = colname)
sample.plot.histplot.currentFig.axes.set_xlabel(colname)
sample.plot.histplot.currentFig.axes.set_ylabel("MCMC Count")
sample.plot.histplot.savefig( fname = "./out/histogram_" + colname )

# report the average parameters

pmpd.sampleList[0].df.mean()

ParaDRAM - NOTE: making the line plot...
done in 0.323673 seconds.
ParaDRAM - NOTE: saving the plot to file: "./out/traceplot_SampleLogFunc"
ParaDRAM - NOTE: done in 0.293022 seconds.
ParaDRAM - NOTE: making the line plot... 
done in 0.276187 seconds.
ParaDRAM - NOTE: saving the plot to file: "./out/traceplot_AverageLogX"
ParaDRAM - NOTE: done in 0.300825 seconds.
ParaDRAM - NOTE: making the line plot...