In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [2]:
%matplotlib notebook
import paramonte as pm
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))
sns.set()

paramonte 2.3.0
numpy 1.19.2
scipy 1.5.2
pandas 1.1.3
seaborn 0.11.0
matplotlib 3.3.2


## Modeling the distribution of 1-dimensional data subject to sample incompleteness¶

Supposed we have observed a dataset comprised of $15027$ events with one attribute variable in this data.csv file. Plotting these points would yield a red-colored histogram like the following figure,

The pale black shaded area represents the missing points from our observational dataset. These are points that we could NOT observe (or detect) because of some instrumental bias and sample incompleteness as represented by the black line.

Statement of problem

Now, our goal is to form a hypothesis about this dataset, that is, a hypothesis about the distribution of the events in the above plot. To make a correct assessment, we will have to also carefully consider the effects of the detection threshold (the black line) in our inference.

The first thing we can do is to obtain a better visualization of this dataset. The data looks highly skewed to the right. Therefore, it makes sense to take the logarithm of this dataset in the hope that it can help us better understand the possible underlying generating distribution of this data,

Just by looking at the observed (red) distribution, we can form a relatively good hypothesis about the distribution of the data: If the detection threshold did not exist, the complete dataset (including the points in the black shaded area) would likely very well resemble a lognormal distribution (or a normal distribution on the logarithmic axes).

However, this dataset is affected by the detection threshold and we need to also take a model of the detection threshold into account. The logarithmic transformation makes it crystal-clear to us that the detection threshold is likely best modeled by as single value represented by the vertical line.

Therefore, we form the following hypothesis,

Hypothesis: Our data comes from a Lognormal distribution subject to censorship by sharp hard cutoff threshold. The proposition that data is lognormally distributed is mathematically equivalent to the proposition that log(data) is normally distributed.

Goal: To estimate the unknown parameters of the Normal distribution, the mean and the standard deviation: ($\mu, \sigma$), as well as the location of the sharp censorship threshold in our data set ($\eta$).

Methodology: We will use the maximum likelihood approach and the ParaDRAM (Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo) sampler of the paramonte library to estimate the 3 unknown parameters of our hypothesis.

## Step 1: Read the data and log-transform it¶

First read the data using Pandas library, then log-transform data to make it look like a Normal distribution.

In [3]:
url = "https://github.com/cdslaborg/paramontex/raw/main/Python/Jupyter/regression_censored_lognormal_data_paradram/data.csv"
df.describe()

Out[3]:
0
count 15027.000000
mean 0.824782
std 0.498326
min 0.019044
25% 0.413535
50% 0.725890
75% 1.177000
max 1.999700
In [4]:
# log-transform data so that it can be fit by a Normal distribution

logdata = np.double(np.log(df.iloc[:,0]))

# plot histogram of log(data)

sns.set()
fig = plt.figure()
ax = fig.gca()
(n, bins, patches) = ax.hist(logdata, histtype='stepfilled', bins = 30)
plt.xlabel('Log( Data )')
plt.ylabel('Count')
plt.title('Histogram of Data')
ax.set_xlim([-4,4])
plt.grid(True)
plt.show()


## Step 2: Compute the probability of observing each data point under the hypothesis¶

We will write a class that takes the logdata as input and has two methods, getLogProb(x,avg,std,cutoff) and getLogLike(param). The former computes the log-probability of observing the input dataset data given a set of parameters for the model (the average avg and the standard deviation std of the Normal distribution, and the sharp cut on the distribution cutoff). The latter method getLogLike(param) must take the set of the parameters of our censored model as a vector. We want to explore the shape of this 3-dimensional likelihood function to find its mode and its width along the three axes, each of which represents one parameter of our model.

Here is the mathematical equation of Normal Probability Density Function (PDF),

$$\large \pi (x ~|~ \mu, \sigma) = \frac{1}{ \sigma~\sqrt{2\pi} } ~ \exp \bigg( -\frac{(x-\mu)^2}{2\sigma^2} \bigg)$$

However, our problem is more complicated than the above equation. The above equation gives the probability of obtaining a specific observation $x$ from a full Normal distribution, whereas the Normal distribution in our problem is heavily censored on its right tail. We must therefore take this censorship into account by renormalizing the above equation such that the integral of the censored PDF becomes unity again just as is the case for the above equation.

For any given cutoff ($\eta$), the intergal of the above PDF in the range $[-\infty, \eta]$ can be expressed in terms of an error function ($\mathrm{erf}$),

$$\large \int^{\eta}_{-\infty} \mathcal{N}(x ~|~ \mu, \sigma) ~ dx = \frac{1}{2} \bigg[ 1 + \mathrm{erf} \bigg( \frac{\eta-\mu}{\sigma\sqrt{2}} \bigg) \bigg] ~,$$

Therefore, $$\large \pi (x ~|~ \mu, \sigma, \eta) = \begin{cases} \frac{1}{ \sigma~\sqrt{2\pi} } ~ \exp \bigg( -\frac{(x-\mu)^2}{2\sigma^2} \bigg) ~ \times ~ \frac{2}{1 + \mathrm{erf} \bigg( \frac{\eta-\mu}{\sigma\sqrt{2}} \bigg)} ~~~,~~~ \forall ~ x~<~\eta \\ 0 ~~~,~~~ \forall ~ x~>=~\eta \end{cases}$$

We will implement the logarithm of the above equation in getLogProb().

## Step 3: Compute the likelihood of data for a given set of parameters¶

Now, with an equation at hand to compute the probabilities of individual data points for a given set of model parameters, we can go on to write the likelihood function of observing the entire dataset of $N$ observations for a given set of input parameters,

$$\large \mathcal{L}\big( \{x_1, \ldots, ..., x_N \} ~;~ \mu, \sigma, \eta \big) = \prod^{N}_{i = 1} \pi (x ~|~ \mu, \sigma, \eta)$$

We can now implement the above likelihood function to find the most likely set of parameters for our hypothesis.

Since the mathematical objective functions (e.g., probability density functions and likelihood functions) can take extremely small or large values, we often work with their natural logarithms instead. This is the reason behind the naming convention used in the ParaMonte library for the user's objective functions: getLogFunc, implying that the user must provide a function that returns the natural logarithm of the target objective function.

See this Jupyter Notebook for an in-depth discussion of why we need to work with the logarithm of mathematical objective functions in optimization and sampling problems.

Tip

• What is the possible range of the mean (avg) of Normal distribution? $[-\infty, +\infty]$
• What is the possible range of the cutoff parameter? $[-\infty, +\infty]$, although practically, its lower limit cannot be less than the maximum value of the our dataset.
• What is the possible range of the standard deviation (std) of the Normal distribution? By definition, $[0, +\infty]$

Now, compare the range of avg and cutoff with std in the above. Did you notice the asymmetric range of std compared to the two others? This may not seem like a big deal, but as you will notice later, it would be much more elegant if the theoretical range of std also looked symmetric like the other two parameters. This is surely possible.

To make the range of the std parameter also span from negative infinity to positive infinity, we can take the logarithm of std and instead try to find the best-fit logStd whose range is $[-\infty, +\infty]$.

Here is an implementation of our censored Normal Hypothesis class,

In [9]:
class Normal():

def __init__(self, normallyDistributedData):
self.data = normallyDistributedData
self.ndata = len(self.data)
self.maxdata = np.max(self.data)

def getLogProb(self, data, avg, logStd, cutoff):
"""
Return the logarithm of the probability density of an input set of x values,
given the input parameters (avg,logStd,cutoff) for the Gaussian PDF.
"""
std = np.exp(logStd)
return -0.9189385332 - logStd - 0.5 * ( (data-avg) / std )**2 - np.log(sp.stats.norm.cdf(cutoff, avg, std))

def getLogLike(self,param):
if param[2] < self.maxdata:
# data cannot be larger than the threshold,
# therefore such input parameter set for the censored Normal distribution is impossible.
# In such occasion, we return negative infinity as the logLikelihood value.
return -1.e300
else:
return np.sum( self.getLogProb(self.data, param[0], param[1], param[2]) )


Now, we will instantiate an object of Normal() class,

In [10]:
normal = Normal(normallyDistributedData = logdata)


## Step 4: Sample the likelihood function to find the best-fit parameters of the hypothesized model¶

In [11]:
# sample the log-likelihood function via the ParaDRAM MCMC sampler

import paramonte as pm
pmpd.spec.chainSize = 20000 # change the number of sampled points from default 100,000 to a smaller value.
pmpd.spec.overwriteRequested = True # overwrite old existing simulation files with the same name
pmpd.spec.variableNameList = ["Average","LogStandardDeviation","Cutoff"]
pmpd.spec.startPointVec = [ 0 # mean
, 0 # log-standard-deviation
, normal.maxdata # initial value for cutoff must be larger than the maximum of data.
]
pmpd.spec.targetAcceptanceRate = [0.1,0.3] # ensure the MCMC sampling efficiency does not become too large or too small.

# to make the simulation fully reproducible and restartable:

pmpd.spec.randomSeed = 3571
pmpd.spec.outputFileName = "./out/censored"

# call the MCMC sampler

pmpd.runSampler( ndim = 3
, getLogFunc = normal.getLogLike
)

ParaDRAM - NOTE: Running the ParaDRAM sampler in serial mode...
ParaDRAM - NOTE: To run the ParaDRAM sampler in parallel mode visit:
ParaDRAM - NOTE: If you are using Jupyter notebook, check the Jupyter's
ParaDRAM - NOTE: terminal window for realtime simulation progress and report.

ParaDRAM - NOTE: where you should replace pmpd with your ParaDRAM sampler's object name.



## Step 5: Ensure no evidence for a lack-of-convergence exists¶

Done. The simulation output files are now stored in the current working directory of Python. You can also see the generated output files on this GitHub page.

We will now follow the guidelines provided by the sampler to read the resulting MCMC chain, in compact format, from the output *_chain.txt.

In [12]:
chain = pmpd.readChain(renabled = True)[0] # enable return. Pick the contents of the first chain file.

ParaDRAM - WARNING: The delimiter is neither given as input to readChain()
ParaDRAM - WARNING: nor set as a simulation specification of the ParaDRAM object.
ParaDRAM - WARNING: This information is essential, otherwise how could the output files be parsed?
ParaDRAM - WARNING: For now, the ParaDRAM sampler will assume a comma-separated
ParaDRAM - WARNING: file format for the contents of the chain file(s) to be parsed.

ParaDRAM - NOTE: 1 files detected matching the pattern: "./out/censored*_chain.txt"

ParaDRAM - NOTE: ndim = 3, count = 20000
ParaDRAM - NOTE: parsing file contents...
ParaDRAM - NOTE: computing the sample correlation matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.013963 seconds.
ParaDRAM - NOTE: computing the sample covariance matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.016966 seconds.
ParaDRAM - NOTE: computing the sample autocorrelations...
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a jointplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a histplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot1 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot2 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contourf plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a grid plot object from scratch... done in 0.000986 seconds.

ParaDRAM - NOTE: The processed chain files are now stored in the output variable as a
ParaDRAM - NOTE: Python list. For example, to access the contents of the first (or the only)
ParaDRAM - NOTE: chain file stored in an output variable named chainList, try:
ParaDRAM - NOTE: where you will have to replace pmpd with your ParaDRAM instance name.
ParaDRAM - NOTE: To access the plotting tools, try:
ParaDRAM - NOTE:     chainList[0].plot.<PRESS TAB TO SEE THE LIST OF PLOTS>
ParaDRAM - NOTE:     chainList[0].plot.line()         # to make 2D line plots.
ParaDRAM - NOTE:     chainList[0].plot.scatter()      # to make 2D scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.lineScatter()  # to make 2D line-scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.line3()        # to make 3D line plots.
ParaDRAM - NOTE:     chainList[0].plot.scatter3()     # to make 3D scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.lineScatter3() # to make 3D line-scatter plots.
ParaDRAM - NOTE:     chainList[0].plot.contour()      # to make fast 2D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.contourf()     # to make fast 2D kernel density filled contour plots.
ParaDRAM - NOTE:     chainList[0].plot.contour3()     # to make fast 3D kernel density contour plots.
ParaDRAM - NOTE:     chainList[0].plot.histplot()     # to make seaborn 1D distribution plots.
ParaDRAM - NOTE:     chainList[0].plot.kdeplot1()     # to make seaborn 1D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.kdeplot2()     # to make seaborn 2D kernel density plots.
ParaDRAM - NOTE:     chainList[0].plot.grid()         # to make GridPlot
ParaDRAM - NOTE: To plot or inspect the variable autocorrelations or the correlation/covariance matrices, try:
ParaDRAM - NOTE:     chainList[0].stats.<PRESS TAB TO SEE THE LIST OF COMPONENTS>



Then, we will ensure that there is no evidence for a lack-of-convergence. We will do so by visualizing the AdaptationMeasure column in th output chain file, which represents the amount of adaptation of the proposal distribution of the MCMC sampler throughout the simulation.

In [13]:
chain.plot.scatter.scatter.kws.cmap = "winter"
chain.plot.scatter.currentFig.axes.set_ylim([1.e-5,1])
chain.plot.scatter.currentFig.axes.set_yscale("log")

ParaDRAM - NOTE: making the scatter plot...
done in 0.174566 seconds.


The above graph looks quite reassuring.

## Step 6: Visualize the sampled parameters¶

We will now read the resulting sample from the output *_sample.txt file.

In [14]:
pmpd.readSample() # read the final refined sample from the MCMC simulation
sample = pmpd.sampleList[0]

ParaDRAM - WARNING: The delimiter is neither given as input to readSample()
ParaDRAM - WARNING: nor set as a simulation specification of the ParaDRAM object.
ParaDRAM - WARNING: This information is essential, otherwise how could the output files be parsed?
ParaDRAM - WARNING: For now, the ParaDRAM sampler will assume a comma-separated
ParaDRAM - WARNING: file format for the contents of the sample file(s) to be parsed.

ParaDRAM - NOTE: 1 files detected matching the pattern: "./out/censored*_sample.txt"

ParaDRAM - NOTE: ndim = 3, count = 1339
ParaDRAM - NOTE: parsing file contents...
ParaDRAM - NOTE: computing the sample correlation matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.017954 seconds.
ParaDRAM - NOTE: computing the sample covariance matrix...
ParaDRAM - NOTE: creating a heatmap plot object from scratch... done in 0.021945 seconds.
ParaDRAM - NOTE: computing the sample autocorrelations...
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a line3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a scatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a lineScatter3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a jointplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a histplot plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot1 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a kdeplot2 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour3 plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contourf plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a contour plot object from scratch... done in 0.0 seconds.
ParaDRAM - NOTE: creating a grid plot object from scratch... done in 0.001996 seconds.

ParaDRAM - NOTE: The processed sample files are now stored in the newly-created
ParaDRAM - NOTE: component sampleList of the ParaDRAM object as a Python list.
ParaDRAM - NOTE: For example, to access the contents of the first (or the only) sample file, try:
ParaDRAM - NOTE: where you will have to replace pmpd with your ParaDRAM instance name.
ParaDRAM - NOTE: To access the plotting tools, try:
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.<PRESS TAB TO SEE THE LIST OF PLOTS>
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.line()         # to make 2D line plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.scatter()      # to make 2D scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.lineScatter()  # to make 2D line-scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.line3()        # to make 3D line plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.scatter3()     # to make 3D scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.lineScatter3() # to make 3D line-scatter plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contour()      # to make fast 2D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contourf()     # to make fast 2D kernel density filled contour plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.contour3()     # to make fast 3D kernel density contour plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.histplot()     # to make seaborn 1D distribution plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.kdeplot1()     # to make seaborn 1D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.kdeplot2()     # to make seaborn 2D kernel density plots.
ParaDRAM - NOTE:     pmpd.sampleList[0].plot.grid()         # to make GridPlot
ParaDRAM - NOTE: To plot or inspect the variable autocorrelations or the correlation/covariance matrices, try:
ParaDRAM - NOTE:     pmpd.sampleList[0].stats.<PRESS TAB TO SEE THE LIST OF COMPONENTS>


In [15]:
# plot the sampled variables

for colname in pmpd.sampleList[0].df.columns:
sample.plot.line.ycolumns = colname
sample.plot.line()
sample.plot.line.currentFig.axes.set_xlabel("MCMC Count")
sample.plot.line.currentFig.axes.set_ylabel(colname)
sample.plot.line.savefig( fname = "./out/traceplot_" + colname )

# plot the histograms of the sampled parameters

for colname in pmpd.sampleList[0].df.columns:
sample.plot.histplot(xcolumns = colname)
sample.plot.histplot.currentFig.axes.set_xlabel(colname)
sample.plot.histplot.currentFig.axes.set_ylabel("MCMC Count")
sample.plot.histplot.savefig( fname = "./out/histogram_" + colname )

# report the average parameters

pmpd.sampleList[0].df.mean()

ParaDRAM - NOTE: making the line plot...
done in 0.174561 seconds.
ParaDRAM - NOTE: saving the plot to file: "./out/traceplot_SampleLogFunc"
ParaDRAM - NOTE: done in 0.176497 seconds.
ParaDRAM - NOTE: making the line plot...