In [1]:

```
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
```

In [2]:

```
%matplotlib notebook
import paramonte as pm
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))
sns.set()
```

Supposed we have observed a dataset comprised of $15027$ events with one attribute variable in this data.csv file. Plotting these points would yield a red-colored histogram like the following figure,

The pale black shaded area represents the missing points from our observational dataset. These are points that we could NOT observe (or detect) because of some instrumental bias and sample incompleteness as represented by the black line.

**Statement of problem**

Now, our goal is to form a hypothesis about this dataset, that is, a hypothesis about the distribution of the events in the above plot. To make a correct assessment, we will have to also carefully consider the effects of the detection threshold (the black line) in our inference.

The first thing we can do is to obtain a better visualization of this dataset. The data looks highly skewed to the right. Therefore, it makes sense to take the logarithm of this dataset in the hope that it can help us better understand the possible underlying generating distribution of this data,

Just by looking at the observed (red) distribution, we can form a relatively good hypothesis about the distribution of the data: If the detection threshold did not exist, the complete dataset (including the points in the black shaded area) would likely very well resemble a lognormal distribution (or a normal distribution on the logarithmic axes).

However, this dataset is affected by the detection threshold and we need to also take a model of the detection threshold into account. The logarithmic transformation makes it crystal-clear to us that the detection threshold is likely best modeled by as single value represented by the vertical line.

Therefore, we form the following hypothesis,

**Hypothesis**: Our data comes from a Lognormal distribution subject to censorship by sharp hard cutoff threshold. The proposition that data is lognormally distributed is mathematically equivalent to the proposition that `log(data)`

is normally distributed.

**Goal**: To **estimate the unknown parameters** of the Normal distribution, the **mean** and the **standard deviation**: ($\mu, \sigma$), as well as the **location of the sharp censorship threshold** in our data set ($\eta$).

**Methodology**: We will use the maximum likelihood approach and the **ParaDRAM** (**Delayed-Rejection Adaptive Metropolis-Hastings Markov Chain Monte Carlo**) sampler of the `paramonte`

library to estimate the 3 unknown parameters of our hypothesis.

First read the data using Pandas library, then log-transform data to make it look like a Normal distribution.

In [3]:

```
url = "https://github.com/cdslaborg/paramontex/raw/main/Python/Jupyter/regression_censored_lognormal_data_paradram/data.csv"
df = pd.read_csv(url, header = None)
df.describe()
```

Out[3]:

In [4]:

```
# log-transform data so that it can be fit by a Normal distribution
logdata = np.double(np.log(df.iloc[:,0]))
# plot histogram of log(data)
sns.set()
fig = plt.figure()
ax = fig.gca()
(n, bins, patches) = ax.hist(logdata, histtype='stepfilled', bins = 30)
plt.xlabel('Log( Data )')
plt.ylabel('Count')
plt.title('Histogram of Data')
ax.set_xlim([-4,4])
plt.grid(True)
plt.show()
```

We will write a class that takes the `logdata`

as input and has two methods, `getLogProb(x,avg,std,cutoff)`

and `getLogLike(param)`

. The former computes the log-probability of observing the input dataset data given a set of parameters for the model (the average `avg`

and the standard deviation `std`

of the Normal distribution, and the sharp cut on the distribution `cutoff`

). The latter method `getLogLike(param)`

must take the set of the parameters of our censored model as a vector. **We want to explore the shape of this 3-dimensional likelihood function** to find its mode and its width along the three axes, each of which represents one parameter of our model.

Here is the mathematical equation of Normal Probability Density Function (PDF),

$$ \large \pi (x ~|~ \mu, \sigma) = \frac{1}{ \sigma~\sqrt{2\pi} } ~ \exp \bigg( -\frac{(x-\mu)^2}{2\sigma^2} \bigg) $$However, our problem is more complicated than the above equation. The above equation gives the probability of obtaining a specific observation $x$ from a *full Normal* distribution, whereas the Normal distribution in our problem is heavily censored on its right tail. We must therefore take this censorship into account by renormalizing the above equation such that the integral of the censored PDF becomes unity again just as is the case for the above equation.

For any given cutoff ($\eta$), the intergal of the above PDF in the range $[-\infty, \eta]$ can be expressed in terms of an **error function** ($\mathrm{erf}$),

Therefore, $$ \large \pi (x ~|~ \mu, \sigma, \eta) = \begin{cases} \frac{1}{ \sigma~\sqrt{2\pi} } ~ \exp \bigg( -\frac{(x-\mu)^2}{2\sigma^2} \bigg) ~ \times ~ \frac{2}{1 + \mathrm{erf} \bigg( \frac{\eta-\mu}{\sigma\sqrt{2}} \bigg)} ~~~,~~~ \forall ~ x~<~\eta \\ 0 ~~~,~~~ \forall ~ x~>=~\eta \end{cases} $$

We will implement the logarithm of the above equation in `getLogProb()`

.

Now, with an equation at hand to compute the probabilities of individual data points for a given set of model parameters, we can go on to write the likelihood function of observing the entire dataset of $N$ observations for a given set of input parameters,

$$ \large \mathcal{L}\big( \{x_1, \ldots, ..., x_N \} ~;~ \mu, \sigma, \eta \big) = \prod^{N}_{i = 1} \pi (x ~|~ \mu, \sigma, \eta) $$We can now implement the above likelihood function to find the most likely set of parameters for our hypothesis.

Since the mathematical objective functions (e.g., probability density functions and likelihood functions) can take extremely small or large values, we often work with their natural logarithms instead.

This is the reason behind the naming convention used in the ParaMonte library forthe user's objective functions:getLogFunc, implying thatthe user must provide a function that returns the natural logarithm of the target objective function.

See this Jupyter Notebook for an in-depth discussion of why we need to work with the logarithm of mathematical objective functions in optimization and sampling problems.

**Tip**

- What is the possible range of the mean (
`avg`

) of Normal distribution? $[-\infty, +\infty]$ - What is the possible range of the
`cutoff`

parameter? $[-\infty, +\infty]$, although practically, its lower limit cannot be less than the maximum value of the our dataset. - What is the possible range of the standard deviation (
`std`

) of the Normal distribution? By definition, $[0, +\infty]$

Now, compare the range of `avg`

and `cutoff`

with `std`

in the above. Did you notice the asymmetric range of `std`

compared to the two others? This may not seem like a big deal, but as you will notice later, it would be much more elegant if the *theoretical* range of `std`

also looked symmetric like the other two parameters. This is surely possible.

To make the range of the `std`

parameter also span from negative infinity to positive infinity, we can take the logarithm of `std`

and instead try to find the best-fit `logStd`

whose range is $[-\infty, +\infty]$.

Here is an implementation of our censored Normal Hypothesis class,

In [9]:

```
class Normal():
def __init__(self, normallyDistributedData):
self.data = normallyDistributedData
self.ndata = len(self.data)
self.maxdata = np.max(self.data)
def getLogProb(self, data, avg, logStd, cutoff):
"""
Return the logarithm of the probability density of an input set of x values,
given the input parameters (avg,logStd,cutoff) for the Gaussian PDF.
"""
std = np.exp(logStd)
return -0.9189385332 - logStd - 0.5 * ( (data-avg) / std )**2 - np.log(sp.stats.norm.cdf(cutoff, avg, std))
def getLogLike(self,param):
if param[2] < self.maxdata:
# data cannot be larger than the threshold,
# therefore such input parameter set for the censored Normal distribution is impossible.
# In such occasion, we return negative infinity as the logLikelihood value.
return -1.e300
else:
return np.sum( self.getLogProb(self.data, param[0], param[1], param[2]) )
```

Now, we will instantiate an object of `Normal()`

class,

In [10]:

```
normal = Normal(normallyDistributedData = logdata)
```

In [11]:

```
# sample the log-likelihood function via the ParaDRAM MCMC sampler
import paramonte as pm
pmpd = pm.ParaDRAM() # create a ParaDRAM sampler object
pmpd.spec.chainSize = 20000 # change the number of sampled points from default 100,000 to a smaller value.
pmpd.spec.overwriteRequested = True # overwrite old existing simulation files with the same name
pmpd.spec.variableNameList = ["Average","LogStandardDeviation","Cutoff"]
pmpd.spec.startPointVec = [ 0 # mean
, 0 # log-standard-deviation
, normal.maxdata # initial value for cutoff must be larger than the maximum of data.
]
pmpd.spec.targetAcceptanceRate = [0.1,0.3] # ensure the MCMC sampling efficiency does not become too large or too small.
# to make the simulation fully reproducible and restartable:
pmpd.spec.randomSeed = 3571
pmpd.spec.outputFileName = "./out/censored"
# call the MCMC sampler
pmpd.runSampler( ndim = 3
, getLogFunc = normal.getLogLike
)
```

**Done**. The simulation output files are now stored in the current working directory of Python. You can also see the generated output files on this GitHub page.

We will now follow the guidelines provided by the sampler to read the resulting MCMC chain, in compact format, from the output `*_chain.txt`

.

In [12]:

```
chain = pmpd.readChain(renabled = True)[0] # enable return. Pick the contents of the first chain file.
```

Then, we will ensure that there is no evidence for a lack-of-convergence. We will do so by visualizing the `AdaptationMeasure`

column in th output chain file, which represents the amount of adaptation of the proposal distribution of the MCMC sampler throughout the simulation.

In [13]:

```
chain.plot.scatter.scatter.kws.cmap = "winter"
chain.plot.scatter(ycolumns="AdaptationMeasure",ccolumns=[])
chain.plot.scatter.currentFig.axes.set_ylim([1.e-5,1])
chain.plot.scatter.currentFig.axes.set_yscale("log")
```

**The above graph looks quite reassuring.**

We will now read the resulting sample from the output `*_sample.txt`

file.

In [14]:

```
pmpd.readSample() # read the final refined sample from the MCMC simulation
sample = pmpd.sampleList[0]
```

In [15]:

```
# plot the sampled variables
for colname in pmpd.sampleList[0].df.columns:
sample.plot.line.ycolumns = colname
sample.plot.line()
sample.plot.line.currentFig.axes.set_xlabel("MCMC Count")
sample.plot.line.currentFig.axes.set_ylabel(colname)
sample.plot.line.savefig( fname = "./out/traceplot_" + colname )
# plot the histograms of the sampled parameters
for colname in pmpd.sampleList[0].df.columns:
sample.plot.histplot(xcolumns = colname)
sample.plot.histplot.currentFig.axes.set_xlabel(colname)
sample.plot.histplot.currentFig.axes.set_ylabel("MCMC Count")
sample.plot.histplot.savefig( fname = "./out/histogram_" + colname )
# report the average parameters
pmpd.sampleList[0].df.mean()
```