#!/usr/bin/env python
# coding: utf-8
# # Monte Carlo Methods
# The basic idea of Monte Carlo Methods is to use random numbers and statistics to solve very complicated problems that are hard to solve in any other way. Typically, this is achieved through simulations, such as Markov chains. However, the most straightforward application is in computing high-dimensional integrals by throwing points into a volume and estimating the result. There are numerous other applications, such as finding global minimums through simulated annealing, drawing inspiration from statistical physics, and so on.
#
# Here we will cover a very small fraction of `Classical Monte Carlo` algorithms. There is a huge set of quantum Monte Carlo algorithms that we will not discuss here.
#
# First we need to understand what are random numbers on a comuter.
# ## Random Numbers
# It is very hard to implement a good random number generator because a sequence of trully random numbers can not be generated by deterministic computers. Only pseudo-random number generators can be coded. There are several excellent pseudo random number generators available in various libraries, which give very satisfactory results in combination
# with Monte Carlo methods or multidimensional integrations. For every Monte Carlo application, it is crucial to use high quality random number generator. In practice, it is best to select a few random number generators with a good
# reputation, and make sure that results do not depend on the choice of a random number
# generator.
#
# Numpy default generators are now from `PCG` library, which has very good reputation, and is very new. Older random number generations with good reputation are `Mersenne Twister` generators, which usually go by the name `rng mt19937`. For example gnu scientific library is implementing them as `gsl_rng_mt19937` and intel `mkl` library is calling them `VSL_BRNG_MT19937` and `VSL_BRNG_SFMT19937`.
#
#
# **How do random number generators work?**
#
# The simplest and fastest random number generators, which are unfortunately not of
# high quality, are linear congruential generators:
# $$I_{j+1}=(a I_j + c)\; \textrm{mod}\; m$$
# for example, $a=16843009$, $c=3014898611$, and $m=2^{32}$ in `cc65` generator.
#
# Modern generators, like `PCG` are of course much better, but still one needs to understand that these are pseudo-random numbers. Before you spent million of CPU hours, you should test that results don't depend on random number generator within stastistical error.
# ## Multidimensional integration
# Multidimensional numeric integration in more than 4 dimensions is more
# appropriate for Monte-Carlo than one-dimensional quadratures. If the
# function is smooth enough, or we know how to transform integral to
# make it smooth, the integration can be performed with MC. The reason
# for MC success is that it's error, according to central limit theorem,
# is always proportional to $1/\sqrt{N}$ independent of dimension.
#
# The error of one dimensional quadratures can be estimated: If the number of
# points used in each dimension is $N_1$, the number of all points used
# in $d$ dimensions is $N=(N_1)^d$. The error for trapezoid rule was
# estimated to $1/(N_1)^2$ therefore the error in $d$ dimensions is
# $1/N^{2/d}$.
# It is therefore clear than for $d=4$ the Monte-Carlo error and the
# trapezoid-rule error are equal.
#
#
# **Straighforward MC**
#
# For more or less flat functions, the integration is straightforward
# \begin{eqnarray}
# \int f dV \approx V \langle f\rangle \pm V \sqrt{\frac{\langle
# f^2\rangle-\langle f\rangle^2}{N}}
# \end{eqnarray}
# here
# \begin{eqnarray}
# \langle f\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f(x_i)\\
# \langle f^2\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f^2(x_i)
# \end{eqnarray}
#
# If the function $f$ is rapidly varying, the variance is going to be
# large and precision of the integral vary bad. The scaling $1/\sqrt{N}$ is
# very bad! If one has a lot of computer time and patience, one might
# still try to use this method, because it is so straighforward to implement.
#
# If the region $V$ of the integral is complicated and is hard to
# generate distribution of points in volume $V$, one can just find a larger and simpler volume
# $W$ which contains volume $V$. Then one samples over $W$ and defines
# the function $f$ to be zero outside $V$. Of course the error will increase because
# number of "good" points is smaller.
# **Importance sampling**
#
# Usefulness of Monte Carlo becomes more appealing when importance
# sampling strategy is implemented. Of course one needs to know something
# about the function to implement the strategy, but one is rewarded with
# much higher accuracy.
#
# It is simplest to ilustrate the idea in 1D. If one knows that function
# $f(x)$ to be integrated is mostly proportional to another function
# $w(x)$ in the region where the integral contains most of the weight,
# we might want to rewrite the integral
# \begin{eqnarray}
# \int f(x)dx = \int \frac{f(x)}{w(x)} w(x) dx
# \end{eqnarray}
# If weight function $w(x)$ is a simple analytic function, which can be
# integrated analytically, and obeys the following constrains
# - $w(x)>0$ for every $x$
# - $\int w(x)dx = 1$
# we can define
# \begin{eqnarray}
# W(x) = \int^x w(t)dt \qquad \rightarrow\qquad dW(x) = w(x)dx
# \end{eqnarray}
# and rewrite
# \begin{eqnarray}
# \int f(x)dx = \int \frac{f(x)}{w(x)} dW(x) = \int
# \frac{f(x(W))}{w(x(W))} dW\rightarrow
# \left\langle \frac{f(x(W))}{w(x(W))}\right\rangle_{W\; uniform\;\in[0,1]}
# \nonumber
# \end{eqnarray}
#
# If the function $f/w$ on mesh $W$ is reasonably flat, it can be
# efficiently integrated by MC. The error is now proportional to
# $$\sqrt{\frac{\langle(f/w)^2\rangle-\langle f/w\rangle^2}{N}}$$ and is
# therefore greatly reduced.
#
# To implement the algorith, we generate uniform random numbers $r$
# in the interval $r\in[0,1]$ which correspond to variable $W$. We can
# solve the equaton for $x=W^{-1}(r)$ to get $x$ and use it to evaluate
# $f(x)/w(x)$. The random numbers are therefore uniformly distributed on
# mesh $W$ while they are non-uniformly distributed on mesh $x$.
#
# This can also be writtens as
# \begin{eqnarray}
# \left\langle \frac{f(x)}{w(x)}\right\rangle_{\frac{P(x)}{dx}=w(x)}
# \nonumber
# \end{eqnarray}
# because the distribution of points $x$ is
# $\frac{dP}{dx}=\frac{dP}{dW}\frac{dW}{dx}$ and since distribution
# $\frac{dP}{dW}$ is uniform, and $\frac{dW}{dx}=w(x)$, we have $\frac{dP}{dx}=w(x)$.
#
# The archaic example is the **exponential** weight function
# \begin{eqnarray}
# w(x) = \frac{1}{\lambda}e^{-x/\lambda}\qquad for\; x>0
# \end{eqnarray}
# This is equivalent to our exponentially distributed mesh points. Most
# of them are going to be close to $0$ and only few at large $x$.
#
# The integral is $W(x)=1-e^{-x/\lambda}$ which gives for the inverse
# $x=-\lambda\ln(1-W)$. The integral
# \begin{equation}
# \int_0^\infty f(x)dx=\int_0^1
# \frac{f(-\lambda\ln(1-W))}{w(-\lambda\ln(1-W))} dW =
# \int_0^1 f(-\lambda\ln(1-W))\frac{\lambda dW}{1-W}
# \end{equation}
# is easily evaluated with MC if $f(x)$ is exponentially falling function.
# \begin{eqnarray}
# \int_0^\infty f(x)dx\rightarrow
# \langle \lambda \frac{f(-\lambda\ln(1-W))}{1-W}\rangle_{W\;uniform\in[0,1]}
# \end{eqnarray}
#
# Since $\frac{dP}{dW}=1$ ($W$ is uniformly distributed), the probability
# for $x$ is $\frac{dP}{dx}=w(x)$, therefore $x$ is exponentially distributed random
# number.
#
# We could also write
# \begin{eqnarray}
# \int_0^\infty f(x)dx\rightarrow
# \left\langle\frac{f(x)}{\frac{e^{-x/\lambda}}{\lambda}}\right\rangle_{\frac{dP}{dx}=e^{-x/\lambda}/\lambda}
# \end{eqnarray}
#
#
# Another very usefull weight function is **Gaussian distribution**
# \begin{eqnarray}
# w(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-x_0)^2}{2\sigma^2}}
# \end{eqnarray}
# How do we get random number $x$ to be distributed according to the
# above distribution? The integral gives $\mathrm{erf}$ function and its
# inverse is not simple to evaluate.
#
# The trick is to use two random numbers to get
# **Gaussian distrbution**. Consider the following algorithm
# \begin{eqnarray}
# x_1 = x_0 + \sqrt{-2\sigma^2\ln r_1}\cos(2\pi r_2)\\
# x_2 = x_0 + \sqrt{-2\sigma^2\ln r_1}\sin(2\pi r_2)
# \end{eqnarray}
# The distribution of $r_1$ and $r_2$ is uniform in the interval
# $[0,1]$. The distribution of $x_1$ and $x_2$ is
# \begin{eqnarray}
# \frac{d^2 P}{dx_1 dx_2} = \frac{d^2 P}{dr_1 dr_2}
# \left|\frac{\partial(r_1,r_2)}{\partial(x_1,x_2)}\right|=
# \frac{1}{\sqrt{2\pi}}e^{-(x_1-x_0)^2/2} \frac{1}{\sqrt{2\pi}}e^{-(x_2-x_0)^2/2}
# \end{eqnarray}
# therefore Gaussian. In this way, we can obtaine $x$ to be distributed
# gaussian, and we can evaluate
# \begin{eqnarray}
# \int f(x)dx = \left\langle \frac{f(x)}{\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-x_0)^2/(2\sigma^2)}}
# \right\rangle_{\frac{dP}{dx}=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-x_0)^2}{2\sigma^2}}}
# \end{eqnarray}
#
#
# **What is the best choice for weight function $w$?**
#
# If function $f$ is positive, clearly best $w$ is just proportional to
# $f$. What if $f$ is not positive everywhere? It turns out that the
# best choice is absolute value of $f$, i.e.,
# \begin{eqnarray}
# w=\frac{|f|}{\int|f|dV}
# \end{eqnarray}
# The proof is simple. The MC importance sampling evaluates
# \begin{eqnarray}
# \int f dV = \int \frac{f}{w}w dV\approx
# \left\langle\frac{f}{w}\right\rangle\pm
# \sqrt{\frac{\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2}{N}}
# \end{eqnarray}
# and the error is minimal when
# \begin{eqnarray}
# \delta\left(\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2
# +\lambda(\int wdV -1)\right)=0\\
# \delta\left(
# \int \frac{f^2}{w^2} w dV -
# \left(
# \int \frac{f}{w}w dV
# \right)^2+\lambda(\int wdV -1)
# \right)=0
# \end{eqnarray}
# \begin{eqnarray}
# \int \left(\frac{f^2}{w^2}-\lambda\right)dV=0\qquad\rightarrow\qquad w\propto|f|
# \end{eqnarray}
#
# If we know a good approximation for function $f$, we can use this
# information to sample the same function $f$ to higher accuracy with
# importance sampling. The solution can thus be improved
# **iteratively**. This idea is implemented in **Vegas**
# algorithm, which is pedagogically implemented in 509 course.
#
#
# There is another set of algorithms to improve precision of Monte Carlo
# sampling. The idea is to divide the volume into smaller
# \DarkGreen{subregions} and check in each subregion how rapidly is the
# function $f$ varying in each subregion. The quantitative estimation
# can be the variance of the function is each subregion $\sqrt{\langle
# f^2\rangle-\langle f\rangle^2}$. The idea is to increase the number of
# points in those regions where variance is big. The algorithm is called
# \DarkGreen{Stratified Sampling} and is used in \DarkGreen{Miser}
# integration routine. The idea seems simple and powerful, but is not
# very usefull for high-dimensional integration because the number of
# subregions grows exponentially with the number of dimensions therefore
# it is usefull only if we have some idea how to constract small number
# of subregions where variance of $f$ is large. This last trick is also
# used in Vegas algorithm which is probabbly the best algorithm
# available at the moment.
#
# ## Monte Carlo Importance Sampling
#
#
# Let us introduce the concept of importance sampling method by
# application to classical many-particle system (like Ising model or
# classical gas).
#
# **The basic idea of Monte Carlo Simulation:**
# The simulation is performed by random walk through *very large* configuration
# space. The probability to make a move has to be such that the system
# gets to thermal equilibrium (and remains in thermal equilibrium) at
# certain temperature $T$ (is usually a parameter) after a lot of Monte
# Carlo steps.
#
# **The basic idea of simulated annealing:**
# Slowly decreasing the temperature of the system leads to the ground state
# of the system. When using this type of cooling down by Monte Carlo,
# one can find a global minimum of a general minimization problem. This
# is called simulated annealing.
#
#
# ### MC Importance Sampling by Markov chain
#
# If a configuration in phase space is denoted by $X$, the probability
# for configuration according to Boltzman is
# \begin{equation}
# \rho(X)\propto e^{-\beta E(X)}\qquad \beta=\frac{1}{T}
# \end{equation}
#
#
# How to sample over the whole phase space for a general problem? How to
# generate configurations?
#
# - **Brute force**: generate a truly random configuration X and accept it with probability $e^{-\beta E(X)}$ where all $E>0$. Successive $X$ are **statistically independent**.
#
# **VERY INEFFICIENT**
#
# - **Markov chain**: Successive configurations $X_i$, $X_{i+1}$ are **NOT statistically independent** but are distributed according to the choosen disribution (such as Boltzman distribution).
#
# **Can be made VERY EFFICIENT**
#
#
#
# What is the difference between Markov chain and uncorrelated sequence?
# \begin{itemize}
# - *Truly random* or *uncorrelated* sequence of configurations satisfies the identity
# $$P(X_1,X_2,\cdots,P_{X_N})=P_1(X_1)P_1(X_2)\cdots P_1(X_{N})$$
# - *Markov chain* satisfies the equation
# $$P(X_1,X_2,\cdots,P_{X_N})=P_1(X_1)T(X_1\rightarrow X_2) T(X_2\rightarrow X_3)\cdots T(X_{N-1}\rightarrow X_N)$$
# where the transition probabilities
# $T(X\rightarrow X')$ are normalized
# $$\sum_{X'} T(X \rightarrow X')=1$$
#
#
# We want to generate Markov chain where distribution of states is
# proportional to the choosen distribution, for example $$e^{-\beta E(X)},$$ and the
# distribution of states is independent of the position within the
# chain and independent of the initial configuration.
#
# - **1) Connectedness**
#
# The necessary conditions for generating such Markov chain is that every configuration in the phase space should be accesible from any other configuration within finite number of steps (*connectedness* or *irreducibility*) - (Be careful to check this condition when choosing Monte Carlo step!)
#
# - **2) Detail balance**
#
# We need to find transition probability $T(X\rightarrow X')$ which leads to a given **stationary** distribution $\rho(X)$ (in this case $\rho(X)\propto e^{-\beta E(X)}$).
#
# The probability for $X$ decreases, if system goes from $X$
# to any other $X'$: $\Delta \rho(X)=-\sum_{X'}\rho(X)T(X\rightarrow X')$ and increases
# if $X$ configuration is visited from any other state $X'$:
# $\Delta \rho(X) = \sum_{X'}\rho(X')T(X'\rightarrow X)$. The
# step (time) difference of the probability $X$ is therefore
# \begin{equation}
# \rho(X,t+1)-\rho(X,t) = -\sum_{X'}\rho(X)T(X\rightarrow X')+\sum_{X'}\rho(X')T(X'\rightarrow X)
# \end{equation}
# We look for stationary solution, i.e., $\rho(X,t+1)-\rho(X,t)=0$ and
# therefore
# \begin{equation}
# \sum_{X'}\rho(X)T(X\rightarrow X')=\sum_{X'}\rho(X')T(X'\rightarrow X)
# \end{equation}
# General solution of this equation is not accesible, but a particular
# solution is obvious
# \begin{equation}
# \rho(X)T(X\rightarrow X')=\rho(X')T(X'\rightarrow X)
# \end{equation}
# This solution is called **DETAIL BALANCE** solution.
#
#
# To construct algorithm, we divide transition prob.
# $T(X\rightarrow X')=\omega_{X\rightarrow X'}A_{X\rightarrow X'}$:
#
#
# - *trial step probability* $\omega_{X\rightarrow X'}$.
#
# Many times it is symmetric, i.e., $\omega_{X\rightarrow X'}=\omega_{X'\rightarrow X}$.
# (for example spin flip in ising: $\omega_{XX'}$ is $1/L^2$ if $X$ and $X'$ differ for a single spin flip and zero otherwise ).
#
# - *acceptance probability* $A_{X\rightarrow X'}$
#
# (for example accepting or rejecting new configuration with probability proportional to $min(1,\exp(-\beta(E(X')-E(X))))$).
#
#
# Detail balance condition becomes
# $$\frac{A_{X\rightarrow X'}}{A_{X'\rightarrow X}}=\frac{\omega_{X' \rightarrow X}\;\rho(X')}
# {\omega_{X\rightarrow X'}\;\rho(X)}$$
#
#
# Metropolis chooses
# \begin{eqnarray}
# \begin{array}{lc}
# A_{X\rightarrow X'}=1 & \;\mathrm{if}\;\; \omega_{X'\rightarrow X} \rho(X')> \omega_{X\rightarrow X'} \rho(X)\\
# A_{X\rightarrow X'}=
# \frac{\omega_{X'\rightarrow X}\,\rho(X')}{\omega_{X\rightarrow X'}\,\rho(X)} &
# \;\mathrm{if}\;\;\omega_{X'\rightarrow X}\rho(X')<\omega_{X\rightarrow X'}\;\rho(X).
# \end{array}
# \end{eqnarray}
# Obviously, this acceptance probability satisfies detail balance
# condition and therefore leads to desired Markov chain with stationary
# probability for any configuration $X\propto \rho(X)$ for long times.
#
#
# To summarize Metropolis algorithm
#
# - $T(X\rightarrow X')=\omega_{X\rightarrow X'}A_{X\rightarrow X'}$ (we have to consider trial step probability)
# - $\sum_{X'}\omega_{X\rightarrow X'}=1$ and $\omega_{X\rightarrow X'}>0$ for all $X, X'$ after finite number of steps (ergodicity)
# - $A_{X\rightarrow X'}=min(1,\frac{\rho(X')\omega_{X'\rightarrow
# X}}{\rho(X)\omega_{X\rightarrow X'}})$ (acceptance step probability for Metropolis)
#
#
# How to accept a step with probability $A_{XX'}<1$? One can generate
# a random number $r\in[0,1]$ and accept the step if $rn_0}^n A_i$$ where $n_0$ steps
# are used to "warm-up". This is because the configurations in the
# Markov's chain are distributed according to $\rho(X)$.
#
# We can also try compute the following quantity
# $$\sum_X \rho(X)(A(X)-\overline{A})^2 \rightarrow^?
# \frac{1}{n-n_0}\sum_{i>n_0}^n (A_i-\overline{A})^2.$$
# This is A WRONG ESTIMATION OF THE ERROR OF MC. The error is much
# bigger than this estimation, because configurations $X$ are
# correlated! Imagine the limit of large correlations when almost all
# values $A_i$ are the same (very slowly changing configurations). We
# would estimate that standard deviation is zero regardless of the
# actual error!
#
# To compute standard deviation, we need to group meassurements within
# the correlation time into bins and than estimate the standard
# deviation of the bins:
# \begin{eqnarray}
# B_l = \frac{1}{N_0}\sum_{i=N_l}^{i