The basic idea of Monte Carlo Methods is to use random numbers and statistics to solve very complicated problems that are hard to solve in any other way. Typically, this is achieved through simulations, such as Markov chains. However, the most straightforward application is in computing high-dimensional integrals by throwing points into a volume and estimating the result. There are numerous other applications, such as finding global minimums through simulated annealing, drawing inspiration from statistical physics, and so on.
Here we will cover a very small fraction of Classical Monte Carlo
algorithms. There is a huge set of quantum Monte Carlo algorithms that we will not discuss here.
First we need to understand what are random numbers on a comuter.
It is very hard to implement a good random number generator because a sequence of trully random numbers can not be generated by deterministic computers. Only pseudo-random number generators can be coded. There are several excellent pseudo random number generators available in various libraries, which give very satisfactory results in combination with Monte Carlo methods or multidimensional integrations. For every Monte Carlo application, it is crucial to use high quality random number generator. In practice, it is best to select a few random number generators with a good reputation, and make sure that results do not depend on the choice of a random number generator.
Numpy default generators are now from PCG
library, which has very good reputation, and is very new. Older random number generations with good reputation are Mersenne Twister
generators, which usually go by the name rng mt19937
. For example gnu scientific library is implementing them as gsl_rng_mt19937
and intel mkl
library is calling them VSL_BRNG_MT19937
and VSL_BRNG_SFMT19937
.
How do random number generators work?
The simplest and fastest random number generators, which are unfortunately not of
high quality, are linear congruential generators:
$$I_{j+1}=(a I_j + c)\; \textrm{mod}\; m$$
for example, $a=16843009$, $c=3014898611$, and $m=2^{32}$ in cc65
generator.
Modern generators, like PCG
are of course much better, but still one needs to understand that these are pseudo-random numbers. Before you spent million of CPU hours, you should test that results don't depend on random number generator within stastistical error.
Multidimensional numeric integration in more than 4 dimensions is more appropriate for Monte-Carlo than one-dimensional quadratures. If the function is smooth enough, or we know how to transform integral to make it smooth, the integration can be performed with MC. The reason for MC success is that it's error, according to central limit theorem, is always proportional to $1/\sqrt{N}$ independent of dimension.
The error of one dimensional quadratures can be estimated: If the number of points used in each dimension is $N_1$, the number of all points used in $d$ dimensions is $N=(N_1)^d$. The error for trapezoid rule was estimated to $1/(N_1)^2$ therefore the error in $d$ dimensions is $1/N^{2/d}$. It is therefore clear than for $d=4$ the Monte-Carlo error and the trapezoid-rule error are equal.
Straighforward MC
For more or less flat functions, the integration is straightforward \begin{eqnarray} \int f dV \approx V \langle f\rangle \pm V \sqrt{\frac{\langle f^2\rangle-\langle f\rangle^2}{N}} \end{eqnarray} here \begin{eqnarray} \langle f\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f(x_i)\\ \langle f^2\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f^2(x_i) \end{eqnarray}
If the function $f$ is rapidly varying, the variance is going to be large and precision of the integral vary bad. The scaling $1/\sqrt{N}$ is very bad! If one has a lot of computer time and patience, one might still try to use this method, because it is so straighforward to implement.
If the region $V$ of the integral is complicated and is hard to generate distribution of points in volume $V$, one can just find a larger and simpler volume $W$ which contains volume $V$. Then one samples over $W$ and defines the function $f$ to be zero outside $V$. Of course the error will increase because number of "good" points is smaller.
Importance sampling
Usefulness of Monte Carlo becomes more appealing when importance sampling strategy is implemented. Of course one needs to know something about the function to implement the strategy, but one is rewarded with much higher accuracy.
It is simplest to ilustrate the idea in 1D. If one knows that function $f(x)$ to be integrated is mostly proportional to another function $w(x)$ in the region where the integral contains most of the weight, we might want to rewrite the integral \begin{eqnarray} \int f(x)dx = \int \frac{f(x)}{w(x)} w(x) dx \end{eqnarray} If weight function $w(x)$ is a simple analytic function, which can be integrated analytically, and obeys the following constrains
we can define \begin{eqnarray} W(x) = \int^x w(t)dt \qquad \rightarrow\qquad dW(x) = w(x)dx \end{eqnarray} and rewrite \begin{eqnarray} \int f(x)dx = \int \frac{f(x)}{w(x)} dW(x) = \int \frac{f(x(W))}{w(x(W))} dW\rightarrow \left\langle \frac{f(x(W))}{w(x(W))}\right\rangle_{W\; uniform\;\in[0,1]} \nonumber \end{eqnarray}
If the function $f/w$ on mesh $W$ is reasonably flat, it can be efficiently integrated by MC. The error is now proportional to $$\sqrt{\frac{\langle(f/w)^2\rangle-\langle f/w\rangle^2}{N}}$$ and is therefore greatly reduced.
To implement the algorith, we generate uniform random numbers $r$ in the interval $r\in[0,1]$ which correspond to variable $W$. We can solve the equaton for $x=W^{-1}(r)$ to get $x$ and use it to evaluate $f(x)/w(x)$. The random numbers are therefore uniformly distributed on mesh $W$ while they are non-uniformly distributed on mesh $x$.
This can also be writtens as \begin{eqnarray} \left\langle \frac{f(x)}{w(x)}\right\rangle_{\frac{P(x)}{dx}=w(x)} \nonumber \end{eqnarray} because the distribution of points $x$ is $\frac{dP}{dx}=\frac{dP}{dW}\frac{dW}{dx}$ and since distribution $\frac{dP}{dW}$ is uniform, and $\frac{dW}{dx}=w(x)$, we have $\frac{dP}{dx}=w(x)$.
The archaic example is the exponential weight function \begin{eqnarray} w(x) = \frac{1}{\lambda}e^{-x/\lambda}\qquad for\; x>0 \end{eqnarray} This is equivalent to our exponentially distributed mesh points. Most of them are going to be close to $0$ and only few at large $x$.
The integral is $W(x)=1-e^{-x/\lambda}$ which gives for the inverse $x=-\lambda\ln(1-W)$. The integral \begin{equation} \int_0^\infty f(x)dx=\int_0^1 \frac{f(-\lambda\ln(1-W))}{w(-\lambda\ln(1-W))} dW = \int_0^1 f(-\lambda\ln(1-W))\frac{\lambda dW}{1-W} \end{equation} is easily evaluated with MC if $f(x)$ is exponentially falling function. \begin{eqnarray} \int_0^\infty f(x)dx\rightarrow \langle \lambda \frac{f(-\lambda\ln(1-W))}{1-W}\rangle_{W\;uniform\in[0,1]} \end{eqnarray}
Since $\frac{dP}{dW}=1$ ($W$ is uniformly distributed), the probability for $x$ is $\frac{dP}{dx}=w(x)$, therefore $x$ is exponentially distributed random number.
We could also write \begin{eqnarray} \int_0^\infty f(x)dx\rightarrow \left\langle\frac{f(x)}{\frac{e^{-x/\lambda}}{\lambda}}\right\rangle_{\frac{dP}{dx}=e^{-x/\lambda}/\lambda} \end{eqnarray}
Another very usefull weight function is Gaussian distribution \begin{eqnarray} w(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-x_0)^2}{2\sigma^2}} \end{eqnarray} How do we get random number $x$ to be distributed according to the above distribution? The integral gives $\mathrm{erf}$ function and its inverse is not simple to evaluate.
The trick is to use two random numbers to get Gaussian distrbution. Consider the following algorithm \begin{eqnarray} x_1 = x_0 + \sqrt{-2\sigma^2\ln r_1}\cos(2\pi r_2)\\ x_2 = x_0 + \sqrt{-2\sigma^2\ln r_1}\sin(2\pi r_2) \end{eqnarray} The distribution of $r_1$ and $r_2$ is uniform in the interval $[0,1]$. The distribution of $x_1$ and $x_2$ is \begin{eqnarray} \frac{d^2 P}{dx_1 dx_2} = \frac{d^2 P}{dr_1 dr_2} \left|\frac{\partial(r_1,r_2)}{\partial(x_1,x_2)}\right|= \frac{1}{\sqrt{2\pi}}e^{-(x_1-x_0)^2/2} \frac{1}{\sqrt{2\pi}}e^{-(x_2-x_0)^2/2} \end{eqnarray} therefore Gaussian. In this way, we can obtaine $x$ to be distributed gaussian, and we can evaluate \begin{eqnarray} \int f(x)dx = \left\langle \frac{f(x)}{\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-x_0)^2/(2\sigma^2)}} \right\rangle_{\frac{dP}{dx}=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-x_0)^2}{2\sigma^2}}} \end{eqnarray}
What is the best choice for weight function $w$?
If function $f$ is positive, clearly best $w$ is just proportional to $f$. What if $f$ is not positive everywhere? It turns out that the best choice is absolute value of $f$, i.e., \begin{eqnarray} w=\frac{|f|}{\int|f|dV} \end{eqnarray} The proof is simple. The MC importance sampling evaluates \begin{eqnarray} \int f dV = \int \frac{f}{w}w dV\approx \left\langle\frac{f}{w}\right\rangle\pm \sqrt{\frac{\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2}{N}} \end{eqnarray} and the error is minimal when \begin{eqnarray} \delta\left(\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2 +\lambda(\int wdV -1)\right)=0\\ \delta\left( \int \frac{f^2}{w^2} w dV - \left( \int \frac{f}{w}w dV \right)^2+\lambda(\int wdV -1) \right)=0 \end{eqnarray} \begin{eqnarray} \int \left(\frac{f^2}{w^2}-\lambda\right)dV=0\qquad\rightarrow\qquad w\propto|f| \end{eqnarray}
If we know a good approximation for function $f$, we can use this information to sample the same function $f$ to higher accuracy with importance sampling. The solution can thus be improved iteratively. This idea is implemented in Vegas algorithm, which is pedagogically implemented in 509 course.
There is another set of algorithms to improve precision of Monte Carlo sampling. The idea is to divide the volume into smaller \DarkGreen{subregions} and check in each subregion how rapidly is the function $f$ varying in each subregion. The quantitative estimation can be the variance of the function is each subregion $\sqrt{\langle f^2\rangle-\langle f\rangle^2}$. The idea is to increase the number of points in those regions where variance is big. The algorithm is called \DarkGreen{Stratified Sampling} and is used in \DarkGreen{Miser} integration routine. The idea seems simple and powerful, but is not very usefull for high-dimensional integration because the number of subregions grows exponentially with the number of dimensions therefore it is usefull only if we have some idea how to constract small number of subregions where variance of $f$ is large. This last trick is also used in Vegas algorithm which is probabbly the best algorithm available at the moment.
Let us introduce the concept of importance sampling method by application to classical many-particle system (like Ising model or classical gas).
The basic idea of Monte Carlo Simulation: The simulation is performed by random walk through very large configuration space. The probability to make a move has to be such that the system gets to thermal equilibrium (and remains in thermal equilibrium) at certain temperature $T$ (is usually a parameter) after a lot of Monte Carlo steps.
The basic idea of simulated annealing: Slowly decreasing the temperature of the system leads to the ground state of the system. When using this type of cooling down by Monte Carlo, one can find a global minimum of a general minimization problem. This is called simulated annealing.
If a configuration in phase space is denoted by $X$, the probability for configuration according to Boltzman is \begin{equation} \rho(X)\propto e^{-\beta E(X)}\qquad \beta=\frac{1}{T} \end{equation}
How to sample over the whole phase space for a general problem? How to generate configurations?
Brute force: generate a truly random configuration X and accept it with probability $e^{-\beta E(X)}$ where all $E>0$. Successive $X$ are statistically independent.
VERY INEFFICIENT
Markov chain: Successive configurations $X_i$, $X_{i+1}$ are NOT statistically independent but are distributed according to the choosen disribution (such as Boltzman distribution).
Can be made VERY EFFICIENT
What is the difference between Markov chain and uncorrelated sequence? \begin{itemize}
where the transition probabilities $T(X\rightarrow X')$ are normalized $$\sum_{X'} T(X \rightarrow X')=1$$
We want to generate Markov chain where distribution of states is proportional to the choosen distribution, for example $$e^{-\beta E(X)},$$ and the distribution of states is independent of the position within the chain and independent of the initial configuration.
The necessary conditions for generating such Markov chain is that every configuration in the phase space should be accesible from any other configuration within finite number of steps (connectedness or irreducibility) - (Be careful to check this condition when choosing Monte Carlo step!)
We need to find transition probability $T(X\rightarrow X')$ which leads to a given stationary distribution $\rho(X)$ (in this case $\rho(X)\propto e^{-\beta E(X)}$).
The probability for $X$ decreases, if system goes from $X$ to any other $X'$: $\Delta \rho(X)=-\sum_{X'}\rho(X)T(X\rightarrow X')$ and increases if $X$ configuration is visited from any other state $X'$: $\Delta \rho(X) = \sum_{X'}\rho(X')T(X'\rightarrow X)$. The step (time) difference of the probability $X$ is therefore \begin{equation} \rho(X,t+1)-\rho(X,t) = -\sum_{X'}\rho(X)T(X\rightarrow X')+\sum_{X'}\rho(X')T(X'\rightarrow X) \end{equation} We look for stationary solution, i.e., $\rho(X,t+1)-\rho(X,t)=0$ and therefore \begin{equation} \sum_{X'}\rho(X)T(X\rightarrow X')=\sum_{X'}\rho(X')T(X'\rightarrow X) \end{equation} General solution of this equation is not accesible, but a particular solution is obvious \begin{equation} \rho(X)T(X\rightarrow X')=\rho(X')T(X'\rightarrow X) \end{equation} This solution is called DETAIL BALANCE solution.
To construct algorithm, we divide transition prob. $T(X\rightarrow X')=\omega_{X\rightarrow X'}A_{X\rightarrow X'}$:
trial step probability $\omega_{X\rightarrow X'}$.
Many times it is symmetric, i.e., $\omega_{X\rightarrow X'}=\omega_{X'\rightarrow X}$. (for example spin flip in ising: $\omega_{XX'}$ is $1/L^2$ if $X$ and $X'$ differ for a single spin flip and zero otherwise ).
acceptance probability $A_{X\rightarrow X'}$
(for example accepting or rejecting new configuration with probability proportional to $min(1,\exp(-\beta(E(X')-E(X))))$).
Detail balance condition becomes $$\frac{A_{X\rightarrow X'}}{A_{X'\rightarrow X}}=\frac{\omega_{X' \rightarrow X}\;\rho(X')} {\omega_{X\rightarrow X'}\;\rho(X)}$$
Metropolis chooses \begin{eqnarray} \begin{array}{lc} A_{X\rightarrow X'}=1 & \;\mathrm{if}\;\; \omega_{X'\rightarrow X} \rho(X')> \omega_{X\rightarrow X'} \rho(X)\\ A_{X\rightarrow X'}= \frac{\omega_{X'\rightarrow X}\,\rho(X')}{\omega_{X\rightarrow X'}\,\rho(X)} & \;\mathrm{if}\;\;\omega_{X'\rightarrow X}\rho(X')<\omega_{X\rightarrow X'}\;\rho(X). \end{array} \end{eqnarray} Obviously, this acceptance probability satisfies detail balance condition and therefore leads to desired Markov chain with stationary probability for any configuration $X\propto \rho(X)$ for long times.
To summarize Metropolis algorithm
How to accept a step with probability $A_{XX'}<1$? One can generate a random number $r\in[0,1]$ and accept the step if $r<A_{XX'}$.
Keep in mind:
Most of the time we are simulating configurations which correspond to the partition function, so that the probability to find a given state $X$ in the Markov's chain is equal to $\rho(X) = \frac{e^{-\beta E(X)}}{\sum_X e^{-\beta E(X)}}=\frac{\Delta Z(X)}{Z} $.
The average of any quantity is then calculated by $$\overline{A}\equiv \sum_X \rho(X) A(X) \rightarrow \frac{1}{n-n_0}\sum_{i>n_0}^n A_i$$ where $n_0$ steps are used to "warm-up". This is because the configurations in the Markov's chain are distributed according to $\rho(X)$.
We can also try compute the following quantity $$\sum_X \rho(X)(A(X)-\overline{A})^2 \rightarrow^? \frac{1}{n-n_0}\sum_{i>n_0}^n (A_i-\overline{A})^2.$$ This is A WRONG ESTIMATION OF THE ERROR OF MC. The error is much bigger than this estimation, because configurations $X$ are correlated! Imagine the limit of large correlations when almost all values $A_i$ are the same (very slowly changing configurations). We would estimate that standard deviation is zero regardless of the actual error!
To compute standard deviation, we need to group meassurements within the correlation time into bins and than estimate the standard deviation of the bins: \begin{eqnarray} B_l = \frac{1}{N_0}\sum_{i=N_l}^{i<N_l+N_0} A_i\\ \sigma^2 = \frac{1}{M}\sum_{j=0}^{M-1}(B_j-\overline{A})^2 \end{eqnarray} where we took into account that $\overline{A}=\overline{B}$. The correlation time (here denoted by $N_0$) is not very easy to estimate. Maybe the best algorithm is to compute $\sigma^2$ for few different $N_0$ and as long as $\sigma^2$ is increasing with $N_0$, the correlation time is still larger than $N_0$. When $\sigma^2$ stops changing with increasing $N_0$, we reached correlation time and $\sigma^2$ is a good estimation of standard deviation.