#!/usr/bin/env python
# coding: utf-8

# # The RR term in autocorrelations
# Author: Lehman Garrison (http://lgarrison.github.io)

# $\newcommand{\RR}{\mathrm{RR}}$
# $\newcommand{\DD}{\mathrm{DD}}$
# When computing a two-point correlation function estimator like
# $\xi(r) = \frac{\DD}{\RR} - 1$,
# the $\RR$ term can be computed analytically if the domain is a periodic box.
# In 2D, this is often done as
# $$\begin{align}
# \RR_i &= N A_i \bar\rho \\
# &= N A_i \frac{N}{L^2}
# \end{align}$$
# where $\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\bar\rho$ is the average density in the box.
# 
# However, using $\bar\rho = \frac{N}{L^2}$ is not quite right.  When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away.  The other particle is the particle you're sitting on, which is always at distance $0$.  Thus, the correct expression is
# $$\RR_i = N A_i \frac{N-1}{L^2}.$$
# 
# The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\frac{1}{N}$ into the estimator.  This is a tiny correction for large $N$ problems, but important for small $N$.
# 
# Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.
# 
# This $N-1$ correction is implemented in the [Corrfunc](https://github.com/manodeep/Corrfunc) code.

# In[1]:


import numpy as np


# In[2]:


# Generates a set of N uniform random points in boxsize L
# and returns binned pair counts
# Pair counts are doubled, and self-pairs are counted
def rand_DD(N, L, bins):
    # Make a 2D random set of N particles in [0,L)
    x = np.random.random((2,N))*L
    
    # Form the periodic distance array
    dist = x[:,None] - x[...,None]
    dist -= np.round(dist/L)*L
    dist = (dist**2).sum(axis=0)**.5
    
    # Count pairs in 2 bins
    DD = np.histogram(dist, bins=bins)[0]
    
    return DD


# In[3]:


# DD parameters
N = 10
L = 1.
bins = np.linspace(0.0,0.49, 5)
n_iter = 100000  # number of times to repeat the experiment
seed = 42
np.random.seed(seed)

# First, calculate RR analytically
dA = (bins[1:]**2 - bins[:-1]**2)*np.pi  # The area of each 2D annular bin

RR = np.empty((2, len(dA)))
RR[:] = N/L**2*dA

# Calculate RR using N-1 and N
RR[0] *= N - 1
RR[1] *= N

# If the first bin includes 0, the random term must include the self-pairs
if bins[0] == 0.:
    RR[:, 0] += N

xi_mean = np.zeros((2, len(bins)-1))
for i in xrange(n_iter):
    DD = rand_DD(N, L, bins)
    xi_mean += DD/RR
    
xi_mean /= n_iter
print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])
print 'DD/RR using density (N)   = {}'.format(xi_mean[1])


# As we expected, using $N$ in the density leads to a $\frac{1}{N} = 10\%$ bias in $\frac{\DD}{\RR}$, when averaging over many realizations of $\DD$.
# 
# The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term.