Author: Lehman Garrison (http://lgarrison.github.io)
$\newcommand{\RR}{\mathrm{RR}}$ $\newcommand{\DD}{\mathrm{DD}}$ When computing a two-point correlation function estimator like $\xi(r) = \frac{\DD}{\RR} - 1$, the $\RR$ term can be computed analytically if the domain is a periodic box. In 2D, this is often done as $$\begin{align} \RR_i &= N A_i \bar\rho \\ &= N A_i \frac{N}{L^2} \end{align}$$ where $\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\bar\rho$ is the average density in the box.
However, using $\bar\rho = \frac{N}{L^2}$ is not quite right. When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away. The other particle is the particle you're sitting on, which is always at distance $0$. Thus, the correct expression is $$\RR_i = N A_i \frac{N-1}{L^2}.$$
The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\frac{1}{N}$ into the estimator. This is a tiny correction for large $N$ problems, but important for small $N$.
Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.
This $N-1$ correction is implemented in the Corrfunc code.
import numpy as np
# Generates a set of N uniform random points in boxsize L
# and returns binned pair counts
# Pair counts are doubled, and self-pairs are counted
def rand_DD(N, L, bins):
# Make a 2D random set of N particles in [0,L)
x = np.random.random((2,N))*L
# Form the periodic distance array
dist = x[:,None] - x[...,None]
dist -= np.round(dist/L)*L
dist = (dist**2).sum(axis=0)**.5
# Count pairs in 2 bins
DD = np.histogram(dist, bins=bins)[0]
return DD
# DD parameters
N = 10
L = 1.
bins = np.linspace(0.0,0.49, 5)
n_iter = 100000 # number of times to repeat the experiment
seed = 42
np.random.seed(seed)
# First, calculate RR analytically
dA = (bins[1:]**2 - bins[:-1]**2)*np.pi # The area of each 2D annular bin
RR = np.empty((2, len(dA)))
RR[:] = N/L**2*dA
# Calculate RR using N-1 and N
RR[0] *= N - 1
RR[1] *= N
# If the first bin includes 0, the random term must include the self-pairs
if bins[0] == 0.:
RR[:, 0] += N
xi_mean = np.zeros((2, len(bins)-1))
for i in xrange(n_iter):
DD = rand_DD(N, L, bins)
xi_mean += DD/RR
xi_mean /= n_iter
print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])
print 'DD/RR using density (N) = {}'.format(xi_mean[1])
DD/RR using density (N-1) = [ 1.00022627 0.99934544 0.99956887 1.00006867] DD/RR using density (N) = [ 0.96817988 0.8994109 0.89961199 0.9000618 ]
As we expected, using $N$ in the density leads to a $\frac{1}{N} = 10\%$ bias in $\frac{\DD}{\RR}$, when averaging over many realizations of $\DD$.
The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term.