Author: Lehman Garrison (http://lgarrison.github.io)

$\newcommand{\RR}{\mathrm{RR}}$ $\newcommand{\DD}{\mathrm{DD}}$ When computing a two-point correlation function estimator like $\xi(r) = \frac{\DD}{\RR} - 1$, the $\RR$ term can be computed analytically if the domain is a periodic box. In 2D, this is often done as $$\begin{align} \RR_i &= N A_i \bar\rho \\ &= N A_i \frac{N}{L^2} \end{align}$$ where $\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\bar\rho$ is the average density in the box.

However, using $\bar\rho = \frac{N}{L^2}$ is not quite right. When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away. The other particle is the particle you're sitting on, which is always at distance $0$. Thus, the correct expression is $$\RR_i = N A_i \frac{N-1}{L^2}.$$

The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\frac{1}{N}$ into the estimator. This is a tiny correction for large $N$ problems, but important for small $N$.

Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.

This $N-1$ correction is implemented in the Corrfunc code.

In [1]:

```
import numpy as np
```

In [2]:

```
# Generates a set of N uniform random points in boxsize L
# and returns binned pair counts
# Pair counts are doubled, and self-pairs are counted
def rand_DD(N, L, bins):
# Make a 2D random set of N particles in [0,L)
x = np.random.random((2,N))*L
# Form the periodic distance array
dist = x[:,None] - x[...,None]
dist -= np.round(dist/L)*L
dist = (dist**2).sum(axis=0)**.5
# Count pairs in 2 bins
DD = np.histogram(dist, bins=bins)[0]
return DD
```

In [3]:

```
# DD parameters
N = 10
L = 1.
bins = np.linspace(0.0,0.49, 5)
n_iter = 100000 # number of times to repeat the experiment
seed = 42
np.random.seed(seed)
# First, calculate RR analytically
dA = (bins[1:]**2 - bins[:-1]**2)*np.pi # The area of each 2D annular bin
RR = np.empty((2, len(dA)))
RR[:] = N/L**2*dA
# Calculate RR using N-1 and N
RR[0] *= N - 1
RR[1] *= N
# If the first bin includes 0, the random term must include the self-pairs
if bins[0] == 0.:
RR[:, 0] += N
xi_mean = np.zeros((2, len(bins)-1))
for i in xrange(n_iter):
DD = rand_DD(N, L, bins)
xi_mean += DD/RR
xi_mean /= n_iter
print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])
print 'DD/RR using density (N) = {}'.format(xi_mean[1])
```

As we expected, using $N$ in the density leads to a $\frac{1}{N} = 10\%$ bias in $\frac{\DD}{\RR}$, when averaging over many realizations of $\DD$.

The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term.