Author: Yannick Copin y.copin@ipnl.in2p3.fr $ \renewcommand{\d}{\mathrm{d}} \renewcommand{\v}[1]{\boldsymbol{#1}} % Vector \newcommand{\m}[1]{\mathsf{#1}} % Matrix \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\med}{med} $
Abstract: A restricted number of statistical tools are presented, with emphasis on details sometimes overlooked in literature. This may be too simple, or too technical, or whatever: comments and suggestions are welcome.
%matplotlib inline
import numpy as N
import scipy.stats as SS
import scipy.linalg as SL
import matplotlib.pyplot as P
/data/ycopin/Softs/local/lib/python2.6/site-packages/numpy/oldnumeric/__init__.py:11: ModuleDeprecationWarning: The oldnumeric module will be dropped in Numpy 1.9 warnings.warn(_msg, ModuleDeprecationWarning)
We define two datasets:
Case without intrinsic dispersion: $\v{x} = (x_1,\ldots, x_N)$ is a sample of $N$ measurements of the same null quantity, with (usually estimated) gaussian measurement uncertainties $\d x_{i}^2$ following a log-normal distribution (here, $\log\d x^2$ follows a $\mathcal{N}(0,1)$ distribution).
Case with intrinsic dispersion: $\v{y} = (y_1,\ldots, y_N)$ is an intrinsically dispersed $\mathcal{N}(\mu=0,\sigma^2)$ sample, measured with the same gaussian measurement uncertainties $\d x_{i}$ as above. Unfortunately, the intrinsic dispersion $\sigma$ is usually not known a priori.
n = 100
N.random.seed(12345)
dx = N.sqrt(N.random.lognormal(sigma=0.3, size=n)) # Log-normally distributed uncertainties
x = N.random.normal(loc=0, scale=dx) # Pure gaussian errors
y = x + N.random.randn(n) # Gaussian uncertainties + intrinsic dispersion
P.errorbar(N.arange(n), x, yerr=dx, fmt='bo', label='x: no intrinsic dispersion')
P.errorbar(N.arange(n), y, yerr=dx, fmt='rs', label='y: x + intrinsic dispersion')
#P.hist(x, orientation='horizontal', histtype='stepfilled', color='b', alpha=0.5)
#P.hist(y, orientation='horizontal', histtype='step', color='r')
P.legend(loc='best')
<matplotlib.legend.Legend at 0xb75b2e6c>
Arithmetic average: $$\begin{align} \hat{\mu} &= \frac{1}{N}\sum_i x_i \\ \Var(\hat{\mu}) &= \frac{\hat{\sigma}^{2}}{N} \end{align}$$
mu = N.mean(x) # Mean
dmu = N.std(x, ddof=1) / N.sqrt(n) # Error on mean
print "Arithmetic mean: µ = %.4f ± %.4f" % (mu, dmu)
Arithmetic mean: µ = -0.0391 ± 0.0970
Weighted ("optimal") average: $$\begin{align} \hat{\mu}_w &= \frac{\sum_i w_i x_i}{\sum_i w_i} \quad\text{with}\quad w_i = \frac{1}{\d x_{i}^{2} + \sigma^2} \\ \Var(\hat{\mu}_w) &= \frac{1}{\sum_i w_i} \end{align}$$
muw, sumw = N.average(x, weights=dx ** -2, returned=True) # Optimally weighted mean
dmuw = N.sqrt(1 / sumw) # Uncertainty on weighted mean
print "x weighted mean: µ = %.4f ± %.4f" % (muw, dmuw)
print "y weighted mean (wrong disp.): µ = %.4f ± %.4f" % (N.average(y, weights=dx ** -2), dmuw)
muyw, sumwy = N.average(y, weights=1 / (1 + dx ** 2), returned=True)
dmuyw = N.sqrt(1 / sumwy)
print "y weighted mean (true disp.): µ = %.4f ± %.4f" % (muyw, dmuyw)
x weighted mean: µ = -0.0258 ± 0.0981 y weighted mean (wrong disp.): µ = -0.2177 ± 0.0981 y weighted mean (true disp.): µ = -0.2168 ± 0.1418
Attention:
In the optimally weighted mean, $\sigma_i^2 = 1/w_{i}$ is supposed to be the total variance of normally-distributed $x_i$, therefore including both measurement error $\d x_i$ and intrinsic distribution dispersion $\sigma$ if any: $\sigma_i^2 = \d x_i^2 + \sigma^2$.
If the weighting is not the true inverse variance $1/\sigma_i^2$, the weighted average is optimal (with minimal variance) but biased.
The reduced $\chi^2$ around estimated mean, defined as $\chi_\nu^2 = 1/(N-1) \times \sum_i (x_i - \mu_w)^2 / \sigma_i^2$, should not be significantly different from one for $\sigma_i$ compatible to the observed dispersion. If this is not the case, $\sigma_i$ has probably been misestimated, either because it ignored the intrinsic dispersion $\sigma$ or because the measurement uncertainties $\d x_i$ are misestimated, or both.
print "DoF k=%d" % (n - 1)
chi2 = N.sum(((x - muw) / dx) ** 2)
p = SS.chisqprob(chi2, n - 1)
print "x (no disp.): χ² = %.2f, p-value=%.3f" % (chi2, p)
chi2 = N.sum(((y - N.average(y, weights=dx ** -2))/dx) ** 2) # Wrong dispersion: sigma_i**2 = dx**2
p = SS.chisqprob(chi2, n - 1)
print "y (wrong disp.): χ² = %.2f, p-value=%.3f" % (chi2, p)
chi2 = N.sum(((y - muyw) / N.hypot(1, dx)) ** 2) # Correct dispersion: sigma_i**2 = sigma**2 + dx**2
p = SS.chisqprob(chi2, n - 1)
print "y (true disp.): χ² = %.2f, p-value=%.3f" % (chi2, p)
DoF k=99 x (no disp.): χ² = 93.38, p-value=0.640 y (wrong disp.): χ² = 185.51, p-value=0.000 y (true disp.): χ² = 87.61, p-value=0.787
Unweighted: The supposedly unbiased sample variance wrt. unknown mean $\mu$ (but see discussion in Oliphant, 2006) is: $$\begin{align} \hat{\sigma}^2 &= \frac{1}{N-1} \sum_i (x_i - \hat{\mu})^2 \\ \Var(\hat{\sigma}^2) &= \frac{1}{N}\left(m^4 - \frac{N-3}{N-1}\sigma^4\right) \\ &\simeq \frac{1}{N} \times 2\hat{\sigma}^4 \quad\text{in the normal approximation} \\ \text{therefore}\quad \Var(\hat{\sigma}) &\simeq \frac{1}{2N}\hat{\sigma}^2 \\ \text{Ahn & Fessler (2003):}\quad \Var(\hat{\sigma}) &= \frac{1}{2(N-1)}\hat{\sigma}^2 \quad\text{pour}\quad n > 10. \end{align}$$
sig = N.std(x, ddof=1)
dsig = sig / N.sqrt(2 * (n-1)) # Normal approximation, n>10
print "Dispersion: σ = %.4f ± %.4f" % (sig, dsig)
Dispersion: σ = 0.9704 ± 0.0690
Weighted: $$ \hat{\sigma}_w^2 = \frac{\sum w_i}{\left(\sum w_i\right)^2 - \sum w_i^2} \times \sum_i w_i (x_i - \hat{\mu}_w)^2. $$
w = dx ** -2
sumw = N.sum(w)
sigw = N.sqrt( N.sum(w * (x - muw) ** 2) / sumw )
print "x Weighted dispersion (biased): σ = %.4f" % (sigw)
sigw = N.sqrt( N.sum(w * (x - muw) ** 2) * sumw / (sumw ** 2 - N.dot(w, w)) )
print "x Weighted dispersion (unbiased): σ = %.4f" % (sigw)
sigyw = N.sqrt( N.sum(w * (y - N.average(y, weights=w)) ** 2) * sumw / (sumw ** 2 - N.dot(w, w)) )
print "y Weighted dispersion (wrong disp., unbiased): σ = %.4f" % (sigyw)
wy = 1 / (1 + dx ** 2)
sigyw = N.sqrt( N.sum(wy * (y - muyw) ** 2) * sumwy / (sumwy ** 2 - N.dot(wy, wy)) )
print "y Weighted dispersion (true disp., unbiased): σ = %.4f" % (sigyw)
x Weighted dispersion (biased): σ = 0.9481 x Weighted dispersion (unbiased): σ = 0.9534 y Weighted dispersion (wrong disp., unbiased): σ = 1.3438 y Weighted dispersion (true disp., unbiased): σ = 1.3338
Weighted RMS is usually defined as:
def wrms(x, dx):
w = 1/dx**2
muw,sumw = N.average(x, weights=w, returned=True)
return N.sqrt( N.sum(w*(x-muw)**2)/sumw )
wrmsx = wrms(x,dx)
print "WRMS = %.4f" % wrmsx
WRMS = 0.9481
This strictly corresponds to the biased weighted variance. If using inverse-variance weighting $w_i = 1/\sigma_i^2$, one has $\mathrm{WRMS}^2 = \chi^2 \times \sigma_{\mu_w}^2$. Under the assumption that the intrinsic scatter is null ($\sigma\equiv 0$), the WRMS can be used as a proxy for the rescaled uncertainty on the weighted mean when the errors $\d x_i$ were misestimated by a multiplicative factor: $$ \sigma_{\mu_w}^{*2} = \chi^2_\nu \times \sigma_{\mu_w}^2 = \mathrm{WRMS}^2/(N-1). $$
muw, sumw = N.average(x, weights=(0.1*dx) ** -2, returned=True) # Errors are wrong by a factor 10
dmuw = N.sqrt(1 / sumw)
print "x Weighted mean (wrong errors): µ = %.4f ± %.4f" % (muw, dmuw) # Wrong weighted mean uncertainty
dmur = wrms(x, 0.1 * dx) / N.sqrt(n - 1)
print "x Weighted mean (rescaled errors): µ = %.4f ± %.4f" % (muw, dmur) # Rescaled weighted mean uncertainty
x Weighted mean (wrong errors): µ = -0.0258 ± 0.0098 x Weighted mean (rescaled errors): µ = -0.0258 ± 0.0953
Note: the weighted mean $\hat{\mu}_{w}$ is unaffected by a scaling error on the errors $\d x_{i}$.
They are studied in Oliphant, 2006, and implemented in scipy.stats.mvsdist
(bayesian estimate of the sample mean, variance and standard deviation
PDFs) or scipy.stats.bayes_mvs
(bayesian confidence intervals).
(mu, (lmu, hmu)), (var, (lvar, hvar)), (sig, (lsig, hsig)) = SS.bayes_mvs(x, alpha=0.683)
print "Bayesian estimates: µ = %.4f +%.4f -%.4f" % (mu, hmu-mu, mu-lmu)
print "Bayesian estimates: σ = %.4f +%.4f -%.4f" % (sig, hsig-sig, sig-lsig)
Bayesian estimates: µ = -0.0391 +0.0976 -0.0976 Bayesian estimates: σ = 0.9779 +0.0696 -0.0697
Median-based estimators are robust (50% of the points can be wrong), but have a low accuracy. More pertinent estimators can be defined.
The variance of the median $m$ for large samples and normal distributions is: $$ \Var(m) = \frac{\pi}{2}\frac{s^{2}}{n}, $$ where $s$ is an (presumably robust) estimate of the scale.
Median Absolute Deviation: $$ \mathrm{MAD} = \med(| x - \med(x) |) $$
Inter-Quartile Range: this is the 25%-trimmed range, i.e. the difference between the 75th- and the 25th-percentiles of a sample.
Both estimators have a optimal breakpoint of 0.5: up to half the points of the sample can be arbitrarily large before the estimator gives an arbitrarily large estimate. (By comparison, the sample standard deviation has a breakpoint of 0.)
For a normal distribution, these are related to standard deviation
$\sigma$ by:
$$\begin{align}
\hat{\sigma} &= \frac{1}{\Phi^{-1}(3/4)}\mathrm{MAD}
\simeq 1.4826\times\mathrm{MAD} \\
% 1/SS.norm.ppf(0.75) = 1.4826
&= \frac{1}{2\Phi^{-1}(3/4)} \mathrm{IQR} \simeq 0.7413\times\mathrm{IQR}
% 2*sqrt(2)*scipy.special.erfinv(0.5) = 2*SS.norm.ppf(0.75) = 1.349
\end{align}$$
where $\Phi$ is the standard normal cumulative distribution function
(scipy.stats.norm.cdf
), and its inverse $\Phi^{-1}$ is known
as the normal quantile function, or probit function
(scipy.stats.norm.ppf
). Note that MAD is biased for low
number of points (see Croux & Rousseeuw, 1992).
Unfortunately, $\mathrm{MAD}$ is only 37% as efficient as the sample standard deviation.
med = N.median(x)
nmad = N.median(N.abs(x - N.median(x))) * 1.4826 # Normalized Median Absolute Deviation (for normal distributions)
dmed = nmad / N.sqrt(2 * n / N.pi) # For large normal samples
print "Median: %.4f ± %.4f" % (med, dmed)
Median: -0.2036 ± 0.1304
More robust estimators of scale with maximal breakdown point are presented in Croux and Rousseeuw (1992).
Absolute Pairwise Differences: defined in Rousseeuw and Croux (1993) $$\begin{align} S_n &= 1.1926 \med_i \left( \med_j(\left| x_i - x_j \right|) \right), \\ Q_n &= c_n \text{first quartile of} \left( \left| x_i - x_j \right| : i < j \right), \end{align}$$
Consider $\v{y}$ a column-vector of $N$ measurements $y_i$, each of which assumed to be Gaussian distributed with a mean $F_i(\v{\theta})$ (hereafter the model $\v{F}(\v{\theta})$, depending on $M$ parameters $\theta_k$). The maximum-likelyhood estimate $\hat{\v{\theta}}$ of set of parameters $\v{\theta}$ is the one minimizing the $\chi^2$ defined by: $$ \chi^2(\v{\theta}) = (\v{y} - \v{F}(\v{\theta}))^T \cdot \m{V}^{-1} \cdot (\v{y} - \v{F}(\v{\theta})), $$ where $\m{V}$ is the (supposedly known) covariance matrix between the measurements: $V_{ij} = \Cov(y_i, y_j)$.
In that case, $\chi^2(\v{\theta})$ follows a $\chi^2$-distribution with $k=N-M$ degrees of freedom. In particular:
where $\m{H}$ the Hessian matrix of the $\chi^2$ estimated at $\hat{\v{\theta}}$:
\begin{align} H_{ij} &= \left.\frac{\partial^2\chi^2}{\partial\theta_i\partial\theta_j}\right|_{\v{\theta}=\hat{\v{\theta}}} \\ &= 2 \left.\frac{\partial\v{F}}{\partial\theta_i}\right|_{\hat{\v{\theta}}}^T \cdot \m{V}^{-1} \cdot \left.\frac{\partial\v{F}}{\partial\theta_j}\right|_{\hat{\v{\theta}}}. \end{align}If the measurements are uncorrelated, the covariance matrix $\m{V}$ is purely diagonal, with $V_{ii} = \sigma_i^2$ the (supposedly known) variance of measurement $y_i$, and the previous equations can be rewritten:
\begin{align} \chi^2(\v{\theta}) &= \sum_{i=1}^N \frac{(y_i - F_i(\v{\theta}))^2}{\sigma_i^2}, \\ H_{ij} &= 2 \sum_{k=1}^N \frac{1}{\sigma_k^2} \frac{\partial F_k}{\partial\theta_i}\frac{\partial F_k}{\partial\theta_j}. \end{align}We define two datasets:
def vec2corr(vec):
"""Define n×n correlation matrix from *vec* of length n, such that
corr[0,:] = corr[:,0] = vec."""
n = len(vec)
tmp = N.concatenate((vec[:0:-1],vec))
corr = N.array([ tmp[n-(i+1):2*n-(i+1)] for i in range(n) ])
return corr
n = 100
tau = 5. # Exponential correlation length
N.random.seed(12)
sig = N.sqrt(N.random.lognormal(sigma=0.3, size=n)) # Log-normally distributed uncertainties
cor = vec2corr(N.exp(-N.arange(n)/tau)) # Correlation matrix
cov = cor * N.outer(sig, sig) # Covariance matrix
x = N.random.normal(loc=0, scale=sig) # Uncorrelated gaussian measurements
y = N.dot(SL.sqrtm(cor), x) # Correlated gaussian measurements
P.errorbar(N.arange(n), x, yerr=sig, fmt='b.', label='x: uncorrelated')
P.errorbar(N.arange(n), y, yerr=sig, fmt='rs', label='y: correlated')
P.legend(loc='best')
<matplotlib.legend.Legend at 0xb70cff0c>
The goodness-of-fit with respect to constant value $\mu = 0$ is given by:
icov = SL.pinvh(cov)
def chi2(mu, x):
"""Estimate chi2 *without* accounting for covariance."""
return N.sum(((x - mu)/sig)**2)
def chi2Cov(mu, y):
"""Estimate chi2 accounting for covariance."""
return N.dot(y - mu, N.dot(icov, y - mu))
print "DoF: k=%d" % n
c = chi2(0, x) # χ²
p = SS.chisqprob(c, n) # p-value
z = SS.distributions.norm.isf(p) # z-value
print "x (uncorrelated): χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
c = chi2(0, y) # No account for covariance
p = SS.chisqprob(c, n) # p-value
z = SS.distributions.norm.isf(p) # z-value
print "y (correlated, w/o cov.): χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
c = chi2Cov(0, y) # Account of covariance
p = SS.chisqprob(c, n) # p-value
z = SS.distributions.norm.isf(p) # z-value
print "y (correlated, w/ cov.): χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
DoF: k=100 x (uncorrelated): χ² = 110.02, p-value=0.232 (0.7-sigma) y (correlated, w/o cov.): χ² = 136.56, p-value=0.009 (2.4-sigma) y (correlated, w/ cov.): χ² = 115.88, p-value=0.132 (1.1-sigma)
In this particular case, the fact of disregarding the covariance for the correlated dataset leads to a spurious 2.4-sigma tension with the true average $\mu = 0$.
The objective is to derive the maximum-likelihood estimate of the average values, the model is simply the 1-parameter function $\v{F} = (\mu, \ldots, \mu)$. In presence of a covariance matrix $\m{V}$, one has then $\sigma_{\hat{\mu}}^2 = \left(\sum_{ij} (\m{V}^{-1})_{ij}\right)^{-1}$. For uncorrelated measurements, this gives $\sigma_{\hat{\mu}}^2 = \left(\sum_{i} 1/\sigma_i^2\right)^{-1}$ (uncertainty on variance weighted mean, see above).
mus = N.linspace(-1, 1, 1000)
print "DoF: k=%d" % (n-1)
def p2z(p):
"""Express the input one-sided *p*-value as a sigma equivalent
significance from a normal distribution (the so-called *z*-value)."""
return N.where(p >= 0.5, N.nan, SS.distributions.norm.isf(p))
xchi2s = [ chi2(mu, x) for mu in mus ]
iminx = N.argmin(xchi2s) # Poor man's minimization!
mux = mus[iminx]
dmux = N.sqrt(1/N.sum(1/sig**2))
print "x ML mean: µ = %+.4f ± %.4f (%.1f-sigma)" % (mux, dmux, N.abs(mux)/dmux)
c = chi2(mux, x) # χ²
p = SS.chisqprob(c, n-1) # p-value
z = p2z(p) # z-value
print " χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
ychi2s = [ chi2(mu, y) for mu in mus ]
iminy = N.argmin(ychi2s) # Poor man's minimization!
muy = mus[iminy]
dmuy = N.sqrt(1/N.sum(1/sig**2))
print "y ML mean (w/o cov): µ = %+.4f ± %.4f (%.1f-sigma)" % (muy, dmuy, N.abs(muy)/dmuy)
c = chi2(muy, y) # No account for covariance
p = SS.chisqprob(c, n-1) # p-value
z = p2z(p) # z-value
print " χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
ychi2Covs = [ chi2Cov(mu, y) for mu in mus ]
iminyCov = N.argmin(ychi2Covs) # Poor man's minimization!
muyCov = mus[iminyCov]
dmuyCov = N.sqrt(1/N.sum(icov)) # Simple expression of dµ derived from model
print "y ML mean (w/ cov): µ = %+.4f ± %.4f (%.1f-sigma)" % (muyCov, dmuyCov, N.abs(muyCov)/dmuyCov)
c = chi2Cov(muyCov, y) # Account of covariance
p = SS.chisqprob(c, n-1) # p-value
z = p2z(p) # z-value
print " χ² = %.2f, p-value=%.3f (%.1f-sigma)" % (c, p, z)
DoF: k=99 x ML mean: µ = -0.1872 ± 0.0955 (2.0-sigma) χ² = 106.20, p-value=0.292 (0.5-sigma) y ML mean (w/o cov): µ = -0.6777 ± 0.0955 (7.1-sigma) χ² = 86.06, p-value=0.820 (nan-sigma) y ML mean (w/ cov): µ = -0.2993 ± 0.1979 (1.5-sigma) χ² = 113.60, p-value=0.150 (1.0-sigma)
The effect of neglecting the covariance in correlated data is even stronger on ML-estimate of mean: the estimated value is more than 7-sigma away from the true value, with an unrealistically good $\chi^2_\min$.
P.plot(mus, xchi2s, 'g:')
P.plot(mus, ychi2s, 'r--')
P.plot(mus, ychi2Covs, 'b-')
P.errorbar([mux],[xchi2s[iminx]], xerr=dmux, fmt='g^', label='UNcorrelated')
P.errorbar([muy],[ychi2s[iminy]], xerr=dmuy, fmt='rs', label='Correlated (w/o cov)')
P.errorbar([muyCov],[ychi2Covs[iminyCov]], xerr=dmuyCov, fmt='bo', label='Correlated (w/ cov)')
P.xlabel(u"µ")
P.ylabel(u"χ²")
P.legend(loc='best', numpoints=1)
P.grid()
Consider a set of $N$ quantities (measurements, parameter estimates, etc.) $\v{y} = (y_1, \ldots y_N)$ associated to a covariance matrix $V_{ij} = \Cov(y_i, y_j)$ (estimated from measurement or adjustment procedures), and a set of $M$ functions $\v{f} = (f_1, \ldots, f_M)$. The covariance matrix among the propagated values $\v{f}(\v{y})$ can be estimated from a Taylor expansion around $\v{y}$. To first order, one has: $$ \Cov(\v{f}(\v{y})) \simeq \m{J}\cdot\m{V}\cdot{\m{J}}^T $$ where the matrix of derivative ("jacobian") is given by: $J_{ij} = \left.\frac{\partial f_i}{\partial y_j}\right|_{\v{y}}$.
This 1st-order expression is exact if $\v{f}$ is linear: $\v{f}(\v{y}) = \m{J}\cdot\v{y}$. If this is not the case, one has to make sure the functions are reasonably linear over the domain $\v{y} \pm \v{\sigma}_{\v{y}}$, or expand the Taylor series to higher orders.
The 1st-order expression be rewritten in a non-matricial form: $$ \Cov(f_i, f_j) \simeq \sum_{k,l = 1}^{N} \frac{\partial f_i}{\partial y_k}\frac{\partial f_j}{\partial y_l} V_{kl}. $$ In the case of uncorrelated input quantities, $V_{ij} = \sigma_i^2 \delta_{ij}$ and one recovers the traditional uncertainty propagation expression: $$ \sigma_{f_i}^2 = \sum_{k=1}^{N} \left(\frac{\partial f_i}{\partial y_k}\right)^2 \sigma_k^2. $$
Consider a spectrum made of $N$ flux measurements $\v{y} = (y_1, \ldots y_N)$ associated to a covariance matrix $V_{ij} = \Cov(y_i, y_j)$. One wants to compute the synthetic photometry (i.e. integrated flux) in bands $A$ and $B$, \begin{align} A &= \sum_i \alpha_i y_i \\ B &= \sum_i \beta_i y_i \end{align} where the fixed values of $\alpha$ and $\beta$ coefficients define the filter bandpass, as well as resulting variances and covariances.
In this simple linear case, fluctuations $\delta A$ are directly related to flux fluctuations $\delta y_i$: $$ \delta A = \sum_i \alpha_i \delta y_i $$ and therefore \begin{align} \sigma_A^2 &= \langle \delta^2 A \rangle \\ &= \sum_{i,j} \alpha_i \alpha_j \langle \delta y_i \delta y_j \rangle \\ &= \v{\alpha} \cdot \m{V} \cdot \v{\alpha}^T \\ \Cov(A, B) &= \langle \delta A \delta B \rangle \\ &= \sum_{i,j} \alpha_i \beta_j \langle \delta y_i \delta y_j \rangle \\ &= \v{\alpha} \cdot \m{V} \cdot \v{\beta}^T \end{align}
Correlation coefficients measure the relationship between two datasets $\{x_{i}\}$ and $\{y_{j}\}$.
Besides standard statistics handbooks,