Notebook

Continuous Data and the Gaussian Distribution¶

[1] (##) We are given an IID data set $D = \{x_1,x_2,\ldots,x_N\}$, where $x_n \in \mathbb{R}^M$. Let's assume that the data were drawn from a multivariate Gaussian (MVG),

$$\begin{align*} p(x_n|\theta) = \mathcal{N}(x_n|\,\mu,\Sigma) = \frac{1}{\sqrt{(2 \pi)^{M} |\Sigma|}} \exp\left\{-\frac{1}{2}(x_n-\mu)^T \Sigma^{-1} (x_n-\mu) \right\} \end{align*}$$

(a) Derive the log-likelihood of the parameters for these data.

(a) Let $\theta ={\mu,\Sigma}$. Then the log-likelihood can be worked out as

$$\begin{align*} \log p(D|\theta) &= \log \prod_n p(x_n|\theta) \\ &= \log \prod_n \mathcal{N}(x_n|\mu, \Sigma) \\ &= \log \prod_n (2\pi)^{-M/2} |\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2}(x_n-\mu)^T \Sigma^{-1}(x_n-\mu)\right\} \\ &= \sum_n \left( \log (2\pi)^{-M/2} + \log |\Sigma|^{-1/2} -\frac{1}{2}(x_n-\mu)^T \Sigma^{-1}(x_n-\mu)\right) \\ &\propto \frac{N}{2}\log |\Sigma|^{-1} - \frac{1}{2}\sum_n (x_n-\mu)^T \Sigma^{-1}(x_n-\mu) \end{align*}$$

(b) Derive the maximum likelihood estimates for the mean $\mu$ and variance $\Sigma$ by setting the derivative of the log-likelihood to zero.

(b) First we take the derivative with respect to the mean.

$$\begin{align*} \nabla_{\mu} \log p(D|\theta) &\propto - \sum_n \nabla_{\mu} \left(x_n-\mu \right)^T\Sigma^{-1}\left(x_n-\mu \right) \\ &= - \sum_n \nabla_{\mu} \left(-2 \mu^T\Sigma^{-1}x_n + \mu^T \Sigma^{-1}\mu \right) \\ &= - \sum_n \left(-2 \Sigma^{-1}x_n + 2\Sigma^{-1}\mu \right) \\ &= -2 \Sigma^{-1} \sum_n (x_n - \mu) \end{align*}$$

Setting the derivative to zeros leads to $\hat{\mu} = \frac{1}{N}\sum_n x_n$.

The derivative with respect to covariance is a bit more involved. It's actually easier to compute this by taking the derivative to the precision: $$\begin{align*} \nabla_{\Sigma^{-1}} \log p(D|\theta) &= \nabla_{\Sigma^{-1}} \left( \frac{N}{2} \log |\Sigma| ^{-1} -\frac{1}{2}\sum_n (x_n-\mu)^T \Sigma^{-1} (x_n-\mu)\right) \\ &= \nabla_{\Sigma^{-1}} \left( \frac{N}{2} \log |\Sigma| ^{-1} - \frac{1}{2}\sum_n \mathrm{Tr}\left[(x_n-\mu) (x_n-\mu)^T \Sigma^{-1} \right]\right) \\ &=\frac{N}{2}\Sigma - \frac{1}{2}\sum_n (x_n-\mu) (x_n-\mu)^T \end{align*}$$

Setting the derivative to zero leads to $\hat{\Sigma} = \frac{1}{N}\sum_n (x_n-\hat{\mu})

(x_n-\hat{\mu})^T$.

[2] (#) Shortly explain why the Gaussian distribution is often preferred as a prior distribution over other distributions with the same support?

You can get this answer straight from the lession notebook. Aside from the computational advantages (operations on distributions tends to make them more Gaussian, and Gaussians tends to remain Gaussians in computational manipulations), the Gaussian distribution is also the maximum-entropy distribution among distributions that are defined over real numbers. This means that there is no distribution with the same variance that assumes less information about its argument.

[3] (###) We make $N$ IID observations $D=\{x_1 \dots x_N\}$ and assume the following model

$$\begin{aligned} x_k &= A + \epsilon_k \\ A &\sim \mathcal{N}(m_A,v_A) \\ \epsilon_k &\sim \mathcal{N}(0,\sigma^2) \,. \end{aligned}$$

We assume that $\sigma$ has a known value and are interested in deriving an estimator for $A$ .

(a) Derive the Bayesian (posterior) estimate $p(A|D)$.

Since $p(D|A) = \prod_k \mathcal{N}(x_k|A,\sigma^2)$ is a Gaussian likelihood and $p(A)$ is a Gaussian prior, their multiplication is proportional to a Gaussian. We will work this out with the canonical parameterization of the Gaussian since it is easier to multiply Gaussians in that domain. This means the posterior $p(A|D)$ is

$$\begin{align*} p(A|D) &\propto p(A) p(D|A) \\ &= \mathcal{N}(A|m_A,v_A) \prod_{k=1}^N \mathcal{N}(x_k|A,\sigma^2) \\ &= \mathcal{N}(A|m_A,v_A) \prod_{k=1}^N \mathcal{N}(A|x_k,\sigma^2) \\ &= \mathcal{N}_c\big(A \Bigm|\frac{m_A}{v_A},\frac{1}{v_A}\big)\prod_{k=1}^N \mathcal{N}_c\big(A\Bigm| \frac{x_k}{\sigma^2},\frac{1}{\sigma^2}\big) \\ &\propto \mathcal{N}_c\big(A \Bigm| \frac{m_A}{v_A} + \frac{1}{\sigma^2} \sum_k x_k , \frac{1}{v_A} + \frac{N}{\sigma^2} \big) \,, \end{align*}$$

where we have made use of the fact that precision-weighted means and precisions add when multiplying Gaussians. In principle this description of the posterior completes the answer.

(b) (##) Derive the Maximum Likelihood estimate for $A$.

The ML estimate can be found by

$$\begin{align*} \nabla \log p(D|A) &=0\\ \nabla \sum_k \log \mathcal{N}(x_k|A,\sigma^2) &= 0 \\ \nabla \frac{-1}{2}\sum_k \frac{(x_k-A)^2}{\sigma^2} &=0\\ \sum_k(x_k-A) &= 0 \\ \Rightarrow \hat{A}_{ML} = \frac{1}{N}\sum_{k=1}^N x_k \end{align*}$$

The MAP is simply the location where the posterior has its maximum value, which for a Gaussian posterior is its mean value. We computed in (a) the precision-weighted mean, so we need to divide by precision (or multiply by variance) to get the location of the mean:

$$\begin{align*} \hat{A}_{MAP} &= \left( \frac{m_A}{v_A} + \frac{1}{\sigma^2} \sum_k x_k\right)\cdot \left( \frac{1}{v_A} + \frac{N}{\sigma^2} \right)^{-1} \\ &= \frac{v_A \sum_k x_k + \sigma^2 m_A}{N v_A + \sigma^2} \end{align*}$$

(d) Now assume that we do not know the variance of the noise term? Describe the procedure for Bayesian estimation of both $A$ and $\sigma^2$ (No need to fully work out to closed-form estimates).

A Bayesian treatment requires putting a prior on the unknown variance. The variance is constrained to be positive hence the support of the prior distribution needs to be on the positive reals. (In a multivariate case positivity needs to be extended to symmetric positive definiteness.) Choosing a conjugate prior will simplify matters greatly. In this scenerio the inverse Gamma distribution is the conjugate prior for the unknown variance. In the literature this model is called a Normal-Gamma distribution. See https://www.seas.harvard.edu/courses/cs281/papers/murphy-2007.pdf for the analytical treatment.

[4] (##) Proof that a linear transformation $z=Ax+b$ of a Gaussian variable $\mathcal{N}(x|\mu,\Sigma)$ is Gaussian distributed as

$$ p(z) = \mathcal{N} \left(z \,|\, A\mu+b, A\Sigma A^T \right) $$

First, we show that a linear transformation of a Gaussian is a Gaussian. In general, the transformed distribution of $z=g(x)$ is given by
$$ p_Z(z) = \frac{p_X(g^{-1}(z))}{\mathrm{det}[g(z)]}\,.$$

Since the transformation is linear, $\mathrm{det}[g] = \mathrm{det}[A]$, which is independent of $z$, and consequently $p_Z(z)$ has the same functional form as $p_X(x)$, i.e. $p_Z(z)$ is a also Gaussian. The mean and variance can easily be determined by the calculation that we used in question 8 of the Probability Theory exercises. This results in

$$ p(z) = \mathcal{N}\left( z \,|\, A\mu+b, A\Sigma A^T \right) \,. $$

[5] (#) Given independent variables

$x \sim \mathcal{N}(\mu_x,\sigma_x^2)$ and $y \sim \mathcal{N}(\mu_y,\sigma_y^2)$, what is the PDF for $z = A\cdot(x -y) + b$?

$z$ is also Gaussian with

$$ p_z(z) = \mathcal{N}(z \,|\, A(\mu_x-\mu_y)+b, \, A (\sigma_x^2 + \sigma_y^2) A^T) $$

[6] (###) Compute

\begin{equation*} \int_{-\infty}^{\infty} \exp(-x^2)\mathrm{d}x \,. \end{equation*}

For a Gaussian with zero mean and varance equal to $1$ we have

$$ \int \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}x^2) \mathrm{d}x = 1 $$

Substitution of $x = \sqrt{2}y$ with $\mathrm{d}x=\sqrt{2}\mathrm{d}y$ will simply lead you to $ \int_{-\infty}^{\infty} \exp(-y^2)\mathrm{d}y=\sqrt{\pi}$. If you don't want to use the result of the Gaussian integral, you can still do this integral, see youtube clip.