Continuous Data and the Gaussian Distribution¶

Preliminaries¶

Goal
- Review of information processing with Gaussian distributions in linear systems
Materials
- Mandatory
  - These lecture notes
- Optional
  - Bishop pp. 85-93
  - MacKay - 2006 - The Humble Gaussian Distribution (highly recommended!)
  - Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.30-34, section 2.8, the Gaussian distribution
- References
  - E.T. Jaynes - 2003 - Probability Theory, The Logic of Science (best book available on the Bayesian view on probability theory)

Example Problem¶

Consider a set of observations $D=\{x_1,…,x_N\}$ in the 2-dimensional plane (see Figure). All observations were generated by the same process. We now draw an extra observation $x_\bullet = (a,b)$ from the same data generating process. What is the probability that $x_\bullet$ lies within the shaded rectangle $S$?

In [2]:

using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end

  Activating project at `~/github/bertdv/BMLIP/lessons`

In [3]:

using Distributions, Plots, LaTeXStrings
N = 100
generative_dist = MvNormal([0,1.], [0.8 0.5; 0.5 1.0])

D = rand(generative_dist, N) # Generate observations from generative_dist
scatter(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
x_dot = rand(generative_dist) # Generate x∙
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray,label=L"S")

The Gaussian Distribution¶

Consider a random (vector) variable $x \in \mathbb{R}^M$ that is "normally" (i.e., Gaussian) distributed. The moment parameterization of the Gaussian distribution is completely specified by its mean $\mu$ and variance $\Sigma$ and given by

$$ p(x | \mu, \Sigma) = \mathcal{N}(x|\mu,\Sigma) \triangleq \frac{1}{\sqrt{(2\pi)^M |\Sigma|}} \,\exp\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \right\}\,. $$

where $|\Sigma| \triangleq \mathrm{det}(\Sigma)$ is the determinant of $\Sigma$.

For the scalar real variable $x \in \mathbb{R}$, this works out to

$$ p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2 }} \,\exp\left\{-\frac{(x-\mu)^2}{2 \sigma^2} \right\}\,. $$

Alternatively, the canonical (a.k.a. natural or information ) parameterization of the Gaussian distribution is given by

$$\begin{equation*} p(x | \eta, \Lambda) = \mathcal{N}_c(x|\eta,\Lambda) = \exp\left\{ a + \eta^T x - \frac{1}{2}x^T \Lambda x \right\}\,. \end{equation*}$$

$a = -\frac{1}{2} \left( M \log(2 \pi) - \log |\Lambda| + \eta^T \Lambda \eta\right)$ is the normalizing constant that ensures that $\int p(x)\mathrm{d}x = 1$.
$\Lambda = \Sigma^{-1}$ is called the precision matrix.
$\eta = \Sigma^{-1} \mu$ is the natural mean or for clarity often called the precision-weighted mean.

Why the Gaussian?¶

Why is the Gaussian distribution so ubiquitously used in science and engineering? (see also Jaynes, section 7.14, and the whole chapter 7 in his book).

(1) Operations on probability distributions tend to lead to Gaussian distributions:
- Any smooth function with single rounded maximum, if raised to higher and higher powers, goes into a Gaussian function. (useful in sequential Bayesian inference).
- The Gaussian distribution has higher entropy than any other with the same variance.
  - Therefore any operation on a probability distribution that discards information but preserves variance gets us closer to a Gaussian.
  - As an example, see Jaynes, section 7.1.4 for how this leads to the Central Limit Theorem, which results from performing convolution operations on distributions.

(2) Once the Gaussian has been attained, this form tends to be preserved. e.g.,
- The convolution of two Gaussian functions is another Gaussian function (useful in sum of 2 variables and linear transformations)
- The product of two Gaussian functions is another Gaussian function (useful in Bayes rule).
- The Fourier transform of a Gaussian function is another Gaussian function.

Transformations and Sums of Gaussian Variables¶

A linear transformation $z=Ax+b$ of a Gaussian variable $x \sim \mathcal{N}(\mu_x,\Sigma_x)$ is Gaussian distributed as

$$ p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right) \tag{SRG-4a} $$

In fact, after a linear transformation $z=Ax+b$, no matter how $x$ is distributed, the mean and variance of $z$ are always given by $\mu_z = A\mu_x + b$ and $\Sigma_z = A\Sigma_x A^T$, respectively (see probability theory review lesson). In case $x$ is not Gaussian, higher order moments may be needed to specify the distribution for $z$.

The sum of two independent Gaussian variables is also Gaussian distributed. Specifically, if $x \sim \mathcal{N} \left(\mu_x, \Sigma_x \right)$ and $y \sim \mathcal{N} \left(\mu_y, \Sigma_y \right)$, then the PDF for $z=x+y$ is given by

$$\begin{align*} p(z) &= \mathcal{N}(x\,|\,\mu_x,\Sigma_x) \ast \mathcal{N}(y\,|\,\mu_y,\Sigma_y) \\ &= \mathcal{N} \left(z\,|\,\mu_x+\mu_y, \Sigma_x +\Sigma_y \right) \tag{SRG-8} \end{align*}$$

The sum of two Gaussian distributions is NOT a Gaussian distribution. Why not?

Example: Gaussian Signals in a Linear System¶

Given independent variables

$x \sim \mathcal{N}(\mu_x,\sigma_x^2)$ and $y \sim \mathcal{N}(\mu_y,\sigma_y^2)$, what is the PDF for $z = A\cdot(x -y) + b$ ? (for answer, see Exercises)

Think about the role of the Gaussian distribution for stochastic linear systems in relation to what sinusoidals mean for deterministic linear system analysis.

Bayesian Inference for the Gaussian¶

Let's estimate a constant $\theta$ from one 'noisy' measurement $x$ about that constant.
We assume the following measurement equations (the tilde $\sim$ means: 'is distributed as'):

$$\begin{align*} x &= \theta + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\sigma^2) \end{align*}$$

Also, let's assume a Gaussian prior for $\theta$

$$\begin{align*} \theta &\sim \mathcal{N}(\mu_0,\sigma_0^2) \\ \end{align*}$$

Model specification¶

Note that you can rewrite these specifications in probabilistic notation as follows:

$$\begin{align*} p(x|\theta) &= \mathcal{N}(x|\theta,\sigma^2) \\ p(\theta) &=\mathcal{N}(\theta|\mu_0,\sigma_0^2) \end{align*}$$

(Notational convention). Note that we write $\epsilon \sim \mathcal{N}(0,\sigma^2)$ but not $\epsilon \sim \mathcal{N}(\epsilon | 0,\sigma^2)$, and we write $p(\theta) =\mathcal{N}(\theta|\mu_0,\sigma_0^2)$ but not $p(\theta) =\mathcal{N}(\mu_0,\sigma_0^2)$.

Inference¶

For simplicity, we assume that the variance $\sigma^2$ is given and will proceed to derive a Bayesian posterior for the mean $\theta$. The case for Bayesian inference of $\sigma^2$ with a given mean is discussed in the optional slides.

Let's do Bayes rule for the posterior PDF $p(\theta|x)$.

$$\begin{align*} p(\theta|x) &= \frac{p(x|\theta) p(\theta)}{p(x)} \propto p(x|\theta) p(\theta) \\ &= \mathcal{N}(x|\theta,\sigma^2) \mathcal{N}(\theta|\mu_0,\sigma_0^2) \\ &\propto \exp \left\{ -\frac{(x-\theta)^2}{2\sigma^2} - \frac{(\theta-\mu_0)^2}{2\sigma_0^2} \right\} \\ &\propto \exp \left\{ \theta^2 \cdot \left( -\frac{1}{2 \sigma_0^2} - \frac{1}{2\sigma^2} \right) + \theta \cdot \left( \frac{\mu_0}{\sigma_0^2} + \frac{x}{\sigma^2}\right) \right\} \\ &= \exp\left\{ -\frac{\sigma_0^2 + \sigma^2}{2 \sigma_0^2 \sigma^2} \left( \theta - \frac{\sigma_0^2 x + \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2}\right)^2 \right\} \end{align*}$$

which we recognize as a Gaussian distribution w.r.t. $\theta$.

(Just as an aside,) this computational 'trick' for multiplying two Gaussians is called completing the square. The procedure makes use of the equality $$ax^2+bx+c_1 = a\left(x+\frac{b}{2a}\right)^2+c_2$$

In particular, it follows that the posterior for $\theta$ is

$$\begin{equation*} p(\theta|x) = \mathcal{N} (\theta |\, \mu_1, \sigma_1^2) \end{equation*}$$

where $$\begin{align*} \frac{1}{\sigma_1^2} &= \frac{\sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2} \\ \mu_1 &= \frac{\sigma_0^2 x + \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2} = \sigma_1^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \frac{1}{\sigma^2} x \right) \end{align*}$$

(Multivariate) Gaussian Multiplication¶

So, multiplication of two Gaussian distributions yields another (unnormalized) Gaussian with
- posterior precision equals sum of prior precisions
- posterior precision-weighted mean equals sum of prior precision-weighted means

As we just saw, a Gaussian prior, combined with a Gaussian likelihood, make Bayesian inference analytically solvable (!):

$$\begin{equation*} \underbrace{\text{Gaussian}}_{\text{posterior}} \propto \underbrace{\text{Gaussian}}_{\text{likelihood}} \times \underbrace{\text{Gaussian}}_{\text{prior}} \end{equation*}$$

In general, the multiplication of two multi-variate Gaussians over $x$ yields an (unnormalized) Gaussian over $x$:

$$\begin{equation*} \boxed{\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c)} \tag{SRG-6} \end{equation*}$$

where $$\begin{align*} \Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\ \Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b \end{align*}$$

Check out that normalization constant $\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)$. Amazingly, this constant can also be expressed by a Gaussian!

$\Rightarrow$ Note that Bayesian inference is trivial in the canonical parameterization of the Gaussian, where we would get

$$\begin{align*} \Lambda_c &= \Lambda_a + \Lambda_b \quad &&\text{(precisions add)}\\ \eta_c &= \eta_a + \eta_b \quad &&\text{(precision-weighted means add)} \end{align*}$$

This property is an important reason why the canonical parameterization of the Gaussian distribution is useful in Bayesian data processing.

Code Example: Product of Two Gaussian PDFs¶

Let's plot the exact product of two Gaussian PDFs as well as the normalized product according to the above derivation.

In [4]:

using Plots, Distributions, LaTeXStrings
d1 = Normal(0, 1) # μ=0, σ^2=1
d2 = Normal(3, 2) # μ=3, σ^2=4

# Calculate the parameters of the product d1*d2
s2_prod = (d1.σ^-2 + d2.σ^-2)^-1
m_prod = s2_prod * ((d1.σ^-2)*d1.μ + (d2.σ^-2)*d2.μ)
d_prod = Normal(m_prod, sqrt(s2_prod)) # Note that we neglect the normalization constant.

# Plot stuff
x = range(-4, stop=8, length=100)
plot(x, pdf.(d1,x), label=L"\mathcal{N}(0,1)", fill=(0, 0.1))                                   # Plot the first Gaussian
plot!(x, pdf.(d2,x), label=L"\mathcal{N}(3,4)", fill=(0, 0.1))                                  # Plot the second Gaussian
plot!(x, pdf.(d1,x) .* pdf.(d2,x), label=L"\mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1))   # Plot the exact product
plot!(x, pdf.(d_prod,x), label=L"Z^{-1} \mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1))      # Plot the normalized Gaussian product

Bayesian Inference with multiple Observations¶

Now consider that we measure a data set $D = \{x_1, x_2, \ldots, x_N\}$, with measurements

$$\begin{aligned} x_n &= \theta + \epsilon_n \\ \epsilon_n &\sim \mathcal{N}(0,\sigma^2) \end{aligned}$$

and the same prior for $\theta$: $$ \theta \sim \mathcal{N}(\mu_0,\sigma_0^2) \\ $$

Let's derive the distribution $p(x_{N+1}|D)$ for the next sample .

inference¶

First, we derive the posterior for $\theta$:

$$\begin{align*} p(\theta|D) \propto \underbrace{\mathcal{N}(\theta|\mu_0,\sigma_0^2)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \mathcal{N}(x_n|\theta,\sigma^2)}_{\text{likelihood}} \end{align*}$$

which is a multiplication of $N+1$ Gaussians and is therefore also Gaussian-distributed.

Using the property that precisions and precision-weighted means add when Gaussians are multiplied, we can immediately write the posterior $$p(\theta|D) = \mathcal{N} (\theta |\, \mu_N, \sigma_N^2)$$ as

$$\begin{align*} \frac{1}{\sigma_N^2} &= \frac{1}{\sigma_0^2} + \sum_n \frac{1}{\sigma^2} \qquad &\text{(B-2.142)} \\ \mu_N &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \qquad &\text{(B-2.141)} \end{align*}$$

application: prediction of future sample¶

We now have a posterior for the model parameters. Let's write down what we know about the next sample $x_{N+1}$.

$$\begin{align*} p(x_{N+1}|D) &= \int p(x_{N+1}|\theta) p(\theta|D)\mathrm{d}\theta \\ &= \int \mathcal{N}(x_{N+1}|\theta,\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\ &= \int \mathcal{N}(\theta|x_{N+1},\sigma^2) \mathcal{N}(\theta|\mu_N,\sigma^2_N) \mathrm{d}\theta \\ &= \int \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta \tag{use SRG-6} \\ &= \mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \underbrace{\int \mathcal{N}(\theta|\cdot,\cdot)\mathrm{d}\theta}_{=1} \\ &=\mathcal{N}(x_{N+1}|\mu_N, \sigma^2_N +\sigma^2 ) \end{align*}$$

Uncertainty about $x_{N+1}$ involved both uncertainty about the parameter ($\sigma_N^2$) and observation noise $\sigma^2$.

Maximum Likelihood Estimation for the Gaussian¶

In order to determine the maximum likelihood estimate of $\theta$, we let $\sigma_0^2 \rightarrow \infty$ (leads to uniform prior for $\theta$), yielding $ \frac{1}{\sigma_N^2} = \frac{N}{\sigma^2}$ and consequently

$$\begin{align*} \mu_{\text{ML}} = \left.\mu_N\right\vert_{\sigma_0^2 \rightarrow \infty} = \sigma_N^2 \, \left( \frac{1}{\sigma^2}\sum_n x_n \right) = \frac{1}{N} \sum_{n=1}^N x_n \end{align*}$$

As expected, having an expression for the maximum likelihood estimate, it is now possible to rewrite the (Bayesian) posterior mean for $\theta$ as

$$\begin{align*} \underbrace{\mu_N}_{\text{posterior}} &= \sigma_N^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \\ &= \frac{\sigma_0^2 \sigma^2}{N\sigma_0^2 + \sigma^2} \, \left( \frac{1}{\sigma_0^2} \mu_0 + \sum_n \frac{1}{\sigma^2} x_n \right) \\ &= \frac{ \sigma^2}{N\sigma_0^2 + \sigma^2} \mu_0 + \frac{N \sigma_0^2}{N\sigma_0^2 + \sigma^2} \mu_{\text{ML}} \\ &= \underbrace{\mu_0}_{\text{prior}} + \underbrace{\underbrace{\frac{N \sigma_0^2}{N \sigma_0^2 + \sigma^2}}_{\text{gain}}\cdot \underbrace{\left(\mu_{\text{ML}} - \mu_0 \right)}_{\text{prediction error}}}_{\text{correction}}\tag{B-2.141} \end{align*}$$

Hence, the posterior mean always lies somewhere between the prior mean $\mu_0$ and the maximum likelihood estimate (the "data" mean) $\mu_{\text{ML}}$.

Conditioning and Marginalization of a Gaussian¶

Let $z = \begin{bmatrix} x \\ y \end{bmatrix}$ be jointly normal distributed as

$$\begin{align*} p(z) &= \mathcal{N}(z | \mu, \Sigma) =\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right) \end{align*}$$

Since covariance matrices are by definition symmetric, it follows that $\Sigma_x$ and $\Sigma_y$ are symmetric and $\Sigma_{xy} = \Sigma_{yx}^T$.

Let's factorize $p(z) = p(x,y)$ as $p(x,y) = p(y|x) p(x)$ through conditioning and marginalization.

$$\begin{equation*} \text{conditioning: }\boxed{ p(y|x) = \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right)} \end{equation*}$$

$$\begin{equation*} \text{marginalization: } \boxed{ p(x) = \mathcal{N}\left( x|\mu_x, \Sigma_x \right)} \end{equation*}$$

proof: in Bishop pp.87-89

Hence, conditioning and marginalization in Gaussians leads to Gaussians again. This is very useful for applications to Bayesian inference in jointly Gaussian systems.

With a natural parameterization of the Gaussian $p(z) = \mathcal{N}_c(z|\eta,\Lambda)$ with precision matrix $\Lambda = \Sigma^{-1} = \begin{bmatrix} \Lambda_x & \Lambda_{xy} \\ \Lambda_{yx} & \Lambda_y \end{bmatrix}$, the conditioning operation results in a simpler result, see Bishop pg.90, eqs. 2.96 and 2.97.

As an exercise, interpret the formula for the conditional mean ($\mathbb{E}[y|x]=\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x)$) as a prediction-correction operation.

Code Example: Joint, Marginal, and Conditional Gaussian Distributions¶

Let's plot of the joint, marginal, and conditional distributions.

In [5]:

using Plots, LaTeXStrings, Distributions

# Define the joint distribution p(x,y)
μ = [1.0; 2.0]
Σ = [0.3 0.7;
     0.7 2.0]
joint = MvNormal(μ,Σ)

# Define the marginal distribution p(x)
marginal_x = Normal(μ[1], sqrt(Σ[1,1]))

# Plot p(x,y)
x_range = y_range = range(-2,stop=5,length=1000)
joint_pdf = [ pdf(joint, [x_range[i];y_range[j]]) for  j=1:length(y_range), i=1:length(x_range)]
plot_1 = heatmap(x_range, y_range, joint_pdf, title = L"p(x, y)")

# Plot p(x)
plot_2 = plot(range(-2,stop=5,length=1000), pdf.(marginal_x, range(-2,stop=5,length=1000)), title = L"p(x)", label="", fill=(0, 0.1))

# Plot p(y|x = 0.1)
x = 0.1
conditional_y_m = μ[2]+Σ[2,1]*inv(Σ[1,1])*(x-μ[1])
conditional_y_s2 = Σ[2,2] - Σ[2,1]*inv(Σ[1,1])*Σ[1,2]
conditional_y = Normal(conditional_y_m, sqrt.(conditional_y_s2))
plot_3 = plot(range(-2,stop=5,length=1000), pdf.(conditional_y, range(-2,stop=5,length=1000)), title = L"p(y|x = %$x)", label="", fill=(0, 0.1))
plot(plot_1, plot_2, plot_3, layout=(1,3), size=(1200,300))

As is clear from the plots, the conditional distribution is a renormalized slice from the joint distribution.

Example: Conditioning of Gaussian¶

Consider (again) the system

$$\begin{align*} p(x\,|\,\theta) &= \mathcal{N}(x\,|\,\theta,\sigma^2) \\ p(\theta) &= \mathcal{N}(\theta\,|\,\mu_0,\sigma_0^2) \end{align*}$$

Let $z = \begin{bmatrix} x \\ \theta \end{bmatrix}$. The distribution for $z$ is then given by (Exercise)

$$ p(z) = p\left(\begin{bmatrix} x \\ \theta \end{bmatrix}\right) = \mathcal{N} \left( \begin{bmatrix} x\\ \theta \end{bmatrix} \,\left|\, \begin{bmatrix} \mu_0\\ \mu_0\end{bmatrix}, \begin{bmatrix} \sigma_0^2+\sigma^2 & \sigma_0^2\\ \sigma_0^2 &\sigma_0^2 \end{bmatrix} \right. \right) $$

Direct substitution of the rule for Gaussian conditioning leads to the posterior (derivation as an Exercise):

$$\begin{align*} p(\theta|x) &= \mathcal{N} \left( \theta\,|\,\mu_1, \sigma_1^2 \right)\,, \end{align*}$$

with $$\begin{align*} K &= \frac{\sigma_0^2}{\sigma_0^2+\sigma^2} \qquad \text{($K$ is called: Kalman gain)}\\ \mu_1 &= \mu_0 + K \cdot (x-\mu_0)\\ \sigma_1^2 &= \left( 1-K \right) \sigma_0^2 \end{align*}$$

$\Rightarrow$ Moral: For jointly Gaussian systems, we can do inference simply in one step by using the formulas for conditioning and marginalization.

Recursive Bayesian Estimation for Adaptive Signal Processing¶

Consider the signal $x_t=\theta+\epsilon_t$, where $D_t= \left\{x_1,\ldots,x_t\right\}$ is observed sequentially (over time).
Problem: Derive a recursive algorithm for $p(\theta|D_t)$, i.e., an update rule for (posterior) $p(\theta|D_t)$ based on (prior) $p(\theta|D_{t-1})$ and (new observation) $x_t$.

Model specification¶

Let's define the estimate after $t$ observations (i.e., our solution ) as $p(\theta|D_t) = \mathcal{N}(\theta\,|\,\mu_t,\sigma_t^2)$.
We define the joint distribution for $\theta$ and $x_t$, given background $D_{t-1}$, by

$$\begin{align*} p(x_t,\theta \,|\, D_{t-1}) &= p(x_t|\theta) \, p(\theta|D_{t-1}) \\ &= \underbrace{\mathcal{N}(x_t\,|\, \theta,\sigma^2)}_{\text{likelihood}} \, \underbrace{\mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2)}_{\text{prior}} \end{align*}$$

Inference¶

Use Bayes rule,

$$\begin{align*} p(\theta|D_t) &= p(\theta|x_t,D_{t-1}) \\ &\propto p(x_t,\theta | D_{t-1}) \\ &= p(x_t|\theta) \, p(\theta|D_{t-1}) \\ &= \mathcal{N}(x_t|\theta,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \\ &= \mathcal{N}(\theta|x_t,\sigma^2) \, \mathcal{N}(\theta\,|\,\mu_{t-1},\sigma_{t-1}^2) \;\;\text{(note this trick)}\\ &= \mathcal{N}(\theta|\mu_t,\sigma_t^2) \;\;\text{(use Gaussian multiplication formula SRG-6)} \end{align*}$$

with $$\begin{align*} K_t &= \frac{\sigma_{t-1}^2}{\sigma_{t-1}^2+\sigma^2} \qquad \text{(Kalman gain)}\\ \mu_t &= \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})\\ \sigma_t^2 &= \left( 1-K_t \right) \sigma_{t-1}^2 \end{align*}$$

This linear sequential estimator of mean and variance in Gaussian observations is called a Kalman Filter.

The so-called Kalman gain $K_t$ serves as a "learning rate" (step size) in the parameter update equation $\mu_t = \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})$. Note that you don't need to choose the learning rate. Bayesian inference computes its own (optimal) learning rates.

Note that the uncertainty about $\theta$ decreases over time (since $0<(1-K_t)<1$). If we assume that the statistics of the system do not change (stationarity), each new sample provides new information about the process, so the uncertainty decreases.

Recursive Bayesian estimation as discussed here is the basis for adaptive signal processing algorithms such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). Both RLS and LMS are special cases of Recursive Bayesian estimation.

Code Example: Kalman Filter¶

Let's implement the Kalman filter described above. We'll use it to recursively estimate the value of $\theta$ based on noisy observations.

In [6]:

using Plots, Distributions

n = 100         # specify number of observations
θ = 2.0         # true value of the parameter we would like to estimate
noise_σ2 = 0.3  # variance of observation noise

observations = noise_σ2 * randn(n) .+ θ

function perform_kalman_step(prior :: Normal, x :: Float64, noise_σ2 :: Float64)
    K = prior.σ / (noise_σ2 + prior.σ)          # compute the Kalman gain
    posterior_μ = prior.μ + K*(x - prior.μ)     # update the posterior mean
    posterior_σ = prior.σ * (1.0 - K)           # update the posterior standard deviation
    return Normal(posterior_μ, posterior_σ)     # return the posterior distribution
end

post_μ = fill!(Vector{Float64}(undef,n + 1), NaN)     # means of p(θ|D) over time
post_σ2 = fill!(Vector{Float64}(undef,n + 1), NaN)    # variances of p(θ|D) over time

prior = Normal(0, 1)    # specify the prior distribution (you can play with the parameterization of this to get a feeling of how the Kalman filter converges)

post_μ[1] = prior.μ     # save prior mean and variance to show these in plot
post_σ2[1] = prior.σ

for (i, x) in enumerate(observations)                           # note that this loop demonstrates Bayesian learning on streaming data; we update the prior distribution using observation(s), after which this posterior becomes the new prior for future observations
    posterior = perform_kalman_step(prior, x, noise_σ2)         # compute the posterior distribution given the observation
    post_μ[i + 1] = posterior.μ                                 # save the mean of the posterior distribution
    post_σ2[i + 1] = posterior.σ                                # save the variance of the posterior distribution
    prior = posterior                                           # the posterior becomes the prior for future observations
end

obs_scale = collect(2:n+1)
scatter(obs_scale, observations, label=L"D", )  
post_scale = collect(1:n+1)                                                         # scatter the observations
plot!(post_scale, post_μ, ribbon=sqrt.(post_σ2), linewidth=3, label=L"p(θ | D_t)")  # lineplot our estimated means of intermediate posterior distributions
plot!(post_scale, θ*ones(n + 1), linewidth=2, label=L"θ")                           # plot the true value of θ

The shaded area represents 2 standard deviations of posterior $p(\theta|D)$. The variance of the posterior is guaranteed to decrease monotonically for the standard Kalman filter.

Product of Normally Distributed Variables ¶

(We've seen that) the sum of two Gausssian distributed variables is also Gaussian distributed.

Has the product of two Gaussian distributed variables also a Gaussian distribution?

No! In general this is a difficult computation. As an example, let's compute $p(z)$ for $Z=XY$ for the special case that $X\sim \mathcal{N}(0,1)$ and $Y\sim \mathcal{N}(0,1)$.

$$\begin{align*} p(z) &= \int_{X,Y} p(z|x,y)\,p(x,y)\,\mathrm{d}x\mathrm{d}y \\ &= \frac{1}{2 \pi}\int \delta(z-xy) \, e^{-(x^2+y^2)/2} \, \mathrm{d}x\mathrm{d}y \\ &= \frac{1}{\pi} \int_0^\infty \frac{1}{x} e^{-(x^2+z^2/x^2)/2} \, \mathrm{d}x \\ &= \frac{1}{\pi} \mathrm{K}_0( \lvert z\rvert )\,. \end{align*}$$

where $\mathrm{K}_n(z)$ is a modified Bessel function of the second kind.

Code Example: Product of Gaussian Distributions¶

We plot $p(Z=XY)$ and $p(X)p(Y)$ for $X\sim\mathcal{N}(0,1)$ and $Y \sim \mathcal{N}(0,1)$ to give an idea of how these distributions differ.

In [7]:

using Plots, Distributions, SpecialFunctions, LaTeXStrings
X = Normal(0,1)
Y = Normal(0,1)
pdf_product_std_normals(z::Vector) = (besselk.(0, abs.(z))./π)
range1 = collect(range(-4,stop=4,length=100))
plot(range1, pdf.(X, range1), label=L"p(X)=p(Y)=\mathcal{N}(0,1)", fill=(0, 0.1))
plot!(range1, pdf.(X,range1).*pdf.(Y,range1), label=L"p(X)*p(Y)", fill=(0, 0.1))
plot!(range1, pdf_product_std_normals(range1), label=L"p(Z=X*Y)", fill=(0, 0.1))

In short, Gaussian-distributed variables remain Gaussian in linear systems, but this is not the case in non-linear systems.

Solution to Example Problem¶

We apply maximum likelihood estimation to fit a 2-dimensional Gaussian model ($m$) to data set $D$. Next, we evaluate $p(x_\bullet \in S | m)$ by (numerical) integration of the Gaussian pdf over $S$: $p(x_\bullet \in S | m) = \int_S p(x|m) \mathrm{d}x$.

In [8]:

using HCubature, LinearAlgebra, Plots, Distributions# Numerical integration package
# Maximum likelihood estimation of 2D Gaussian
N = length(sum(D,dims=1))
μ = 1/N * sum(D,dims=2)[:,1]
D_min_μ = D - repeat(μ, 1, N)
Σ = Hermitian(1/N * D_min_μ*D_min_μ')
m = MvNormal(μ, convert(Matrix, Σ));

contour(range(-3, 4, length=100), range(-3, 4, length=100), (x, y) -> pdf(m, [x, y]))

# Numerical integration of p(x|m) over S:
(val,err) = hcubature((x)->pdf(m,x), [0., 1.], [2., 2.])
println("p(x⋅∈S|m) ≈ $(val)")

scatter!(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray, label=L"S")

p(x⋅∈S|m) ≈ 0.23491845066574196

Summary¶

A linear transformation $z=Ax+b$ of a Gaussian variable $x \sim \mathcal{N}(\mu_x,\Sigma_x)$ is Gaussian distributed as

$$ p(z) = \mathcal{N} \left(z \,|\, A\mu_x+b, A\Sigma_x A^T \right) $$

Bayesian inference with a Gaussian prior and Gaussian likelihood leads to an analytically computable Gaussian posterior, because of the multiplication rule for Gaussians:

$$\begin{equation*} \mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \underbrace{\mathcal{N}(\mu_a|\, \mu_b, \Sigma_a + \Sigma_b)}_{\text{normalization constant}} \cdot \mathcal{N}(x|\mu_c,\Sigma_c) \end{equation*}$$

where $$\begin{align*} \Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\ \Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b \end{align*}$$

Conditioning and marginalization of a multivariate Gaussian distribution yields Gaussian distributions. In particular, the joint distribution

$$\mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_y \end{bmatrix} \right. \right) $$

can be decomposed as $$\begin{align*} p(y|x) &= \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right) \\ p(x) &= \mathcal{N}\left( x|\mu_x, \Sigma_x \right) \end{align*}$$

Here's a nice summary of Gaussian calculations by Sam Roweis.

OPTIONAL SLIDES¶

Inference for the Precision Parameter of the Gaussian ¶

Again, we consider an observed data set $D = \{x_1, x_2, \ldots, x_N\}$ and try to explain these data by a Gaussian distribution.

We discussed earlier Bayesian inference for the mean with a given variance. Now we will derive a posterior for the variance if the mean is given. (Technically, we will do the derivation for a precision parameter $\lambda = \sigma^{-2}$, since the discussion is a bit more straightforward for the precision parameter).

model specification¶

The likelihood for the precision parameter is

$$\begin{align*} p(D|\lambda) &= \prod_{n=1}^N \mathcal{N}\left(x_n \,|\, \mu, \lambda^{-1} \right) \\ &\propto \lambda^{N/2} \exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^N \left(x_n - \mu \right)^2\right\} \tag{B-2.145} \end{align*}$$

The conjugate distribution for this function of $\lambda$ is the Gamma distribution, given by

$$ p(\lambda\,|\,a,b) = \mathrm{Gam}\left( \lambda\,|\,a,b \right) \triangleq \frac{1}{\Gamma(a)} b^{a} \lambda^{a-1} \exp\left\{ -b \lambda\right\}\,, \tag{B-2.146} $$

where $a>0$ and $b>0$ are known as the shape and rate parameters, respectively.

(Bishop fig.2.13). Plots of the Gamma distribution $\mathrm{Gam}\left( \lambda\,|\,a,b \right) $ for different values of $a$ and $b$.

The mean and variance of the Gamma distribution evaluate to $\mathrm{E}\left( \lambda\right) = \frac{a}{b}$ and $\mathrm{var}\left[\lambda\right] = \frac{a}{b^2}$.

inference¶

We will consider a prior $p(\lambda) = \mathrm{Gam}\left( \lambda\,|\,a_0, b_0\right)$, which leads by Bayes rule to the posterior

$$\begin{align*} p(\lambda\,|\,D) &\propto \underbrace{\lambda^{N/2} \exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^N \left(x_n - \mu \right)^2\right\} }_{\text{likelihood}} \cdot \underbrace{\frac{1}{\Gamma(a_0)} b_0^{a_0} \lambda^{a_0-1} \exp\left\{ -b_0 \lambda\right\}}_{\text{prior}} \\ &\propto \mathrm{Gam}\left( \lambda\,|\,a_N,b_N \right) \end{align*}$$

with $$\begin{align*} a_N &= a_0 + \frac{N}{2} \qquad &&\text{(B-2.150)} \\ b_N &= b_0 + \frac{1}{2}\sum_n \left( x_n-\mu\right)^2 \qquad &&\text{(B-2.151)} \end{align*}$$

Hence the posterior is again a Gamma distribution. By inspection of B-2.150 and B-2.151, we deduce that we can interpret $2a_0$ as the number of a priori (pseudo-)observations.

Since the most uninformative prior is given by $a_0=b_0 \rightarrow 0$, we can derive the maximum likelihood estimate for the precision as

$$ \lambda_{\text{ML}} = \left.\mathrm{E}\left[ \lambda\right]\right\vert_{a_0=b_0\rightarrow 0} = \left. \frac{a_N}{b_N}\right\vert_{a_0=b_0\rightarrow 0} = \frac{N}{\sum_{n=1}^N \left(x_n-\mu \right)^2} $$

In short, if we do density estimation with a Gaussian distribution $\mathcal{N}\left(x_n\,|\,\mu,\sigma^2 \right)$ for an observed data set $D = \{x_1, x_2, \ldots, x_N\}$, the maximum likelihood estimates for $\mu$ and $\sigma^2$ are given by

$$\begin{align*} \mu_{\text{ML}} &= \frac{1}{N} \sum_{n=1}^N x_n \qquad &&\text{(B-2.121)} \\ \sigma^2_{\text{ML}} &= \frac{1}{N} \sum_{n=1}^N \left(x_n - \mu_{\text{ML}} \right)^2 \qquad &&\text{(B-2.122)} \end{align*}$$

These estimates are also known as the sample mean and sample variance respectively.

Continuous Data and the Gaussian Distribution¶

Preliminaries¶

Example Problem¶

The Gaussian Distribution¶

Why the Gaussian?¶

Transformations and Sums of Gaussian Variables¶

Example: Gaussian Signals in a Linear System¶

Bayesian Inference for the Gaussian¶

Model specification¶

Inference¶

(Multivariate) Gaussian Multiplication¶

Code Example: Product of Two Gaussian PDFs¶

Bayesian Inference with multiple Observations¶

inference¶

application: prediction of future sample¶

Maximum Likelihood Estimation for the Gaussian¶

Conditioning and Marginalization of a Gaussian¶

Code Example: Joint, Marginal, and Conditional Gaussian Distributions¶

Example: Conditioning of Gaussian¶

Recursive Bayesian Estimation for Adaptive Signal Processing¶

Model specification¶

Inference¶

Code Example: Kalman Filter¶

Product of Normally Distributed Variables¶

Code Example: Product of Gaussian Distributions¶

Solution to Example Problem¶

Summary¶

OPTIONAL SLIDES¶

Inference for the Precision Parameter of the Gaussian¶

model specification¶

inference¶

Product of Normally Distributed Variables ¶

Inference for the Precision Parameter of the Gaussian ¶