Consider a set of observations $D=\{x_1,…,x_N\}$ in the 2-dimensional plane (see Figure). All observations were generated by the same process. We now draw an extra observation $x_\bullet = (a,b)$ from the same data generating process. What is the probability that $x_\bullet$ lies within the shaded rectangle $S$?
using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end
Activating project at `~/github/bertdv/BMLIP/lessons`
using Distributions, Plots, LaTeXStrings
N = 100
generative_dist = MvNormal([0,1.], [0.8 0.5; 0.5 1.0])
D = rand(generative_dist, N) # Generate observations from generative_dist
scatter(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
x_dot = rand(generative_dist) # Generate x∙
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray,label=L"S")
where $|\Sigma| \triangleq \mathrm{det}(\Sigma)$ is the determinant of $\Sigma$.
Let's estimate a constant $\theta$ from one 'noisy' measurement $x$ about that constant.
We assume the following measurement equations (the tilde $\sim$ means: 'is distributed as'):
which we recognize as a Gaussian distribution w.r.t. $\theta$.
where $$\begin{align*} \frac{1}{\sigma_1^2} &= \frac{\sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2} \\ \mu_1 &= \frac{\sigma_0^2 x + \sigma^2 \mu_0}{\sigma^2 + \sigma_0^2} = \sigma_1^2 \, \left( \frac{1}{\sigma_0^2} \mu_0 + \frac{1}{\sigma^2} x \right) \end{align*}$$
where $$\begin{align*} \Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\ \Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b \end{align*}$$
using Plots, Distributions, LaTeXStrings
d1 = Normal(0, 1) # μ=0, σ^2=1
d2 = Normal(3, 2) # μ=3, σ^2=4
# Calculate the parameters of the product d1*d2
s2_prod = (d1.σ^-2 + d2.σ^-2)^-1
m_prod = s2_prod * ((d1.σ^-2)*d1.μ + (d2.σ^-2)*d2.μ)
d_prod = Normal(m_prod, sqrt(s2_prod)) # Note that we neglect the normalization constant.
# Plot stuff
x = range(-4, stop=8, length=100)
plot(x, pdf.(d1,x), label=L"\mathcal{N}(0,1)", fill=(0, 0.1)) # Plot the first Gaussian
plot!(x, pdf.(d2,x), label=L"\mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the second Gaussian
plot!(x, pdf.(d1,x) .* pdf.(d2,x), label=L"\mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the exact product
plot!(x, pdf.(d_prod,x), label=L"Z^{-1} \mathcal{N}(0,1) \mathcal{N}(3,4)", fill=(0, 0.1)) # Plot the normalized Gaussian product
and the same prior for $\theta$: $$ \theta \sim \mathcal{N}(\mu_0,\sigma_0^2) \\ $$
which is a multiplication of $N+1$ Gaussians and is therefore also Gaussian-distributed.
using Plots, LaTeXStrings, Distributions
# Define the joint distribution p(x,y)
μ = [1.0; 2.0]
Σ = [0.3 0.7;
0.7 2.0]
joint = MvNormal(μ,Σ)
# Define the marginal distribution p(x)
marginal_x = Normal(μ[1], sqrt(Σ[1,1]))
# Plot p(x,y)
x_range = y_range = range(-2,stop=5,length=1000)
joint_pdf = [ pdf(joint, [x_range[i];y_range[j]]) for j=1:length(y_range), i=1:length(x_range)]
plot_1 = heatmap(x_range, y_range, joint_pdf, title = L"p(x, y)")
# Plot p(x)
plot_2 = plot(range(-2,stop=5,length=1000), pdf.(marginal_x, range(-2,stop=5,length=1000)), title = L"p(x)", label="", fill=(0, 0.1))
# Plot p(y|x = 0.1)
x = 0.1
conditional_y_m = μ[2]+Σ[2,1]*inv(Σ[1,1])*(x-μ[1])
conditional_y_s2 = Σ[2,2] - Σ[2,1]*inv(Σ[1,1])*Σ[1,2]
conditional_y = Normal(conditional_y_m, sqrt.(conditional_y_s2))
plot_3 = plot(range(-2,stop=5,length=1000), pdf.(conditional_y, range(-2,stop=5,length=1000)), title = L"p(y|x = %$x)", label="", fill=(0, 0.1))
plot(plot_1, plot_2, plot_3, layout=(1,3), size=(1200,300))
As is clear from the plots, the conditional distribution is a renormalized slice from the joint distribution.
with $$\begin{align*} K &= \frac{\sigma_0^2}{\sigma_0^2+\sigma^2} \qquad \text{($K$ is called: Kalman gain)}\\ \mu_1 &= \mu_0 + K \cdot (x-\mu_0)\\ \sigma_1^2 &= \left( 1-K \right) \sigma_0^2 \end{align*}$$
Consider the signal $x_t=\theta+\epsilon_t$, where $D_t= \left\{x_1,\ldots,x_t\right\}$ is observed sequentially (over time).
Problem: Derive a recursive algorithm for $p(\theta|D_t)$, i.e., an update rule for (posterior) $p(\theta|D_t)$ based on (prior) $p(\theta|D_{t-1})$ and (new observation) $x_t$.
Let's define the estimate after $t$ observations (i.e., our solution ) as $p(\theta|D_t) = \mathcal{N}(\theta\,|\,\mu_t,\sigma_t^2)$.
We define the joint distribution for $\theta$ and $x_t$, given background $D_{t-1}$, by
with $$\begin{align*} K_t &= \frac{\sigma_{t-1}^2}{\sigma_{t-1}^2+\sigma^2} \qquad \text{(Kalman gain)}\\ \mu_t &= \mu_{t-1} + K_t \cdot (x_t-\mu_{t-1})\\ \sigma_t^2 &= \left( 1-K_t \right) \sigma_{t-1}^2 \end{align*}$$
using Plots, Distributions
n = 100 # specify number of observations
θ = 2.0 # true value of the parameter we would like to estimate
noise_σ2 = 0.3 # variance of observation noise
observations = noise_σ2 * randn(n) .+ θ
function perform_kalman_step(prior :: Normal, x :: Float64, noise_σ2 :: Float64)
K = prior.σ / (noise_σ2 + prior.σ) # compute the Kalman gain
posterior_μ = prior.μ + K*(x - prior.μ) # update the posterior mean
posterior_σ = prior.σ * (1.0 - K) # update the posterior standard deviation
return Normal(posterior_μ, posterior_σ) # return the posterior distribution
end
post_μ = fill!(Vector{Float64}(undef,n + 1), NaN) # means of p(θ|D) over time
post_σ2 = fill!(Vector{Float64}(undef,n + 1), NaN) # variances of p(θ|D) over time
prior = Normal(0, 1) # specify the prior distribution (you can play with the parameterization of this to get a feeling of how the Kalman filter converges)
post_μ[1] = prior.μ # save prior mean and variance to show these in plot
post_σ2[1] = prior.σ
for (i, x) in enumerate(observations) # note that this loop demonstrates Bayesian learning on streaming data; we update the prior distribution using observation(s), after which this posterior becomes the new prior for future observations
posterior = perform_kalman_step(prior, x, noise_σ2) # compute the posterior distribution given the observation
post_μ[i + 1] = posterior.μ # save the mean of the posterior distribution
post_σ2[i + 1] = posterior.σ # save the variance of the posterior distribution
prior = posterior # the posterior becomes the prior for future observations
end
obs_scale = collect(2:n+1)
scatter(obs_scale, observations, label=L"D", )
post_scale = collect(1:n+1) # scatter the observations
plot!(post_scale, post_μ, ribbon=sqrt.(post_σ2), linewidth=3, label=L"p(θ | D_t)") # lineplot our estimated means of intermediate posterior distributions
plot!(post_scale, θ*ones(n + 1), linewidth=2, label=L"θ") # plot the true value of θ
The shaded area represents 2 standard deviations of posterior $p(\theta|D)$. The variance of the posterior is guaranteed to decrease monotonically for the standard Kalman filter.
where $\mathrm{K}_n(z)$ is a modified Bessel function of the second kind.
using Plots, Distributions, SpecialFunctions, LaTeXStrings
X = Normal(0,1)
Y = Normal(0,1)
pdf_product_std_normals(z::Vector) = (besselk.(0, abs.(z))./π)
range1 = collect(range(-4,stop=4,length=100))
plot(range1, pdf.(X, range1), label=L"p(X)=p(Y)=\mathcal{N}(0,1)", fill=(0, 0.1))
plot!(range1, pdf.(X,range1).*pdf.(Y,range1), label=L"p(X)*p(Y)", fill=(0, 0.1))
plot!(range1, pdf_product_std_normals(range1), label=L"p(Z=X*Y)", fill=(0, 0.1))
We apply maximum likelihood estimation to fit a 2-dimensional Gaussian model ($m$) to data set $D$. Next, we evaluate $p(x_\bullet \in S | m)$ by (numerical) integration of the Gaussian pdf over $S$: $p(x_\bullet \in S | m) = \int_S p(x|m) \mathrm{d}x$.
using HCubature, LinearAlgebra, Plots, Distributions# Numerical integration package
# Maximum likelihood estimation of 2D Gaussian
N = length(sum(D,dims=1))
μ = 1/N * sum(D,dims=2)[:,1]
D_min_μ = D - repeat(μ, 1, N)
Σ = Hermitian(1/N * D_min_μ*D_min_μ')
m = MvNormal(μ, convert(Matrix, Σ));
contour(range(-3, 4, length=100), range(-3, 4, length=100), (x, y) -> pdf(m, [x, y]))
# Numerical integration of p(x|m) over S:
(val,err) = hcubature((x)->pdf(m,x), [0., 1.], [2., 2.])
println("p(x⋅∈S|m) ≈ $(val)")
scatter!(D[1,:], D[2,:], marker=:x, markerstrokewidth=3, label=L"D")
scatter!([x_dot[1]], [x_dot[2]], label=L"x_\bullet")
plot!(range(0, 2), [1., 1., 1.], fillrange=2, alpha=0.4, color=:gray, label=L"S")
p(x⋅∈S|m) ≈ 0.23491845066574196
where $$\begin{align*} \Sigma_c^{-1} &= \Sigma_a^{-1} + \Sigma_b^{-1} \\ \Sigma_c^{-1} \mu_c &= \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b \end{align*}$$
can be decomposed as $$\begin{align*} p(y|x) &= \mathcal{N}\left(y\,|\,\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right) \\ p(x) &= \mathcal{N}\left( x|\mu_x, \Sigma_x \right) \end{align*}$$
where $a>0$ and $b>0$ are known as the shape and rate parameters, respectively.
with $$\begin{align*} a_N &= a_0 + \frac{N}{2} \qquad &&\text{(B-2.150)} \\ b_N &= b_0 + \frac{1}{2}\sum_n \left( x_n-\mu\right)^2 \qquad &&\text{(B-2.151)} \end{align*}$$