Note that none of the material below is new. The point of the Probabilistic Programming sessions is to solve practical problems so that the concepts from Bert's lectures become less abstract.
using Pkg
Pkg.activate("../../../lessons/")
Pkg.instantiate();
IJulia.clear_output();
using JLD
using Statistics
using LinearAlgebra
using Distributions
using RxInfer
using ColorSchemes
using LaTeXStrings
using Plots
default(label="", grid=false, linewidth=3, margin=10Plots.pt)
include("../scripts/clusters.jl");
Archeologists have asked for your help in analyzing data on tools made of stone. It is believed that primitive humans created tools by striking stones with others. During this process, the stone loses flakes, which have been preserved. The archeologists have recovered these flakes from various locations and time periods and want to know whether this stone tool shaping process has improved over the centuries.
The data is available from the UCI Machine Learning Repository. Each instance represents summary information of the stone flakes for a particular site. We will be using the attributes flaking angle (FLA) and the proportion of the dorsal surface worked (PROZD) for now.
data = load("../datasets/stone_flakes.jld");
I've done some pre-processing on the data set, namely z-scoring and removing two outliers. This reduces the scale of the attributes which helps numerical stability during optimization. Now let's visualize the data with a scatterplot.
scatter(data["observations"][:,1],
data["observations"][:,2],
label="",
xlabel="Proportion of worked dorsal surface (PROZD)",
ylabel="Flaking angle (FLA)",
size=(800,300))
We will be clustering this data with a Gaussian mixture model, to see if we can identify clear types of stone tools. The generative model for a Gaussian mixture consists of:
$$ p(X, z, \pi, \mu, \Lambda) =\ \underbrace{p(X \mid z, \mu, \Lambda)}_{\text{likelihood}}\ \times \ \underbrace{p(z \mid \pi)}_{\text{prior latent variables}} \ \times \ \underbrace{p(\mu \mid \Lambda)\ p(\Lambda)\ p(\pi)}_{\text{prior parameters}}$$with the likelihood of observation $X_i$ being a Gaussian raised to the power of the latent assignment variables $z$
$$ p(X_i \mid z, \mu, \Lambda) = \prod_{k=1}^{K} \mathcal{N}(X_i \mid \mu_k, \Lambda_k^{-1})^{z_i = k}$$the prior for each latent variable $z_i$ being a Categorical distribution
$$ p(z_i \mid \pi) = \text{Categorical}(z_i \mid \pi) $$and priors for the parameters being
$$ \begin{aligned} p(\Lambda_k) =&\ \text{Wishart}(\Lambda_k \mid V_0, n_0) \qquad &\text{for all}\ k , \\ p(\mu_k \mid \Lambda_k) =&\ \mathcal{N}(\mu_k \mid m_0, \Lambda_k^{-1}) \qquad &\text{for all}\ k , \\ p(\pi) =&\ \text{Dirichlet}(\pi \mid a_0) \quad . \end{aligned}$$We can implement these equations nearly directly in ReactiveMP.
# Data dimensionality
num_features = size(data["observations"],2)
# Sample size
num_samples = size(data["observations"],1)
# Number of mixture components
num_components = 3;
Mixture models can be sensitive to initialization, so we are going to specify the prior parameters explicitly.
# Identity matrix (convenience variable)
Id = diagm(ones(num_features));
# Prior scale matrices
V0 = cat(Id, Id, Id, dims=3)
# Prior degrees of freedom
n0 = num_features
# Prior means
m0 = [ 1.0 0.0 -1.0;
-1.0 0.0 1.0];
# Prior concentration parameters
a0 = ones(num_components);
# Pack into dictionary
prior_params = Dict{Symbol,Any}(:Λ => (V0,n0), :μ => m0, :π => a0)
Dict{Symbol, Any} with 3 entries: :μ => [1.0 0.0 -1.0; -1.0 0.0 1.0] :π => [1.0, 1.0, 1.0] :Λ => ([1.0 0.0; 0.0 1.0;;; 1.0 0.0; 0.0 1.0;;; 1.0 0.0; 0.0 1.0], 2)
Now to specify the full model.
@model function GMM(prior_params; K=1, N=1)
# Allocate variables
X = datavar(Vector{Float64}, N)
z = randomvar(N)
μ = randomvar(K)
Λ = randomvar(K)
# Unpack prior parameters
V0 = prior_params[:Λ][1]
n0 = prior_params[:Λ][2]
m0 = prior_params[:μ]
a0 = prior_params[:π]
# Component parameters
for k in 1:K
Λ[k] ~ Wishart(n0, V0[:,:,k])
μ[k] ~ MvNormalMeanPrecision(m0[:,k], Λ[k])
end
# Mixture weights
π ~ Dirichlet(a0)
cmeans = tuple(μ...)
cprecs = tuple(Λ...)
for i in 1:N
# Latent assignment variable
z[i] ~ Categorical(π)
# Mixture distribution
X[i] ~ NormalMixture(z[i], cmeans, cprecs) where { q = MeanField() }
end
return X,z,π,μ,Λ
end
Set up the inference procedure.
# Map data to list of vectors
observations = [data["observations"][i,:] for i in 1:num_samples]
# Set variational distribution factorization
constraints = @constraints begin
q(z,π,μ,Λ) = q(z)q(π)q(μ)q(Λ)
end
# Initialize variational distributions
initmarginals = (
π = Dirichlet(a0),
μ = [MvNormalMeanCovariance(m0[:,k], Id) for k in 1:num_components],
Λ = [Wishart(n0, V0[:,:,k]) for k in 1:num_components]
)
# Iterations of variational inference
num_iters = 100
# Variational inference
results = inference(
model = GMM(prior_params, K=num_components, N=num_samples),
data = (X = observations,),
constraints = constraints,
initmarginals = initmarginals,
iterations = num_iters,
free_energy = true,
showprogress = true,
)
Progress: 100%|█████████████████████████████████████████| Time: 0:00:05
Inference results: Posteriors | available for (μ, π, Λ, z) Free Energy: | Real[217.442, 202.111, 200.341, 199.685, 199.169, 198.64, 198.078, 197.534, 197.088, 196.743 … 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448]
Alright, we're done. Let's track the evolution of free energy over iterations.
plot(1:num_iters,
results.free_energy,
color="black",
xscale=:log10,
xlabel="Number of iterations",
ylabel="Free Energy",
size=(800,300))
That looks like it is nicely decreasing. Let's now visualize the cluster on top of the observations.
# Extract mixture weights
π_hat = mean(results.posteriors[:π][num_iters])
# Extract component means
μ_hat = mean.(results.posteriors[:μ][num_iters])
# Extract covariance matrices
Σ_hat = inv.(mean.(results.posteriors[:Λ][num_iters]))
# Select dimensions to plot
xlims = [minimum(data["observations"][:,1])-1, maximum(data["observations"][:,1])+1]
ylims = [minimum(data["observations"][:,2])-1, maximum(data["observations"][:,2])+1]
# Plot data and overlay estimated posterior probabilities
plot_clusters(data["observations"][:, 1:2],
μ=μ_hat,
Σ=Σ_hat,
xlims=xlims,
ylims=ylims,
xlabel="Proportion of worked dorsal surface (PROZD)",
ylabel="Flaking angle (FLA)",
colors=[:reds, :blues, :greens],
figsize=(800,500))
That doesn't look bad. The three Gaussians nicely cover all samples.
Play around with the number of components. Can you get an equally good coverage with just 2 components? What if you had 4?
We can also plot the evolution of the parameters over iterations of the variational inference procedure.
# Extract mean and standard deviation from
mean_π_iters = cat(mean.(results.posteriors[:π])..., dims=2)
vars_π_iters = cat(var.( results.posteriors[:π])..., dims=2)
plot(1:num_iters,
mean_π_iters',
ribbon=vars_π_iters',
xscale=:log10,
xlabel="Number of iterations",
ylabel="Free Energy",
size=(800,300))
Plot the evolution of one of the component means over iterations of variational inference, including its variance.
Suppose you are contacted to estimate how resistant a building is to shaking caused by minor earthquakes. You decide to model the building as a mass-spring-damper system, described by:
$$m \ddot{x} + c\dot{x} + kx = w \, ,$$where $m$ corresponds to the mass of the building, $c$ is friction and $k$ the stiffness of the building's main supports. You don't know the external force acting upon the building and decide to model it as white noise $w$. In essence, this means you think the building will be pushed to the left as strongly on average as it will be pushed to the right. A simple discretization scheme with substituted variable $z = [x \ \dot{x}]$ and time-step $\Delta t$ yields:
$$z_{k} = \underbrace{\begin{bmatrix} 1 & \Delta t \\ \frac{-k}{m}\Delta t & \frac{-c}{m}\Delta t + 1 \end{bmatrix}}_{A} z_{k-1} + q_k \, ,$$where $q_k \sim \mathcal{N}(0, Q)$ with covariance matrix $Q$.
You place a series of sensors on the building that measure the displacement:
$$ y_k = \underbrace{\begin{bmatrix} 1 & 0 \end{bmatrix}}_{C} z_k + r_k \, $$where the measurement noise is white as well: $r_k \sim \mathcal{N}(0, \sigma^2)$. You have a good estimate of the variance $\sigma^2$ from previous sensor calibrations and decide to consider it a known variable.
# Load data from file
data = load("../datasets/shaking_buildings.jld")
# Data
states = data["states"]
observations = data["observations"]
# Parameters
mass = data["m"]
friction = data["c"]
stiffness = data["k"]
# Measurement noise variance
σ = data["σ"]
# Time
Δt = data["Δt"]
T = length(observations)
time = range(1,step=Δt,length=T)
plot(time, states[1,:], color="red", label="states", xlabel="time (sec)", ylabel="train position")
scatter!(time, observations, color="black", label="observations", legend=:topleft, size=(800,300))
Following the steps from the lecture on Dynamic Systems, we can derive the following probabilistic state-space model:
$$\begin{aligned} p(z_k \mid z_{k-1}) =&\ \mathcal{N}(z_k \mid A z_{k-1}, Q)\\ p(y_k \mid z_k) =&\ \mathcal{N}(y_k \mid C z_k, \sigma^2) \, . \end{aligned}$$For now, we will use a simple structure for the process noise covariance matrix, e.g. $Q = I$. If we consider a Gaussian prior distribution for the initial state
$$ p(z_0) = \mathcal{N}(m_0, S_0) \, ,$$we obtain a complete generative model:
$$\begin{aligned} \underbrace{p(y_{1:T}, z_{0:T})}_{\text{generative model}} = \underbrace{p(z_0)}_{\text{prior}} \, \prod_{k=1}^T \, \underbrace{p(y_k \mid z_k)}_{\text{likelihood}} \, \underbrace{p(z_k \mid z_{k-1})}_{\text{state transition}} \end{aligned}$$To define this model in ReactiveMP, we must start with the process matrices:
# Transition matrix
A = [1 Δt; -stiffness/mass*Δt -friction/mass*Δt+1]
# Emission matrix
C = [1.0, 0.0]
# Set process noise covariance matrix
Q = diagm(ones(2))
2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0
Next, we define a linear Gaussian dynamical system with only the states as unknown variables:
@model function LGDS(prior_params, A,C,Q, σ; T=1)
"State estimation in linear Gaussian dynamical system"
z = randomvar(T)
y = datavar(Float64,T)
# Prior state
z_0 ~ MvNormalMeanCovariance(prior_params[:z0][1], prior_params[:z0][2])
z_kmin1 = z_0
for k in 1:T
# State transition
z[k] ~ MvNormalMeanCovariance(A * z_kmin1, Q)
# Likelihood
y[k] ~ NormalMeanVariance(dot(C, z[k]), σ^2)
# Update recursive aux
z_kmin1 = z[k]
end
return y, z
end
# Initial state prior
prior_params = Dict(:z0 => (zeros(2), diageye(2)))
(posteriors,_) = inference(
model = LGDS(prior_params, A,C,Q, σ, T=T),
data = (y = [observations[k] for k in 1:T],),
free_energy = true,
)
Inference results: Posteriors | available for (z_0, z) Free Energy: | Real[434.098]
Let's extract the inferred states and visualize them.
m_z = cat(mean.(posteriors[:z])...,dims=2)
v_z = cat(var.( posteriors[:z])...,dims=2)
plot(time, states[1,:], color="red", label="states", xlabel="time (sec)", ylabel="train position")
plot!(time, m_z[1,:], color="blue", ribbon=v_z[1,:], label="inferred")
scatter!(time, observations, color="black", alpha=0.2, label="observations", legend=:bottomright, size=(800,300))
Mmmh... the inferred states are not smooth at all. This is most likely due to our process noise covariance matrix not being calibrated. So can we improve?
Of course, as Bayesians, we can just treat $Q$ as an unknown random variable and infer its posterior distribution. Adjusting the model is straightforward. The probabilistic state-space model becomes:
$$\begin{aligned} p(z_k \mid z_{k-1},Q) =&\ \mathcal{N}(z_k \mid A z_{k-1}, Q)\\ p(y_k \mid z_k) =&\ \mathcal{N}(y_k \mid C z_k, \sigma^2) \, , \end{aligned}$$with priors
$$\begin{aligned} p(Q) =&\ \mathcal{W}^{-1}(\nu, \Lambda) \\ p(z_0) =&\ \mathcal{N}(m_0, S_0) \, . \end{aligned}$$The $\mathcal{W}^{-1}$ represents an inverse-Wishart distribution with degrees-of-freedom $\nu$ and scale matrix $\Lambda$. This will give the following generative model:
$$\begin{aligned} \underbrace{p(y_{1:T}, z_{0:T}, Q)}_{\text{generative model}} = \underbrace{p(z_0)p(Q)}_{\text{priors}} \, \prod_{k=1}^T \, \underbrace{p(y_k \mid z_k)}_{\text{likelihood}} \, \underbrace{p(z_k \mid z_{k-1},Q)}_{\text{state transition}} \end{aligned}$$Our model definition in ReactiveMP is only slightly larger:
@model function LGDS_Q(prior_params, A,C, σ; T=1)
"State estimation in a linear Gaussian dynamical system with unknown process noise"
z = randomvar(T)
y = datavar(Float64,T)
# Prior state
z_0 ~ MvNormalMeanCovariance(prior_params[:z0][1], prior_params[:z0][2])
# Process noise covariance matrix
Q ~ InverseWishart(prior_params[:Q][1], prior_params[:Q][2])
z_kmin1 = z_0
for k in 1:T
# State transition
z[k] ~ MvNormalMeanCovariance(A * z_kmin1, Q)
# Likelihood
y[k] ~ NormalMeanVariance(dot(C, z[k]), σ^2)
# Update recursive aux
z_kmin1 = z[k]
end
return y, z, Q
end
Think of what might be appropriate prior parameters for the Inverse-Wishart distribution. Should its mean be high or low here?
I have chosen the following set of prior parameters:
# Define prior parameters
prior_params = Dict(:z0 => (zeros(2), diageye(2)),
:Q => (10, diageye(2)))
Dict{Symbol, Tuple{Any, Matrix{Float64}}} with 2 entries: :Q => (10, [1.0 0.0; 0.0 1.0]) :z0 => ([0.0, 0.0], [1.0 0.0; 0.0 1.0])
The variational inference procedure for estimating states and the process noise covariance matrix simultaneously requires a bit more thought, but is still very straightforward:
# Iterations of variational inference
num_iters = 100
# Initialize variational marginal distributions and messages
inits = Dict(:z => MvNormalMeanCovariance(zeros(2), diageye(2)),
:Q => InverseWishart(10, diageye(2)))
# Define variational distribution factorization
constraints = @constraints begin
q(z_0, z,Q) = q(z_0, z)q(Q)
end
# Variational inference procedure
results = inference(
model = LGDS_Q(prior_params, A,C, σ, T=T),
data = (y = [observations[k] for k in 1:T],),
constraints = constraints,
iterations = num_iters,
options = (limit_stack_depth = 100,),
initmarginals = inits,
initmessages = inits,
free_energy = true,
showprogress = true,
)
Progress: 100%|█████████████████████████████████████████| Time: 0:00:02
Inference results: Posteriors | available for (z_0, Q, z) Free Energy: | Real[425.591, 368.392, 359.522, 356.68, 356.422, 356.522, 356.717, 356.87, 356.851, 356.897 … 357.077, 357.078, 357.078, 357.078, 357.079, 357.079, 357.079, 357.08, 357.08, 357.08]
Again, let's inspect the free energy to see if we have converged.
plot(1:num_iters,
results.free_energy,
color="black",
xscale=:log10,
xlabel="Number of iterations",
ylabel="Free Energy",
size=(800,300))
Alright. That looks good. Let's extract the inferred states and visualize.
m_z = cat(mean.(last(results.posteriors[:z]))...,dims=2)
v_z = cat(var.(last(results.posteriors[:z]))...,dims=2)
plot(time, states[1,:], color="red", label="states", xlabel="time (sec)", ylabel="train position")
plot!(time, m_z[1,:], color="blue", ribbon=v_z[1,:], label="inferred")
scatter!(time, observations, color="black", alpha=0.2, label="observations", legend=:topleft, size=(800,300))
That's much smoother. The free energy of this model ($\mathcal{F} = 357.08$) is smaller than that of the earlier model with $Q$ set to an identity matrix ($\mathcal{F} = 434.10$). That means that the added cost of inferring the matrix $Q$ is offset by the increase in performance it provides.
The error with respect to the true states seems smaller as well, but in practice we of course can't check this.
Let's inspect the inferred process noise covariance matrix:
Q_MAP = mean(last(results.posteriors[:Q]))
2×2 Matrix{Float64}: 0.0600678 0.00535827 0.00535827 0.145712
We do not have enough data to recover the true process noise covariance matrix exactly, but the result is definitely closer to the truth.
# True data
Q_true = data["Q"]
2×2 Matrix{Float64}: 0.00173611 0.0240885 0.0240885 0.549072
Let's visualize that for a closer look:
# Colorbar limits
clims = (minimum([Q_MAP[:]; Q_true[:]]), maximum([Q_MAP[:]; Q_true[:]]))
# Plot covariance matrices as heatmaps
p401 = heatmap(Q_MAP, axis=([], false), yflip=true, title="Estimated", clims=clims)
p402 = heatmap(Q_true, axis=([], false), yflip=true, title="True", clims=clims)
plot(p401,p402, layout=(1,2), size=(900,300))