Note that none of the material below is new. The point of the Probabilistic Programming sessions is to solve practical problems so that the concepts from Bert's lectures become less abstract.
using Pkg
Pkg.activate("../../../lessons/")
Pkg.instantiate();
Activating project at `~/syndr/Wouter/Onderwijs/Vakken/tueindhoven/5SSD0 - Bayesian Machine Learning & Information Processing/2023-2024 Q2/BMLIP/lessons`
using JLD
using Statistics
using LinearAlgebra
using Distributions
using RxInfer
using ColorSchemes
using LaTeXStrings
using Plots
default(label="", grid=false, linewidth=3, margin=10Plots.pt)
include("../scripts/clusters.jl");
Archeologists have asked for your help in analyzing data on tools made of stone. It is believed that primitive humans created tools by striking stones with others. During this process, the stone loses flakes, which have been preserved. The archeologists have recovered these flakes from various locations and time periods and want to know whether this stone tool shaping process has improved over the centuries.
The data is available from the UCI Machine Learning Repository. Each instance represents summary information of the stone flakes for a particular site. We will be using the attributes flaking angle (FLA) and the proportion of the dorsal surface worked (PROZD) for now.
data = load("../datasets/stone_flakes.jld");
I've done some pre-processing on the data set, namely z-scoring and removing two outliers. This reduces the scale of the attributes which helps numerical stability during optimization. Now let's visualize the data with a scatterplot.
scatter(data["observations"][:,1],
data["observations"][:,2],
label="",
xlabel="Proportion of worked dorsal surface (PROZD)",
ylabel="Flaking angle (FLA)",
size=(800,300))
We will be clustering this data with a Gaussian mixture model, to see if we can identify clear types of stone tools. The generative model for a Gaussian mixture consists of:
$$ p(X, z, \pi, \mu, \Lambda) =\ \underbrace{p(X \mid z, \mu, \Lambda)}_{\text{likelihood}}\ \times \ \underbrace{p(z \mid \pi)}_{\text{prior latent variables}} \ \times \ \underbrace{p(\mu \mid \Lambda)\ p(\Lambda)\ p(\pi)}_{\text{prior parameters}}$$with the likelihood of observation $X_i$ being a Gaussian raised to the power of the latent assignment variables $z$
$$ p(X_i \mid z, \mu, \Lambda) = \prod_{k=1}^{K} \mathcal{N}(X_i \mid \mu_k, \Lambda_k^{-1})^{z_i = k}$$the prior for each latent variable $z_i$ being a Categorical distribution
$$ p(z_i \mid \pi) = \text{Categorical}(z_i \mid \pi) $$and priors for the parameters being
$$ \begin{aligned} p(\Lambda_k) =&\ \text{Wishart}(\Lambda_k \mid V_0, n_0) \qquad &\text{for all}\ k , \\ p(\mu_k \mid \Lambda_k) =&\ \mathcal{N}(\mu_k \mid m_0, \Lambda_k^{-1}) \qquad &\text{for all}\ k , \\ p(\pi) =&\ \text{Dirichlet}(\pi \mid a_0) \quad . \end{aligned}$$We can implement these equations nearly directly in ReactiveMP.
# Data dimensionality
num_features = size(data["observations"],2)
# Sample size
num_samples = size(data["observations"],1)
# Number of mixture components
num_components = 3;
Mixture models can be sensitive to initialization, so we are going to specify the prior parameters explicitly.
# Identity matrix (convenience variable)
Id = diagm(ones(num_features));
# Prior scale matrices
V0 = cat(Id, Id, Id, dims=3)
# Prior degrees of freedom
n0 = num_features
# Prior means
m0 = [ 1.0 0.0 -1.0;
-1.0 0.0 1.0];
# Prior concentration parameters
a0 = ones(num_components);
# Pack into dictionary
prior_params = Dict{Symbol,Any}(:Λ => (V0,n0), :μ => m0, :π => a0)
Dict{Symbol, Any} with 3 entries: :μ => [1.0 0.0 -1.0; -1.0 0.0 1.0] :π => [1.0, 1.0, 1.0] :Λ => ([1.0 0.0; 0.0 1.0;;; 1.0 0.0; 0.0 1.0;;; 1.0 0.0; 0.0 1.0], 2)
Now to specify the full model.
@model function GMM(prior_params; K=1, N=1)
# Allocate variables
X = datavar(Vector{Float64}, N)
z = randomvar(N)
μ = randomvar(K)
Λ = randomvar(K)
# Unpack prior parameters
V0 = prior_params[:Λ][1]
n0 = prior_params[:Λ][2]
m0 = prior_params[:μ]
a0 = prior_params[:π]
# Component parameters
for k in 1:K
Λ[k] ~ Wishart(n0, V0[:,:,k])
μ[k] ~ MvNormalMeanPrecision(m0[:,k], Λ[k])
end
# Mixture weights
π ~ Dirichlet(a0)
cmeans = tuple(μ...)
cprecs = tuple(Λ...)
for i in 1:N
# Latent assignment variable
z[i] ~ Categorical(π)
# Mixture distribution
X[i] ~ NormalMixture(z[i], cmeans, cprecs) where { q = MeanField() }
end
return X,z,π,μ,Λ
end
Set up the inference procedure.
# Map data to list of vectors
observations = [data["observations"][i,:] for i in 1:num_samples]
# Set variational distribution factorization
constraints = @constraints begin
q(z,π,μ,Λ) = q(z)q(π)q(μ)q(Λ)
end
# Initialize variational distributions
initmarginals = (
π = Dirichlet(a0),
μ = [MvNormalMeanCovariance(m0[:,k], Id) for k in 1:num_components],
Λ = [Wishart(n0, V0[:,:,k]) for k in 1:num_components]
)
# Iterations of variational inference
num_iters = 100
# Variational inference
results = inference(
model = GMM(prior_params, K=num_components, N=num_samples),
data = (X = observations,),
constraints = constraints,
initmarginals = initmarginals,
iterations = num_iters,
free_energy = true,
showprogress = true,
)
Progress: 100%|█████████████████████████████████████████| Time: 0:00:06
Inference results: Posteriors | available for (μ, π, Λ, z) Free Energy: | Real[217.442, 202.111, 200.341, 199.685, 199.169, 198.64, 198.078, 197.534, 197.088, 196.743 … 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448, 193.448]
Alright, we're done. Let's track the evolution of free energy over iterations.
plot(1:num_iters,
results.free_energy,
color="black",
xscale=:log10,
xlabel="Number of iterations",
ylabel="Free Energy",
size=(800,300))
That looks like it is nicely decreasing. Let's now visualize the cluster on top of the observations.
# Extract mixture weights
π_hat = mean(results.posteriors[:π][num_iters])
# Extract component means
μ_hat = mean.(results.posteriors[:μ][num_iters])
# Extract covariance matrices
Σ_hat = inv.(mean.(results.posteriors[:Λ][num_iters]))
# Select dimensions to plot
xlims = [minimum(data["observations"][:,1])-1, maximum(data["observations"][:,1])+1]
ylims = [minimum(data["observations"][:,2])-1, maximum(data["observations"][:,2])+1]
# Plot data and overlay estimated posterior probabilities
plot_clusters(data["observations"][:, 1:2],
μ=μ_hat,
Σ=Σ_hat,
xlims=xlims,
ylims=ylims,
xlabel="Proportion of worked dorsal surface (PROZD)",
ylabel="Flaking angle (FLA)",
colors=[:reds, :blues, :greens],
figsize=(800,500))
That doesn't look bad. The three Gaussians nicely cover all samples.
Play around with the number of components. Can you get an equally good coverage with just 2 components? What if you had 4?
We can also plot the evolution of the parameters over iterations of the variational inference procedure.
# Extract mean and standard deviation from
mean_π_iters = cat(mean.(results.posteriors[:π])..., dims=2)
vars_π_iters = cat(var.( results.posteriors[:π])..., dims=2)
plot(1:num_iters,
mean_π_iters',
ribbon=vars_π_iters',
xscale=:log10,
xlabel="Number of iterations",
ylabel="Free Energy",
size=(800,300))
Plot the evolution of one of the component means over iterations of variational inference, including its variance.