Consider again a set of observed data $D=\{x_1,\dotsc,x_N\}$
We consider the same model as we did in the generative classification lesson: $$\begin{align*} p(x_n | z_{nk}=1) &= \mathcal{N}\left( x_n | \mu_k, \Sigma_k\right)\\ p(z_{nk}=1) &= \pi_k \end{align*}$$ which can be summarized with the selection variables $z_{nk}$ as $$\begin{align*} p(x_n,z_n) &= p(x_n | z_n) p(z_n) = \prod_{k=1}^K (\underbrace{\pi_k \cdot \mathcal{N}\left( x_n | \mu_k, \Sigma_k\right) }_{p(x_n,z_{nk}=1)})^{z_{nk}} \end{align*}$$
Again, this is the same model as we defined for the generative classification model: A Gaussian-Categorical model but now with unobserved classes).
This model (with unobserved class labels) is known as a Gaussian Mixture Model (GMM).
Full proof as an exercise.
Eq. B-9.12 reveals the link to the name Gaussian mixture model. The priors $\pi_k$ are also called mixture coefficients.
where we defined $q_{-j}^*(z_{-j}) \triangleq q_1^*(z_1)q_2^*(z_2)\cdots q_{j-1}^*(z_{j-1})q_{j+1}^*(z_{j+1})\cdots q_m^*(z_m)$.
In practice, we solve this chicken-and-egg problem by an iterative approach: we first initialize all $q_j(z_j)$ (for $j=1,\ldots,m$) to an appropriate initial distribution and then cycle through the factors in turn by solving eq.B-10.9 and update $q_{-j}^*(z_{-j})$ with the latest estimates. (See Blei, 2017, Algorithm 1, p864).
This algorithm for approximating Bayesian inference is known Coordinate Ascent Variational Inference (CAVI).
using Pkg;Pkg.activate("probprog/workspace");Pkg.instantiate()
IJulia.clear_output();
using DataFrames, CSV, LinearAlgebra, PDMats, SpecialFunctions
include("scripts/gmm_plot.jl") # Holds plotting function
old_faithful = CSV.read("datasets/old_faithful.csv",DataFrame);
X = convert(Matrix{Float64}, [old_faithful[!,1] old_faithful[!,2]]');#data matrix
N = size(X, 2) #number of observations
K = 6
function sufficientStatistics(X,r,k::Int) #function to compute sufficient statistics
N_k = sum(r[k,:])
hat_x_k = sum([r[k,n]*X[:,n] for n in 1:N]) ./ N_k
S_k = sum([r[k,n]*(X[:,n]-hat_x_k)*(X[:,n]-hat_x_k)' for n in 1:N]) ./ N_k
return N_k, hat_x_k, S_k
end
function updateMeanPrecisionPi(m_0,β_0,W_0,ν_0,α_0,r) #variational maximisation function
m = Array{Float64}(undef,2,K) #mean of the clusters
β = Array{Float64}(undef,K) #precision scaling for Gausian distribution
W = Array{Float64}(undef,2,2,K) #precision prior for Wishart distributions
ν = Array{Float64}(undef,K) #degrees of freedom parameter for Wishart distribution
α = Array{Float64}(undef,K) #Dirichlet distribution parameter
for k=1:K
sst = sufficientStatistics(X,r,k)
α[k] = α_0[k] + sst[1]; β[k] = β_0[k] + sst[1]; ν[k] = ν_0[k] .+ sst[1]
m[:,k] = (1/β[k])*(β_0[k].*m_0[:,k] + sst[1].*sst[2])
W[:,:,k] = inv(inv(W_0[:,:,k])+sst[3]*sst[1] + ((β_0[k]*sst[1])/(β_0[k]+sst[1])).*(sst[2]-m_0[:,k])*(sst[2]-m_0[:,k])')
end
return m,β,W,ν,α
end
function updateR(Λ,m,α,ν,β) #variational expectation function
r = Array{Float64}(undef,K,N) #responsibilities
hat_π = Array{Float64}(undef,K)
hat_Λ = Array{Float64}(undef,K)
for k=1:K
hat_Λ[k] = 1/2*(2*log(2)+logdet(Λ[:,:,k])+digamma(ν[k]/2)+digamma((ν[k]-1)/2))
hat_π[k] = exp(digamma(α[k])-digamma(sum(α)))
for n=1:N
r[k,n] = hat_π[k]*exp(-hat_Λ[k]-1/β[k] - (ν[k]/2)*(X[:,n]-m[:,k])'*Λ[:,:,k]*(X[:,n]-m[:,k]))
end
end
for n=1:N
r[:,n] = r[:,n]./ sum(r[:,n]) #normalize to ensure r represents probabilities
end
return r
end
max_iter = 15
#store the inference results in these vectors
ν = fill!(Array{Float64}(undef,K,max_iter),3)
β = fill!(Array{Float64}(undef,K,max_iter),1.0)
α = fill!(Array{Float64}(undef,K,max_iter),0.01)
R = Array{Float64}(undef,K,N,max_iter)
M = Array{Float64}(undef,2,K,max_iter)
Λ = Array{Float64}(undef,2,2,K,max_iter)
clusters_vb = Array{Distribution}(undef,K,max_iter) #clusters to be plotted
#initialize prior distribution parameters
M[:,:,1] = [[3.0; 60.0];[4.0; 70.0];[2.0; 50.0];[2.0; 60.0];[3.0; 100.0];[4.0; 80.0]]
for k=1:K
Λ[:,:,k,1] = [1.0 0;0 0.01]
R[k,:,1] = 1/(K)*ones(N)
clusters_vb[k,1] = MvNormal(M[:,k,1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[1,1].*Λ[:,:,k,1])))))
end
#variational inference
for i=1:max_iter-1
#variational expectation
R[:,:,i+1] = updateR(Λ[:,:,:,i],M[:,:,i],α[:,i],ν[:,i],β[:,i])
#variational minimisation
M[:,:,i+1],β[:,i+1],Λ[:,:,:,i+1],ν[:,i+1],α[:,i+1] = updateMeanPrecisionPi(M[:,:,i],β[:,i],Λ[:,:,:,i],ν[:,i],α[:,i],R[:,:,i+1])
for k=1:K
clusters_vb[k,i+1] = MvNormal(M[:,k,i+1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[k,i+1].*Λ[:,:,k,i+1])))))
end
end
subplot(2,3,1); plotGMM(X, clusters_vb[:,1], R[:,:,1]); title("Initial situation")
for i=2:6
subplot(2,3,i)
plotGMM(X, clusters_vb[:,i*2], R[:,:,i*2]); title("After $(i*2) iterations")
end
PyPlot.tight_layout();
The generated figure looks much like Figure 10.6 in Bishop. The plots show results for Variational Bayesian mixture of K = 6 Gaussians applied to the Old Faithful data set, in which the ellipses denote the one standard-deviation density contours for each of the components, and the color coding of the data points reflects the "soft" class label assignments. Components whose expected mixing coefficient are numerically indistinguishable from zero are not plotted. Note that this procedure learns the number of classes (two), learns the class label for each observation, and learns the mean and variance for the Gaussian data distributions.
In the Bayesian Machine Learning lecture, we discussed the CA decomposition of Bayesian model evidence to support the interpretation of evidence as a model performance criterion. Here, we recognize that FE allows a similar CA decomposition: minimizing FE increases data accuracy and decreases model complexity.
The CA decomposition makes use of the prior $p(z)$ and likelihood $p(x|z)$, both of which are selected by the engineer, so the FE can be evaluated with this decomposition!
If we now assume some parametric form for $q(z)$, e.g. $q(z) = \mathcal{N}(z\mid \mu, \Sigma)$, then the FE functional degenerates to a regular function $F(\mu,\Sigma)$. In principle, this function can be evaluated by the CA decomposition and minimized by standard (function) minimization methods.
The EE decomposition provides a link to a second law of thermodynamics: Minimizing FE leads to entropy maximization.
Consider a model $$p(x,z,\theta)$$ with observations $x = \{x_n\}$, latent variables $z=\{z_n\}$ and parameters $\theta$.
We can write the following FE functional for this model: $$\begin{align*} F[q] = \sum_z \sum_\theta q(z,\theta) \log \frac{q(z,\theta)}{p(x,z,\theta)} \end{align*}$$
The EM algorithm makes the following simplifying assumptions:
Basically, these three assumptions turn FE minimization into maximum likelihood estimation for the parameters $\theta$ and the FE simplifies to $$\begin{align*} F[q,\theta] = \sum_z q(z) \log \frac{q(z)}{p(x,z|\theta)} \end{align*}$$
The EM algorithm minimizes this FE by iterating (iteration counter: $i$) over
These choices are optimal for the given FE functional. In order to see this, consider the two decompositions $$\begin{align} F[q,\theta] &= \underbrace{-\sum_z q(z) \log p(x,z|\theta)}_{\text{energy}} - \underbrace{\sum_z q(z) \log \frac{1}{q(z)}}_{\text{entropy}} \tag{EE}\\ &= \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z|x,\theta)}}_{\text{divergence}} - \underbrace{\log p(x|\theta)}_{\text{log-likelihood}} \tag{DE} \end{align}$$
The DE decomposition shows that the FE is minimized for the choice $q(z) := p(z|x,\theta)$. Also, for this choice, the FE equals the (negative) log-evidence (, which is this case simplifies to the log-likelihood).
The EE decomposition shows that the FE is minimized wrt $\theta$ by minimizing the energy term. The energy term is computed in the E-step and optimized in the M-step.
In order to execute the EM algorithm, it is assumed that we can analytically execute the E- and M-steps. For a large set of models (including models whose distributions belong to the exponential family of distributions), this is indeed the case and hence the large popularity of the EM algorithm.
The EM algorihm imposes rather severe assumptions on the FE (basically approximating Bayesian inference by maximum likelihood estimation). Over the past few years, the rise of Probabilistic Programming languages has dramatically increased the range of models for which the parameters can by estimated autmatically by (approximate) Bayesian inference, so the popularity of EM is slowly waning. (More on this in the Probabilistic Programming lessons).
Bishop (2006) works out EM for the GMM in section 9.2.
We'll perform clustering on the data set from the illustrative example by fitting a GMM consisting of two Gaussians using the EM algorithm.
using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
IJulia.clear_output();
using DataFrames, CSV, LinearAlgebra
include("scripts/gmm_plot.jl") # Holds plotting function
old_faithful = CSV.read("datasets/old_faithful.csv", DataFrame);
X = Array(Matrix{Float64}(old_faithful)')
N = size(X, 2)
# Initialize the GMM. We assume 2 clusters.
clusters = [MvNormal([4.;60.], [.5 0;0 10^2]);
MvNormal([2.;80.], [.5 0;0 10^2])];
π_hat = [0.5; 0.5] # Mixing weights
γ = fill!(Matrix{Float64}(undef,2,N), NaN) # Responsibilities (row per cluster)
# Define functions for updating the parameters and responsibilities
function updateResponsibilities!(X, clusters, π_hat, γ)
# Expectation step: update γ
norm = [pdf(clusters[1], X) pdf(clusters[2], X)] * π_hat
γ[1,:] = (π_hat[1] * pdf(clusters[1],X) ./ norm)'
γ[2,:] = 1 .- γ[1,:]
end
function updateParameters!(X, clusters, π_hat, γ)
# Maximization step: update π_hat and clusters using ML estimation
m = sum(γ, dims=2)
π_hat = m / N
μ_hat = (X * γ') ./ m'
for k=1:2
Z = (X .- μ_hat[:,k])
Σ_k = Symmetric(((Z .* (γ[k,:])') * Z') / m[k])
clusters[k] = MvNormal(μ_hat[:,k], convert(Matrix, Σ_k))
end
end
# Execute the algorithm: iteratively update parameters and responsibilities
subplot(2,3,1); plotGMM(X, clusters, γ); title("Initial situation")
updateResponsibilities!(X, clusters, π_hat, γ)
subplot(2,3,2); plotGMM(X, clusters, γ); title("After first E-step")
updateParameters!(X, clusters, π_hat, γ)
subplot(2,3,3); plotGMM(X, clusters, γ); title("After first M-step")
iter_counter = 1
for i=1:3
for j=1:i+1
updateResponsibilities!(X, clusters, π_hat, γ)
updateParameters!(X, clusters, π_hat, γ)
iter_counter += 1
end
subplot(2,3,3+i);
plotGMM(X, clusters, γ);
title("After $(iter_counter) iterations")
end
PyPlot.tight_layout()
Note that you can step through the interactive demo yourself by running this script in julia. You can run a script in julia by
julia> include("path/to/script-name.jl")
Also, we assume a mean-field approximation for the posterior:
$$
q(x) = \prod_{i=1}^N q_i(x_i)
$$
and consequently a corresponding free energy functional
$$\begin{align*}
F[q] &= \sum_x q(x) \log \frac{q(x)}{p(x)} \\
&= \sum_i \sum_{x_i} \left(\prod_{i=1}^N q_i(x_i)\right) \log \frac{\prod_{i=1}^N q_i(x_i)}{\prod_{a=1}^M p_a(x_a)}
\end{align*}$$
With these assumptions, it can be shown that the FE evaluates to (exercise) $$ F[q] = \sum_{a=1}^M \underbrace{\sum_{x_a} \left( \prod_{j\in N(a)} q_j(x_j)\cdot \left(-\log p_a(x_a)\right) \right) }_{\text{node energy }U[p_a]} - \sum_{i=1}^N \underbrace{\sum_{x_i} q_i(x_i) \log \frac{1}{q_i(x_i)}}_{\text{edge entropy }H[q_i]} $$
In words, the FE decomposes into a sum of (expected) energies for the nodes minus the entropies on the edges.
Let us now consider the local free energy that is associated with edge corresponding to $x_j$.
Apparently (see previous slide), there are three contributions to the free energy for $x_j$:
The local free energy for $x_j$ can be written as (exercise) $$ F[q_j] \propto \sum_{x_j} q(x_j) \log \frac{q_j(x_j)}{\nu_a(x_j)\cdot \nu_b(x_j)} $$ where $$\begin{align*} \nu_a(x_j) &\propto \exp\left( \mathbb{E}_{q_{k}}\left[ \log p_a(x_a)\right]\right) \\ \nu_b(x_j) &\propto \exp\left( \mathbb{E}_{q_{l}}\left[ \log p_b(x_b)\right]\right) \end{align*}$$ and $\mathbb{E}_{q_{k}}\left[\cdot\right]$ is an expectation w.r.t. all $q(x_k)$ with $k \in N(a)\setminus {j}$.
$\nu_a(x_j)$ and $\nu_b(x_j)$ can be locally computed in nodes $a$ and $b$ respectively and can be interpreted as colliding messages over edge $x_j$.
Local free energy minimization is achieved by setting $$ q_j(x_j) \propto \nu_a(x_j) \cdot \nu_b(x_j) $$
Note that message $\nu_a(x_j)$ depends on posterior beliefs over incoming edges ($k$) for node $a$, and in turn, the message from node $a$ towards edge $x_k$ depends on the belief $q_j(x_j)$. I.o.w., direct mutual dependencies exist between posterior beliefs over edges that attach to the same node.
These considerations lead to the Variational Message Passing procedure, which is an iterative free energy minimization procedure that can be executed completely through locally computable messages.
Procedure VMP, see Dauwels (2007), section 3
- Initialize all messages $q$ and $ν$, e.g., $q(\cdot) \propto 1$ and $\nu(\cdot) \propto 1$.
- Select an edge $z_k$ in the factor graph of $f(z_1,\ldots,z_m)$.
- Compute the two messages $\overrightarrow{\nu}(z_k)$ and $\overleftarrow{\nu}(z_k)$ by applying the following generic rule: $$ \overrightarrow{\nu}(y) \propto \exp\left( \mathbb{E}_{q}\left[ \log g(x_1,\dots,x_n,y)\right] \right) $$
- Compute the marginal $q(z_k)$ $$ q(z_k) \propto \overrightarrow{\nu}(z_k) \overleftarrow{\nu}(z_k) $$ and send it to the two nodes connected to the edge $x_k$.
- Iterate 2–4 until convergence.
We showed that, under mean field assumptions, the FE can be decomposed into a sum of local FE contributions for the nodes ($a$) and edges ($i$): $$ F[q] = \sum_{a=1}^M \underbrace{\sum_{x_a} \left( \prod_{j\in N(a)} q_j(x_j)\cdot \left(-\log p_a(x_a)\right) \right) }_{\text{node energy }U[p_a]} - \sum_{i=1}^N \underbrace{\sum_{x_i} q_i(x_i) \log \frac{1}{q_i(x_i)}}_{\text{edge entropy }H[q_i]} $$
The mean field assumption is very strong and may lead to large inference costs ($\mathrm{KL}(q(x),p(x|\text{data}))$). A more relaxed assumption is to allow joint posterior beliefs over the variables that attach to a node. This idea is expressed by the Bethe Free Energy: $$ F_B[q] = \sum_{a=1}^M \left( \sum_{x_a} q_a(x_a) \log \frac{q_a(x_a)}{p_a(x_a)} \right) - \sum_{i=1}^N (d_i - 1) \sum_{x_i} q_i(x_i) \log {q_i(x_i)} $$ where $q_a(x_a)$ is the posterior joint belief over the variables $x_a$ (i.e., the set of variables that attach to node $a$), $q_i(x_i)$ is the posterior marginal belief over the variable $x_i$ and $d_i$ is the number of factor nodes that link to edge $i$. Moreover, $q_a(x_a)$ and $q_i(x_i)$ are constrained to obey the following equalities: $$ \sum_{x_a \backslash x_i} q_a(x_a) = q_i(x_i), ~~~ \forall i, \forall a \\ \sum_{x_i} q_i(x_i) = 1, ~~~ \forall i \\ \sum_{x_a} q_a(x_a) = 1, ~~~ \forall a \\ $$
We form the Lagrangian by augmenting the Bethe Free Energy functional with the constraints: $$ L[q] = F_B[q] + \sum_i\sum_{a \in N(i)} \lambda_{ai}(x_i) \left(q_i(x_i) - \sum_{x_a\backslash x_i} q(x_a) \right) + \sum_{i} \gamma_i \left( \sum_{x_i}q_i(x_i) - 1\right) + \sum_{a}\gamma_a \left( \sum_{x_a}q_a(x_a) -1\right) $$
The stationary solutions for this Lagrangian are given by $$ q_a(x_a) = f_a(x_a) \exp\left(\gamma_a -1 + \sum_{i \in N(a)} \lambda_{ai}(x_i)\right) \\ q_i(x_i) = \exp\left(1- \gamma_i + \sum_{a \in N(i)} \lambda_{ai}(x_i)\right) ^{\frac{1}{d_i - 1}} $$ where $N(i)$ denotes the factor nodes that have $x_i$ in their arguments and $N(a)$ denotes the set of variables in the argument of $f_a$.
Stationary solutions are functions of Lagrange multipliers. This means that Lagrange multipliers need to be determined. Lagrange multipliers can be determined by plugging the stationary solutions back into the constraint specification and solving for the multipliers which ensure that the constraint is satisfied. The first constraint we consider is normalization, which yields the following identification: $$ \gamma_a = 1 - \log \Bigg(\sum_{x_a}f_a(x_a)\exp\left(\sum_{i \in N(a)}\lambda_{ai}(x_i)\right)\Bigg)\\ \gamma_i = 1 + (d_i-1) \log\Bigg(\sum_{x_i}\exp\left( \frac{1}{d_i-1}\sum_{a \in N(i)} \lambda_{ai}(x_i)\right)\Bigg). $$
The functional form of the Lagrange multipliers that corresponds to the normalization constraint enforces us to obtain the Lagrange multipliers that correspond to the marginalization constraint. To do so we solve for $$ \sum_{x_a \backslash x_i} f_a(x_a) \exp\left(\sum_{i \in N(a)} \lambda_{ai}(x_i)\right) = \exp\left(\sum_{a \in N(i)} \lambda_{ai}(x_i)\right) ^{\frac{1}{d_i - 1}} \\ \exp\left(\lambda_{ai}(x_i)\right)\sum_{x_a \backslash x_i} f_a(x_a) \exp\Bigg(\sum_{\substack{{j \in N(a)} \\ j \neq i}}\lambda_{aj}(x_j)\Bigg) = \exp\left(\sum_{a \in N(i)} \lambda_{ai}(x_i)\right) ^{\frac{1}{d_i - 1}}\\ \exp\left(\lambda_{ai}(x_i) + \lambda_{ia}(x_i)\right) = \exp\left(\sum_{a \in N(i)} \lambda_{ai}(x_i)\right) ^{\frac{1}{d_i - 1}}\,, $$ where we defined an auxilary function $$ \exp(\lambda_{ia}(x_i)) \triangleq \sum_{x_a \backslash x_i} f_a(x_a) \exp\Bigg(\sum_{\substack{{j \in N(a)} \\ j \neq i}}\lambda_{aj}(x_j)\Bigg) \,. $$ This definition is valid since it can be inverted by the relation $$ \lambda_{ia}(x_i) = \frac{2-d_i}{d_i - 1}\lambda_{ai}(x_i) + \frac{1}{d_i -1}\sum_{\substack{c \in N(i)\\c \neq a}}\lambda_{ci}(x_i) $$
In general it is not possible to solve for the Lagrange multipliers analytically and we resort to iteratively obtaining the solutions. This leads to the Belief Propagation algorithm where the exponentiated Lagrange multipliers (messages) are updated iteratively via $$ \mu_{ia}^{(k+1)}(x_i) = \sum_{x_a \backslash x_i} f_a(x_a) \prod_{\substack{{j \in N(a)} \\ j \neq i}}\mu^{(k)}_{aj}(x_j) \\ \mu_{ai}^{(k)}(x_i) = \prod_{\substack{c \in N(i) \\ c \neq a}}\mu^{(k)}_{ic}(x_i)\,, $$ where $k$ denotes iteration number and the messages are defined as $$ \mu_{ia}(x_i) \triangleq \exp(\lambda_{ia}(x_i))\\ \mu_{ai}(x_i) \triangleq \exp(\lambda_{ai}(x_i))\,. $$
For a more complete overview of message passing as Bethe Free Energy minimization, see Zhang (2017).
open("../../styles/aipstyle.html") do f
display("text/html", read(f, String))
end