underlying causes (factors) that control the observed data.
Tobamovirus data set, see section 4.1 in Tipping and Bishop (1999) (Originally from Ripley (1996)).
X = readDataSet("datasets/virus3.dat")
or equivalently $$\begin{align*} p(x|z) &= \mathcal{N}(x|\,W z + \mu,\Psi) \tag{likelihood}\\ p(z) &= \mathcal{N}(z|\,0,I) \tag{prior} \end{align*}$$
$p(z|x)$ distributions are also Gaussian.
since the mean evaluates to $$\begin{align*} \mathrm{E}[x] &= \mathrm{E}[W z + \mu+ \epsilon] \\ &= W \mathrm{E}[z] + \mu + \mathrm{E}[\epsilon] \\ &= \mu \end{align*}$$ and the covariance matrix is $$\begin{align*} \mathrm{cov}[x] &= \mathrm{E}[({x}-{\mu})({x}-{\mu})^T] \\ &= \mathrm{E}[(W z +\epsilon)(W z +\epsilon)^T] \\ &= W \mathrm{E}[z z^T] W^T + \mathrm{E}[\epsilon \epsilon^T] \\ &= W W^T + \Psi \end{align*}$$
long skinny matrices plus a diagonal matrix.
i.e., the noise model is discarded altogether and the columns of $W$ are orthonormal.
$\Rightarrow$ FA, pPCA and PCA differ only by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').
$\Rightarrow$ PCA is very widely applied to image and signal processing tasks!
$\Rightarrow$ FA has strong history in 'social sciences'
and observations ${D}=\{x_1,\dotsc,x_N\}$, find ML estimates for the parameters $\theta=\{W,\mu,\sigma\}$
Now subtract $\hat {\mu}$ from all data points (${x}_n:= {x}_n-\hat {\mu}$) and assume that we have zero-mean data.
and optimize w.r.t. $W$ and $\sigma^2$.
Let's perform pPCA on the example (Tobamovirus) data set using EM. We'll find the two principal components ($M=2$), and then visualize the data in a 2-D plot. The implementation is quite straightforward, have a look at the source file if you're interested in the details.
using LinearAlgebra
X = readDataSet("datasets/virus3.dat")
(θ, Z) = pPCA(convert(Matrix,X'), 2)# uses EM, implemented in scripts/pca_demo_helpers.jl. Feel free to try more/less dimensions.
using PyPlot
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
title("Projection of Tobamovirus data set on two dimensions (numbers correspond to data points)", fontsize=10);
Note that the solution is not unique, but the clusters should be more or less persistent.
Now let's randomly remove 20% of the data:
X_corrupt = convert(Matrix{Float64}, X)# convert to floating point matrix so we can use NaN to indicate missing values
indices = findall(rand(Float64,size(X)) .< 0.2)
X_corrupt[indices] .= NaN
(θ, Z) = pPCA(convert(Matrix,X_corrupt'), 2) # Perform pPCA on the corrupted data set
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
title("Projection of CORRUPTED Tobamovirus data set on two dimensions", fontsize=10);
As you can see, pPCA is quite robust in the face of missing data.