underlying causes (factors) that control the observed data.
Tobamovirus data set, see section 4.1 in Tipping and Bishop (1999) (Originally from Ripley (1996)).
include("scripts/pca_demo_helpers.jl")
X = readDataSet("datasets/virus3.dat")
38×18 Array{Int64,2}: 17 13 14 16 4 9 14 1 13 0 11 13 5 7 1 4 11 5 12 11 9 12 6 5 12 1 9 1 7 12 5 6 0 4 8 2 18 16 16 16 8 6 14 1 14 0 9 12 4 8 0 2 11 3 18 16 15 19 8 6 11 1 15 1 7 13 5 8 0 2 9 3 17 13 13 22 8 4 18 1 10 3 8 11 7 6 1 2 10 2 16 13 16 21 9 3 17 1 10 4 7 12 7 5 1 2 11 3 22 19 10 16 10 4 18 1 12 2 8 11 6 8 0 1 8 2 20 10 24 10 6 9 21 0 7 0 7 18 4 9 1 4 8 2 20 21 12 15 9 7 11 1 9 3 8 14 6 7 0 1 10 3 20 21 12 15 9 7 11 1 9 3 9 14 5 7 0 1 10 3 18 11 24 10 9 6 19 0 12 0 7 14 4 11 0 4 9 1 20 12 23 10 8 5 20 0 13 0 6 13 4 11 0 4 10 1 18 19 18 16 8 4 12 0 12 0 10 15 8 6 1 1 12 1 ⋮ ⋮ ⋮ ⋮ 17 12 22 10 8 5 18 0 14 0 5 13 4 10 0 3 9 1 17 16 16 16 8 6 15 1 14 0 9 12 4 8 0 2 11 3 19 17 15 17 7 6 15 1 14 0 8 12 4 8 0 2 10 3 18 16 16 19 8 6 11 1 15 1 7 13 5 8 0 2 9 3 18 17 15 17 8 6 15 1 14 0 8 12 4 8 0 3 9 3 15 12 14 23 8 3 17 1 9 4 7 15 6 6 1 2 11 2 13 11 14 22 7 3 17 1 10 4 8 13 6 6 1 3 11 2 16 11 15 23 10 4 18 1 10 3 7 12 6 5 1 2 9 3 14 11 14 25 11 3 19 2 10 2 7 12 6 5 1 2 9 3 11 11 15 24 10 5 18 1 11 1 7 14 5 7 2 3 11 2 15 9 12 21 8 4 21 1 10 3 7 15 7 6 1 3 10 3 15 11 15 22 7 3 19 1 8 3 4 14 6 5 1 2 10 2
or equivalently $$\begin{align*} p(x|z) &= \mathcal{N}(x|\,W z + \mu,\Psi) \tag{likelihood}\\ p(z) &= \mathcal{N}(z|\,0,I) \tag{prior} \end{align*}$$
$p(z|x)$ distributions are also Gaussian.
since the mean evaluates to $$\begin{align*} \mathrm{E}[x] &= \mathrm{E}[W z + \mu+ \epsilon] \\ &= W \mathrm{E}[z] + \mu + \mathrm{E}[\epsilon] \\ &= \mu \end{align*}$$ and the covariance matrix is $$\begin{align*} \mathrm{cov}[x] &= \mathrm{E}[({x}-{\mu})({x}-{\mu})^T] \\ &= \mathrm{E}[(W z +\epsilon)(W z +\epsilon)^T] \\ &= W \mathrm{E}[z z^T] W^T + \mathrm{E}[\epsilon \epsilon^T] \\ &= W W^T + \Psi \end{align*}$$
long skinny matrices plus a diagonal matrix.
i.e., the noise model is discarded altogether and the columns of $W$ are orthonormal.
$\Rightarrow$ FA, pPCA and PCA differ only by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').
$\Rightarrow$ PCA is very widely applied to image and signal processing tasks!
$\Rightarrow$ FA has strong history in 'social sciences'
and observations ${D}=\{x_1,\dotsc,x_N\}$, find ML estimates for the parameters $\theta=\{W,\mu,\sigma\}$
Now subtract $\hat {\mu}$ from all data points (${x}_n:= {x}_n-\hat {\mu}$) and assume that we have zero-mean data.
and optimize w.r.t. $W$ and $\sigma^2$.
Let's perform pPCA on the example (Tobamovirus) data set using EM. We'll find the two principal components ($M=2$), and then visualize the data in a 2-D plot. The implementation is quite straightforward, have a look at the source file if you're interested in the details.
using LinearAlgebra
include("scripts/pca_demo_helpers.jl")
X = readDataSet("datasets/virus3.dat")
(θ, Z) = pPCA(convert(Matrix,X'), 2)# uses EM, implemented in scripts/pca_demo_helpers.jl. Feel free to try more/less dimensions.
using PyPlot
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of Tobamovirus data set on two dimensions (numbers correspond to data points)", fontsize=10);
Note that the solution is not unique, but the clusters should be more or less persistent.
Now let's randomly remove 20% of the data:
X_corrupt = convert(Matrix{Float64}, X)# convert to floating point matrix so we can use NaN to indicate missing values
indices = findall(rand(Float64,size(X)) .< 0.2)
X_corrupt[indices] .= NaN
println(X_corrupt)
[17.0 13.0 14.0 16.0 4.0 9.0 NaN 1.0 13.0 0.0 11.0 13.0 NaN 7.0 1.0 4.0 11.0 5.0; 12.0 11.0 9.0 12.0 NaN 5.0 12.0 1.0 9.0 1.0 7.0 12.0 5.0 6.0 0.0 NaN 8.0 2.0; 18.0 16.0 16.0 NaN 8.0 6.0 14.0 1.0 NaN 0.0 9.0 12.0 NaN 8.0 0.0 2.0 11.0 NaN; 18.0 16.0 15.0 19.0 8.0 6.0 11.0 1.0 15.0 NaN 7.0 13.0 5.0 8.0 0.0 2.0 9.0 3.0; 17.0 13.0 13.0 NaN 8.0 4.0 18.0 1.0 10.0 NaN 8.0 11.0 7.0 6.0 1.0 NaN 10.0 2.0; 16.0 13.0 16.0 21.0 9.0 3.0 17.0 NaN 10.0 NaN 7.0 12.0 7.0 5.0 1.0 2.0 11.0 NaN; 22.0 NaN 10.0 16.0 10.0 NaN 18.0 NaN NaN 2.0 8.0 11.0 6.0 8.0 0.0 1.0 8.0 NaN; 20.0 10.0 24.0 10.0 6.0 9.0 21.0 0.0 7.0 0.0 7.0 NaN 4.0 9.0 1.0 NaN NaN 2.0; 20.0 21.0 12.0 15.0 NaN 7.0 11.0 1.0 NaN 3.0 8.0 14.0 6.0 7.0 NaN 1.0 10.0 3.0; NaN 21.0 NaN 15.0 9.0 7.0 11.0 1.0 9.0 3.0 9.0 14.0 5.0 7.0 NaN 1.0 10.0 3.0; NaN 11.0 NaN 10.0 9.0 6.0 19.0 0.0 12.0 NaN 7.0 14.0 NaN NaN NaN 4.0 9.0 1.0; 20.0 12.0 23.0 10.0 8.0 5.0 20.0 0.0 13.0 0.0 NaN 13.0 4.0 NaN NaN 4.0 10.0 1.0; NaN NaN 18.0 16.0 NaN 4.0 12.0 0.0 12.0 0.0 10.0 15.0 8.0 6.0 1.0 1.0 12.0 1.0; 17.0 NaN NaN 15.0 8.0 6.0 14.0 1.0 14.0 0.0 9.0 12.0 NaN NaN 0.0 3.0 NaN 3.0; 19.0 17.0 NaN NaN 8.0 NaN 14.0 1.0 14.0 0.0 8.0 12.0 NaN NaN 0.0 2.0 12.0 NaN; NaN NaN 15.0 NaN 8.0 5.0 14.0 1.0 14.0 NaN 8.0 12.0 4.0 8.0 0.0 2.0 12.0 3.0; NaN 15.0 16.0 16.0 8.0 6.0 14.0 1.0 15.0 NaN 8.0 12.0 NaN 8.0 NaN 2.0 NaN 3.0; 17.0 17.0 16.0 19.0 NaN 6.0 NaN 1.0 NaN 1.0 7.0 13.0 5.0 8.0 0.0 2.0 9.0 NaN; 18.0 17.0 NaN NaN 8.0 6.0 11.0 1.0 NaN 1.0 7.0 13.0 5.0 8.0 0.0 NaN 9.0 3.0; 22.0 NaN 10.0 16.0 10.0 5.0 17.0 1.0 12.0 2.0 8.0 11.0 6.0 8.0 0.0 1.0 8.0 NaN; 18.0 NaN NaN 18.0 NaN 8.0 17.0 1.0 14.0 1.0 5.0 16.0 4.0 NaN 0.0 NaN NaN 2.0; NaN 16.0 16.0 15.0 NaN 6.0 13.0 1.0 14.0 1.0 8.0 12.0 4.0 8.0 1.0 2.0 12.0 3.0; 20.0 21.0 12.0 NaN 9.0 NaN 11.0 1.0 10.0 3.0 NaN 14.0 7.0 7.0 0.0 1.0 9.0 3.0; 20.0 21.0 12.0 15.0 9.0 7.0 11.0 NaN 10.0 3.0 9.0 14.0 5.0 7.0 0.0 1.0 10.0 3.0; 18.0 12.0 NaN 10.0 9.0 NaN NaN 0.0 14.0 0.0 7.0 12.0 4.0 NaN 0.0 NaN 10.0 1.0; 18.0 12.0 21.0 10.0 10.0 NaN 18.0 0.0 13.0 0.0 8.0 12.0 4.0 NaN 0.0 4.0 10.0 NaN; 17.0 12.0 22.0 10.0 8.0 5.0 18.0 0.0 NaN NaN 5.0 13.0 4.0 10.0 0.0 3.0 NaN 1.0; NaN 16.0 16.0 NaN 8.0 6.0 15.0 1.0 NaN NaN NaN 12.0 4.0 8.0 0.0 2.0 11.0 NaN; 19.0 NaN 15.0 17.0 7.0 NaN 15.0 1.0 14.0 0.0 8.0 12.0 4.0 8.0 0.0 2.0 NaN 3.0; NaN 16.0 16.0 19.0 NaN 6.0 11.0 1.0 15.0 1.0 NaN 13.0 5.0 8.0 0.0 2.0 9.0 3.0; 18.0 17.0 15.0 17.0 8.0 6.0 15.0 1.0 NaN 0.0 8.0 12.0 4.0 8.0 0.0 3.0 9.0 3.0; 15.0 12.0 14.0 23.0 8.0 3.0 17.0 1.0 9.0 NaN 7.0 15.0 6.0 6.0 NaN 2.0 11.0 2.0; 13.0 11.0 14.0 22.0 NaN 3.0 NaN 1.0 10.0 4.0 8.0 13.0 6.0 6.0 1.0 3.0 11.0 2.0; 16.0 11.0 15.0 23.0 10.0 4.0 18.0 1.0 10.0 3.0 7.0 12.0 6.0 NaN 1.0 NaN 9.0 3.0; 14.0 NaN 14.0 NaN 11.0 3.0 19.0 2.0 10.0 2.0 7.0 12.0 6.0 5.0 NaN 2.0 9.0 3.0; NaN NaN 15.0 24.0 10.0 NaN 18.0 1.0 11.0 1.0 7.0 NaN 5.0 NaN 2.0 3.0 11.0 2.0; 15.0 NaN 12.0 21.0 8.0 4.0 NaN 1.0 10.0 3.0 7.0 15.0 7.0 6.0 1.0 3.0 10.0 3.0; 15.0 11.0 NaN 22.0 NaN 3.0 19.0 1.0 8.0 3.0 4.0 14.0 6.0 5.0 1.0 2.0 10.0 NaN]
(θ, Z) = pPCA(convert(Matrix,X_corrupt'), 2) # Perform pPCA on the corrupted data set
plot(Z[1,:], Z[2,:], "w")
for n=1:size(Z,2)
PyPlot.text(Z[1,n], Z[2,n], string(n), fontsize=10) # put a label on the position of the data point
end
title("Projection of CORRUPTED Tobamovirus data set on two dimensions", fontsize=10);
As you can see, pPCA is quite robust in the face of missing data.
open("../../styles/aipstyle.html") do f
display("text/html", read(f, String))
end