Let's investigate a real world example of PCA with the MNIST digits dataset. This is a very popular dataset that looks like:
This is a very simple dataset. How good is PCA at finding the components that best describe this dataset?
import numpy as np
import pandas as pd
data = np.load("mnist.npz")
X = data["X"]
labels = data["y"]
This data is 70,000 images, each of which are 28x28, or have 784 pixels each. The rows are concatenated to conform with "tidy" data, which is what Scikit-Learn expects:
X.shape
(70000, 784)
These images are 28 x 28 images, which means that there are 784 pixel values for each image. Each row of X has 784 numbers (which is 28x28)
# visualize one of the digits (reshape to 28x28)
#np.set_printoptions(linewidth=200)
X[1].reshape(28,28)
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 51, 159, 253, 159, 50, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 238, 252, 252, 252, 237, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 54, 227, 253, 252, 239, 233, 252, 57, 6, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 60, 224, 252, 253, 252, 202, 84, 252, 253, 122, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 238, 252, 252, 179, 12, 75, 121, 21, 0, 0, 253, 243, 50, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 38, 165, 253, 233, 208, 84, 0, 0, 0, 0, 0, 0, 253, 252, 165, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 7, 178, 252, 240, 71, 19, 28, 0, 0, 0, 0, 0, 0, 253, 252, 195, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 57, 252, 252, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 253, 252, 195, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 198, 253, 190, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 253, 196, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 76, 246, 252, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 253, 252, 148, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 230, 25, 0, 0, 0, 0, 0, 0, 0, 0, 7, 135, 253, 186, 12, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 223, 0, 0, 0, 0, 0, 0, 0, 0, 7, 131, 252, 225, 71, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 145, 0, 0, 0, 0, 0, 0, 0, 48, 165, 252, 173, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 86, 253, 225, 0, 0, 0, 0, 0, 0, 114, 238, 253, 162, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 249, 146, 48, 29, 85, 178, 225, 253, 223, 167, 56, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 28, 199, 252, 252, 253, 252, 252, 233, 145, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 25, 128, 252, 253, 252, 141, 37, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
labels.shape
(70000,)
Let's visualize a random selection of images.
To do this, matplotlib will need to be imported (Pandas plotting is a wrapper around matplotlib):
import matplotlib.pyplot as plt
# create a d1 x d2 grid of subplots
rows, cols = 4, 6
fig, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(8,6))
for i in range(rows):
for j in range(cols):
# pick an image at random
k = np.random.randint(len(labels))
# reshape the image
img = X[k].reshape(28,28)
label = labels[k]
# plot the image
axs[i,j].imshow(img, cmap='gray_r')
axs[i,j].set_title(str(label))
axs[i,j].axis('off')
Running this cell multiple times will show different images because no random seed is set.
This data is complicated because the handwritten numbers...
This prompts a couple questions:
Specifically, how many dimensions are needed to explain 95% of the variance? This will look at the standard deviation of the same pixel in different images. However, there's no control on image location: the handwriting can appear anywhere. Why should the standard deviation of the same pixel in different images mean anything?
This can be found by passing computing all the principal components, then using the explained_variance_ratio_
attribute.
from sklearn.decomposition import PCA
embedding = PCA()
We could say "explain 95% of the variance", but let's look at all data PCA has to offer and make sure that 0.95 is a good value.
The Scikit-Learn PCA documentation explains that when PCA()
is called with no arguments, it computes all the principal components.
embedding.fit(X)
PCA()
import pandas as pd
dims = pd.Index(range(1,785), name="dimensions")
dim_explain = pd.Series(embedding.explained_variance_ratio_, index=dims)
dim_explain.head()
dimensions 1 0.097461 2 0.071554 3 0.061495 4 0.054034 5 0.048889 dtype: float64
This information says "the 4th principal component explains 5.4% of the variance", and so on. To see the cumulative effect (variace explained by the first k components combined), where k=1,2,..., we can use the Series cumsum
method:
dim_explain.cumsum().plot(grid=True)
<AxesSubplot:xlabel='dimensions'>
Looks like around 100 features are needed in the MNIST dataset to explain 95% of the variance.
There's another method to find how many principal components explain 95% of the variance:
If called with just a number between 0 and 1, it uses just enough principal components to explain this amount of variance. More information about this in the documentation:
Let's use that to explain 90% of the variance for this dataset.
embedding = PCA(0.90)
#embedding = PCA(n_components=2)
# compressed version of the dataset (reduced dimensions)
X_low = embedding.fit_transform(X)
Exactly how many principal components were used to explain 85% of the variance?
embedding.n_components_, embedding.explained_variance_ratio_.sum()
(87, 0.9005229738568302)
Great – 90% of the variance is explained by 87 dimensions.
How well does this perform? Let's visualize that by seeing how well the embedding performs. It takes points in 28x28 images (or 784 pixel values) and approximates that with n_components
principal components that are each 28x28 (or 784 images).
A reconstruction of the original 28x28 image can be found from only the n_components
28x28 principal components. How well does that reconstruction look? Let's use the inverse_tranform
method to reconstruct the digits from the PCA approximation:
# Use inverse transform to construct approximation to original dataset
X_est = embedding.inverse_transform(X_low)
Let's visualize the original images and the reconstructed image:
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(6,8))
for left, right in axs:
# pick an image at random
k = np.random.randint(len(labels))
# reshape images
img = X[k].reshape(28, 28)
img_est = X_est[k].reshape(28, 28).clip(min=0,max=255)
left.imshow(img, cmap='gray_r')
left.set_title('original')
left.axis('off')
right.imshow(img_est, cmap='gray_r')
right.set_title('reconstructed')
right.axis('off')
This is very useful as a compression technique. PCA has some internal structure that allows it to map from n_components
features to 784 features. Given the quality of the reconstruction, that means the images could be stored with 87 dimensions instead of 784 dimensions (alongside the n_components
principal components given by the PCA
object).
Let's visualize the principal components of these images:
N = min(embedding.n_components_, 10)
principal_components = [embedding.components_[k].reshape(28,28) for k in range(N)]
fig, axs = plt.subplots(ncols=10, figsize=(20, 2))
i = 0
for img, ax in zip(principal_components, axs):
ax.imshow(img, cmap='bwr')
i = i+1
ax.set_title('PC ' + str(i))
ax.axis('off')
# let's take a closer look: NOTE! these are normalized vectors!
plt.imshow(principal_components[0], cmap='bwr')
plt.colorbar();
The first image is the first principal component. It's the most important because it's the first one. It says "the most basic approximation using PCA to any digit is somewhat 0 shaped". For example, 8 is similar to 0. At the most basic level with PCA, 8 looks a lot like 0.
Let's visualize the two dimensions, the two dimensions marked as "most important" by PCA.
show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels, cmap="tab10", s=2)
<AxesSubplot:xlabel='0', ylabel='1'>
Two dimensions should be enough to visualize this dataset. Why can't MNIST be embedded in two dimensions? The data is not that complex – the numbers are pretty simple. I can think of a couple features that explain most of the relevant parts of the image: number of edges, number of curves, number of crossings. These features would explain most of the dataset.
How can these relevant features be obtained from the 784 features? It looks like PCA doesn't obtain the relevant features – it needs at least 80 features to do anything performant. This notebook would look much different if run with n_components=3
.
How do the reconstructed images
X_hat
look with two dimensions, i.e. withn_components=2
?
How much variance is explained by PCA with
n_components=2
?
Code has been provided (below) that computes a 2-dimensional approximation to MNIST using PCA. Then, K-Means clustering is applied to the transformed dataset, using 10 clusters. A visualization is then created that shows how each point is classified using KMeans.
How are the labels separated?
from sklearn.cluster import KMeans
X_low = PCA(n_components=2).fit_transform(X)
labels_hat = KMeans(n_clusters=10).fit_predict(X_low)
show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels_hat, cmap="tab10", s=2)