Let's investigate a real world example of PCA with the MNIST digits dataset. This is a very popular dataset that looks like:

This is a very simple dataset. **How good is PCA at finding the components that best describe this dataset?**

In [1]:

```
import numpy as np
import pandas as pd
data = np.load("mnist.npz")
X = data["X"]
labels = data["y"]
```

In [2]:

```
X.shape
```

Out[2]:

(70000, 784)

In [3]:

```
# visualize one of the digits (reshape to 28x28)
#np.set_printoptions(linewidth=200)
X[1].reshape(28,28)
```

Out[3]:

array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 51, 159, 253, 159, 50, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 238, 252, 252, 252, 237, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 54, 227, 253, 252, 239, 233, 252, 57, 6, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 60, 224, 252, 253, 252, 202, 84, 252, 253, 122, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 163, 252, 252, 252, 253, 252, 252, 96, 189, 253, 167, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 51, 238, 253, 253, 190, 114, 253, 228, 47, 79, 255, 168, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 238, 252, 252, 179, 12, 75, 121, 21, 0, 0, 253, 243, 50, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 38, 165, 253, 233, 208, 84, 0, 0, 0, 0, 0, 0, 253, 252, 165, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 7, 178, 252, 240, 71, 19, 28, 0, 0, 0, 0, 0, 0, 253, 252, 195, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 57, 252, 252, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 253, 252, 195, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 198, 253, 190, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 253, 196, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 76, 246, 252, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 253, 252, 148, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 230, 25, 0, 0, 0, 0, 0, 0, 0, 0, 7, 135, 253, 186, 12, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 223, 0, 0, 0, 0, 0, 0, 0, 0, 7, 131, 252, 225, 71, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 145, 0, 0, 0, 0, 0, 0, 0, 48, 165, 252, 173, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 86, 253, 225, 0, 0, 0, 0, 0, 0, 114, 238, 253, 162, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 249, 146, 48, 29, 85, 178, 225, 253, 223, 167, 56, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 85, 252, 252, 252, 229, 215, 252, 252, 252, 196, 130, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 28, 199, 252, 252, 253, 252, 252, 233, 145, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 25, 128, 252, 253, 252, 141, 37, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [4]:

```
labels.shape
```

Out[4]:

(70000,)

Let's visualize a random selection of images.

To do this, matplotlib will need to be imported (Pandas plotting is a wrapper around matplotlib):

In [5]:

```
import matplotlib.pyplot as plt
```

In [6]:

```
# create a d1 x d2 grid of subplots
rows, cols = 4, 6
fig, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(8,6))
for i in range(rows):
for j in range(cols):
# pick an image at random
k = np.random.randint(len(labels))
# reshape the image
img = X[k].reshape(28,28)
label = labels[k]
# plot the image
axs[i,j].imshow(img, cmap='gray_r')
axs[i,j].set_title(str(label))
axs[i,j].axis('off')
```

Running this cell multiple times will show different images because no random seed is set.

This data is complicated because the handwritten numbers...

- are in different places
- are of different widths
- have different shapes

This prompts a couple questions:

- How many dimensions are required to effectively code this relatively simple data?
- How good is that embedding?

Specifically, how many dimensions are needed to explain 95% of the variance? This will look at the standard deviation of the same pixel in different images. However, there's no control on image location: the handwriting can appear anywhere. **Why should the standard deviation of the same pixel in different images mean anything?**

This can be found by passing computing all the principal components, then using the `explained_variance_ratio_`

attribute.

In [7]:

```
from sklearn.decomposition import PCA
embedding = PCA()
```

We *could* say "explain 95% of the variance", but let's look at *all* data PCA has to offer and make sure that 0.95 is a good value.

The Scikit-Learn PCA documentation explains that when `PCA()`

is called with no arguments, it computes *all* the principal components.

In [8]:

```
embedding.fit(X)
```

Out[8]:

PCA()

In [9]:

```
import pandas as pd
dims = pd.Index(range(1,785), name="dimensions")
dim_explain = pd.Series(embedding.explained_variance_ratio_, index=dims)
dim_explain.head()
```

Out[9]:

dimensions 1 0.097461 2 0.071554 3 0.061495 4 0.054034 5 0.048889 dtype: float64

`cumsum`

method:

In [10]:

```
dim_explain.cumsum().plot(grid=True)
```

Out[10]:

<AxesSubplot:xlabel='dimensions'>

Looks like around 100 features are needed in the MNIST dataset to explain 95% of the variance.

There's another method to find how many principal components explain 95% of the variance:

If called with just a number between 0 and 1, it uses just enough principal components to explain this amount of variance. More information about this in the documentation:

Let's use that to explain 90% of the variance for this dataset.

In [11]:

```
embedding = PCA(0.90)
#embedding = PCA(n_components=2)
```

In [12]:

```
# compressed version of the dataset (reduced dimensions)
X_low = embedding.fit_transform(X)
```

Exactly how many principal components were used to explain 85% of the variance?

In [13]:

```
embedding.n_components_, embedding.explained_variance_ratio_.sum()
```

Out[13]:

(87, 0.9005229738568302)

Great – 90% of the variance is explained by 87 dimensions.

How well does this perform? Let's visualize that by seeing how well the embedding performs. It takes points in 28x28 images (or 784 pixel values) and approximates that with `n_components`

principal components that are each 28x28 (or 784 images).

A reconstruction of the original 28x28 image can be found from only the `n_components`

28x28 principal components. How well does that reconstruction look? Let's use the `inverse_tranform`

method to reconstruct the digits from the PCA approximation:

In [14]:

```
# Use inverse transform to construct approximation to original dataset
X_est = embedding.inverse_transform(X_low)
```

Let's visualize the original images and the reconstructed image:

In [15]:

```
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(6,8))
for left, right in axs:
# pick an image at random
k = np.random.randint(len(labels))
# reshape images
img = X[k].reshape(28, 28)
img_est = X_est[k].reshape(28, 28).clip(min=0,max=255)
left.imshow(img, cmap='gray_r')
left.set_title('original')
left.axis('off')
right.imshow(img_est, cmap='gray_r')
right.set_title('reconstructed')
right.axis('off')
```

`n_components`

features to 784 features. Given the quality of the reconstruction, that means the images could be stored with 87 dimensions instead of 784 dimensions (alongside the `n_components`

principal components given by the `PCA`

object).

Let's visualize the principal components of these images:

In [16]:

```
N = min(embedding.n_components_, 10)
principal_components = [embedding.components_[k].reshape(28,28) for k in range(N)]
```

In [17]:

```
fig, axs = plt.subplots(ncols=10, figsize=(20, 2))
i = 0
for img, ax in zip(principal_components, axs):
ax.imshow(img, cmap='bwr')
i = i+1
ax.set_title('PC ' + str(i))
ax.axis('off')
```

In [18]:

```
# let's take a closer look: NOTE! these are normalized vectors!
plt.imshow(principal_components[0], cmap='bwr')
plt.colorbar();
```

Let's visualize the two dimensions, the two dimensions marked as "most important" by PCA.

In [20]:

```
show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels, cmap="tab10", s=2)
```

Out[20]:

<AxesSubplot:xlabel='0', ylabel='1'>

Two dimensions should be enough to visualize this dataset. Why can't MNIST be embedded in two dimensions? The data is not that complex – the numbers are pretty simple. I can think of a couple features that explain most of the relevant parts of the image: number of edges, number of curves, number of crossings. These features would explain most of the dataset.

How can these relevant features be obtained from the 784 features? It looks like PCA doesn't obtain the relevant features – it needs at least 80 features to do anything performant. This notebook would look much different if run with `n_components=3`

.

How do the reconstructed images

`X_hat`

look with two dimensions, i.e. with`n_components=2`

?

In [ ]:

```
```

How much variance is explained by PCA with

`n_components=2`

?

In [ ]:

```
```

Code has been provided (below) that computes a 2-dimensional approximation to MNIST using PCA. Then, K-Means clustering is applied to the transformed dataset, using 10 clusters. A visualization is then created that shows how each point is classified using KMeans.

How are the labels separated?

In [ ]:

```
from sklearn.cluster import KMeans
X_low = PCA(n_components=2).fit_transform(X)
labels_hat = KMeans(n_clusters=10).fit_predict(X_low)
```

In [ ]:

```
show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels_hat, cmap="tab10", s=2)
```