#!/usr/bin/env python # coding: utf-8 # # **PCA with MNIST** # # Let's investigate a real world example of PCA with the [MNIST digits dataset][mnist]. This is a very popular dataset that looks like: # # ![](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png) # # [mnist]:https://en.wikipedia.org/wiki/MNIST_database # # This is a very simple dataset. **How good is PCA at finding the components that best describe this dataset?** # # Data # In[1]: import numpy as np import pandas as pd data = np.load("mnist.npz") X = data["X"] labels = data["y"] # This data is 70,000 images, each of which are 28x28, or have 784 pixels each. The rows are concatenated to conform with "tidy" data, which is what Scikit-Learn expects: # In[2]: X.shape # These images are 28 x 28 images, which means that there are 784 pixel values for each image. Each row of X has 784 numbers (which is 28x28) # In[3]: # visualize one of the digits (reshape to 28x28) #np.set_printoptions(linewidth=200) X[1].reshape(28,28) # In[4]: labels.shape # Let's visualize a random selection of images. # # To do this, matplotlib will need to be imported (Pandas plotting is a wrapper around matplotlib): # In[5]: import matplotlib.pyplot as plt # In[6]: # create a d1 x d2 grid of subplots rows, cols = 4, 6 fig, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(8,6)) for i in range(rows): for j in range(cols): # pick an image at random k = np.random.randint(len(labels)) # reshape the image img = X[k].reshape(28,28) label = labels[k] # plot the image axs[i,j].imshow(img, cmap='gray_r') axs[i,j].set_title(str(label)) axs[i,j].axis('off') # Running this cell multiple times will show different images because no random seed is set. # This data is complicated because the handwritten numbers... # # * are in different places # * are of different widths # * have different shapes # # This prompts a couple questions: # # 1. How many dimensions are required to effectively code this relatively simple data? # 2. How good is that embedding? # # How many dimensions are needed? # # Specifically, how many dimensions are needed to explain 95% of the variance? This will look at the standard deviation of the same pixel in different images. However, there's no control on image location: the handwriting can appear anywhere. **Why should the standard deviation of the same pixel in different images mean anything?** # # This can be found by passing computing all the principal components, then using the `explained_variance_ratio_` attribute. # In[7]: from sklearn.decomposition import PCA embedding = PCA() # We *could* say "explain 95% of the variance", but let's look at *all* data PCA has to offer and make sure that 0.95 is a good value. # # The [Scikit-Learn PCA documentation][pca] explains that when `PCA()` is called with no arguments, it computes *all* the principal components. # # [pca]:https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA # In[8]: embedding.fit(X) # In[9]: import pandas as pd dims = pd.Index(range(1,785), name="dimensions") dim_explain = pd.Series(embedding.explained_variance_ratio_, index=dims) dim_explain.head() # This information says "the 4th principal component explains 5.4% of the variance", and so on. To see the cumulative effect (variace explained by the first k components combined), where k=1,2,..., we can use the Series `cumsum` method: # In[10]: dim_explain.cumsum().plot(grid=True) # Looks like around 100 features are needed in the MNIST dataset to explain 95% of the variance. # There's another method to find how many principal components explain 95% of the variance: # # > If called with just a number between 0 and 1, it uses just enough principal components to explain this amount of variance. More information about this in the documentation: # > # > https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA # # Let's use that to explain 90% of the variance for this dataset. # # Principal Component Analysis # In[11]: embedding = PCA(0.90) #embedding = PCA(n_components=2) # In[12]: # compressed version of the dataset (reduced dimensions) X_low = embedding.fit_transform(X) # Exactly how many principal components were used to explain 85% of the variance? # In[13]: embedding.n_components_, embedding.explained_variance_ratio_.sum() # Great – 90% of the variance is explained by 87 dimensions. # How well does this perform? Let's visualize that by seeing how well the embedding performs. It takes points in 28x28 images (or 784 pixel values) and approximates that with `n_components` principal components that are each 28x28 (or 784 images). # # A reconstruction of the original 28x28 image can be found from only the `n_components` 28x28 principal components. How well does that reconstruction look? Let's use the `inverse_tranform` method to reconstruct the digits from the PCA approximation: # In[14]: # Use inverse transform to construct approximation to original dataset X_est = embedding.inverse_transform(X_low) # Let's visualize the original images and the reconstructed image: # In[15]: fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(6,8)) for left, right in axs: # pick an image at random k = np.random.randint(len(labels)) # reshape images img = X[k].reshape(28, 28) img_est = X_est[k].reshape(28, 28).clip(min=0,max=255) left.imshow(img, cmap='gray_r') left.set_title('original') left.axis('off') right.imshow(img_est, cmap='gray_r') right.set_title('reconstructed') right.axis('off') # This is very useful as a compression technique. PCA has some internal structure that allows it to map from `n_components` features to 784 features. Given the quality of the reconstruction, that means the images could be stored with 87 dimensions instead of 784 dimensions (alongside the `n_components` principal components given by the `PCA` object). # ## Visualization # Let's visualize the principal components of these images: # In[16]: N = min(embedding.n_components_, 10) principal_components = [embedding.components_[k].reshape(28,28) for k in range(N)] # In[17]: fig, axs = plt.subplots(ncols=10, figsize=(20, 2)) i = 0 for img, ax in zip(principal_components, axs): ax.imshow(img, cmap='bwr') i = i+1 ax.set_title('PC ' + str(i)) ax.axis('off') # In[18]: # let's take a closer look: NOTE! these are normalized vectors! plt.imshow(principal_components[0], cmap='bwr') plt.colorbar(); # The first image is the first principal component. It's the most important because it's the first one. It says "the most basic approximation using PCA to any digit is somewhat 0 shaped". For example, 8 is similar to 0. At the most basic level with PCA, 8 looks a lot like 0. # Let's visualize the two dimensions, the two dimensions marked as "most important" by PCA. # In[20]: show = pd.DataFrame(X_low) show.plot.scatter(x=0, y=1, style="o", c=labels, cmap="tab10", s=2) # Two dimensions should be enough to visualize this dataset. Why can't MNIST be embedded in two dimensions? The data is not that complex – the numbers are pretty simple. I can think of a couple features that explain most of the relevant parts of the image: number of edges, number of curves, number of crossings. These features would explain most of the dataset. # # How can these relevant features be obtained from the 784 features? It looks like PCA doesn't obtain the relevant features – it needs at least 80 features to do anything performant. This notebook would look much different if run with `n_components=3`. # # --- # # # **QUIZ** # # --- # ## Question 1 # > How do the reconstructed images `X_hat` look with two dimensions, i.e. with `n_components=2`? # In[ ]: # ## Question 2 # > How much variance is explained by PCA with `n_components=2`? # In[ ]: # ## Question 3 # > Code has been provided (below) that computes a 2-dimensional approximation to MNIST using PCA. Then, K-Means clustering is applied to the transformed dataset, using 10 clusters. A visualization is then created that shows how each point is classified using KMeans. # > # > How are the labels separated? # In[ ]: from sklearn.cluster import KMeans X_low = PCA(n_components=2).fit_transform(X) labels_hat = KMeans(n_clusters=10).fit_predict(X_low) # In[ ]: show = pd.DataFrame(X_low) show.plot.scatter(x=0, y=1, style="o", c=labels_hat, cmap="tab10", s=2)