#!/usr/bin/env python
# coding: utf-8

# # **PCA with MNIST**
# 
# Let's investigate a real world example of PCA with the [MNIST digits dataset][mnist]. This is a very popular dataset that looks like:
# 
# ![](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)
# 
# [mnist]:https://en.wikipedia.org/wiki/MNIST_database
# 
# This is a very simple dataset. **How good is PCA at finding the components that best describe this dataset?**

# # Data

# In[1]:


import numpy as np
import pandas as pd

data = np.load("mnist.npz")
X = data["X"]
labels = data["y"]


# This data is 70,000 images, each of which are 28x28, or have 784 pixels each. The rows are concatenated to conform with "tidy" data, which is what Scikit-Learn expects:

# In[2]:


X.shape


# These images are 28 x 28 images, which means that there are 784 pixel values for each image. Each row of X has 784 numbers (which is 28x28)

# In[3]:


# visualize one of the digits (reshape to 28x28)

#np.set_printoptions(linewidth=200)
X[1].reshape(28,28)


# In[4]:


labels.shape


# Let's visualize a random selection of images.
# 
# To do this, matplotlib will need to be imported (Pandas plotting is a wrapper around matplotlib):

# In[5]:


import matplotlib.pyplot as plt


# In[6]:


# create a d1 x d2 grid of subplots
rows, cols = 4, 6
fig, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(8,6))

for i in range(rows):
    for j in range(cols):
    
        # pick an image at random
        k = np.random.randint(len(labels))

        # reshape the image
        img = X[k].reshape(28,28)
        label = labels[k]

        # plot the image
        axs[i,j].imshow(img, cmap='gray_r')
        axs[i,j].set_title(str(label))
        axs[i,j].axis('off')


# Running this cell multiple times will show different images because no random seed is set.

# This data is complicated because the handwritten numbers...
# 
# * are in different places
# * are of different widths
# * have different shapes
# 
# This prompts a couple questions:
# 
# 1. How many dimensions are required to effectively code this relatively simple data?
# 2. How good is that embedding?

# # How many dimensions are needed?
# 
# Specifically, how many dimensions are needed to explain 95% of the variance? This will look at the standard deviation of the same pixel in different images. However, there's no control on image location: the handwriting can appear anywhere. **Why should the standard deviation of the same pixel in different images mean anything?**
# 
# This can be found by passing computing all the principal components, then using the `explained_variance_ratio_` attribute.

# In[7]:


from sklearn.decomposition import PCA
embedding = PCA()


# We *could* say "explain 95% of the variance", but let's look at *all* data PCA has to offer and make sure that 0.95 is a good value.
# 
# The [Scikit-Learn PCA documentation][pca] explains that when `PCA()` is called with no arguments, it computes *all* the principal components.
# 
# [pca]:https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

# In[8]:


embedding.fit(X)


# In[9]:


import pandas as pd

dims = pd.Index(range(1,785), name="dimensions")
dim_explain = pd.Series(embedding.explained_variance_ratio_, index=dims)
dim_explain.head()


# This information says "the 4th principal component explains 5.4% of the variance", and so on. To see the cumulative effect (variace explained by the first k components combined), where k=1,2,..., we can use the Series `cumsum` method:

# In[10]:


dim_explain.cumsum().plot(grid=True)


# Looks like around 100 features are needed in the MNIST dataset to explain 95% of the variance.

# There's another method to find how many principal components explain 95% of the variance:
# 
# > If called with just a number between 0 and 1, it uses just enough principal components to explain this amount of variance. More information about this in the documentation:
# >
# > https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
# 
# Let's use that to explain 90% of the variance for this dataset.

# # Principal Component Analysis

# In[11]:


embedding = PCA(0.90)
#embedding = PCA(n_components=2)


# In[12]:


# compressed version of the dataset (reduced dimensions)
X_low = embedding.fit_transform(X)


# Exactly how many principal components were used to explain 85% of the variance?

# In[13]:


embedding.n_components_, embedding.explained_variance_ratio_.sum()


# Great – 90% of the variance is explained by 87 dimensions.

# How well does this perform? Let's visualize that by seeing how well the embedding performs. It takes points in 28x28 images (or 784 pixel values) and approximates that with `n_components` principal components that are each 28x28 (or 784 images).
# 
# A reconstruction of the original 28x28 image can be found from only the `n_components` 28x28 principal components. How well does that reconstruction look? Let's use the `inverse_tranform` method to reconstruct the digits from the PCA approximation:

# In[14]:


# Use inverse transform to construct approximation to original dataset
X_est = embedding.inverse_transform(X_low)


# Let's visualize the original images and the reconstructed image:

# In[15]:


fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(6,8))

for left, right in axs:
    
    # pick an image at random
    k = np.random.randint(len(labels))
    
    # reshape images
    img = X[k].reshape(28, 28)
    img_est = X_est[k].reshape(28, 28).clip(min=0,max=255)
    
    left.imshow(img, cmap='gray_r')
    left.set_title('original')
    left.axis('off')
    right.imshow(img_est, cmap='gray_r')
    right.set_title('reconstructed')
    right.axis('off')


# This is very useful as a compression technique. PCA has some internal structure that allows it to map from `n_components` features to 784 features. Given the quality of the reconstruction, that means the images could be stored with 87 dimensions instead of 784 dimensions (alongside the `n_components` principal components given by the `PCA` object).

# ## Visualization

# Let's visualize the principal components of these images:

# In[16]:


N = min(embedding.n_components_, 10)
principal_components = [embedding.components_[k].reshape(28,28) for k in range(N)]


# In[17]:


fig, axs = plt.subplots(ncols=10, figsize=(20, 2))
i = 0
for img, ax in zip(principal_components, axs):
    ax.imshow(img, cmap='bwr')
    i = i+1
    ax.set_title('PC ' + str(i))
    ax.axis('off')


# In[18]:


# let's take a closer look: NOTE! these are normalized vectors!
plt.imshow(principal_components[0], cmap='bwr')
plt.colorbar();


# The first image is the first principal component. It's the most important because it's the first one. It says "the most basic approximation using PCA to any digit is somewhat 0 shaped". For example, 8 is similar to 0. At the most basic level with PCA, 8 looks a lot like 0.

# Let's visualize the two dimensions, the two dimensions marked as "most important" by PCA.

# In[20]:


show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels, cmap="tab10", s=2)


# Two dimensions should be enough to visualize this dataset. Why can't MNIST be embedded in two dimensions? The data is not that complex – the numbers are pretty simple. I can think of a couple features that explain most of the relevant parts of the image: number of edges, number of curves, number of crossings. These features would explain most of the dataset.
# 
# How can these relevant features be obtained from the 784 features? It looks like PCA doesn't obtain the relevant features – it needs at least 80 features to do anything performant. This notebook would look much different if run with `n_components=3`.
# 

# ---
# 
# # **QUIZ**
# 
# ---

# ## Question 1
# > How do the reconstructed images `X_hat` look with two dimensions, i.e. with `n_components=2`?

# In[ ]:


# ## Question 2
# > How much variance is explained by PCA with `n_components=2`?

# In[ ]:


# ## Question 3

# > Code has been provided (below) that computes a 2-dimensional approximation to MNIST using PCA. Then, K-Means clustering is applied to the transformed dataset, using 10 clusters. A visualization is then created that shows how each point is classified using KMeans.
# > 
# > How are the labels separated?

# In[ ]:


from sklearn.cluster import KMeans
X_low = PCA(n_components=2).fit_transform(X)
labels_hat = KMeans(n_clusters=10).fit_predict(X_low)


# In[ ]:


show = pd.DataFrame(X_low)
show.plot.scatter(x=0, y=1, style="o", c=labels_hat, cmap="tab10", s=2)