Week 1, Day 4: Dimensionality Reduction
By Neuromatch Academy
Content creators: Alex Cayco Gajic, John Murray
Content reviewers: Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom, Siddharth Suresh, Natalie Schaworonkow, Ella Batty
Estimated timing of tutorial: 35 minutes
In this notebook we'll explore how dimensionality reduction can be useful for visualizing and inferring structure in your data. To do this, we will compare PCA with t-SNE, a nonlinear dimensionality reduction method.
Overview:
# @title Tutorial slides
# @markdown These are the slides for the videos in all tutorials today
from IPython.display import IFrame
IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/kaq2x/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)
# @title Video 1: PCA Applications
from ipywidgets import widgets
out2 = widgets.Output()
with out2:
from IPython.display import IFrame
class BiliVideo(IFrame):
def __init__(self, id, page=1, width=400, height=300, **kwargs):
self.id=id
src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
super(BiliVideo, self).__init__(src, width, height, **kwargs)
video = BiliVideo(id="BV1Jf4y1R7UZ", width=854, height=480, fs=1)
print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
display(video)
out1 = widgets.Output()
with out1:
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="2Zb93aOWioM", width=854, height=480, fs=1, rel=0)
print('Video available at https://youtube.com/watch?v=' + video.id)
display(video)
out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')
display(out)
# Imports
import numpy as np
import matplotlib.pyplot as plt
#@title Figure Settings
import ipywidgets as widgets # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle")
# @title Plotting Functions
def visualize_components(component1, component2, labels, show=True):
"""
Plots a 2D representation of the data for visualization with categories
labelled as different colors.
Args:
component1 (numpy array of floats) : Vector of component 1 scores
component2 (numpy array of floats) : Vector of component 2 scores
labels (numpy array of floats) : Vector corresponding to categories of
samples
Returns:
Nothing.
"""
plt.figure()
cmap = plt.cm.get_cmap('tab10')
plt.scatter(x=component1, y=component2, c=labels, cmap=cmap)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
if show:
plt.show()
In this exercise, we'll visualize the first few components of the MNIST dataset to look for evidence of structure in the data. But in this tutorial, we will also be interested in the label of each image (i.e., which numeral it is from 0 to 9). Start by running the following cell to reload the MNIST dataset (this takes a few seconds).
from sklearn.datasets import fetch_openml
# Get images
mnist = fetch_openml(name='mnist_784', as_frame=False)
X_all = mnist.data
# Get labels
labels_all = np.array([int(k) for k in mnist.target])
Note: We saved the complete dataset as X_all
and the labels as labels_all
.
To perform PCA, we now will use the method implemented in sklearn. Run the following cell to set the parameters of PCA - we will only look at the top 2 components because we will be visualizing the data in 2D.
from sklearn.decomposition import PCA
# Initializes PCA
pca_model = PCA(n_components=2)
# Performs PCA
pca_model.fit(X_all)
Fill in the code below to perform PCA and visualize the top two components. For better visualization, take only the first 2,000 samples of the data (this will also make t-SNE much faster in the following section of the tutorial so don't skip this step!)
Suggestions:
visualize_components
to plot the labeled data.help(visualize_components)
help(pca_model.transform)
#################################################
## TODO for students: take only 2,000 samples and perform PCA
# Comment once you've completed the code
raise NotImplementedError("Student exercise: perform PCA")
#################################################
# Take only the first 2000 samples with the corresponding labels
X, labels = ...
# Perform PCA
scores = pca_model.transform(X)
# Plot the data and reconstruction
visualize_components(...)
# to_remove solution
# Take only the first 2000 samples with the corresponding labels
X, labels = X_all[:2000, :], labels_all[:2000]
# Perform PCA
scores = pca_model.transform(X)
# Plot the data and reconstruction
with plt.xkcd():
visualize_components(scores[:, 0], scores[:, 1], labels)
# to_remove explanation
"""
1) Images corresponding to the some labels (numbers) are sort of clustered together
in some cases but there's a lot of overlap and definitely not a clear distinction between
all the number clusters.
2) The zeros and ones seem fairly non-overlapping.
""";
Estimated timing to here from start of tutorial: 15 min
# @title Video 2: Nonlinear Methods
from ipywidgets import widgets
out2 = widgets.Output()
with out2:
from IPython.display import IFrame
class BiliVideo(IFrame):
def __init__(self, id, page=1, width=400, height=300, **kwargs):
self.id=id
src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
super(BiliVideo, self).__init__(src, width, height, **kwargs)
video = BiliVideo(id="BV14Z4y1u7HG", width=854, height=480, fs=1)
print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
display(video)
out1 = widgets.Output()
with out1:
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="5Xpb0YaN5Ms", width=854, height=480, fs=1, rel=0)
print('Video available at https://youtube.com/watch?v=' + video.id)
display(video)
out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')
display(out)
Next we will analyze the same data using t-SNE, a nonlinear dimensionality reduction method that is useful for visualizing high dimensional data in 2D or 3D. Run the cell below to get started.
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, perplexity=30, random_state=2020)
First, we'll run t-SNE on the data to explore whether we can see more structure. The cell above defined the parameters that we will use to find our embedding (i.e, the low-dimensional representation of the data) and stored them in model
. To run t-SNE on our data, use the function model.fit_transform
.
Suggestions:
model.fit_transform
.visualize_components
.help(tsne_model.fit_transform)
#################################################
## TODO for students
# Comment once you've completed the code
raise NotImplementedError("Student exercise: perform t-SNE")
#################################################
# Perform t-SNE
embed = ...
# Visualize the data
visualize_components(..., ..., labels)
# to_remove solution
# Perform t-SNE
embed = tsne_model.fit_transform(X)
# Visualize the data
with plt.xkcd():
visualize_components(embed[:, 0], embed[:, 1], labels)
Unlike PCA, t-SNE has a free parameter (the perplexity) that roughly determines how global vs. local information is weighted. Here we'll take a look at how the perplexity affects our interpretation of the results.
Steps:
TSNE
as above) with a perplexity of 50, 5 and 2.def explore_perplexity(values, X, labels):
"""
Plots a 2D representation of the data for visualization with categories
labeled as different colors using different perplexities.
Args:
values (list of floats) : list with perplexities to be visualized
X (np.ndarray of floats) : matrix with the dataset
labels (np.ndarray of int) : array with the labels
Returns:
Nothing.
"""
for perp in values:
#################################################
## TO DO for students: Insert your code here to redefine the t-SNE "model"
## while setting the perplexity perform t-SNE on the data and plot the
## results for perplexity = 50, 5, and 2 (set random_state to 2020
# Comment these lines when you complete the function
raise NotImplementedError("Student Exercise! Explore t-SNE with different perplexity")
#################################################
# Perform t-SNE
tsne_model = ...
embed = tsne_model.fit_transform(X)
visualize_components(embed[:, 0], embed[:, 1], labels, show=False)
plt.title(f"perplexity: {perp}")
# Visualize
values = [50, 5, 2]
explore_perplexity(values, X, labels)
# to_remove solution
def explore_perplexity(values, X, labels):
"""
Plots a 2D representation of the data for visualization with categories
labeled as different colors using different perplexities.
Args:
values (list of floats) : list with perplexities to be visualized
X (np.ndarray of floats) : matrix with the dataset
labels (np.ndarray of int) : array with the labels
Returns:
Nothing.
"""
for perp in values:
# Perform t-SNE
tsne_model = TSNE(n_components=2, perplexity=perp, random_state=2020)
embed = tsne_model.fit_transform(X)
visualize_components(embed[:, 0], embed[:, 1], labels, show=False)
plt.title(f"perplexity: {perp}")
plt.show()
# Visualize
values = [50, 5, 2]
with plt.xkcd():
explore_perplexity(values, X, labels)
Estimated timing of tutorial: 35 minutes