%pylab inline
import pylab as pl
import numpy as np
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Let's start with downloading the data using a scikit-learn utility function.
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
Let's introspect the images arrays to find the shapes (for plotting with matplotlib)
X = lfw_people.data
y = lfw_people.target
names = lfw_people.target_names
n_samples, n_features = X.shape
_, h, w = lfw_people.images.shape
n_classes = len(names)
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
n_samples: 1288 n_features: 1850 n_classes: 7
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.7 * n_col, 2.3 * n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())
plot_gallery(X, names[y], h, w)
Let's have a look at the repartition among target classes:
pl.figure(figsize=(14, 3))
y_unique = np.unique(y)
counts = [(y == i).sum() for i in y_unique]
pl.xticks(y_unique, names[y_unique])
locs, labels = pl.xticks()
pl.setp(labels, rotation=45, size=20)
_ = pl.bar(y_unique, counts)
Let's split the data in a development set and final evaluation set.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
To train a model we will first reduce the dimensionality of the original picture to a 150 PCA space: unsupervised feature extraction.
from sklearn.decomposition import RandomizedPCA
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (
n_components, X_train.shape[0])
pca = RandomizedPCA(n_components=n_components, whiten=True)
%time pca.fit(X_train)
eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 150 eigenfaces from 966 faces CPU times: user 559 ms, sys: 69.4 ms, total: 629 ms Wall time: 449 ms
Let's plot the gallery of the most significant eigenfaces:
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)