$\newcommand{\vec}[1]{\boldsymbol{#1}}$
import random, itertools
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D
plt.rc('text', usetex=True); plt.rc('font', family='serif')
plt.rcParams['text.latex.preamble'] ='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}'
import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# fixing a seed for reproducibility, do not do this in real life.
random.seed(a=42)
We are going to present two approaches to unsupervised learning and present some sample applications:
Common clustering algorithms include:
We'll start with our standard set of initial imports:
In linear algebra, an orthogonal transformation is a linear transformation $T: V \rightarrow V$ on a real inner product space $V$, that preserves the inner product. That is, for each pair $u$, $v$ of elements of $V$, we have $$ \langle u,v \rangle = \langle Tu,Tv \rangle \,. $$
Consider a data matrix, $\vec{X}$, of shape $n\times p$ such that:
The transformation is defined by a set of $p$-dimensional vectors of weights or loadings $$ \vec{w}_{(k)} = (w_1, \dots, w_p)_{(k)}$$ that map each row vector $\vec{x}_{(i)}$ of $\vec{X}$ to a new vector of principal component scores $\vec{t}_{(i)} = (t_1, \dots, t_k)_{(i)}$, given by $$ {t_{k}}_{(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)} $$ such that individual variables of $\vec{t}$ considered over the data set successively inherit the maximum possible variance from $\vec{X}$, with each loading vector $\vec{w}$ constrained to be a unit vector.
The first loading vector $\vec{w}_{(1)}$ has to satisfy $$ \vec{w}_{(1)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\,\left\{ \sum_i \left(t_1\right)^2_{(i)} \right\} = \underset{\Vert \mathbf{w} \Vert = 1}{\operatorname{\arg\,max}}\,\left\{ \sum_i \left(\vec{x}_{(i)} \cdot \vec{w} \right)^2 \right\} $$
Rewriting it in matrix form gives $$ \vec{w}_{(1)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\, \{ \Vert \vec{Xw} \Vert^2 \} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\, \left\{ \vec{w}^\intercal \vec{X}^\intercal \vec{X w} \right\} $$
The $k$th component can be found by subtracting the first $k − 1$ principal components from $\vec{X}$: $$ \hat{\vec{X}}_{k} = \vec{X} - \sum_{s = 1}^{k - 1} \vec{X} \vec{w}_{(s)} \vec{w}_{(s)}^{\intercal} $$ and then finding the loading vector which extracts the maximum variance from this new data matrix $$ \vec{w}_{(k)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{arg\,max}} \left\{ \Vert \vec{\hat{X}}_{k} \vec{w} \Vert^2 \right\} = {\operatorname{\arg\,max}}\, \left\{ \tfrac{\vec{w}^\intercal\vec{\hat{X}}_{k}^\intercal \mathbf{\hat{X}}_{k} \vec{w}}{\vec{w}^\intercal\vec{w}} \right\} $$
Interestingly, this produces the remaining eigenvectors of $\vec{X}^\intercal\vec{X}$, with the maximum values for the quantity in brackets given by their corresponding eigenvalues.
The full principal components decomposition of X can therefore be given as
$$\vec{T} = \vec{X} \vec{W}$$where $\vec{W}$ is a $p\times p$ matrix whose columns are the eigenvectors of $\vec{X}^\intercal\vec{X}$.
Principal Component Analysis is a very powerful unsupervised method for dimensionality reduction in data.
It's easiest to visualize by looking at a two-dimensional dataset:
np.random.seed(1)
X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5); plt.axis('equal');
We can see that there is a definite trend in the data.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
Explained variance
pca.explained_variance_
array([ 0.75871884, 0.01838551])
Components
pca.components_
array([[-0.94446029, -0.32862557], [-0.32862557, 0.94446029]])
To see what these numbers mean, let's view them as vectors plotted on top of the data:
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
plt.plot([0, v[0]], [0, v[1]], '-b', lw=3)
plt.axis('equal');
Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data is somehow more "important" than the other direction. The explained variance quantifies this measure of "importance" in direction.
Another way to think of it is that the second principal component could be completely ignored without much loss of information! Let's see what our data look like if we only keep 95% of the variance:
clf = sklearn.decomposition.PCA(0.95) # keep 95% of variance
X_trans = clf.fit_transform(X)
print(X.shape)
print(X_trans.shape)
(200, 2) (200, 1)
By specifying that we want to throw away 5% of the variance, the data is now compressed by a factor of 50%! Let's see what the data look like after this compression:
X_new = clf.inverse_transform(X_trans)
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)
plt.plot(X_new[:, 0], X_new[:, 1], '.b', alpha=0.8)
plt.axis('equal');
The light points are the original data, while the dark points are the projected version.
The dimensionality reduction might seem a bit abstract in two dimensions, but the projection and dimensionality reduction can be extremely useful when visualizing high-dimensional data. Let's take a quick look at the application of PCA to the digits data we looked at before:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
print(digits['DESCR'])
Optical Recognition of Handwritten Digits Data Set =================================================== Notes ----- Data Set Characteristics: :Number of Instances: 5620 :Number of Attributes: 64 :Attribute Information: 8x8 image of integer pixels in the range 0..16. :Missing Attribute Values: None :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr) :Date: July; 1998 This is a copy of the test set of the UCI ML hand-written digits datasets http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits The data set contains images of hand-written digits: 10 classes where each class refers to a digit. Preprocessing programs made available by NIST were used to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions. For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994. References ---------- - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University. - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika. - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin. Linear dimensionalityreduction using relevance weighted LDA. School of Electrical and Electronic Engineering Nanyang Technological University. 2005. - Claudio Gentile. A New Approximate Maximal Margin Classification Algorithm. NIPS. 2000.
pca = PCA(2) # project from 64 to 2 dimensions
Xproj = pca.fit_transform(X)
print(X.shape)
print(Xproj.shape)
(1797, 64) (1797, 2)
plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar();
This gives us an idea of the relationship between the digits. Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits, without reference to the labels.
PCA is a useful dimensionality reduction algorithm, because it has a very intuitive interpretation via eigenvectors.
but what this really means is
$$ image(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pixel~3)} \cdots $$If we reduce the dimensionality in the pixel space to (say) 6, we recover only a partial image:
from fig_code.figures import plot_image_components
plot_image_components(digits.data[2])
But the pixel-wise representation is not the only choice. We can also use other basis functions, and write something like
$$ image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3 \cdot{\rm (basis~3)} \cdots $$What PCA does is to choose optimal basis functions so that only a few are needed to get a reasonable approximation. The low-dimensional representation of our data is the coefficients of this series, and the approximate reconstruction is the result of the sum:
from fig_code.figures import plot_pca_interactive
plot_pca_interactive(digits.data)
/Users/lm/anaconda/lib/python3.6/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`. "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning) Widget Javascript not detected. It may not be installed or enabled properly.
Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.
Here we see that with only six PCA components, we recover a reasonable approximation of the input!
Thus we see that PCA can be viewed from two angles. It can be viewed as dimensionality reduction, or it can be viewed as a form of lossy data compression where the loss favors noise. In this way, PCA can be used as a filtering process as well.
But how much information have we thrown away? We can figure this out by looking at the explained variance as a function of the components:
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.
As we mentioned, PCA can be used for is a sort of data compression. Using a small n_components
allows you to represent a high dimensional point as a sum of just a few principal vectors.
Here's what a single digit looks like as you change the number of components:
fig, axes = plt.subplots(8, 8, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
pca = sklearn.decomposition.PCA(i + 1).fit(X)
im = pca.inverse_transform(pca.transform(X[20:21]))
ax.imshow(im.reshape((8, 8)), cmap='binary')
ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',
transform=ax.transAxes, color='darkorange')
ax.set_xticks([]); ax.set_yticks([])
Let's take another look at this by using IPython's interact
functionality to view the reconstruction of several images at once:
Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.
from ipywidgets import interact
def plot_digits(n_components):
fig = plt.figure(figsize=(8, 8))
plt.subplot(1, 1, 1, frameon=False, xticks=[], yticks=[])
nside = 10
pca = PCA(n_components).fit(X)
Xproj = pca.inverse_transform(pca.transform(X[:nside ** 2]))
Xproj = np.reshape(Xproj, (nside, nside, 8, 8))
total_var = pca.explained_variance_ratio_.sum()
im = np.vstack([np.hstack([Xproj[i, j] for j in range(nside)])
for i in range(nside)])
plt.imshow(im)
plt.grid(False)
plt.title("n = {0}, variance = {1:.2f}".format(n_components, total_var), size=18)
plt.clim(0, 16)
interact(plot_digits, n_components=[1, 64], nside=[1, 8]);
Widget Javascript not detected. It may not be installed or enabled properly.
Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are Other dimensionality reduction techniques which are useful to know about:
Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the scikit-learn website.
As mentionend, $k$-means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).
It is a relatively easy-to-understand algorithm:
where $\mu_i$ is the mean of points in $S_i$.
Given an initial set of $k$ means $\vec{\mu}_1,\ldots,\vec{\mu}_k$, the algorithm proceeds by alternating between two steps:
where each $x_p$ is assigned to exactly one $S^{(t)}$, even if it could be assigned to two or more of them. 2. Update step: Calculate the new means to be the centroids of the observations in the new clusters. $$ \vec{\mu}_i = \frac{1}{|S^{(t)}_i|} \sum_{x_j \in S^{(t)}_i} x_j $$ * Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.
Lets create a four clusters dataset.
X, y = sklearn.datasets.samples_generator.make_blobs(n_samples=300,
centers=4,
random_state=0,
cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=47, alpha=0.5);
scikit-learn
implements, so that KMeans can be solved relatively quickly.As mentioned, $k$-means is an example of an algorithm which uses a two-step approach which works as follows:
from scipy.spatial.distance import euclidean
def k_means(k, X, centroids=None, stop_threshold=0.00005, random_state=42):
if not centroids:
centroids = sklearn.utils.resample(X, n_samples=k, replace=False,
random_state=random_state)
centroid_paths = [[c.copy()] for c in centroids]
errors = []
while True:
# assign points to clusters
clusters = [[] for _ in range(k)] # k empty clusters
for x in X:
distances = [euclidean(x, centroid) for centroid in centroids]
clusters[np.argmin(distances)].append(x)
diffs = []
for i, cluster in enumerate(clusters):
if len(cluster) > 0:
centroid = np.mean(cluster, axis=0)
diffs.append(euclidean(centroid, centroids[i]))
centroids[i] = centroid
centroid_paths[i].append(centroid)
# compute the errors
errors.append(np.mean([min([euclidean(centroid, point)
for centroid in centroids]) for point in X]))
# are we seeing any progress?
if max(diffs) < stop_threshold: break
return centroids, clusters, errors, centroid_paths
Running the algorithm for four clusters:
k=4
centroids, clusters, errors, centroid_paths = k_means(k, X)
colors = cm.Set1(np.linspace(0,1,len(clusters)))
for i, cluster in enumerate(clusters):
if len(cluster) > 0:
c = np.array(cluster)
plt.scatter(c[:,0], c[:,1], c=colors[i], alpha=0.3)
for i, path in enumerate(centroid_paths):
p = np.array(path)
plt.plot(p[:,0], p[:,1], 'o-', c=colors[i], label="Centroid %d" % i)
for i, centroid in enumerate(centroids):
plt.scatter(centroid[0], centroid[1], marker='o', s=75, c='k')
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('$k$-means algorithm progress');
plt.legend(loc='best', frameon=True);
plt.plot(errors,'o-')
plt.xlabel('Iterations'); plt.ylabel('Error');
This is best visualized in an interactive way.
Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.
from fig_code import plot_kmeans_interactive
plot_kmeans_interactive();
Widget Javascript not detected. It may not be installed or enabled properly.
Improvements:
Be ware that there are many other algorithms. Let's take a quick look at some of them.
import sklearn.cluster as clu
methods = [clu.SpectralClustering(k), clu.AgglomerativeClustering(4),
clu.MeanShift(), clu.AffinityPropagation(), clu.DBSCAN()]
plt.figure(figsize=(8,5)); plt.subplot(231)
plt.scatter(X[:,0], X[:,1], c=y, s=30,
linewidths=0, cmap=plt.cm.rainbow)
plt.xticks([]); plt.yticks([]); plt.title("True labels")
for i, est in enumerate(methods):
c = est.fit(X).labels_
plt.subplot(232 + i)
plt.scatter(X[:,0], X[:,1], c=c, s=30,
linewidths=0, cmap=plt.cm.rainbow)
plt.xticks([]); plt.yticks([]); plt.title(est.__class__.__name__)
from sklearn.cluster import KMeans
est = KMeans(n_clusters=10)
clusters = est.fit_predict(digits.data)
est.cluster_centers_.shape
(10, 64)
fig = plt.figure(figsize=(8, 3))
for i, centroid in enumerate(est.cluster_centers_):
ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])
ax.imshow(centroid.reshape((8, 8)), cmap=plt.cm.binary)
The cluster labels are permuted; let's fix that:
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
Now we have associated each cluster with its corresponding true label.
For good measure, let's use our PCA 2D projection and look at the true cluster labels and $k$-means cluster labels:
X = PCA(2).fit_transform(digits.data)
kwargs = dict(cmap = plt.cm.get_cmap('rainbow', 10), edgecolor='none', alpha=0.5)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(X[:, 0], X[:, 1], c=labels, **kwargs)
ax1.set_title('learned cluster labels')
ax2.scatter(X[:, 0], X[:, 1], c=digits.target, **kwargs)
ax2.set_title('true labels');
Let's see how accurate our $k$-means classifier is with no label information:
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)
0.79688369504730105
Not bad at all!
The confusion matrix for this:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(digits.target, labels))
[[177 0 0 0 1 0 0 0 0 0] [ 0 54 24 1 0 1 2 0 100 0] [ 1 2 148 13 0 0 0 3 8 2] [ 0 0 1 157 0 2 0 7 7 9] [ 0 3 0 0 166 0 0 8 4 0] [ 0 0 0 1 2 136 1 0 0 42] [ 1 0 0 0 0 0 177 0 3 0] [ 0 0 0 0 0 0 0 177 2 0] [ 0 6 3 4 0 4 2 5 102 48] [ 0 20 0 7 0 6 0 7 2 138]]
plt.imshow(confusion_matrix(digits.target, labels),
cmap='Blues', interpolation='nearest')
plt.colorbar(); plt.grid(False)
plt.ylabel('true'); plt.xlabel('predicted');
scikit-learn
has a number of images that you can play with, accessed through the datasets
module.
For example:
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
plt.imshow(china); plt.grid(False);
The image itself is stored in a three-dimensional array, of size (height, width, RGB)
:
china.shape
(427, 640, 3)
scikit-learn
input:X = (china / 255.0).reshape(-1, 3)
We now have a three-column data set where each column represents one primary RGB color (red,blue, or green) and each row corresponds to a pixel in the image.
print(X[0:9])
[[ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.68235294 0.78823529 0.90588235] [ 0.67843137 0.79215686 0.90980392]]
We now have 273,280 points in three dimensions.
X.shape
(273280, 3)
Our task is to use $k$-means to compress the $256^3$ colors into a smaller number (say, 64 colors).
We'll use sklearn.cluster.MiniBatchKMeans
, a more sophisticated $k$-means implementation that performs better for larger datasets.
from sklearn.cluster import MiniBatchKMeans
# note: my sklearn version produces a lot of deprications warnings
import warnings
warnings.filterwarnings(action='ignore')
n_colors = 64
model = MiniBatchKMeans(n_colors)
labels = model.fit_predict(X)
colors = model.cluster_centers_
new_image = colors[labels].reshape(china.shape)
new_image = (255 * new_image).astype(np.uint8)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.imshow(china); ax1.grid(False)
ax1.set_title('input: 16 million colors')
ax2.imshow(new_image); ax2.grid(False)
ax2.set_title('result: {0} colors'.format(n_colors));
plt.tight_layout()
The image lost a little quality, we are using only 64 colors after all. This is particularly visible on the sky area of the photo.
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–78. http://doi.org/10.1109/TNN.2005.845141
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))