Machine Learning

Unsupervised Learning and Clustering

In [1]:
import random, itertools
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib as mpl
import matplotlib.pyplot as plt
import as cm
from mpl_toolkits.mplot3d import Axes3D
In [2]:
import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [3]:
# fixing a seed for reproducibility, do not do this in real life. 

About the notebook/slides

  • The slides are programmed as a Jupyter/IPython notebook.
  • Feel free to try them and experiment on your own by launching the notebooks.

  • You can run the notebook online: Binder

If you are using nbviewer you can change to slides mode by clicking on the icon:

Unsupervised learning

  • A type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
  • The most common unsupervised learning method is cluster analysis,
    • used for exploratory data analysis to find hidden patterns or grouping in data.
  • Also useful for dimensionality reduction.
  • The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.

We are going to present two approaches to unsupervised learning and present some sample applications:

  • Dimensionality reduction with principal components analysis (PCA)
  • Clustering with the $k$-means algorithm.


  • Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.
  • Centroid models: for example, the $k$-means algorithm represents each cluster by a single mean vector.
  • Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
  • Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
  • Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster.
    • Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.

Common clustering algorithms include:

  • Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
  • $k$-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster
  • Gaussian mixture models: models clusters as a mixture of multivariate normal density components
  • Self-organizing maps: uses neural networks that learn the topology and distribution of the data
  • Hidden Markov models: uses observed data to recover the sequence of states

 Cluster analysis or clustering

  • The task of grouping a set of objects in such a way that
  • objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
  • It is a main task of exploratory data mining, and a common technique for statistical data analysis,

Dimensionality Reduction with Principal Component Analysis

  • Uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
  • Useful linear dimensionality reduction technique.

We'll start with our standard set of initial imports:


In linear algebra, an orthogonal transformation is a linear transformation $T: V \rightarrow V$ on a real inner product space $V$, that preserves the inner product. That is, for each pair $u$, $v$ of elements of $V$, we have $$ \langle u,v \rangle = \langle Tu,Tv \rangle \,. $$

  • Since the lengths of vectors and the angles between them are defined through the inner product, orthogonal transformations preserve lengths of vectors and angles between them.
  • Orthogonal transformations map orthonormal bases to orthonormal bases.
  • Orthogonal transformations in two- or three-dimensional Euclidean space are stiff rotations, reflections, or combinations of a rotation and a reflection (also known as improper rotations).
  • Reflections are transformations that exchange left and right, similar to mirror images.
  • The matrices corresponding to proper rotations (without reflection) have determinant +1.
  • Transformations with reflection are represented by matrices with determinant −1.
  • This allows the concept of rotation and reflection to be generalized to higher dimensions.

Consider a data matrix, $\vec{X}$, of shape $n\times p$ such that:

  • column-wise zero empirical mean (the sample mean of each column has been shifted to zero),
  • each of the $n$ rows represents a different repetition of the experiment,
  • and each of the $p$ columns represents a given input feature feature (say, the results from a particular sensor).

The transformation is defined by a set of $p$-dimensional vectors of weights or loadings $$ \vec{w}_{(k)} = (w_1, \dots, w_p)_{(k)}$$ that map each row vector $\vec{x}_{(i)}$ of $\vec{X}$ to a new vector of principal component scores $\vec{t}_{(i)} = (t_1, \dots, t_k)_{(i)}$, given by $$ {t_{k}}_{(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)} $$ such that individual variables of $\vec{t}$ considered over the data set successively inherit the maximum possible variance from $\vec{X}$, with each loading vector $\vec{w}$ constrained to be a unit vector.

First component

The first loading vector $\vec{w}_{(1)}$ has to satisfy $$ \vec{w}_{(1)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\,\left\{ \sum_i \left(t_1\right)^2_{(i)} \right\} = \underset{\Vert \mathbf{w} \Vert = 1}{\operatorname{\arg\,max}}\,\left\{ \sum_i \left(\vec{x}_{(i)} \cdot \vec{w} \right)^2 \right\} $$

Rewriting it in matrix form gives $$ \vec{w}_{(1)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\, \{ \Vert \vec{Xw} \Vert^2 \} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{\arg\,max}}\, \left\{ \vec{w}^\intercal \vec{X}^\intercal \vec{X w} \right\} $$

  • The quantity to be maximised can be recognised as a Rayleigh quotient.
  • A standard result for a symmetric matrix such as $\vec{X}^\intercal\vec{X}$ is that the quotient's maximum possible value is the largest eigenvalue of the matrix.
  • This occurs when $\vec{w}$ is the corresponding eigenvector.

Remaining components

The $k$th component can be found by subtracting the first $k − 1$ principal components from $\vec{X}$: $$ \hat{\vec{X}}_{k} = \vec{X} - \sum_{s = 1}^{k - 1} \vec{X} \vec{w}_{(s)} \vec{w}_{(s)}^{\intercal} $$ and then finding the loading vector which extracts the maximum variance from this new data matrix $$ \vec{w}_{(k)} = \underset{\Vert \vec{w} \Vert = 1}{\operatorname{arg\,max}} \left\{ \Vert \vec{\hat{X}}_{k} \vec{w} \Vert^2 \right\} = {\operatorname{\arg\,max}}\, \left\{ \tfrac{\vec{w}^\intercal\vec{\hat{X}}_{k}^\intercal \mathbf{\hat{X}}_{k} \vec{w}}{\vec{w}^\intercal\vec{w}} \right\} $$

Interestingly, this produces the remaining eigenvectors of $\vec{X}^\intercal\vec{X}$, with the maximum values for the quantity in brackets given by their corresponding eigenvalues.

  • the loading vectors are eigenvectors of $\vec{X}^\intercal\vec{X}$.
  • The $k$th component of a data vector $\vec{x}_{(i)}$ can therefore be given as a score $t_{k(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)}$ in the transformed co-ordinates,
  • or as the corresponding vector in the space of the original variables, $\left\{\vec{x}_{(i)}\cdot\vec{w}_{(k)}\right\}\cdot\vec{w}_{(k)}$, where $\vec{w}_{(k)}$ is the $k$th eigenvector of $\vec{X}^\intercal\vec{X}$.

The full principal components decomposition of X can therefore be given as

$$\vec{T} = \vec{X} \vec{W}$$

where $\vec{W}$ is a $p\times p$ matrix whose columns are the eigenvectors of $\vec{X}^\intercal\vec{X}$.

Demonstrating PCA

Principal Component Analysis is a very powerful unsupervised method for dimensionality reduction in data.

It's easiest to visualize by looking at a two-dimensional dataset:

In [4]:
X =, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5); plt.axis('equal');

We can see that there is a definite trend in the data.

  • What PCA seeks to do is to find the principal axes in the data, and
  • explain how important those axes are in describing the data distribution.
In [5]:
from sklearn.decomposition import PCA
In [6]:
pca = PCA(n_components=2)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Explained variance

In [7]:
array([ 0.7625315,  0.0184779])


In [8]:
array([[-0.94446029, -0.32862557],
       [-0.32862557,  0.94446029]])

To see what these numbers mean, let's view them as vectors plotted on top of the data:

In [9]:
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    plt.plot([0, v[0]], [0, v[1]], '-b', lw=3)

Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data is somehow more "important" than the other direction. The explained variance quantifies this measure of "importance" in direction.

Another way to think of it is that the second principal component could be completely ignored without much loss of information! Let's see what our data look like if we only keep 95% of the variance:

In [10]:
clf = sklearn.decomposition.PCA(0.95) # keep 95% of variance
X_trans = clf.fit_transform(X)
(200, 2)
(200, 1)

By specifying that we want to throw away 5% of the variance, the data is now compressed by a factor of 50%! Let's see what the data look like after this compression:

In [11]:
X_new = clf.inverse_transform(X_trans)
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)
plt.plot(X_new[:, 0], X_new[:, 1], '.b', alpha=0.8)

The light points are the original data, while the dark points are the projected version.

  • We see that after truncating 5% of the variance of this dataset and then reprojecting it, the "most important" features of the data are maintained, and we've compressed the data by 50%!
  • This is the sense in which "dimensionality reduction" works:
    • if you can approximate a data set in a lower dimension, you may have an easier time visualizing it or fitting complicated models to the data.

Application of PCA to Digits

The dimensionality reduction might seem a bit abstract in two dimensions, but the projection and dimensionality reduction can be extremely useful when visualizing high-dimensional data. Let's take a quick look at the application of PCA to the digits data we looked at before:

In [12]:
from sklearn.datasets import load_digits
In [13]:
digits = load_digits()
X =
y =
In [14]:
Optical Recognition of Handwritten Digits Data Set

Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@'
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

In [15]:
pca = PCA(2)  # project from 64 to 2 dimensions
Xproj = pca.fit_transform(X)
(1797, 64)
(1797, 2)
In [16]:
plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,
  'nipy_spectral', 10))

This gives us an idea of the relationship between the digits. Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits, without reference to the labels.

What do the components mean?

PCA is a useful dimensionality reduction algorithm, because it has a very intuitive interpretation via eigenvectors.

  • The input data is represented as a vector: in the case of the digits, our data is
$$ \vec{x} = [x_1, x_2, x_3 \cdots] $$

but what this really means is

$$ image(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pixel~3)} \cdots $$

If we reduce the dimensionality in the pixel space to (say) 6, we recover only a partial image:

In [17]:
from fig_code.figures import plot_image_components

But the pixel-wise representation is not the only choice. We can also use other basis functions, and write something like

$$ image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3 \cdot{\rm (basis~3)} \cdots $$

What PCA does is to choose optimal basis functions so that only a few are needed to get a reasonable approximation. The low-dimensional representation of our data is the coefficients of this series, and the approximate reconstruction is the result of the sum:

In [18]:
from fig_code.figures import plot_pca_interactive

Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.

Here we see that with only six PCA components, we recover a reasonable approximation of the input!

Thus we see that PCA can be viewed from two angles. It can be viewed as dimensionality reduction, or it can be viewed as a form of lossy data compression where the loss favors noise. In this way, PCA can be used as a filtering process as well.

Choosing the Number of Components

But how much information have we thrown away? We can figure this out by looking at the explained variance as a function of the components:

In [19]:
pca = PCA().fit(X)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.

PCA as data compression

As we mentioned, PCA can be used for is a sort of data compression. Using a small n_components allows you to represent a high dimensional point as a sum of just a few principal vectors.

Here's what a single digit looks like as you change the number of components:

In [20]:
fig, axes = plt.subplots(8, 8, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
    pca = sklearn.decomposition.PCA(i + 1).fit(X)
    im = pca.inverse_transform(pca.transform(X[20:21]))
    ax.imshow(im.reshape((8, 8)), cmap='binary')
    ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',
            transform=ax.transAxes, color='darkorange')
    ax.set_xticks([]); ax.set_yticks([])

Let's take another look at this by using IPython's interact functionality to view the reconstruction of several images at once:

Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.

In [21]:
from ipywidgets import interact

def plot_digits(n_components):
    fig = plt.figure(figsize=(8, 8))
    plt.subplot(1, 1, 1, frameon=False, xticks=[], yticks=[])
    nside = 10
    pca = PCA(n_components=n_components)
    Xproj = pca.inverse_transform(pca.transform(X[:nside ** 2]))
    Xproj = np.reshape(Xproj, (nside, nside, 8, 8))
    total_var = pca.explained_variance_ratio_.sum()
    im = np.vstack([np.hstack([Xproj[i, j] for j in range(nside)])
                    for i in range(nside)])
    plt.title("n = {0}, variance = {1:.2f}".format(n_components, total_var), size=18)
    plt.clim(0, 16)
interact(plot_digits, n_components=(1, 64), nside=(1, 8));

Other Dimensionality Reducting Routines

Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are Other dimensionality reduction techniques which are useful to know about:

Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the scikit-learn website.

Clustering with $k$-means

Introducing $k$-means

  • As mentionend, $k$-means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).

  • It is a relatively easy-to-understand algorithm:

    • It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.

$k$-means clustering

  • Having a dataset of the form $\Psi=\left\{\vec{x}^{(1)},\ldots,x^{(D)}\right\}$, with $\vec{x}^{(d)}\in\mathbb{R}^n$.
  • $k$-means clustering aims to partition the $D$ observations into $k$ sets $(k ≤ D)$, $S = {S_1, S_2, …, S_k}\;$ so as to minimize the within-cluster sum of squares (WCSS):
$$S= \text{arg min} \; \sum_{i=1}^{k} \; \sum_{x_j \in S_i} \; \left\| x_j - \mu_i \right\|$$

where $\mu_i$ is the mean of points in $S_i$.

  • The most common algorithm uses an iterative refinement technique.
  • Due to its ubiquity it is often called the $k$-means algorithm;
  • it is also referred to as Lloyd's algorithm, particularly in the computer science community.

Given an initial set of $k$ means $\vec{\mu}_1,\ldots,\vec{\mu}_k$, the algorithm proceeds by alternating between two steps:

  1. Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. $$ S_i = \left\{ \vec{x}_d \in\Psi: \left\| \vec{x}_d - \vec{\mu}_i \right\|^2 \le \left\| \vec{x}_d - \vec{m}_j \right\|^2 \ j, 1 \le j \le k;\ i \neq j \right\}, $$ where each $x_p$ is assigned to exactly one $S^{(t)}$, even if it could be assigned to two or more of them.
  2. Update step: Calculate the new means to be the centroids of the observations in the new clusters. $$ \vec{\mu}_i = \frac{1}{|S^{(t)}_i|} \sum_{x_j \in S^{(t)}_i} x_j $$
    • Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.
  • The algorithm has converged when the assignments no longer change.

Example code of $k$-means

Lets create a four clusters dataset.

In [22]:
X_clus, y_clus = sklearn.datasets.samples_generator.make_blobs(n_samples=300, 
In [23]:
plt.scatter(X_clus[:, 0], X_clus[:, 1], s=47, alpha=0.5);
  • It is relatively easy to distinguish the four clusters.
  • If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points.
  • Fortunately, there is a well-known Expectation Maximization (EM) procedure which scikit-learn implements, so that KMeans can be solved relatively quickly.

The $k$-means algorithm

As mentioned, $k$-means is an example of an algorithm which uses a two-step approach which works as follows:

  1. Guess some cluster centers
  2. Repeat until converged
    • Assign points to the nearest cluster center
    • Set the cluster centers to the mean
In [24]:
from scipy.spatial.distance import euclidean
In [25]:
def k_means(k, X, centroids=None, stop_threshold=0.00005, random_state=42):
    if not centroids:
        centroids = sklearn.utils.resample(X, n_samples=k, replace=False, 
    centroid_paths = [[c.copy()] for c in centroids]
    errors = []
    while True:
        # assign points to clusters
        clusters = [[] for _ in range(k)] # k empty clusters
        for x in X:
            distances = [euclidean(x, centroid) for centroid in centroids]
        diffs = []
        for i, cluster in enumerate(clusters):
            if len(cluster) > 0:
                centroid = np.mean(cluster, axis=0)
                diffs.append(euclidean(centroid, centroids[i]))
                centroids[i] = centroid

        # compute the errors
        errors.append(np.mean([min([euclidean(centroid, point) 
                                    for centroid in centroids]) for point in X]))
        # are we seeing any progress?
        if max(diffs) < stop_threshold: break
    return centroids, clusters, errors, centroid_paths

Running the algorithm for four clusters:

In [26]:
In [27]:
centroids, clusters, errors, centroid_paths = k_means(k, X_clus)
In [28]:
colors = cm.Set1(np.linspace(0,1,len(clusters)))
for i, cluster in enumerate(clusters):
    if len(cluster) > 0: 
        c = np.array(cluster)
        plt.scatter(c[:,0], c[:,1], c=colors[i], alpha=0.3)
for i, path in enumerate(centroid_paths):
    p = np.array(path)
    plt.plot(p[:,0], p[:,1], 'o-', c=colors[i], label="Centroid %d" % i)
for i, centroid in enumerate(centroids):
    plt.scatter(centroid[0], centroid[1], marker='o', s=75, c='k')
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('$k$-means algorithm progress');
plt.legend(loc='best', frameon=True);
In [29]:
plt.xlabel('Iterations'); plt.ylabel('Error');

This is best visualized in an interactive way.

Note: if you are visualizing this notebook in (online) static form you will see here just a plot. To interact with the plot you have to run the notebook.

In [30]:
from fig_code import plot_kmeans_interactive

$k$-means drawbacks

  • The convergence of this algorithm is not guaranteed;
  • for that reason, scikit-learn by default uses a large number of random initializations and finds the best results.
  • The number of clusters must be set beforehand which contradicts the fact that we know very little about the problem.
  • There are other clustering algorithms for which this requirement may be lifted.


  • Restarting,
  • centroids initialization,
  • dealing with "zombie" centroids that create empty clusters, and more.

Be ware that there are many other algorithms. Let's take a quick look at some of them.

In [31]:
import sklearn.cluster as clu
In [32]:
methods = [clu.SpectralClustering(k), clu.AgglomerativeClustering(4),
           clu.MeanShift(), clu.AffinityPropagation(), clu.DBSCAN()]
In [33]:
plt.figure(figsize=(8,5)); plt.subplot(231)
plt.scatter(X_clus[:,0], X_clus[:,1], c=y_clus, s=30,
plt.xticks([]); plt.yticks([]); plt.title("True labels")
for i, est in enumerate(methods):
    c =
    plt.subplot(232 + i)
    plt.scatter(X_clus[:,0], X_clus[:,1], c=c, s=30,
    plt.xticks([]); plt.yticks([]); plt.title(est.__class__.__name__)