Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance). In this section we will explore a basic clustering task on the iris data.
By the end of this section you will
Let's re-use the results of the 2D PCA of the iris dataset in order to explore clustering. First we need to repeat some of the code from the previous notebook
# make sure ipython inline mode is activated
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
# all of this is copied from the previous notebook, '06_iris_dimensionality'
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
def plot_2D(data, target, target_names):
colors = cycle('rgbcmykw')
target_ids = range(len(target_names))
pl.figure()
for i, c, label in zip(target_ids, colors, target_names):
pl.scatter(data[target == i, 0], data[target == i, 1],
c=c, label=label)
pl.legend()
/usr/local/lib/python2.7/site-packages/scikits/__init__.py:1: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path __import__('pkg_resources').declare_namespace(__name__)
To remind ourselves what we're looking at, let's again plot the PCA components we defined in the last notebook:
plot_2D(X_pca, iris.target, iris.target_names)
Now we will use one of the simplest clustering algorithms, K-means. This is an iterative algorithm which searches for three cluster centers such that the distance from each point to its cluster is minimizied. First, let's step back for a second, look at the above plot, and think about what this will do. The algorithm will look for three cluster centers, and label the points according to which cluster center they're closest to.
Question: what would you expect the output to look like?
from sklearn.cluster import KMeans
from numpy.random import RandomState
rng = RandomState(42)
kmeans = KMeans(n_clusters=3, random_state=rng)
kmeans.fit(X_pca)
KMeans(copy_x=True, init='k-means++', k=None, max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances=True, random_state=<mtrand.RandomState object at 0x1064aa9a8>, tol=0.0001, verbose=0)
import numpy as np
np.round(kmeans.cluster_centers_, decimals=2)
array([[ 1.02, -0.71], [ 0.33, 0.89], [-1.29, -0.44]])
The labels_
attribute of the K means estimator contains the ID of the
cluster that each point is assigned to.
kmeans.labels_
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1], dtype=int32)
The K-means algorithm has been used to infer cluster labels for the
points. Let's call the plot_2D
function again, but color the points
based on the cluster labels rather than the iris species.
plot_2D(X_pca, kmeans.labels_, ["c0", "c1", "c2"])
Clustering comes with assumptions: A clustering algorithm find clusters using specific criterion, that correspond to given assumptions. For K-means clustering, the model is that all clusters have equal, spherical variance. In the case of the iris dataset this assumption does not match the geometry of the classes, and thus the clustering cannot recover the classes.
Gaussian Mixture Models: we can choose a different set of assumptions using a Gaussian Mixture Model (GMM). The GMM can be used to relax the assumptions of equal variance or of sphericality. However, the less assumptions, the more the problem is ill-posed and hard to learn. The covariance_type
argument of the GMM controls these assumptions. For the iris dataset, we will use the 'tied' mode, which imposes the same covariance for each classes. This makes the covariance learning problem easier.
from sklearn.mixture import GMM
gmm = GMM(n_components=3, covariance_type='tied')
gmm.fit(X_pca)
plot_2D(X_pca, gmm.predict(X_pca), ["c0", "c1", "c2"])
plt.title('GMM labels')
plot_2D(X_pca, iris.target, iris.target_names)
plt.title('True labels')
<matplotlib.text.Text at 0x10ad62fd0>
We see that the label are now much closer to the ground truth.
In general, there is no garanty that structure found by a clustering algorithm has anything to do with latent structures of the data.
The following are two well-known clustering algorithms. Like most unsupervised learning
models in the scikit, they expect the data to be clustered to have the shape (n_samples, n_features)
:
sklearn.cluster.KMeans
: sklearn.cluster.MeanShift
: sklearn.cluster.DBSCAN
: Other clustering algorithms do not work with a data array of shape (n_samples, n_features) but directly with a precomputed affinity matrix of shape (n_samples, n_samples):
sklearn.cluster.AffinityPropagation
: sklearn.cluster.SpectralClustering
: sklearn.cluster.Ward
: sklearn.cluster.DBSCAN
: Here are some common applications of clustering algorithms:
Perform the K-Means cluster search again, but this time learn the
clusters using the full data matrix X
, rather than the projected
matrix X_pca
.
Does this change the results?
Plot the results (you can still use X_pca for visualization, but plot the labels derived from the full 4-D set). Do the 4D K-means labels look closer to the true labels?
Explore how this changes using GMMs with different covariance types.