A significant part of this material is taken from David Kirkby's material for a UC Irvine course offered by the Department of Physics and Astronomy.
Content is maintained on github and distributed under a BSD3 license.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
a_data = pd.read_hdf('data/cluster_a_data.hf5')
b_data = pd.read_hdf('data/cluster_b_data.hf5')
c_data = pd.read_hdf('data/cluster_c_data.hf5')
d_data = pd.read_hdf('data/cluster_d_data.hf5')
cluster_3d = pd.read_hdf('data/cluster_3d_data.hf5')
cosmo_data = pd.read_hdf('data/cosmo_data.hf5')
This will be our first time using the SciKit Learn package. We don't include it in our standard preamble since it contains many modules (sub-packages). Instead, we import each module as we need it. The ones we need now are:
from sklearn import preprocessing, cluster
The type of structure we can look for is "clusters" of "nearby" samples, but the definition of these terms requires some care.
In the simplest case, all features $x_{ij}$ have the same (possibly dimensionless) units, and the natural distance between samples (rows) $j$ and $k$ is: $$ d(j, k) = \sum_{\text{features}\,i} (x_{ji} - x_{ki})^2 \; . $$ However, what if some columns have different units? For example, what is the distance between: $$ \left( 1.2, 0.4~\text{cm}, 5.2~\text{kPa}\right) $$ and $$ \left( 0.7, 0.5~\text{cm}, 4.9~\text{kPa}\right) $$ ML algorithms are generally unit-agnostic, so will happily combine features with different units but that may not be what you really want.
One reasonable solution is to normalize each feature with the sphering transformation: $$ x \rightarrow (x - \mu) / \sigma $$ where $\mu$, $\sigma$ are the mean and standard deviation of the original feature values.
The sklearn.preprocessing module automates this process with:
cosmo_data
omega_b | omega_cdm | ln10^{10}A_s | H0 | |
---|---|---|---|---|
0 | 0.021984 | 0.126410 | 2.754078 | 64.660790 |
1 | 0.021120 | 0.107570 | 2.880657 | 65.231402 |
2 | 0.021920 | 0.120965 | 3.039052 | 69.714897 |
3 | 0.021304 | 0.131144 | 2.772624 | 69.520171 |
4 | 0.021985 | 0.121561 | 2.849463 | 63.284940 |
... | ... | ... | ... | ... |
9995 | 0.021406 | 0.110055 | 3.069914 | 67.993275 |
9996 | 0.022045 | 0.107412 | 3.013025 | 63.083467 |
9997 | 0.021588 | 0.108479 | 3.315846 | 65.571640 |
9998 | 0.022582 | 0.124919 | 2.999254 | 65.667611 |
9999 | 0.020763 | 0.125550 | 3.171801 | 73.810086 |
50000 rows × 4 columns
cosmo_normed = cosmo_data.copy()
cosmo_normed[cosmo_data.columns] = preprocessing.scale(cosmo_data)
cosmo_normed.describe()
omega_b | omega_cdm | ln10^{10}A_s | H0 | |
---|---|---|---|---|
count | 5.000000e+04 | 5.000000e+04 | 5.000000e+04 | 5.000000e+04 |
mean | -1.101341e-17 | 4.192202e-17 | 1.326299e-15 | 1.858922e-15 |
std | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 |
min | -1.735789e+00 | -1.726691e+00 | -1.734287e+00 | -1.725280e+00 |
25% | -8.584267e-01 | -8.697915e-01 | -8.654500e-01 | -8.737388e-01 |
50% | -1.958299e-03 | -2.131438e-03 | -3.995583e-03 | 5.644049e-03 |
75% | 8.617981e-01 | 8.679544e-01 | 8.704566e-01 | 8.736686e-01 |
max | 1.731191e+00 | 1.732135e+00 | 1.726426e+00 | 1.717778e+00 |
However, this may discard useful information contained in the relative normalization between features. To normalize only certain columns use, for example:
cosmo_normed = cosmo_data.copy()
for colname in 'ln10^{10}A_s', 'H0':
cosmo_normed[colname] = preprocessing.scale(cosmo_data[colname])
In the simplest case (a), clusters are well separated by a line (in 2D, or hyperplane in more dimensions) and can be unambiguously identified by looking only at the distance between pairs of samples.
In practice, clusters might overlap leading to ambiguities (b), or the clusters we expect to find might require considering groups of more than two samples at a time (c), or might have a non-linear separation boundary (d).
The K-means algorithm is fast and robust, but assumes that your data consists of roughly round clusters of the same size (where the meanings of "round" and "size" depend on how your data is scaled).
Let's first look at the first of our example datasets:
# plot the data in a_data
plt.scatter(a_data["x1"], a_data["x2"])
<matplotlib.collections.PathCollection at 0x11af5b070>
Now let's use K-Means on this data and see if it works well.
Most sklearn algorithms use a similar calling pattern:
result = module.ClassName(..args..).fit(data)
For the KMeans algorithm:
a_fit = cluster.KMeans(n_clusters=2).fit(a_data)
The output of the clustering algoritms is one label (the ID of the cluster) for each datapoint)
print(a_fit.labels_)
[1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0]
def display(data, fit):
labels = fit.labels_
plt.scatter(
data["x1"], data["x2"],
s=5, # set marker size
c=labels, # the colour is given by the labels
cmap=plt.cm.Accent, # set the colour palette
)
# set plot boundaries
plt.xlim(-9, 9)
plt.ylim(-5, 5)
display(a_data, a_fit)
EXERCISE: Use KMeans to fit the three other (b,c,d) 2D datasets with n_clusters=2
and generate similar plots. Which fits give the expected results?
b_fit = cluster.KMeans(n_clusters=2).fit(b_data)
display(b_data, b_fit)
c_fit = cluster.KMeans(n_clusters=2).fit(c_data)
display(c_data, c_fit)
d_fit = cluster.KMeans(n_clusters=2).fit(d_data)
display(d_data, d_fit)
The fit results look reasonable for (b), although the sharp dividing line between the two clusters looks artificial.
The fit results for (c) and (d) do not match what we expect because KMeans only considers one pair at a time, so cannot identify larger scale patterns that are obvious by eye.
# Add your solution here...
Algorithms have many parameters that influence their results for a given dataset, but these fall into two categories:
We refer the second group as "hyperparameters" and set their values during the "model selection" process, which we will discuss later.
DISCUSS: Are all of the arguments of the KMeans constructor hyperparameters?
In principle, yes, but in practice some of these arguments will have no (or minimal) impact on the algorithm result under normal conditions. The arguments that are most clearly hyperparameters are:
n_clusters
, algorithm
, tol
The arguments that are most clearly not hyperparameters are:
verbose
, precompute_distances
, n_jobs
The remaining arugments are in the gray area. In general, it is prudent to experiment with your actual data to identify which arguments affect your results significantly.
EXERCISE: Fit dataset (b) with the n_clusters
hyperparameter set to 3 and display the results. Comparing with the 2-cluster fit above, by eye, what do think is the "true" number of clusters? How might you decide between 2 and 3 more objectively?
b_fit_3 = cluster.KMeans(n_clusters=3).fit(b_data)
display(b_data, b_fit_3)
The plot above makes a convincing case (to me, at least) that there are three clusters. However, the "truth" in this case is two clusters.
This illustrates the dangers of superimposing a fit result on your data: it inevitably "draws your eye" and makes the fit more credible. Look out for examples of this when reading papers or listening to talks!
# Add your solution here...
An algorithm to find clusters in 2D data is just automating what you could already do by eye. However, most clustering algorithms also work well with higher dimensional data, where the clusters might not be visible in any single 2D projection.
fit_3d = cluster.KMeans(n_clusters=3).fit(cluster_3d)
cluster_3d['label'] = fit_3d.labels_
# FutureWarning from scipy.stats is harmless
sns.pairplot(cluster_3d, vars=('x0', 'x1', 'x2'), hue='label');
These clusters look quite arbitrary in each of the 2D scatter plots. However, they are actually very well separated, as we can see if we rotate the axes:
R = np.array(
[[ 0.5 , -0.5 , -0.707],
[ 0.707, 0.707, 0. ],
[ 0.5 , -0.5 , 0.707]])
rotated_3d = cluster_3d.copy()
rotated_3d[['x0', 'x1', 'x2']] = cluster_3d[['x0', 'x1', 'x2']].dot(R)
sns.pairplot(rotated_3d, vars=('x0', 'x1', 'x2'), hue='label');
This example is contrived, but the lesson is that clustering algorithms can discover higher-dimensional structure that you might miss with visualization.
Now that we have introduced our first ML algorithm, this is a good time for some general comments.
Most ML algorithms have some common features:
Questions to ask about the goal function:
Questions to ask about how the algorithm optimizes its goal function:
The goal function of the KMeans algorithm is: $$ \sum_{i=1}^n\, \sum_{c_j = i}\, \left| x_j - \mu_i\right|^2 $$ where $c_j = 1$ if sample $j$ is assigned to cluster $i$ or otherwise $c_j = 0$, and $$ \mu_i = \sum_{c_j = i}\, x_j $$ is the mean of samples assigned to cluster $i$. The outer sum is over the number of clusters $n$ and $j$ indexes samples. If we consider sample $x_j$ to be a vector, then its elements are the feature values.
DISCUSS: What are the parameters of the KMeans goal function? How many parameters are there?
The parameters are the binary values $c_j$ and there is one per sample (row). Note that the number of parameters is independent of the number of features (columns) in the data.
The number of clusters $n$ is a hyperparameter since it is externally set and not adjusted by the algorithm in response to the data.
The means $\mu_i$ are not independent parameters since their values are fixed by the $c_j$ (given the data).
ML algorithms come in two flavors, depending on whether they require some training data where you already know the answer ("supervised") or not ("unsupervised"). Clustering algorithms are unsupervised.
An advantage of unsupervised ML is that it works with any input data, and can discover patterns that you might not already know about (as in the 3D example above). Even when you have training data available, an unsupervised algorithm can still be useful.
The disadvantage of unsupervised learning is that we cannot formulate objective measures for how well an algorithm is performing, so the results are always somewhat subjective.
We have focused on KMeans as a prototypical clustering algorithm, but there are many others to chose from.
We will finish this section with some brief experimentation with two alternatives that use more global information than KMeans, so are better suited to examples (c) and (d) above:
Here is a visualization (taken from the sklearn documentation) of the many clustering algorithms available in sklearn.cluster
and their behaviour on some stereotyped datasets:
Image from sklearn documentation
EXERCISE: Use cluster.SpectralClustering
to fit c_data
and d_data
and display the results. Adjust the default hyperparameters, if necessary, to obtain the expected results.
c_fit = cluster.SpectralClustering(n_clusters=2).fit(c_data)
display(c_data, c_fit)
d_fit = cluster.SpectralClustering(n_clusters=2, gamma=2.0).fit(d_data)
display(d_data, d_fit)
# Add your solution here...
EXERCISE: Use cluster.DBSCAN
to fit c_data
and d_data
and display the results. Adjust the default hyperparameters, if necessary, to obtain the expected results.
c_fit = cluster.DBSCAN(eps=1.5).fit(c_data)
display(c_data, c_fit)
d_fit = cluster.DBSCAN().fit(d_data)
display(d_data, d_fit)