Machine Learning and Statistics for Physicists¶

A significant part of this material is taken from David Kirkby's material for a UC Irvine course offered by the Department of Physics and Astronomy.

Content is maintained on github and distributed under a BSD3 license.

Table of contents

Start from the OUTLINE OF THE ML PIPELINE

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Load Data¶

In [3]:

a_data = pd.read_hdf('data/cluster_a_data.hf5')
b_data = pd.read_hdf('data/cluster_b_data.hf5')
c_data = pd.read_hdf('data/cluster_c_data.hf5')
d_data = pd.read_hdf('data/cluster_d_data.hf5')

In [5]:

cluster_3d = pd.read_hdf('data/cluster_3d_data.hf5')

In [6]:

cosmo_data = pd.read_hdf('data/cosmo_data.hf5')

SciKit Learn¶

This will be our first time using the SciKit Learn package. We don't include it in our standard preamble since it contains many modules (sub-packages). Instead, we import each module as we need it. The ones we need now are:

In [7]:

from sklearn import preprocessing, cluster

Find Structure in Data¶

The type of structure we can look for is "clusters" of "nearby" samples, but the definition of these terms requires some care.

Distance between samples¶

In the simplest case, all features $x_{ij}$ have the same (possibly dimensionless) units, and the natural distance between samples (rows) $j$ and $k$ is: $$ d(j, k) = \sum_{\text{features}\,i} (x_{ji} - x_{ki})^2 \; . $$ However, what if some columns have different units? For example, what is the distance between: $$ \left( 1.2, 0.4~\text{cm}, 5.2~\text{kPa}\right) $$ and $$ \left( 0.7, 0.5~\text{cm}, 4.9~\text{kPa}\right) $$ ML algorithms are generally unit-agnostic, so will happily combine features with different units but that may not be what you really want.

One reasonable solution is to normalize each feature with the sphering transformation: $$ x \rightarrow (x - \mu) / \sigma $$ where $\mu$, $\sigma$ are the mean and standard deviation of the original feature values.

The sklearn.preprocessing module automates this process with:

In [8]:

cosmo_data

Out[8]:

	omega_b	omega_cdm	ln10^{10}A_s	H0
0	0.021984	0.126410	2.754078	64.660790
1	0.021120	0.107570	2.880657	65.231402
2	0.021920	0.120965	3.039052	69.714897
3	0.021304	0.131144	2.772624	69.520171
4	0.021985	0.121561	2.849463	63.284940
...	...	...	...	...
9995	0.021406	0.110055	3.069914	67.993275
9996	0.022045	0.107412	3.013025	63.083467
9997	0.021588	0.108479	3.315846	65.571640
9998	0.022582	0.124919	2.999254	65.667611
9999	0.020763	0.125550	3.171801	73.810086

50000 rows × 4 columns

In [12]:

cosmo_normed = cosmo_data.copy()
cosmo_normed[cosmo_data.columns] = preprocessing.scale(cosmo_data)

In [13]:

cosmo_normed.describe()

Out[13]:

	omega_b	omega_cdm	ln10^{10}A_s	H0
count	5.000000e+04	5.000000e+04	5.000000e+04	5.000000e+04
mean	-1.101341e-17	4.192202e-17	1.326299e-15	1.858922e-15
std	1.000010e+00	1.000010e+00	1.000010e+00	1.000010e+00
min	-1.735789e+00	-1.726691e+00	-1.734287e+00	-1.725280e+00
25%	-8.584267e-01	-8.697915e-01	-8.654500e-01	-8.737388e-01
50%	-1.958299e-03	-2.131438e-03	-3.995583e-03	5.644049e-03
75%	8.617981e-01	8.679544e-01	8.704566e-01	8.736686e-01
max	1.731191e+00	1.732135e+00	1.726426e+00	1.717778e+00

However, this may discard useful information contained in the relative normalization between features. To normalize only certain columns use, for example:

In [14]:

cosmo_normed = cosmo_data.copy()
for colname in 'ln10^{10}A_s', 'H0':
    cosmo_normed[colname] = preprocessing.scale(cosmo_data[colname])

What is a "cluster"?¶

In the simplest case (a), clusters are well separated by a line (in 2D, or hyperplane in more dimensions) and can be unambiguously identified by looking only at the distance between pairs of samples.

In practice, clusters might overlap leading to ambiguities (b), or the clusters we expect to find might require considering groups of more than two samples at a time (c), or might have a non-linear separation boundary (d).

Cluster Types

K-means Clustering¶

The K-means algorithm is fast and robust, but assumes that your data consists of roughly round clusters of the same size (where the meanings of "round" and "size" depend on how your data is scaled).

Let's first look at the first of our example datasets:

In [51]:

# plot the data in a_data
plt.scatter(a_data["x1"], a_data["x2"])

Out[51]:

<matplotlib.collections.PathCollection at 0x11af5b070>

Now let's use K-Means on this data and see if it works well.

Most sklearn algorithms use a similar calling pattern:

result = module.ClassName(..args..).fit(data)

For the KMeans algorithm:

In [60]:

a_fit = cluster.KMeans(n_clusters=2).fit(a_data)

The output of the clustering algoritms is one label (the ID of the cluster) for each datapoint)

In [61]:

print(a_fit.labels_)

[1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 1
 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0
 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0
 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 1
 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 0 1 0
 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0
 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0
 0 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0
 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0
 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0
 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1
 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0]

In [72]:

def display(data, fit):
    labels = fit.labels_
    
    plt.scatter(
        data["x1"], data["x2"],
        s=5,  # set marker size
        c=labels,  # the colour is given by the labels
        cmap=plt.cm.Accent,  # set the colour palette
    )
    
    # set plot boundaries
    plt.xlim(-9, 9)
    plt.ylim(-5, 5)

In [73]:

display(a_data, a_fit)

EXERCISE: Use KMeans to fit the three other (b,c,d) 2D datasets with n_clusters=2 and generate similar plots. Which fits give the expected results?

In [74]:

b_fit = cluster.KMeans(n_clusters=2).fit(b_data)
display(b_data, b_fit)

In [75]:

c_fit = cluster.KMeans(n_clusters=2).fit(c_data)
display(c_data, c_fit)

In [76]:

d_fit = cluster.KMeans(n_clusters=2).fit(d_data)
display(d_data, d_fit)

The fit results look reasonable for (b), although the sharp dividing line between the two clusters looks artificial.

The fit results for (c) and (d) do not match what we expect because KMeans only considers one pair at a time, so cannot identify larger scale patterns that are obvious by eye.

In [17]:

# Add your solution here...

Hyperparameters¶

Algorithms have many parameters that influence their results for a given dataset, but these fall into two categories:

Parameters whose values are determined by the data during the fitting process.
Parameters which must be externally set.

We refer the second group as "hyperparameters" and set their values during the "model selection" process, which we will discuss later.

DISCUSS: Are all of the arguments of the KMeans constructor hyperparameters?

In principle, yes, but in practice some of these arguments will have no (or minimal) impact on the algorithm result under normal conditions. The arguments that are most clearly hyperparameters are:

n_clusters, algorithm, tol

The arguments that are most clearly not hyperparameters are:

verbose, precompute_distances, n_jobs

The remaining arugments are in the gray area. In general, it is prudent to experiment with your actual data to identify which arguments affect your results significantly.

EXERCISE: Fit dataset (b) with the n_clusters hyperparameter set to 3 and display the results. Comparing with the 2-cluster fit above, by eye, what do think is the "true" number of clusters? How might you decide between 2 and 3 more objectively?

In [18]:

b_fit_3 = cluster.KMeans(n_clusters=3).fit(b_data)
display(b_data, b_fit_3)

The plot above makes a convincing case (to me, at least) that there are three clusters. However, the "truth" in this case is two clusters.

This illustrates the dangers of superimposing a fit result on your data: it inevitably "draws your eye" and makes the fit more credible. Look out for examples of this when reading papers or listening to talks!

In [19]:

# Add your solution here...

Clustering in many dimensions¶

An algorithm to find clusters in 2D data is just automating what you could already do by eye. However, most clustering algorithms also work well with higher dimensional data, where the clusters might not be visible in any single 2D projection.

In [20]:

fit_3d = cluster.KMeans(n_clusters=3).fit(cluster_3d)
cluster_3d['label'] = fit_3d.labels_
# FutureWarning from scipy.stats is harmless
sns.pairplot(cluster_3d, vars=('x0', 'x1', 'x2'), hue='label');

These clusters look quite arbitrary in each of the 2D scatter plots. However, they are actually very well separated, as we can see if we rotate the axes:

In [21]:

R = np.array(
    [[ 0.5  , -0.5  , -0.707],
     [ 0.707,  0.707,  0.   ],
     [ 0.5  , -0.5  ,  0.707]])
rotated_3d = cluster_3d.copy()
rotated_3d[['x0', 'x1', 'x2']] = cluster_3d[['x0', 'x1', 'x2']].dot(R)
sns.pairplot(rotated_3d, vars=('x0', 'x1', 'x2'), hue='label');

This example is contrived, but the lesson is that clustering algorithms can discover higher-dimensional structure that you might miss with visualization.

Anatomy of a ML Algorithm¶

Now that we have introduced our first ML algorithm, this is a good time for some general comments.

Most ML algorithms have some common features:

They seek to maximize (or minimize) some goal function $f(\theta, D)$ of the (fixed) data $D$, for some (unknown) parameters $\theta$.
The goal function embodies some model (perhaps only implicitly) of what the data is expected to look like.

Questions to ask about the goal function:

Is there a single global optimum by construction? (i.e., is $\pm f$ a convex function?)
If not, might there be multiple local optima?

Questions to ask about how the algorithm optimizes its goal function:

Is it exact or approximate?
If it is approximate, is it also iterative? If so, what are the convergence criteria?
Does the optimization use derivatives of the goal function to speed its convergence? Are these known in advance or how are they calculated?
How does the algorithm's running time scale with:
- the number of samples in the data?
- the number of features in the data?
- the number of parameters in the model?

The goal function of the KMeans algorithm is: $$ \sum_{i=1}^n\, \sum_{c_j = i}\, \left| x_j - \mu_i\right|^2 $$ where $c_j = 1$ if sample $j$ is assigned to cluster $i$ or otherwise $c_j = 0$, and $$ \mu_i = \sum_{c_j = i}\, x_j $$ is the mean of samples assigned to cluster $i$. The outer sum is over the number of clusters $n$ and $j$ indexes samples. If we consider sample $x_j$ to be a vector, then its elements are the feature values.

DISCUSS: What are the parameters of the KMeans goal function? How many parameters are there?

The parameters are the binary values $c_j$ and there is one per sample (row). Note that the number of parameters is independent of the number of features (columns) in the data.

The number of clusters $n$ is a hyperparameter since it is externally set and not adjusted by the algorithm in response to the data.

The means $\mu_i$ are not independent parameters since their values are fixed by the $c_j$ (given the data).

Supervised vs Unsupervised¶

ML algorithms come in two flavors, depending on whether they require some training data where you already know the answer ("supervised") or not ("unsupervised"). Clustering algorithms are unsupervised.

An advantage of unsupervised ML is that it works with any input data, and can discover patterns that you might not already know about (as in the 3D example above). Even when you have training data available, an unsupervised algorithm can still be useful.

The disadvantage of unsupervised learning is that we cannot formulate objective measures for how well an algorithm is performing, so the results are always somewhat subjective.

Other Clustering Methods¶

We have focused on KMeans as a prototypical clustering algorithm, but there are many others to chose from.

We will finish this section with some brief experimentation with two alternatives that use more global information than KMeans, so are better suited to examples (c) and (d) above:

Spectral clustering: sklearn, wikipedia.
Density-based spatial clustering of applications with noise (DBSCAN): sklearn, wikipedia.

Here is a visualization (taken from the sklearn documentation) of the many clustering algorithms available in sklearn.cluster and their behaviour on some stereotyped datasets:

Image from sklearn documentation

EXERCISE: Use cluster.SpectralClustering to fit c_data and d_data and display the results. Adjust the default hyperparameters, if necessary, to obtain the expected results.

In [22]:

c_fit = cluster.SpectralClustering(n_clusters=2).fit(c_data)
display(c_data, c_fit)

In [23]:

d_fit = cluster.SpectralClustering(n_clusters=2, gamma=2.0).fit(d_data)
display(d_data, d_fit)

In [24]:

# Add your solution here...

EXERCISE: Use cluster.DBSCAN to fit c_data and d_data and display the results. Adjust the default hyperparameters, if necessary, to obtain the expected results.

In [25]:

c_fit = cluster.DBSCAN(eps=1.5).fit(c_data)
display(c_data, c_fit)

In [26]:

d_fit = cluster.DBSCAN().fit(d_data)
display(d_data, d_fit)