Notebook

Artificial Intelligence and Modern Physics: a Two Way Connection¶

The following is part of the hands-on sessions of the AIPhy school. The notebook aims at giving an introduction to machine learning methods in Python. Tutorials deal with different unsupervised and supervised algorithms. Students will learn how to build these algorithms from scratch using basic Python classes. They will then apply different techniques to test and evaluate them on toy and real world datasets.

Preliminary Actions¶

I recommend you use a Python virtual environment to setup your work area:

python -m venv .venv

At the time of writing the Python version is 3.10.12: you can change this either with your distribution package manager, or by installing a Conda environment (e.g. conda create -y -n aiphy python"==3.10.12" && conda activate aiphy).

You can then activate it with:

source .venv/bin/activate

Before we begin, remember to install the requirements:

pip install -r requirements.txt

We shall use mainly the numpy library for the algorithms, and the matplotlib library for plots.

K-means Clustering¶

The K-means algorithm is an unsupervised learning algorithm used for exploration and clustering.

The intuitive idea is to partition a dataset $\mathcal{D} = \{ \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n \}$ ($\mathbf{X}_i \in \mathbb{R}^p~ \forall i = 1, \dots, n$) into $K$ clusters (or sets) $S_k$, such that $\bigcup\limits_{k = 1}^K S_k = \mathcal{D}$. Each cluster is defined by centroids $\{ \mathbf{c}_1, \mathbf{c}_2, \ldots, \mathbf{c}_K \}$, where $\mathbf{c}_k \in \mathbb{R}^p~ \forall k = 1, \dots, K$:

$$ \mathbf{c}_i = \frac{1}{\left| S_i \right|} \sum\limits_{j~|~\mathbf{x}_j \in S_i} \mathbf{x}_j. $$

The algorithm is trained to find the best centroids for each cluster as to minimise the distance between samples in the same cluster:

$$ \argmin\limits_{S_1, \dots, S_k} \sum\limits_{k = 1}^K \sum\limits_{j~|~\mathbf{x}_j \in S_i} \left|\left| \mathbf{x}_j - \mathbf{c}_k \right|\right|^2. $$

This objective function can be solved using different algorithms. For instance the widely known and used Lloyd's algorithm:

pre-step (only at first iteration): randomly choose the initial centroids $\mathbf{c}_k$ using data points;
assignment (loop over $k$): select samples $\mathbf{x}_i$ closest to centroid $\mathbf{c}_k$ and label it as belonging to cluster $k$;
update (loop over $k$): recompute the centroid $\mathbf{c}_k$ as $\mathbf{c}_k = \frac{1}{\left| S_k \right|} \sum\limits_{j~|~\mathbf{x}_j \in S_k} \mathbf{x}_j$ using the new assignments;
repeat 1. and 2. until convergence (e.g. a minimum change in the centroid positions).

Coding K-Means Clustering¶

In the following, we build the code for a complete K-Means algorithò, using Python classes. We first import the necessary libraries, and build an abstract class of transformations and projections, capable to fit and transform data.

In [ ]:

%load_ext autoreload
%autoreload 2
%matplotlib inline

In [150]:

import numpy as np
import jdc  # this is a Jupyter extension to write classes in multiple cells
import matplotlib as mpl
from typing import Tuple
from numpy.typing import NDArray
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA  # let us use the "real" implementation
from IPython.display import display, Image

We then select a style for the plots: you can freely play with this parameter, but I like this one.

In [151]:

plt.style.use('grayscale')

The Abstract Implementation¶

We first consider the structure of the desired data transformation. Ideally, it needs:

an initialisation step to randomly initialize the centroids,
a function to compute the euclidean distance between sets of points,
an assignment function to label each point as belonging to a cluster,
a centroid update function to compute the new centroid positions,
to have a fitting functionality to compute the cluster centroids,
to have a predict method, capable of taking inputs points and assign them to the fitted clusters.

Here is an abstract implementation to be inherited by the the actual implementation of the K-Means class:

In [152]:

class Clustering:
    """An abstract clustering class,"""

    def __init__(self, n_clusters: int = 2) -> None:
        """
        Parameters
        ----------
        n_clusters : int
            The number of clusters to be created.

        Raises
        ------
        ValueError
            If the number of clusters is less than 2.
        """
        if n_clusters < 2:
            raise ValueError(
                'The number of clusters must be at least 2! However, we found %d < 2 clusters.'
                % n_clusters)
        self.n_clusters = n_clusters
        self._fitted = False  # save the fitted status of the clustering

    def _initialize_centres(self, X: NDArray) -> NDArray:
        """
        Initialise the positions of the centroids.

        Paramenters
        -----------
        X : NDArray
            The data to be clustered.

        Returns
        -------
        NDArray
            The initial positions of the centroids.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the _initialize_centres method!'
        )

    def _euclidean_distance(self, A: NDArray, B: NDArray) -> NDArray:
        """
        Compute the Euclidean distance between two sets of points.

        Parameters
        ----------
        A : NDArray
            The first set of points.
        B : NDArray
            The second set of points.

        Returns
        -------
        NDArray
            The Euclidean distance between the two sets of points.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the _euclidean_distance method!'
        )

    def _cluster_assignment(self, dist: NDArray) -> NDArray:
        """
        Compute the assignment of each point to a cluster.

        Parameters
        ----------
        dist : NDArray
            The distance matrix.

        Returns
        -------
        NDArray
            The cluster assignment of each point.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the _cluster_assignment method!'
        )

    def _centre_positions(self, X: NDArray, clust: NDArray) -> NDArray:
        """
        Compute the positions of the new centroids.

        Parameters
        ----------
        X : NDArray
            The data to be clustered.
        clust : NDArray
            The cluster assignment of each point.

        Returns
        -------
        NDArray
            The positions of the new centroids.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the _centre_positions method!'
        )

    def fit(self, X: NDArray) -> 'Clustering':
        """
        Fit the clustering model.

        Parameters
        ----------
        X : NDArray
            The data to be clustered.

        Returns
        -------
        Clustering
            The fitted clustering model.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the fit method!')

    def predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:
        """
        Predict the cluster assignment of each point.

        Parameters
        ----------
        X : NDArray
            The data to be clustered.

        Returns
        -------
        Tuple[NDArray, NDArray]
            The cluster assignment of each point and the positions of the new centroids.
        """
        raise NotImplementedError(
            'All subclasses of Clustering must implement the predict method!')

    def fit_predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:
        """
        Fit and predict the labels of the data.

        This is a wrapper for the `fit` and `predict` methods.

        Parameters
        ----------
        X : NDArray
            The data to be clustered.

        Returns
        -------
        Tuple[NDArray, NDArray]
            The cluster assignment of each point and the positions of the new centroids.
        """
        self.fit(X)
        return self.predict(X)

The Actual Implementation¶

We can now proceed to implement the K-Means class.

Let us start with the constructor. Clearly, a clustering method must specify the number of clusters we would like to produce (this is already taken into account in the abstract class). However, in K-Means, this is done iteratively: we should add a way to stop the iteration with a convergence criterion. Moreover, the initialisation is random: we should implement something for consistency (e.g. fixing a seed of a random number generator).

In [153]:

class KMeans(Clustering):
    """K-Means Clustering using the Lloyd's algorithm."""

    def __init__(self,
                 n_clusters: int = 2,
                 max_iter: int = 1000,
                 tol: float = 0.0001,
                 seed: int = 42) -> None:
        """
        Parameters
        ----------
        n_clusters : int
            The number of clusters to be created.
        max_iter : int
            The maximum number of iterations allowed.
        tol : float
            The stopping criterion (based on tolerance) on the position of centroids between iterations.
        seed : float
            The random seed.

        Raises
        ------
        ValueError
            If the number of clusters is less than 2.
        """
        # YOUR CODE HERE

We then start by initializing the centroids: we shall choose random data points from the dataset to be the initial centres of the clusters.

HINT: take a look at the numpy library for something to "choose" from a set of data points.

In [154]:

%%add_to KMeans
def _initialize_centres(self, X: NDArray) -> NDArray:
    """
    Initialise the positions of the centroids.

    Parameters
    ----------
    X : NDArray
        The data to be clustered.

    Returns
    -------
    NDArray
        The initial positions of the centroids.
    """
    # YOUR CODE HERE

To proceed, we need a function capable of computing the distance between two sets of points, namely:

the data poinsts (usually of shape (n_samples, n_dimensions)),
the centroids (usually of shape (n_clusters, n_dimensions)).

The output of the function should be a distance matrix of shape (n_samples, n_clusters), in order to have the distance of each piece of data with respect to each cluster.

HINT: this is actually not that obvious, and it might require to know something about "broadcasting" numpy arrays. However, different implementations may have different ways of doing this: there is no "right" way to compute it, just do it (we shall not worry about computing time or optimisation here)!

In [155]:

%%add_to KMeans
def _euclidean_distance(self, A: NDArray, B: NDArray) -> NDArray:
    """
    Compute the Euclidean distance between two sets of points.

    Parameters
    ----------
    A : NDArray
        The first set of points.
    B : NDArray
        The second set of points.

    Returns
    -------
    NDArray
        The Euclidean distance between the two sets of points.
    """
    # YOUR CODE HERE

Cluster assignments should be easy once we have the distance matrix of shape (n_samples, n_clusters). Just find the cluster nearest to the data point!

In [156]:

%%add_to KMeans
def _cluster_assignment(self, dist: NDArray) -> NDArray:
    """
    Compute the assignment of each point to a cluster.

    Parameters
    ----------
    dist : NDArray
        The distance matrix.

    Returns
    -------
    NDArray
        The cluster assignment of each point.
    """
    # YOUR CODE HERE

Next step is to compute the position of the new centroids: given the data points and their cluster assignments, we need to compute the baricentre of each cluster.

In [157]:

%%add_to KMeans
def _centre_positions(self, X: NDArray, clust: NDArray) -> NDArray:
    """
    Compute the positions of the new centroids.

    Parameters
    ----------
    X : NDArray
        The data to be clustered.
    clust : NDArray
        The cluster assignment of each point.

    Returns
    -------
    NDArray
        The positions of the centroids.
    """
    # YOUR CODE HERE

We can finally put all together in a "training loop": given some "training data", we need to compute the positions of the centroids until convergence is reached (here modelled as both a maximum number of iterations and a tolerance).

HINT: we should take care of invalid inputs. Suppose the number of clusters is greater than the number of data points, does this make any sense?

BONUS: it might be interesting to keep track of the history of the centroids during the training loop. Can you think of a way to do that? Maybe a list of positions might do the job...

In [158]:

%%add_to KMeans
def fit(self, X: NDArray) -> 'KMeans':
    """
    Fit the clustering model.

    Parameters
    ----------
    X : NDArray
        The data to be clustered.

    Returns
    -------
    KMeans
        The fitted clustering model.
    """
    # Check that the number of clusters is consistent with the data
    if self.n_clusters > X.shape[0]:

        # YOUR CODE HERE

    # Initialise the centroids
    # YOUR CODE HERE

    # Keep track of the history of centroids
    centres_history = [centres]

    # Perform the clustering (implement stopping criterion based on iterations)
    # YOUR CODE HERE

        # Save the position of the centres
        old_centres = centres

        # Compute distance and assign the clusters
        # YOUR CODE HERE

        # Do not forget to update the clusters (and store the history)
        # YOUR CODE HERE

        # Apply the stopping criterion based on tolerance
        # YOUR CODE HERE

    # Save the results
    self.cluster_centres_ = centres
    self.cluster_centres_history_ = np.array(centres_history)
    self.n_iter = n

    # Update the fitted status
    self._fitted = # YOUR CODE HERE
    return self

Finally, write the prediction loop: once we have the centroids, this amounts to assign a cluster to each piece of data independently.

N.B.: if the fitting method has not been called, we should probably raise an error...

In [159]:

%%add_to KMeans
def predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:
    """
    Predict the closest cluster each sample in X belongs to.

    Parameters
    ----------
    X : NDArray
        The data to be clustered.

    Returns
    -------
    Tuple[NDArray, NDArray]
        The predicted cluster indices for each sample in X and the cluster centres.
    """
    # Check that the model has been fitted
    if not self._fitted:
        raise RuntimeError('KMeans must be fitted before calling the transform method!')

    # Compute the cluster assignment
    X = X.copy()
    # YOUR CODE HERE

    # Return cluster assignments (labels), and the position of the centres
    return labels, self.cluster_centres_

As usual, before moving on, let us test the implementation of the functions:

In [ ]:

# THIS IS A TEST CELL. DO NOT DELETE IT.

# Generate some test data
A = np.array([[2, 3], [4, 5], [5, 6]])
B = np.array([[0, 1], [5, 0]])
X = np.vstack([A, B])

# Create the model
kmeans = KMeans(n_clusters=2)

# Check the centre initialisation method
centres = kmeans._initialize_centres(X)
if centres.shape != (2, 2):
    display(Image('img/allegri_giacca.gif', width=500))
    raise ValueError(
        'The shape of the centroid list is incorrect! It should be (2, 2), but we received %s.'
        % centres.shape)
for c in centres:
    if (not c in A) and (not c in B):
        display(Image('img/allegri_giacca.gif', width=500))
        raise ValueError(
            'The centroid list is incorrect! The centroid %s is neither in the subset A nor in the subset B.'
            % c)

# Check the euclidean distance method
dist_gt = np.linalg.norm(X.reshape(-1, 1, 2) - centres.reshape(1, -1, 2),
                         axis=-1)
dist = kmeans._euclidean_distance(X, centres)
if dist.shape != dist_gt.shape:
    display(Image('img/allegri_dipoco.gif', width=500))
    raise ValueError(
        'The shape of the distance matrix is incorrect! It should be %s, but we received %s.'
        % (dist_gt.shape, dist.shape))
if not np.allclose(dist, dist_gt):
    display(Image('img/allegri_dipoco.gif', width=500))
    raise ValueError(
        'There is a problem with the euclidean distance method! The computed distances do not match.'
    )

# Check the cluster assignment method
clust_gt = np.argmin(dist_gt, axis=-1)
clust = kmeans._cluster_assignment(dist)
if (clust_gt != clust).all():
    display(Image('img/allegri_dipoco.gif', width=500))
    raise ValueError(
        'There is a problem with the cluster assignment method! The computed cluster assignments do not match.'
    )

# Compute the new centre positions
centres_gt = [X[clust_gt == i].mean(axis=0) for i in range(2)]
centres = kmeans._centre_positions(X, clust)
if not np.allclose(centres, centres_gt):
    display(Image('img/allegri_dipoco.gif', width=500))
    raise ValueError(
        'There is a problem with the new centre positions! The computed centre positions do not match.'
    )

# Check the fit method
kmeans = kmeans.fit(X)

if (not hasattr(kmeans, 'cluster_centres_')) and (not hasattr(
        kmeans, 'cluster_centres')):
    display(Image('img/allegri_dipoco.gif', width=500))
    raise ValueError('The fit method does not return the cluster centres!')
if not kmeans._fitted:
    display(Image('img/allegri_giacca.gif', width=500))
    raise RuntimeError(
        'The K-Means class does not update the fitted state correctly!')

# Everything passed
print('All tests passed!')
display(Image('img/allegri_calma.gif', width=500))

Some Simple Tests¶

We proceed to test the unsupervised algorithm on some synthetic data. We generate data from three different multivariate normal distributions, presenting both simple and difficult separation of the clusters:

$$ \mathcal{N}(\mathbf{x} ~\mid~ \mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{2 \pi \sqrt{\det{\mathbf{\Sigma}}}} \exp\left({-\left(\mathbf{x} - \mathbf{\mu}\right)^T \mathbf{\Sigma}^{-1} \left(\mathbf{x} - \mathbf{\mu} \right)}\right), $$

where $\mathbf{\Sigma}$ is the population covariance matrix, and $\mathbf{\mu}$ is the population mean.

In [161]:

gen = np.random.default_rng()
A = gen.multivariate_normal(mean=[0, 0], cov=[[6, 3], [3, 4]], size=1000)
A_labels = np.array([0] * 1000)
B = gen.multivariate_normal(mean=[8, 9], cov=[[1, 0], [0, 1]], size=1000)
B_labels = np.array([1] * 1000)
C = gen.multivariate_normal(mean=[-5, 10], cov=[[3, 2], [2, 2]], size=1000)
C_labels = np.array([2] * 1000)
X = np.vstack([A, B, C])
y = np.hstack([A_labels, B_labels, C_labels])

We then fit the KMeans clustering algorithm and predict the cluster assignments. As this is simply a test, we shall use 3 clusters. However, you are invited to experiment with different values for the number of clusters.

In [162]:

kmeans = # YOUR CODE HERE
y_pred, centres = # YOUR CODE HERE

We then plot the data and the predicted clusters in two separate axes for visualisation purposes. Does the cluster separation look good?

HINT: this is a scatter plot. You can plot separately different clusters (e.g. using a loop over the predicted labels) to better control the colours of the markers.

In [ ]:

# Create a figure
fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')

# Plot the original data
ax[0].scatter(  # YOUR CODE HERE,
    color='r', alpha=0.5, label='data (1)')
ax[0].scatter(  # YOUR CODE HERE,
    color='g', alpha=0.5, label='data (2)')
ax[0].scatter(  # YOUR CODE HERE,
    color='b', alpha=0.5, label='data (3)')
ax[0].legend()
ax[0].set_xlabel('x')
ax[0].set_ylabel('y')
ax[0].set_title('Original Data')

# Plot the cluster assignments (loop over the labels)

# YOUR CODE HERE

# Plot the cluster centres for the sake of completeness
ax[1].scatter(centres[..., 0], centres[..., 1], marker='x', label='centroids')
ax[1].legend()
ax[1].set_xlabel('x')
ax[1].set_ylabel('y')
ax[1].set_title('Cluster Assignments')
plt.show()

A Simple Visualisation¶

Just to be sure, it has to be noted that clustering algorithms, as well as classification algorithms, do not simply assign a piece of data to a given class. The real effect of training such functions is to create a partition of the feature space: each data point falling into one of the subsets of the space is assigned to a different class. In other words, clustering algorithms and classification algorithms create piece-wise constant functions in feature space.

In [ ]:

# Create a grid of points
xx = np.linspace(1.1 * X[..., 0].min(), 1.1 * X[..., 0].max(), 1000)
yy = np.linspace(1.1 * X[..., 1].min(), 1.1 * X[..., 1].max(), 1000)

# Create a meshgrid
xx, yy = np.meshgrid(xx, yy)

# Flatten the meshgrid
X_mesh = np.stack([xx.ravel(), yy.ravel()], axis=-1)

# Predict the cluster assignments
y_pred_mesh, _ = kmeans.predict(X_mesh)

# Build the contour plot
fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')
cmap = ['r', 'g', 'b']
cmap_mpl = ListedColormap(cmap)
ax.contourf(xx, yy, y_pred_mesh.reshape(xx.shape), cmap=cmap_mpl, alpha=0.3)
for i in np.unique(y_pred):
    ax.scatter(X[y_pred == i, 0],
               X[y_pred == i, 1],
               c=cmap[i],
               alpha=0.75,
               label=f'cluster ({i+1:d})')
ax.scatter(centres[..., 0], centres[..., 1], marker='x', label='centroids')
ax.legend()
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Decision Boundaries')
plt.show()

Evaluation¶

It starts to become interesting! We would like to evaluate the quality of the clustering, as we can see that not all data points have been correctly labelled. We will use an information theoretical approach to evaluate the mutual information of the clusters. In other words, we will measure the "amount of information" (measured in bits) needed obtained by using the cluster assignments with respect to the ground truth:

$$ \mathrm{MI}(y, \widehat{y}) = \sum\limits_{k, h = 1}^K \mathrm{P}(Y = y_k \wedge \widehat{Y} = \widehat{y}_h) \log_2 \left( \frac{\mathrm{P}\left(Y = y_k \wedge \widehat{Y} = \widehat{y}_h \right)}{\mathrm{P}(Y = y_k) \mathrm{P}(\widehat{Y} = \widehat{y}_h)} \right) = \mathrm{H}(Y) - H(Y | \widehat{Y}), $$

where $H(X)$ is the entropy of the variable $X$ and $H(X | Y)$ is the conditional entropy of $X$ given $Y$:

$$ H(Z) = - \sum\limits_{i = 1}^N \mathrm{P}(Z = z_i) \log_2 \mathrm{P}(Z = z_i), $$$$ H(Z | Y) = \mathrm{H}(Z \wedge Y) - \mathrm{H}(Y). $$

In our implementation, we will actually use a normalized version of the MI, as we shall use the Normalised Mutual Information (NMI)

$$ \mathrm{NMI}(X, Y) = \frac{MI(X, Y)}{\frac{H(X) + H(Y)}{2}} $$

which is more sensitive to extreme cases, such as unbalanced cluster assignemnts. Moreover, it is more directly comparable, as it is a normalised ("adimensional") quantity.

Before proceeding, let me suggest you compute the following functions:

N.B.: use the $\log_2$ function in your implementation.

In [164]:

def entropy(X: NDArray) -> float:
    """
    Entropy of the random variable X: H(X)

    Parameters
    ----------
    X : array_like
        The random variable

    Returns
    -------
    H : float
        The entropy
    """
    # YOUR CODE HERE


def entropy_joint(X: NDArray, Y: NDArray) -> float:
    """
    Joint entropy of the random variables X and Y: H(X, Y)

    Parameters
    ----------
    X : array_like
        The first random variable
    Y : array_like
        The second random variable

    Returns
    -------
    H : float
        The joint entropy
    """
    # YOUR CODE HERE


def entropy_conditional(X: NDArray, Y: NDArray) -> float:
    """
    Conditional entropy of the random variable Y given X: H(Y | X)

    Parameters
    ----------
    X : array_like
        The first random variable
    Y : array_like
        The second random variable

    Returns
    -------
    H : float
        The conditional entropy
    """
    # YOUR CODE HERE


def mutual_information(X: NDArray, Y: NDArray) -> float:
    """
    Mutual information of the random variables X and Y: MI(X, Y)

    Parameters
    ----------
    X : array_like
        The first random variable
    Y : array_like
        The second random variable

    Returns
    -------
    MI : float
        The mutual information
    """
    # YOUR CODE HERE


def normalised_mutual_information(X: NDArray, Y: NDArray) -> float:
    """
    Normalised mutual information of the random variables X and Y: NMI(X, Y)

    Parameters
    ----------
    X : array_like
        The first random variable
    Y : array_like
        The second random variable

    Returns
    -------
    NMI : float
        The normalised mutual information
    """
    # YOUR CODE HERE

In [ ]:

# THIS IS A TEST CELL. DO NOT MODIFY IT!

# Generate some data
P = np.array([0, 1, 2, 3, 2, 3, 1, 1, 1, 0, 2, 3, 0, 2])

# Compute the entropy
_, counts = np.unique(P, return_counts=True)
prob = counts / sum(counts)
H_gt = -float(sum(prob * np.log2(prob)))
H = entropy(P)
if not np.isclose(H, H_gt):
    display(Image('img/allegri_giacca.gif', width=500))
    raise ValueError('Entropy mismatch: %r != %r' % (H, H_gt))

# All tests passed
print('All tests passed!')
display(Image('img/allegri_calma.gif', width=500))

We finally move to the evaluation of the clustering results by modifying the number of clusters and computing the normalised mutual information.

N.B.: this is literally just a test case. We do not worry about finding a good training/test split as this is just for illustration purposes.

In [166]:

nmi = []

# Loop over different numbers of clusters
for k in range(2, 10):

    # Fit the k-means model

    # YOUR CODE HERE

    # Evaluate the k-means model

    # YOUR CODE HERE

# Convert to array
nmi = np.array(nmi)

We plot the values of the normalised mutual information against the number of clusters. What can we reasonably expect to find?

In [ ]:

# Create the figure
fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')

# Plot the results

# YOUR CODE HERE

ax.set_xlabel('no. of clusters')
ax.set_ylabel('normalised mutual information')
ax.set_title('Clustering Evaluation')
plt.show()

Does the result look reasonable? Does it shock you?

K-Means Clustering and Biomedical Data¶

We proceed to a different case study with real data. In particular, we use the Wisconsin Breast Cancer dataset to try and use clustering to binary classify malignant ($Y = 1$) and benign ($Y = 0$) instances.

In [168]:

X, y = load_breast_cancer(return_X_y=True)
gen = np.random.default_rng(42)

# Divide the dataset into training and test sets

# YOUR CODE HERE

We perform a PCA to reduce the dimensionality of the data and visualise them. We shall use the scikit-learn implementation of PCA: it is slightly different from what we coded, but not that much...

N.B.: main differences between our implementation and scikit-learn's:

the transform method outputs a single value: the principal components,
the loadings_ attribute is called components_.

In [169]:

# Perform the PCA
pca = PCA(n_components=2)
X_train_vis = # YOUR CODE HERE

We then plot the data using the principal components:

In [ ]:

fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')
ax.scatter(# YOUR CODE HERE)
ax.set_xlabel(
    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(
    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')
ax.set_title('Breast Cancer Wisconsin Dataset')
plt.show()

Let us try to evaluate the quality of clustering by performing the computation for several values of number of clusters:

In [ ]:

metric = []

# Loop over the number of clusters
for k in # YOUR CODE HERE:

    # Fit the k-means model
    # YOUR CODE HERE

    # Compute some metric
    # YOUR CODE HERE

metric = np.array(metric)

# Plot the metric
fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')
ax.plot(# YOUR CODE HERE,
     metric, 'k-o')
ax.set_xlabel('no. of clusters')
ax.set_ylabel('normalised mutual information')
ax.set_title('Clustering Evaluation')
plt.show()

It seems that a low number of clusters is better than a high number of clusters. Let us try to use this information to predict the labels of the data:

In [172]:

kmeans = # YOUR CODE HERE
y_train_pred, train_centres = # YOUR CODE HERE

For visualisation purposes, let us apply the same PCA transformation to the centroids:

In [173]:

train_centres_vis = # YOUR CODE HERE

Finally, let us visualise data (with their labels), and the cluster assignments. What can we say about the quality of the clustering?

In [ ]:

# Create a figure
fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')
fig.suptitle('Breast Cancer Wisconsin Dataset (training set)')

# Plot the original data using the 2D PCA

# YOUR CODE HERE

ax[0].set_xlabel(
    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')
ax[0].set_ylabel(
    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')
ax[0].set_title('Original Data (PCA 2D)')

# Plot the cluster assignments using the 2D PCA

# YOUR CODE HERE

ax[1].scatter(train_centres_vis[..., 0], train_centres_vis[..., 1], marker='x')
ax[1].set_xlabel(
    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')
ax[1].set_ylabel(
    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')
ax[1].set_title(f'Clusters (PCA 2D) | NMI = {nmi[0]:.2f}')

plt.show()

Let us compute the NMI on the test set, to see if the result is consistent:

In [ ]:

y_test_pred, test_centres = # YOUR CODE HERE
nmi = # YOUR CODE HERE
print(f'NMI = {nmi:.2%}')

Finally, plot the test data and the cluster assignments:

In [ ]:

# Compute the PCA
X_test_vis = # YOUR CODE HERE
test_centres_vis = # YOUR CODE HERE

# Build a figure
fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')
fig.suptitle('Breast Cancer Wisconsin Dataset (test set)')

# Plot the test data using the 2D PCA

# YOUR CODE HERE

ax[0].set_xlabel(
    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')
ax[0].set_ylabel(
    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')
ax[0].set_title('Original Data (PCA 2D)')

# Plot the cluster assignments using the 2D PCA

# YOUR CODE HERE

ax[1].scatter(test_centres_vis[..., 0], test_centres_vis[..., 1], marker='x')
ax[1].set_xlabel(
    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')
ax[1].set_ylabel(
    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')
ax[1].set_title(f'Clusters (PCA 2D) | NMI = {nmi:.2f}')
plt.show()

An Attempt at Classification¶

Using the K-Means clustering, we try to classify the data using a Logistic Regression classifier. In particular, we shall use the cluster assignments as feature engineering procedure to help the classifier.

In [177]:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_recall_fscore_support,
                             ConfusionMatrixDisplay, RocCurveDisplay)

Let us first compute a baseline score, by performing a classification over the untouched dataset. As we are using scikit-learn for the classifier, we can also acces its predict_proba method to compute the probability estimates for the classes, instead of simply the thresholded result.

QUESTION: as this is a binary classification problem, do we need to know the probability estimates for both negative and positive classes?

In [ ]:

# Rescale the input data
prep = StandardScaler()

# YOUR CODE HERE

# Perform the regression and predict the labels
clf = LogisticRegression()

# YOUR CODE HERE

y_prob = clf.predict_proba(X_test)[:, 1]  # positive class is enough

# Print the precision, recall and F1 score
acc = accuracy_score(y_test, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_test,
                                                           y_pred,
                                                           average='binary')

print(f'Accuracy = {acc:.2%}')
print(f'Precision = {precision:.2%}')
print(f'Recall = {recall:.2%}')
print(f'F1 = {f1:.2%}')

Let us then visualise the confusion matrix and the Receiver Operating Characteristic (ROC):

In [ ]:

ConfusionMatrixDisplay.from_predictions(y_test,
                                        y_pred,
                                        normalize='all',
                                        values_format='.1%',
                                        display_labels=['benign', 'malignant'])

In [ ]:

RocCurveDisplay.from_predictions(y_test,
                                 y_prob,
                                 color='r',
                                 name='plain classifier',
                                 plot_chance_level=True)

Feature Engineering¶

In this last part of the tutorial, we perform a basic feature engineering of the dataset. In particular, we would like to add what we know on the structure of the data (i.e. clusters) to the dataset, in order to improve the performance of the classifier.

We then compute different clustering labels to concatenate to the original features. Will they help to improve the performance of the classifier?

In [ ]:

# Store the clustering labels in separate values
X_train_clust = []
X_test_clust = []

# Perform clustering for different values (play with the parameters!)
for n in range(2, 20):

    # Define a clustering model

    # YOUR CODE HERE

# Transform the values to a 2D array
X_train_clust = np.array(X_train_clust).T
X_test_clust = np.array(X_test_clust).T

Finally, try to refit a classification model with the new features.

In [182]:

# Concatenate the clustering features
X_train = np.hstack([X_train, X_train_clust])
X_test = np.hstack([X_test, X_test_clust])

# Define a preprocessing scaler
prep = StandardScaler()

# YOUR CODE HERE

# Define and train a classifier (compute the predictions and probabilities)
clf = LogisticRegression()

# YOUR CODE HERE

We can finally compute the different metrics:

In [ ]:

# YOUR CODE HERE

print(f'Accuracy = {acc:.2%}')
print(f'Precision = {prec:.2%}')
print(f'Recall = {rec:.2%}')
print(f'F1 = {f1:.2%}')

Let us take some time to appreciate the importance of the result:

what metric has improved (if any)? What metric has lost something (if any)?
how do we interpret the result?
is it a good or a bad model? Is it better than the baseline?

And finally, think about the newly introduced hyperparameter: the maximum number of clusters. For this tutorial, we are not interested in optimising it. However, how would we have to modify our pipeline in order to include an optimisation step of the hyperparameters?

Finally, let us visualise the confusion matrix and the ROC curve:

In [ ]:

ConfusionMatrixDisplay.from_predictions(y_test,
                                        y_pred,
                                        normalize='all',
                                        values_format='.1%',
                                        display_labels=['benign', 'malignant'])

In [ ]:

RocCurveDisplay.from_predictions(y_test,
                                 y_prob,
                                 color='r',
                                 name='feat. eng.',
                                 plot_chance_level=True)