Notebook

Clustering and Unsupervised Analysis¶

Clustering Overview¶

This notebook covers different clustering methods. We'll cover both k-means and hierarchical based clustering. We'll also cover how to incorporate SVD into clustering methods.

Clustering is an unsupervised technique that doesn't require a particular outcome variable. The core idea of clustering follows this logic:

Identify an object/instance $i$ by its feature vector $X_i=$
Define a distance function $d(X_i,X_j)$ to measure distance between $X_i$ and $X_j$.
Create distinct groupings of instances that:

Minimize intra-group distances
Maximize inter-group distances

Note that in the above, if we use a similarity function instead of traditional distance metric (i.e., cosine similarity as opposed to Euclidean distance), we want to maximize similarity within clusters.

The key questions we often consider when clustering are:

What is a good distance/similarity function?
How many clusters is appropriate?
How do we interpret/assess the clustering?

We'll explore these in the options below.

The data we'll use is from the student survey that everyone filled out at the beggining of the class. The survey asked for each student to rank themselves on a scale of 1 to 10 in each of the following DS related skill sets - Visualization, Computer Science, Math, Statistics, Machine Learning, Business, Communication. In the next section we load and do some basic distributions of the results.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os


cwd = os.getcwd()
datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'


d = pd.read_csv(datadir + 'survey_responses_2018.csv', header = 0, sep=',')
dpro = d[['profile_{}'.format(k + 1) for k in range(7)]]
dpro.columns = ['Viz', 'CS', 'Math', 'Stats', 'ML', 'Bus', 'Com']

fig = plt.figure(figsize = (10, 6))
for i in range(7):
    plt.subplot(3, 4, i + 1)
    plt.hist(dpro[dpro.columns.values[i]])
    plt.title(dpro.columns.values[i])

fig.tight_layout()

plt.show()

We can see that most categories have a full range of values. An important question is how correlated are the values to each other. We'll explore this question in two different ways.

First let's look at the correlation of the different categories.

In [2]:

from matplotlib import pyplot as plt

#Get correlation
c_mat = dpro.corr()

#Mask diagnol & duplicates (for plotting purposes)
plot_data = c_mat.values
mask = np.triu(np.ones_like(c_mat, dtype=bool))
plot_data = np.ma.masked_where(np.asarray(mask), plot_data)

fig, ax = plt.subplots()
ax.pcolormesh(plot_data, cmap=plt.cm.RdBu)

#Set the tick labels and center them
ax.set_xticks(np.arange(c_mat.shape[0])+0.5, minor=False)
ax.set_yticks(np.arange(c_mat.shape[1])+0.5, minor=False)
ax.set_xticklabels(c_mat.index.values, minor=False)
ax.set_yticklabels(c_mat.index.values, minor=False)

plt.title("Category Correlation")
plt.show()

c_mat

Out[2]:

	Viz	CS	Math	Stats	ML	Bus	Com
Viz	0.000000	0.282906	0.015144	0.270820	0.328538	0.437584	0.368496
CS	0.282906	0.000000	0.215316	0.101657	0.481906	0.107712	0.185016
Math	0.015144	0.215316	0.000000	0.670016	0.272858	-0.044132	-0.037995
Stats	0.270820	0.101657	0.670016	0.000000	0.321812	0.201824	0.114225
ML	0.328538	0.481906	0.272858	0.321812	0.000000	0.117302	0.083175
Bus	0.437584	0.107712	-0.044132	0.201824	0.117302	0.000000	0.589699
Com	0.368496	0.185016	-0.037995	0.114225	0.083175	0.589699	0.000000

We can see a range of correlations, while we also see two sets of strongly correlated features. These relationships are between "business/communication" and "math/stats." It is perhaps not surprising that students that rank themselves on one aspect of each of these sets might rank themselves highly on the other. We can also see that these two groups of correlated features are uncorrelated with each other. From this we can certainly sense that distinct segments of the student population might be exist.

Latent Variables¶

With this range of correlation in the data, we might wonder whether certain latent features might exist that can explain the above observations. A latent feature (or variable) is described by Wikipedia as: "...latent variables (or hidden variables, as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). As we see above, students rank themselves very similarly in "math" and "stats". Both "math" and "stats" are the observed feature. The latent feature might be some sort of intellectual capacity for abstraction and logic. This hidden feature is of course manifested, and thus observed, in the form of skill in two related academic disciplines.

One way to detect and define the latent features is through a decomposition of the observed features. Our student survey results are stored in a matrix $X$. Ideally, latent features will all be independent, and each observed feature might be a linear combination of the latent features. One straightfoward mechanism to to mathematically arrive at the properties just described is via the singular value decomposition. See the notebook titled "Lecture_PhotoSVD_3" to explore SVD and an example of a potential use case.

For our exploratory analysis of the survey data, we'll use the SVD to define independent features (basically, latent features).

This starts with the basic decomposition. We'll also generate a scree plot to get a sense of how important the various latent features are to the overall distribution of the data.

In [5]:

import sys
import course_utils as bd

U, sig, Vt = np.linalg.svd(dpro, full_matrices=False)

bd.plotSVD(sig)

plt.show()

/Users/briand/anaconda/envs/py35/lib/python3.5/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)

This is a fairly extreme outcome. Most of the data can be explained by the first singular vector and value. We can guess that this first (and very dominant) latent feature might be related to having self-reported business/communication skills or math/stats skills. Remember from the above correlations, that this is almost an either/or scenario.

So while we have good evidence that skills based segments exist, we have no principled way to identify them right now. Fortunately, there are tools to solve this problem.

Clustering Examples¶

Clustering can be performed using the sklearn.cluster library. We'll show two examples below, one using sklearn.cluster.kmeans for K-Means clustering, and one using scipy.cluster.hierarchy for hierarchical clustering. Note, in the latter case we'll demo scipy not sklearn here because scipy supports plotting the cluster dendrograms better.

K-means clustering¶

One decision we have to make is what data to use. We have our original feature matrix $X$, but we have also computed the SVD of $X$, which gives us an orthonormal matrix of user latent features $U$. If we use $U$, the features are normalized and independent. The normalization is important because clustering methods use distance metrics that are sensitive to scale. The indepedence means each feature will hold equal weight in the clustering. This may or may not be a good thing. For example, if we use $U$, we know that the fist singular vector is by far the most important. We might want this feature to dominate the clustering process. The good news is, we can weight the columns in $U$ using the singular values, i.e., cluster on $U\Sigma$ instead of $U$.

A subtle corollary of this last point is using $U$ or $U\Sigma$ gives us a great tool to overcome the curse of dimensionality. If $X$ happend to be very high dimensional, but most of the sum-of-squares can be explained by a smaller first-$k$ subset of the singular vectors, then we might be better off clustering on $U_k$ or $U_k\Sigma_k$ (the rank-$k$ approximations).

We start with a basic clustering. It is fairly easy to implement. In general, you'll always get a result, and a major question is always how do you know if it is a good fit? Ultimately, this becomes both a qualitative and quantitative issue. Some criteria might be:

Do the clusters make sense? (this is decidedly qualitative)
Are the clusters well balanced? (a quantitative attribution to a potentially arbitrary need).

A similar question on quality, is what is the optimal $k$. For choosing $k$, we think of the above two questions, but we can also see how well the clusters minimize the within cluster sum of squares. This criteria is also called 'inertia' and is defined as:

$inertia = \sum\limits_{j=1}^k\:\sum\limits_{x_i \in C_j}|x_i-\mu_j|^2$

In [6]:

from sklearn import cluster

#Note - most of these input parms, except the first, help ensure stability of the fit
km = cluster.KMeans(n_clusters = 2, init = 'k-means++', n_init = 5)
km.fit(dpro)

Out[6]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=5, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Let's loop through different values of k to get the inertia as a function of k. We'll also compute another metric that has been used to evaluate clusters where true cluster labels are not known (which is usually the case). This is called the Silhoette Coefficient. More details can be found here

In [8]:

from sklearn.metrics import pairwise_distances
from sklearn import metrics

inert_k = []
sil_k = []

for k in range(2, 20):
    km = cluster.KMeans(n_clusters = k, init = 'k-means++', n_init = 5)
    km.fit(dpro)
    inert_k.append(km.inertia_)
    sil_k.append(metrics.silhouette_score(dpro, km.labels_, metric = 'euclidean'))    

fix = plt.figure(figsize = (15, 9))
ax1 = plt.subplot(211)
plt.bar(range(2,20), inert_k, 0.35)
plt.title('Inertia by k')
plt.tick_params(axis='x',which='both',bottom=False,top=False,labelbottom=False)
ax1 = plt.subplot(212)
plt.bar(range(2,20), sil_k, 0.35)
plt.title('Silhoette Coefficient by k')

plt.show()

We can see that increasing $k$ tends to continually decrease the inertia. But at the same time, our Silhouette Coefficient does not increase. We should always expect inertia to decrease as $k$ increases, but the other metric shows that this does not exactly result in good clusters.

One method for determining a good value of $k$ is called the elbow method. This amounts to looking at the first plot above, and determining at what $k$ the within cluster sum of squares starts to level off. This is somewhat subjective, but tends to be useful.

Next we'll look at a qualitative evaluation. For a small range of $k$, we first look at the distribution of clusters.

In [9]:

fig = plt.figure(figsize = (12, 9))
for k in range(2, 8):
    km = cluster.KMeans(n_clusters = k, init = 'k-means++', n_init = 5)
    km.fit(dpro)
    ax = fig.add_subplot(3, 2 , k - 1)
    plt.hist(km.labels_)
    plt.title('k={}'.format(k))

fig.tight_layout()

plt.show()

We can see that the clusters are actually pretty well balanced for all values of $k$. Each choice of $k$ certainly has a clear maximum, but the differences aren't too extreme.

The next thing we might try to do then is define the clusters based on the centroids. The centroid essentially describes the average student within each cluster. We can use that to better understand and then define the clusters. For comparison's sake, we'll also show this for clusters derived using $U$ or $U\Sigma$. When doing this though, we have to remember to report the cluster means in the original $X$ space for us to be able to interpret it. To project the $U\Sigma$ centroids back into the $X$ space, we just right multiply the centroid by $V^T$ from the SVD.

Another thing we do here is we subtract the mean of $X$ from each centroid. This is because we're more interested in how each cluster differs from the average student profile.

In [17]:

#Clustering on original X space
km = cluster.KMeans(n_clusters = 4, init = 'k-means++', n_init = 10)
km.fit(dpro)
cols = ['r', 'y', 'b', 'g']

fig = plt.figure(figsize = (12, 5))
for i, cc in enumerate(km.cluster_centers_):
    w = 0.15
    ax = fig.add_subplot(1, 1, 1)
    plt.bar(np.arange(len(cc)) + i*w, cc - dpro.mean(), w, color = cols[i], label='Cluster {}'.format(i))
    ax.set_xticklabels(dpro.columns.values)
    ax.set_xticks(np.arange(len(cc))+2*w)

plt.ylabel('Mean Adjusted Score')
plt.title('Avg Scores by Cluster')
plt.legend(loc = 3, ncol = 4, prop={'size':9})
   
#Now cluster on the SVD space
km_u = cluster.KMeans(n_clusters = 4, init = 'k-means++', n_init = 10)
#km_u.fit(pd.DataFrame(U.dot(diag(sig))))
km_u.fit(pd.DataFrame(U))
cols = ['r', 'y', 'b', 'g']

fig = plt.figure(figsize = (12, 5))
for i, cc in enumerate(km_u.cluster_centers_):
    w = 0.15
    ax = fig.add_subplot(1, 1, 1)
    #plt.bar(np.arange(len(cc))+i*w, cc.dot(Vt), w, color = cols[i], label='Cluster {}'.format(i))
    plt.bar(np.arange(len(cc)) + i*w, cc.dot(np.diag(sig).dot(Vt)) - dpro.mean(), w, color = cols[i], label='Cluster {}'.format(i))
    ax.set_xticklabels(dpro.columns.values)
    ax.set_xticks(np.arange(len(cc))+2*w)

plt.ylabel('Mean Adujsted Score')
plt.title('Avg Scores by Cluster Using SVD')
plt.legend(loc = 3, ncol = 4, prop={'size':9})

plt.show()

/Users/briand/anaconda/envs/py35/lib/python3.5/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

In the above we chose $k=4$. Let's assume we did this because we want to assign students to groups of 4 to work together in a study group. Our goal would be to maximize the skill diversity of each group, so we'd cluster into 4 clusters and assign one student per cluster into each group. If we look at the above plots, we can get a sense of how the average person within each group differs from the average student in the class. When we make our profiles, we'll use the top chart, as the clusters here tend to follow a more intuitive line of reasoning (note the subjectivity here)!

Student Profiles

Cluster 0 appears to be strongest in business, communication and vizualization.
Cluster 1 is the math/stats group.
Cluster 2 in general is the group that tends to rate itself low on all dimensions!
Cluster 3 in general is the group that tends to be very confident on all technical subjects!

Hierarchical Clustering¶

In the above we showed how to compute and evaluate K-Means, and we also came up with a use case for the clustering. The above use case (i.e., putting students into study groups of size 4) essentially dictated the choice of $k$. In a more general use case, we might not have such an application specific best $k$. One way we can be more general is to use hiearchical clustering. In this type of clustering, the individual clusters are embedded in a taxonomy. We can use this taxonomy to see if there are any natural values of $k$ that make the most sense. We can also use it to ensure that each final cluster we choose is well balanced in size. Additionally, we can use this to get a sense of any outlier clusters (those with very small counts).

Using Scipy isn't as straightforward as using Sklearn, but again, scipy has a good procedure for displaying the dendrogram.

In [18]:

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

#This function gets pairwise distances between observations in n-dimensional space.
dists = pdist(dpro)

#This function performs hierarchical/agglomerative clustering on the condensed distance matrix y.
links = linkage(dists)

p = 46
#Now we want to plot the dendrogram
fig = plt.figure(figsize = (12, 6))
den = dendrogram(links, truncate_mode = 'lastp', p = p)
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.suptitle('Samples clustering', fontweight='bold', fontsize=14);

plt.show()

Let's try to understand the output of the linkage function. The first few records look like:

In [19]:

links[:5,:]

Out[19]:

array([[ 16.        , 121.        ,   1.73205081,   2.        ],
       [ 93.        , 110.        ,   1.73205081,   2.        ],
       [ 13.        , 159.        ,   1.73205081,   3.        ],
       [ 35.        , 157.        ,   1.73205081,   2.        ],
       [ 92.        , 130.        ,   1.73205081,   2.        ]])

The way we read this is as follows:

At the lowest level, each record is assigned to its own cluster, and the cluster number is just its original index.
The linkage function returns an (n-1) by 4 matrix where:

The ith row corresponds to a new cluster, whose id=n+i
The L[i,0] and L[i,1] columns are the two clusters that are joined to make cluster (n+i)
The column L[i, 2] is the distance between L[i,0] and L[i,1]
The last column is the final size of cluster (n+i)

In the above few rows, we see that user 19 and 30 were combined to make cluster 46. Then, we see in the 5th row that this cluster 46 was joined with user 0 to make cluster 50. If we look up at the dendrogram we can see these 3 students being combined into a single cluster.

In [191]:

#Note code to follow

If we want to use a hiearchical clustering technique to get a specific number of clusters, we can use sklearn.cluster.Ward for a concise process that returns exactly what we need. We'll continue with our student study group example and choose $k=4$.

In [20]:

ka = cluster.AgglomerativeClustering(n_clusters = 4)
ka.fit(dpro)

Out[20]:

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=4,
            pooling_func=<function mean at 0x10c8ca950>)

Let's do another plot of mean adjusted centroids to get a sense of what each cluster represents. Sklearn.cluster.Ward does not return the centroids, so we'll have to compute this step ourselves.

In [21]:

#Clustering on original X space using Hierarchical clustering

cols = ['r','y','b','g']

fig = plt.figure(figsize = (12, 5))
for i in range(4):
    w = 0.15
    ax = fig.add_subplot(1,1,1)
    cc = dpro[(ka.labels_==i)].mean()
    plt.bar(np.arange(len(cc))+i*w, cc - dpro.mean(), w, color = cols[i], label='Cluster {}'.format(i))
    ax.set_xticklabels(dpro.columns.values)
    ax.set_xticks(np.arange(len(cc))+2*w)

plt.ylabel('Mean Adjusted Score')
plt.title('Avg Scores by Cluster')
plt.legend(loc = 3, ncol = 4, prop={'size':9})

plt.show()
   

/Users/briand/anaconda/envs/py35/lib/python3.5/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

When we plot the 4 clusters using hierarchical clustering, we get similar conceptual groupings as we did with k-means. However, I find this latter plot more easy to interpret (remember, this is a bit subjective).

Student Profiles using Hierarchical Clustering

<li><u>Cluster 0:</u> This group has most of the math/stats experience</li>
<li><u>Cluster 1:</u> These are the business and strategy minded students. </li>
<li><u>Cluster 2:</u> These are the math/stats folks in the group.</li>
<li><u>Cluster 3:</u> This group is a little below average in skill across all categories (or at least is the group that underrates their own skill levels).</li>

So now we have to make a choice - which clustering method to use for the student profiles. One last thing to compare is the distribution of students across clusters for each method.

In [22]:

fig = plt.figure(figsize = (12, 4))

plt.subplot(1,3,1)
plt.hist(ka.labels_)
plt.ylim([0,20])
plt.title('Hier: X')

plt.subplot(1,3,2)
plt.hist(km.labels_)
plt.ylim([0,20])
plt.title('K-means: X')

plt.subplot(1,3,3)
plt.hist(km_u.labels_)
plt.ylim([0,20])
plt.title('K-means: U*Sig')

fig.tight_layout()


plt.show()

Comparing the above methods, we see fairly even representation in the clusters. Given this result, it is reasonable to choose the clustering method that gives us a more interpretable result.

In [ ]: