Clustering is the process of examining a collection of "points" and grouping the points into "clusters" according to some distance measure. The goal is that the points in the same cluster have a small distance from one another, while points in different clusters are at a larger distance from one another.
A dataset suitable for clustering is a collection of points, which are just objects belonging to some space. In its most general sense, a space is just a universal set of points, from which the points in the dataset are drawn. All spaces for which we can perform a clustering have a distance measure, giving a distance between any two points in the space.
# Import pandas
import pandas as pd
pd.set_option('display.max_columns', None)
# Import seaborn
import seaborn as sns
%matplotlib inline
# Apply the default theme
sns.set_theme()
K-Means is the best known algorithm in a family of clustering algorithms known as point-assignment algorithms. K-Means requires a Euclidean space and assumes the number of clusters, k, is known in advance. It is, however, possible to deduce k by trial and error.
In order to explain the K-Means algorithm, we will start by creating a low dimension "demo" dataset that has a clear number of clusters (K=2)
# create demo data
points = [
[2.5, 1.1],
[1.5, 2],
[2, 3],
[4, 6],
[4, 4],
[4.5, 5],
[3.5, 5.1]
]
points
[[2.5, 1.1], [1.5, 2], [2, 3], [4, 6], [4, 4], [4.5, 5], [3.5, 5.1]]
# identify columns and index
columns = ["x", "y"]
index = [f"P-{i}" for i in range(len(points))]
# create data frame
demodf = pd.DataFrame(
points,
columns=columns,
index=index
)
# add blank label column
demodf['cluster'] = 'N/A'
# display demo data
demodf
x | y | cluster | |
---|---|---|---|
P-0 | 2.5 | 1.1 | N/A |
P-1 | 1.5 | 2.0 | N/A |
P-2 | 2.0 | 3.0 | N/A |
P-3 | 4.0 | 6.0 | N/A |
P-4 | 4.0 | 4.0 | N/A |
P-5 | 4.5 | 5.0 | N/A |
P-6 | 3.5 | 5.1 | N/A |
def kmeans_plot(df, cf=None):
df_tmp = df.copy()
df_tmp['centriod'] = False
if cf is not None:
cf_tmp = cf.copy()
cf_tmp['centriod'] = True
vf = pd.concat([df_tmp, cf_tmp]) if cf is not None else df_tmp
# visualize demo + centriod data
scatter = sns.relplot(
data=vf,
x="x",
y="y",
hue='cluster',
style='centriod',
kind="scatter",
height=7,
aspect=1.5,
s=300
).set(
title="Demo Data"
)
# label points with index
for i, point in vf.iterrows():
scatter.ax.text(point['x']+.1, point['y'], i)
kmeans_plot(demodf)
k=2
initial_centriods = [
[2.5, 1.1, 'C-1'], # P-0
[3.5, 5.1, 'C-2'], # P-6
]
centriods = pd.DataFrame(
initial_centriods,
columns=['x', 'y', 'cluster'],
index=['C-1', 'C-2']
)
centriods
x | y | cluster | |
---|---|---|---|
C-1 | 2.5 | 1.1 | C-1 |
C-2 | 3.5 | 5.1 | C-2 |
kmeans_plot(demodf, centriods)
Let's use the $\textit{L}_{1}\textit{-norm}$, or $\textit{Manhattan distance}$.
# 2D array of points
points = demodf[['x', 'y']].values
# demodf[['x', 'y']].values
# point of C-1
c_1 = centriods.loc['C-1'].values[:2]
# point of C-2
c_2 = centriods.loc['C-2'].values[:2]
def manhattan_distance(a, c):
dist = []
for row in a:
x_dist = abs(row[0] - c[0])
y_dist = abs(row[1] - c[1])
dist.append(x_dist + y_dist)
return dist
demodf['C-1'] = manhattan_distance(points, c_1)
demodf['C-2'] = manhattan_distance(points, c_2)
demodf
x | y | cluster | C-1 | C-2 | |
---|---|---|---|---|---|
P-0 | 2.5 | 1.1 | N/A | 0.0 | 5.0 |
P-1 | 1.5 | 2.0 | N/A | 1.9 | 5.1 |
P-2 | 2.0 | 3.0 | N/A | 2.4 | 3.6 |
P-3 | 4.0 | 6.0 | N/A | 6.4 | 1.4 |
P-4 | 4.0 | 4.0 | N/A | 4.4 | 1.6 |
P-5 | 4.5 | 5.0 | N/A | 5.9 | 1.1 |
P-6 | 3.5 | 5.1 | N/A | 5.0 | 0.0 |
def label_cluster(row):
if row['C-1'] < row['C-2']:
return 'C-1'
return 'C-2'
demodf['cluster'] = demodf.apply(label_cluster, axis=1)
demodf
x | y | cluster | C-1 | C-2 | |
---|---|---|---|---|---|
P-0 | 2.5 | 1.1 | C-1 | 0.0 | 5.0 |
P-1 | 1.5 | 2.0 | C-1 | 1.9 | 5.1 |
P-2 | 2.0 | 3.0 | C-1 | 2.4 | 3.6 |
P-3 | 4.0 | 6.0 | C-2 | 6.4 | 1.4 |
P-4 | 4.0 | 4.0 | C-2 | 4.4 | 1.6 |
P-5 | 4.5 | 5.0 | C-2 | 5.9 | 1.1 |
P-6 | 3.5 | 5.1 | C-2 | 5.0 | 0.0 |
centriods
x | y | cluster | |
---|---|---|---|
C-1 | 2.5 | 1.1 | C-1 |
C-2 | 3.5 | 5.1 | C-2 |
demodf[demodf['cluster'] == 'C-1'][['x', 'y']].mean()
x 2.000000 y 2.033333 dtype: float64
new_center_1 = demodf[ demodf['cluster'] == 'C-1' ][['x', 'y']].mean()
new_center_1
x 2.000000 y 2.033333 dtype: float64
centriods.loc['C-1', 'x'] = new_center_1[0]
centriods.loc['C-1', 'y'] = new_center_1[1]
new_center_2 = demodf[ demodf['cluster'] == 'C-2' ][['x', 'y']].mean()
centriods.loc['C-2', 'x'] = new_center_2[0]
centriods.loc['C-2', 'y'] = new_center_2[1]
centriods
x | y | cluster | |
---|---|---|---|
C-1 | 2.0 | 2.033333 | C-1 |
C-2 | 4.0 | 5.025000 | C-2 |
kmeans_plot(demodf, centriods)
from sklearn.datasets import make_blobs
# create dataset https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
blobs, _ = make_blobs(
n_samples=150,
n_features=2,
centers=3,
cluster_std=0.5,
shuffle=True,
random_state=0
)
type(blobs)
numpy.ndarray
# print the first five points
blobs[:5]
array([[2.60509732, 1.22529553], [0.5323772 , 3.31338909], [0.802314 , 4.38196181], [0.5285368 , 4.49723858], [2.61858548, 0.35769791]])
This step is not strictly necessary (in fact, we'll just use the 2D ndarray for fitting our model). Converting to a DataFrame allows us to use our pandas and seaborn knowledge.
blobs_df = pd.DataFrame(blobs, columns=['x', 'y'])
blobs_df.head()
x | y | |
---|---|---|
0 | 2.605097 | 1.225296 |
1 | 0.532377 | 3.313389 |
2 | 0.802314 | 4.381962 |
3 | 0.528537 | 4.497239 |
4 | 2.618585 | 0.357698 |
scatter = sns.relplot(
data=blobs_df,
x="x",
y="y",
kind="scatter",
height=7,
aspect=1.5,
s=300
).set(
title="Blob Data"
)
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans
km = KMeans(
n_clusters = 3,
)
km
KMeans(n_clusters=3)
result = km.fit(blobs)
result.cluster_centers_
array([[ 2.06521743, 0.96137409], [-1.5947298 , 2.92236966], [ 0.9329651 , 4.35420712]])
result.labels_
array([0, 2, 2, 2, 0, 2, 2, 0, 1, 2, 0, 1, 1, 2, 2, 1, 1, 0, 1, 0, 2, 0, 2, 2, 1, 0, 0, 2, 1, 0, 1, 1, 1, 1, 2, 0, 0, 0, 2, 2, 1, 1, 2, 0, 0, 0, 1, 2, 1, 2, 0, 2, 2, 0, 0, 1, 2, 0, 1, 2, 1, 1, 1, 1, 2, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 2, 1, 1, 2, 0, 0, 2, 2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 1, 2, 2, 1, 1, 2, 0, 2, 2, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 2, 1, 0, 1, 2, 2, 0, 0, 1, 1, 1, 1, 0, 0], dtype=int32)
# assign the newly minted cluster labels to the DataFrame we creaed earlier
blobs_df["cluster"] = pd.Series(result.labels_).astype("str")
blobs_df.head()
x | y | cluster | |
---|---|---|---|
0 | 2.605097 | 1.225296 | 0 |
1 | 0.532377 | 3.313389 | 2 |
2 | 0.802314 | 4.381962 | 2 |
3 | 0.528537 | 4.497239 | 2 |
4 | 2.618585 | 0.357698 | 0 |
scatter = sns.relplot(
data=blobs_df,
x="x",
y="y",
hue="cluster", # color by cluster
kind="scatter",
height=7,
aspect=1.5,
s=300
).set(
title="Blob Data"
)
We may not know the correct value of k to use in a k-means clustering. However, if we can measure the quality of the clustering for variouss values of k, we can usually guess what the right value of k is. If we take a measure of appropriateness for clusters, such as average radius or diameter, we expect that value will grow slowly as long as the number of clusters remains at or above the true number of clusters. However, as soon as we try to form fewer clusters than there really are, the measure will rise precipitously.
The K-Means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares. We want to choose a value of k that also minimizes inertia, but not at the expense of choosing an unnaturally large number of clusters. Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:
Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).
result.inertia_
72.47601670996697
dist = []
dfs = []
for i in range(1, 8):
km = KMeans(n_clusters = i, init="random",
n_init = 10, max_iter=300,
tol=1e-04, random_state=0)
result = km.fit(blobs)
dist.append(result.inertia_)
blobs_df["cluster"] = pd.Series(result.labels_).astype("str")
dfs.append(blobs_df.copy())
idf = pd.DataFrame({
"k":range(1, 8),
"distortion":dist
})
scatter = sns.relplot(
data=idf,
x="k",
y="distortion",
kind="line",
height=7,
).set(
title="Distortion"
)
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
for idx, df in enumerate(dfs, 1):
scatter = sns.relplot(
data=df,
x="x",
y="y",
hue="cluster", # color by cluster
kind="scatter",
height=7,
aspect=1.5,
s=300
).set(
title=f"Blob Data ({idx})"
)