k-means algo searches for a pre-determined number of clusters

cluster center is arithmetic mean of all points belonging to cluster.
Each point is closer to its own cluster center than to other cluster centers.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [2]:

df = pd.read_csv('my_machine-learning/customers.csv')
X = df.iloc[:, [3, 4]].values
df.head()

Out[2]:

	CustomerID	Genre	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40

In [3]:

plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()

Elbow Method¶

We find k by this algorithm
k = point (where elbow like shape formed)

In [4]:

WCSS = []
for i in range(1, 21):
    clf = KMeans(n_clusters=i)
    clf.fit(X)
    WCSS.append(clf.inertia_) # inertia is another name for WCSS

plt.plot(range(1, 21), WCSS)
plt.title('The Elbow Method')
plt.grid()
plt.show()

Elbow is at 5

In [5]:

kmean = KMeans(n_clusters=5,)
y_kmeans = kmean.fit_predict(X)

In [6]:

fig = plt.figure(figsize=(12, 9))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5)
plt.title('Clusters using KMeans')
plt.ylabel('Annual Income (k$)')
plt.xlabel('Spending Score (1-100)')
plt.show()

Limitations¶

Strong sensitivity to outliers and noise
Doesn't work well with non-circular cluster shape
Number of cluster and initial seed value need to be specified beforehand

Advantage¶

Faster than other algos