In this post, we will use a K-means algorithm to perform image classification. Clustering isn't limited to the consumer information and population sciences, it can be used for imagery analysis as well. Leveraging Scikit-learn and the MNIST dataset, we will investigate the use of K-means clustering for computer vision.
import sys
import sklearn
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
print('Python: {}'.format(sys.version))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('NumPy: {}'.format(np.__version__))
Python: 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] Scikit-learn: 0.22.1 NumPy: 1.18.1
import tensorflow as tf
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Print shape of dataset
print("Training: {}".format(X_train.shape))
print("Test: {}".format(X_test.shape))
Training: (60000, 28, 28) Test: (10000, 28, 28)
As you can see, the original dataset contains 28x28x1 pixel image. Let's print it out, and what it looks like.
fig, axs = plt.subplots(3, 3, figsize = (12, 12))
plt.gray()
# loop through subplots and add mnist images
for i, ax in enumerate(axs.flat):
ax.imshow(X_train[i])
ax.axis('off')
ax.set_title('Number {}'.format(y_train[i]))
# display the figure
plt.show()
Images stored as NumPy arrays are 2-dimensional arrays. However, the K-means clustering algorithm provided by scikit-learn ingests 1-dimensional arrays; as a result, we will need to reshape each image. (in other words, we need to flatten the data)
Clustering algorithms almost always use 1-dimensional data. For example, if you were clustering a set of X, Y coordinates, each point would be passed to the clustering algorithm as a 1-dimensional array with a length of two (example: [2,4] or [-1, 4]). If you were using 3-dimensional data, the array would have a length of 3 (example: [2, 4, 1] or [-1, 4, 5]).
MNIST contains images that are 28 by 28 pixels; as a result, they will have a length of 784 once we reshape them into a 1-dimensional array.
# Convert each image to 1d array (28x28 -> 784x1)
X_train = X_train.reshape(len(X_train), -1)
print(X_train.shape)
(60000, 784)
Also, One approach to help training is normalization. In order to do this, we need to convert each pixel value into 0 to 1 range. The maximum value of pixel in grayscale is 255, so it can normalize it by dividing 255. Of course, its overall shape is same as before.
# Normalize the data to 0 - 1
X_train = X_train.astype(np.float32) / 255.
Since the size of the MNIST dataset is quite large, we will use the mini-batch implementation of k-means clustering (MiniBatchKMeans
) provided by scikit-learn. This will dramatically reduce the amount of time it takes to fit the algorithm to the data.
Here, we just choose the n_clusters
argument to the n_digits
(the size of unique labels, in our case, 10), and set the default parameters in MiniBatchKMeans
.
And as you know that, K-means clustering is one of the unsupervised learning. That means it doesn't require any label to train.
from sklearn.cluster import MiniBatchKMeans
n_digits = len(np.unique(y_train))
print(n_digits)
10
kmeans = MiniBatchKMeans(n_clusters=n_digits)
kmeans.fit(X_train)
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=10, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0)
We can find the labels of each input that is generated from K means model.
kmeans.labels_
array([7, 8, 3, ..., 7, 9, 7])
But these are not real label of each image, since the output of the kmeans.labels_
is just group id for clustering. For example, 6 in kmeans.labels_
has similar features with another 6 in kmeans.labels_
. There is no more meaning from the label.
To match it with real label, we can tackle the follow things:
np.bincount
)np.argmax
), and set the label.def infer_cluster_labels(kmeans, actual_labels):
"""
Associates most probable label with each cluster in KMeans model
returns: dictionary of clusters assigned to each label
"""
inferred_labels = {}
# Loop through the clusters
for i in range(kmeans.n_clusters):
# find index of points in cluster
labels = []
index = np.where(kmeans.labels_ == i)
# append actual labels for each point in cluster
labels.append(actual_labels[index])
# determine most common label
if len(labels[0]) == 1:
counts = np.bincount(labels[0])
else:
counts = np.bincount(np.squeeze(labels))
# assign the cluster to a value in the inferred_labels dictionary
if np.argmax(counts) in inferred_labels:
# append the new number to the existing array at this slot
inferred_labels[np.argmax(counts)].append(i)
else:
# create a new array in this slot
inferred_labels[np.argmax(counts)] = [i]
return inferred_labels
def infer_data_labels(X_labels, cluster_labels):
"""
Determines label for each array, depending on the cluster it has been assigned to.
returns: predicted labels for each array
"""
# empty array of len(X)
predicted_labels = np.zeros(len(X_labels)).astype(np.uint8)
for i, cluster in enumerate(X_labels):
for key, value in cluster_labels.items():
if cluster in value:
predicted_labels[i] = key
return predicted_labels
cluster_labels = infer_cluster_labels(kmeans, y_train)
X_clusters = kmeans.predict(X_train)
predicted_labels = infer_data_labels(X_clusters, cluster_labels)
print(predicted_labels[:20])
print(y_train[:20])
[8 0 4 1 7 2 1 8 1 7 3 1 3 6 1 7 2 8 6 7] [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]
As a result, some predicted label is mismatched, but most of case, the k-means model can correctly cluster of each group.
With the functions defined above, we can now determine the accuracy of our algorithms. Since we are using this clustering algorithm for classification, accuracy is ultimately the most important metric; however, there are other metrics out there that can be applied directly to the clusters themselves, regardless of the associated labels. Two of these metrics that we will use are inertia and homogeneity. (See the detailed description of homogeneity_score)
Furthermore, earlier we made the assumption that K = 10 was the appropriate number of clusters; however, this might not be the case. Let's fit the K-means clustering algorithm with several different values of K, than evaluate the performance using our metrics.
from sklearn.metrics import homogeneity_score
def calc_metrics(estimator, data, labels):
print('Number of Clusters: {}'.format(estimator.n_clusters))
# Inertia
inertia = estimator.inertia_
print("Inertia: {}".format(inertia))
# Homogeneity Score
homogeneity = homogeneity_score(labels, estimator.labels_)
print("Homogeneity score: {}".format(homogeneity))
return inertia, homogeneity
from sklearn.metrics import accuracy_score
clusters = [10, 16, 36, 64, 144, 256]
iner_list = []
homo_list = []
acc_list = []
for n_clusters in clusters:
estimator = MiniBatchKMeans(n_clusters=n_clusters)
estimator.fit(X_train)
inertia, homo = calc_metrics(estimator, X_train, y_train)
iner_list.append(inertia)
homo_list.append(homo)
# Determine predicted labels
cluster_labels = infer_cluster_labels(estimator, y_train)
prediction = infer_data_labels(estimator.labels_, cluster_labels)
acc = accuracy_score(y_train, prediction)
acc_list.append(acc)
print('Accuracy: {}\n'.format(acc))
Number of Clusters: 10 Inertia: 2383375.0 Homogeneity score: 0.46576292303121536 Accuracy: 0.56465 Number of Clusters: 16 Inertia: 2208197.5 Homogeneity score: 0.5531322770474518 Accuracy: 0.65095 Number of Clusters: 36 Inertia: 1961340.875 Homogeneity score: 0.6783212163972349 Accuracy: 0.767 Number of Clusters: 64 Inertia: 1822361.625 Homogeneity score: 0.727585914263205 Accuracy: 0.7895166666666666 Number of Clusters: 144 Inertia: 1635514.25 Homogeneity score: 0.8048996371912126 Accuracy: 0.8673833333333333 Number of Clusters: 256 Inertia: 1519708.25 Homogeneity score: 0.8428113183818001 Accuracy: 0.9000333333333334
fig, ax = plt.subplots(1, 2, figsize=(16, 10))
ax[0].plot(clusters, iner_list, label='inertia', marker='o')
ax[1].plot(clusters, homo_list, label='homogeneity', marker='o')
ax[1].plot(clusters, acc_list, label='accuracy', marker='^')
ax[0].legend(loc='best')
ax[1].legend(loc='best')
ax[0].grid('on')
ax[1].grid('on')
ax[0].set_title('Inertia of each clusters')
ax[1].set_title('Homogeneity and Accuracy of each clusters')
plt.show()
As a result, we found out that when the K value is increased, the accuracy and homogeneity is also increased. We can also check the performance on test dataset.
X_test = X_test.reshape(len(X_test), -1)
X_test = X_test.astype(np.float32) / 255.
kmeans = MiniBatchKMeans(n_clusters=256)
kmeans.fit(X_test)
cluster_labels = infer_cluster_labels(kmeans, y_test)
test_clusters = kmeans.predict(X_test)
prediction = infer_data_labels(kmeans.predict(X_test), cluster_labels)
print('Accuracy: {}'.format(accuracy_score(y_test, prediction)))
Accuracy: 0.8877
There we have MiniBatchKmeans Clustering model with almost 90% accuracy. One definite way to check the model performance is to visualize the real image.
For the convenience, we decrease the n_clusters
to 36.
# Initialize and fit KMeans algorithm
kmeans = MiniBatchKMeans(n_clusters = 36)
kmeans.fit(X_test)
# record centroid values
centroids = kmeans.cluster_centers_
# reshape centroids into images
images = centroids.reshape(36, 28, 28)
images *= 255
images = images.astype(np.uint8)
# determine cluster labels
cluster_labels = infer_cluster_labels(kmeans, y_test)
prediction = infer_data_labels(kmeans.predict(X_test), cluster_labels)
# create figure with subplots using matplotlib.pyplot
fig, axs = plt.subplots(6, 6, figsize = (20, 20))
plt.gray()
# loop through subplots and add centroid images
for i, ax in enumerate(axs.flat):
# determine inferred label using cluster_labels dictionary
for key, value in cluster_labels.items():
if i in value:
ax.set_title('Inferred Label: {}'.format(key), color='blue')
# add image to subplot
ax.matshow(images[i])
ax.axis('off')
# display the figure
plt.show()
Through this post, we built the K means clustering model for MNIST digit classification. To do this, we applied preprocessing steps like reshape and normalization. And the model performance is changed in depends on n_clusters. After that, we can make MNIST classifier with almost 90%.