Notebook

K-Nearest Neighbors (KNN)¶

by Chiyuan Zhang and Sören Sonnenburg¶

This notebook illustrates the K-Nearest Neighbors (KNN) algorithm on the USPS digit recognition dataset in Shogun. Further, the effect of Cover Trees on speed is illustrated by comparing KNN with and without it. Finally, a comparison with Multiclass Support Vector Machines is shown.

The basics¶

The training of a KNN model basically does nothing but memorizing all the training points and the associated labels, which is very cheap in computation but costly in storage. The prediction is implemented by finding the K nearest neighbors of the query point, and voting. Here K is a hyper-parameter for the algorithm. Smaller values for K give the model low bias but high variance; while larger values for K give low variance but high bias.

In SHOGUN, you can use CKNN to perform KNN learning. To construct a KNN machine, you must choose the hyper-parameter K and a distance function. Usually, we simply use the standard CEuclideanDistance, but in general, any subclass of CDistance could be used. For demonstration, in this tutorial we select a random subset of 1000 samples from the USPS digit recognition dataset, and run 2-fold cross validation of KNN with varying K.

First we load and init data split:

In [ ]:

import numpy as np
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')

from scipy.io import loadmat, savemat
from numpy    import random
from os       import path

mat  = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))
Xall = mat['data']
Yall = np.array(mat['label'].squeeze(), dtype=np.double)

# map from 1..10 to 0..9, since shogun
# requires multiclass labels to be
# 0, 1, ..., K-1
Yall = Yall - 1

random.seed(0)

subset = random.permutation(len(Yall))

Xtrain = Xall[:, subset[:5000]]
Ytrain = Yall[subset[:5000]]

Xtest = Xall[:, subset[5000:6000]]
Ytest = Yall[subset[5000:6000]]

Nsplit = 2
all_ks = range(1, 21)

print(Xall.shape)
print(Xtrain.shape)
print(Xtest.shape)

Let us plot the first five examples of the train data (first row) and test data (second row).

In [ ]:

%matplotlib inline
import pylab as P
def plot_example(dat, lab):
    for i in range(5):
        ax=P.subplot(1,5,i+1)
        P.title(int(lab[i]))
        ax.imshow(dat[:,i].reshape((16,16)), interpolation='nearest')
        ax.set_xticks([])
        ax.set_yticks([])
        
        
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xtrain, Ytrain)

_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xtest, Ytest)

Then we import shogun components and convert the data to shogun objects:

In [ ]:

from shogun import MulticlassLabels, RealFeatures
from shogun import KNN, EuclideanDistance

labels = MulticlassLabels(Ytrain)
feats  = RealFeatures(Xtrain)
k=3
dist = EuclideanDistance()
knn = KNN(k, dist, labels)
labels_test = MulticlassLabels(Ytest)
feats_test  = RealFeatures(Xtest)
knn.train(feats)
pred = knn.apply_multiclass(feats_test)
print("Predictions", pred.get_int_labels()[:5])
print("Ground Truth", Ytest[:5])

from shogun import MulticlassAccuracy
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(pred, labels_test)

print("Accuracy = %2.2f%%" % (100*accuracy))

Let's plot a few missclassified examples - I guess we all agree that these are notably harder to detect.

In [ ]:

idx=np.where(pred != Ytest)[0]
Xbad=Xtest[:,idx]
Ybad=Ytest[idx]
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xbad, Ybad)

Now the question is - is 97.30% accuracy the best we can do? While one would usually re-train KNN with different values for k here and likely perform Cross-validation, we just use a small trick here that saves us lots of computation time: When we have to determine the $K\geq k$ nearest neighbors we will know the nearest neigbors for all $k=1...K$ and can thus get the predictions for multiple k's in one step:

In [ ]:

knn.set_k(13)
multiple_k=knn.classify_for_multiple_k()
print(multiple_k.shape)

We have the prediction for each of the 13 k's now and can quickly compute the accuracies:

In [ ]:

for k in range(13):
    print("Accuracy for k=%d is %2.2f%%" % (k+1, 100*np.mean(multiple_k[:,k]==Ytest)))

So k=3 seems to have been the optimal choice.

Accellerating KNN¶

Obviously applying KNN is very costly: for each prediction you have to compare the object against all training objects. While the implementation in SHOGUN will use all available CPU cores to parallelize this computation it might still be slow when you have big data sets. In SHOGUN, you can use Cover Trees to speed up the nearest neighbor searching process in KNN. Just call set_use_covertree on the KNN machine to enable or disable this feature. We also show the prediction time comparison with and without Cover Tree in this tutorial. So let's just have a comparison utilizing the data above:

In [ ]:

from shogun import Time, KNN_COVER_TREE, KNN_BRUTE
start = Time.get_curtime()
knn.set_k(3)
knn.set_knn_solver_type(KNN_BRUTE)
pred = knn.apply_multiclass(feats_test)
print("Standard KNN took %2.1fs" % (Time.get_curtime() - start))


start = Time.get_curtime()
knn.set_k(3)
knn.set_knn_solver_type(KNN_COVER_TREE)
pred = knn.apply_multiclass(feats_test)
print("Covertree KNN took %2.1fs" % (Time.get_curtime() - start))

So we can significantly speed it up. Let's do a more systematic comparison. For that a helper function is defined to run the evaluation for KNN:

In [ ]:

def evaluate(labels, feats, use_cover_tree=False):
    from shogun import MulticlassAccuracy, CrossValidationSplitting
    import time
    split = CrossValidationSplitting(labels, Nsplit)
    split.build_subsets()
    
    accuracy = np.zeros((Nsplit, len(all_ks)))
    acc_train = np.zeros(accuracy.shape)
    time_test = np.zeros(accuracy.shape)
    for i in range(Nsplit):
        idx_train = split.generate_subset_inverse(i)
        idx_test  = split.generate_subset_indices(i)

        for j, k in enumerate(all_ks):
            #print "Round %d for k=%d..." % (i, k)

            feats.add_subset(idx_train)
            labels.add_subset(idx_train)

            dist = EuclideanDistance(feats, feats)
            knn = KNN(k, dist, labels)
            knn.set_store_model_features(True)
            if use_cover_tree:
                knn.set_knn_solver_type(KNN_COVER_TREE)
            else:
                knn.set_knn_solver_type(KNN_BRUTE)
            knn.train()

            evaluator = MulticlassAccuracy()
            pred = knn.apply_multiclass()
            acc_train[i, j] = evaluator.evaluate(pred, labels)

            feats.remove_subset()
            labels.remove_subset()
            feats.add_subset(idx_test)
            labels.add_subset(idx_test)

            t_start = time.clock()
            pred = knn.apply_multiclass(feats)
            time_test[i, j] = (time.clock() - t_start) / labels.get_num_labels()

            accuracy[i, j] = evaluator.evaluate(pred, labels)

            feats.remove_subset()
            labels.remove_subset()
    return {'eout': accuracy, 'ein': acc_train, 'time': time_test}

Evaluate KNN with and without Cover Tree. This takes a few seconds:

In [ ]:

labels = MulticlassLabels(Ytest)
feats  = RealFeatures(Xtest)
print("Evaluating KNN...")
wo_ct = evaluate(labels, feats, use_cover_tree=False)
wi_ct = evaluate(labels, feats, use_cover_tree=True)
print("Done!")

Generate plots with the data collected in the evaluation:

In [ ]:

import matplotlib

fig = P.figure(figsize=(8,5))
P.plot(all_ks, wo_ct['eout'].mean(axis=0), 'r-*')
P.plot(all_ks, wo_ct['ein'].mean(axis=0), 'r--*')
P.legend(["Test Accuracy", "Training Accuracy"])
P.xlabel('K')
P.ylabel('Accuracy')
P.title('KNN Accuracy')
P.tight_layout()

fig = P.figure(figsize=(8,5))
P.plot(all_ks, wo_ct['time'].mean(axis=0), 'r-*')
P.plot(all_ks, wi_ct['time'].mean(axis=0), 'b-d')
P.xlabel("K")
P.ylabel("time")
P.title('KNN time')
P.legend(["Plain KNN", "CoverTree KNN"], loc='center right')
P.tight_layout()

Although simple and elegant, KNN is generally very resource costly. Because all the training samples are to be memorized literally, the memory cost of KNN learning becomes prohibitive when the dataset is huge. Even when the memory is big enough to hold all the data, the prediction will be slow, since the distances between the query point and all the training points need to be computed and ranked. The situation becomes worse if in addition the data samples are all very high-dimensional. Leaving aside computation time issues, k-NN is a very versatile and competitive algorithm. It can be applied to any kind of objects (not just numerical data) - as long as one can design a suitable distance function. In pratice k-NN used with bagging can create improved and more robust results.

Comparison to Multiclass Support Vector Machines¶

In contrast to KNN - multiclass Support Vector Machines (SVMs) attempt to model the decision function separating each class from one another. They compare examples utilizing similarity measures (so called Kernels) instead of distances like KNN does. When applied, they are in Big-O notation computationally as expensive as KNN but involve another (costly) training step. They do not scale very well to cases with a huge number of classes but usually lead to favorable results when applied to small number of classes cases. So for reference let us compare how a standard multiclass SVM performs wrt. KNN on the mnist data set from above.

Let us first train a multiclass svm using a Gaussian kernel (kind of the SVM equivalent to the euclidean distance).

In [ ]:

from shogun import GaussianKernel, GMNPSVM

width=80
C=1

gk=GaussianKernel()
gk.set_width(width)

svm=GMNPSVM(C, gk, labels)
_=svm.train(feats)

Let's apply the SVM to the same test data set to compare results:

In [ ]:

out=svm.apply(feats_test)
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(out, labels_test)

print("Accuracy = %2.2f%%" % (100*accuracy))

Since the SVM performs way better on this task - let's apply it to all data we did not use in training.

In [ ]:

Xrem=Xall[:,subset[6000:]]
Yrem=Yall[subset[6000:]]

feats_rem=RealFeatures(Xrem)
labels_rem=MulticlassLabels(Yrem)
out=svm.apply(feats_rem)

evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(out, labels_rem)

print("Accuracy = %2.2f%%" % (100*accuracy))

idx=np.where(out.get_labels() != Yrem)[0]
Xbad=Xrem[:,idx]
Ybad=Yrem[idx]
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xbad, Ybad)

The misclassified examples are indeed much harder to label even for human beings.