K-Nearest-Neighbor Classification¶

Reading the data¶

Let's look at the Iris dataset. We have already seen this dataset in the "Clustering" case study before. This dataset measures a bunch of flower measurements (Petal Length, Petal Width, Sepal Length, Sepal Width) for different types of flowers.

In the Clustering case study, we used K-Means to cluster the flowers into different groups without making use of the labels (unsupervised learning).

In this case study, we will use K-Nearest-Neighbors (KNN) to train a classifier using the labeled flowers. We will then use this information to predict labels for test flowers.

In [1]:

import pandas as pd

df = pd.read_csv("iris.csv")
df.sample(10)

Out[1]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
21	1.5	0.4	5.1	3.7	setosa
33	1.4	0.2	5.5	4.2	setosa
101	5.1	1.9	5.8	2.7	virginica
141	5.1	2.3	6.9	3.1	virginica
97	4.3	1.3	6.2	2.9	versicolor
122	6.7	2.0	7.7	2.8	virginica
41	1.3	0.3	4.5	2.3	setosa
96	4.2	1.3	5.7	2.9	versicolor
115	5.3	2.3	6.4	3.2	virginica
60	3.5	1.0	5.0	2.0	versicolor

In [2]:

# a useful method for categorical data
df.species.unique()
df.species.value_counts()

Out[2]:

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Let's see what different classes or species of flowers are present in the dataset

In [3]:

# the describe method does something different when data is categorical!
df.species.describe()

Out[3]:

count        150
unique         3
top       setosa
freq          50
Name: species, dtype: object

So there are 50 datapoints each belonging to one of three flower-species or classes.

Visualizing the data¶

Since it's difficult to visualize 4 dimensions, let's plot all combinations of these features pair-wise and see if these data points are separable.

In [4]:

df.columns

Out[4]:

Index(['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth', 'species'], dtype='object')

In [5]:

features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']

In [6]:

ax = df[df.species=='setosa'].plot.scatter(x='petalLength', y='petalWidth', c='blue')
df[df.species=='virginica'].plot.scatter(x='petalLength', y='petalWidth', c='red', ax=ax)
df[df.species=='versicolor'].plot.scatter(x='petalLength', y='petalWidth', c='purple', ax=ax)
ax.legend(['setosa','virginica','versicolor']);

K-Nearest Neighbour Algorithm¶

First, let's train and test on the whole dataset of 150 points.

In [7]:

# We select the training features and labels
Xtrain = df[features]
Ytrain = df.species

Training with K = 5¶

In [8]:

from sklearn.neighbors import KNeighborsClassifier

# Instantiate learning model (k = 5)
knn = KNeighborsClassifier(n_neighbors=5)

# Fitting the model
knn.fit(Xtrain,Ytrain)

# Predicting the Test set results
Ypred = knn.predict(Xtrain)

# compute (number of correct predictions) / (total number of predictions)
accuracy = sum(Ytrain == Ypred) / len(Ypred)
print("The model accuracy is: ", accuracy)

The model accuracy is:  0.9666666666666667

In [9]:

# A built-in way of measuring accuracy:
from sklearn.metrics import accuracy_score

accuracy_score(Ytrain, Ypred)

Out[9]:

0.9666666666666667

That means that with K=5, we have achieved 96.67% accuracy on testing with the same dataset we trained with.
Let's see what happens with K = 1, i.e. with just 1 Nearest Neighbour.

Training with K = 1¶

In [10]:

knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(Xtrain, Ytrain)
Ypred1 = knn1.predict(Xtrain)

print("Model accuracy:", accuracy_score(Ytrain, Ypred1))

Model accuracy: 1.0

We see 100% accuracy! WHY???

Quiz problems¶

Question 1¶

(see the Canvas quiz!)

Question 2¶

This time, we will only use two of the features. What accuracy do we obtain if we only use the sepalWidth and sepalLength features for classification and KNN with K=5? Here is some starter code:

In [ ]:

# YOUR CODE HERE: (define Xtrain and Ytrain here)

Xtrain = 
Ytrain = 

In [ ]:

# the following code generates the KNN classifier and computes the prediction error
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(Xtrain,Ytrain)

Ypred = knn.predict(Xtrain)
print( "The accuracy is: ", accuracy_score(Ytrain,Ypred) )

Question 3¶

Rather than using the same training and testing set, we will do the following:

Train using the even-numbered rows (index 0,2,4,...)
Test using the odd-numbered rows (index 1,3,5,...)

When we do this with KNN (K = 5), what accuracy do we obtain? Here is some code to get you started:

In [ ]:

# all the indices (check to see what "everything" is!)
everything = list(df.index)

# train on the even indices, test on the odd indices
training = everything[0::2]
testing = everything[1::2]

In [ ]:

# YOUR CODE HERE: (define Xtrain, Ytrain, Xtest, and Ytest)

Xtrain = 
Ytrain = 

Xtest = 
Ytest =

In [ ]:

# generate knn classifier with K=5 and form the prediction error
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(Xtrain,Ytrain)

Ypred = knn.predict(Xtest)
print( "The accuracy is: ", accuracy_score(Ytest,Ypred) )