Let's look at the Iris dataset. We have already seen this dataset in the "Clustering" case study before. This dataset measures a bunch of flower measurements (Petal Length, Petal Width, Sepal Length, Sepal Width) for different types of flowers.
In the Clustering case study, we used K-Means to cluster the flowers into different groups without making use of the labels (unsupervised learning).
In this case study, we will use K-Nearest-Neighbors (KNN) to train a classifier using the labeled flowers. We will then use this information to predict labels for test flowers.
import pandas as pd
df = pd.read_csv("iris.csv")
df.sample(10)
petalLength | petalWidth | sepalLength | sepalWidth | species | |
---|---|---|---|---|---|
21 | 1.5 | 0.4 | 5.1 | 3.7 | setosa |
33 | 1.4 | 0.2 | 5.5 | 4.2 | setosa |
101 | 5.1 | 1.9 | 5.8 | 2.7 | virginica |
141 | 5.1 | 2.3 | 6.9 | 3.1 | virginica |
97 | 4.3 | 1.3 | 6.2 | 2.9 | versicolor |
122 | 6.7 | 2.0 | 7.7 | 2.8 | virginica |
41 | 1.3 | 0.3 | 4.5 | 2.3 | setosa |
96 | 4.2 | 1.3 | 5.7 | 2.9 | versicolor |
115 | 5.3 | 2.3 | 6.4 | 3.2 | virginica |
60 | 3.5 | 1.0 | 5.0 | 2.0 | versicolor |
# a useful method for categorical data
df.species.unique()
df.species.value_counts()
setosa 50 versicolor 50 virginica 50 Name: species, dtype: int64
Let's see what different classes or species of flowers are present in the dataset
# the describe method does something different when data is categorical!
df.species.describe()
count 150 unique 3 top setosa freq 50 Name: species, dtype: object
So there are 50 datapoints each belonging to one of three flower-species or classes.
Since it's difficult to visualize 4 dimensions, let's plot all combinations of these features pair-wise and see if these data points are separable.
df.columns
Index(['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth', 'species'], dtype='object')
features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
ax = df[df.species=='setosa'].plot.scatter(x='petalLength', y='petalWidth', c='blue')
df[df.species=='virginica'].plot.scatter(x='petalLength', y='petalWidth', c='red', ax=ax)
df[df.species=='versicolor'].plot.scatter(x='petalLength', y='petalWidth', c='purple', ax=ax)
ax.legend(['setosa','virginica','versicolor']);
First, let's train and test on the whole dataset of 150 points.
# We select the training features and labels
Xtrain = df[features]
Ytrain = df.species
from sklearn.neighbors import KNeighborsClassifier
# Instantiate learning model (k = 5)
knn = KNeighborsClassifier(n_neighbors=5)
# Fitting the model
knn.fit(Xtrain,Ytrain)
# Predicting the Test set results
Ypred = knn.predict(Xtrain)
# compute (number of correct predictions) / (total number of predictions)
accuracy = sum(Ytrain == Ypred) / len(Ypred)
print("The model accuracy is: ", accuracy)
The model accuracy is: 0.9666666666666667
# A built-in way of measuring accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(Ytrain, Ypred)
0.9666666666666667
That means that with K=5, we have achieved 96.67% accuracy on testing with the same dataset we trained with.
Let's see what happens with K = 1, i.e. with just 1 Nearest Neighbour.
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(Xtrain, Ytrain)
Ypred1 = knn1.predict(Xtrain)
print("Model accuracy:", accuracy_score(Ytrain, Ypred1))
Model accuracy: 1.0
We see 100% accuracy! WHY???
(see the Canvas quiz!)
This time, we will only use two of the features. What accuracy do we obtain if we only use the sepalWidth
and sepalLength
features for classification and KNN with K=5? Here is some starter code:
# YOUR CODE HERE: (define Xtrain and Ytrain here)
Xtrain =
Ytrain =
# the following code generates the KNN classifier and computes the prediction error
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(Xtrain,Ytrain)
Ypred = knn.predict(Xtrain)
print( "The accuracy is: ", accuracy_score(Ytrain,Ypred) )
Rather than using the same training and testing set, we will do the following:
When we do this with KNN (K = 5), what accuracy do we obtain? Here is some code to get you started:
# all the indices (check to see what "everything" is!)
everything = list(df.index)
# train on the even indices, test on the odd indices
training = everything[0::2]
testing = everything[1::2]
# YOUR CODE HERE: (define Xtrain, Ytrain, Xtest, and Ytest)
Xtrain =
Ytrain =
Xtest =
Ytest =
# generate knn classifier with K=5 and form the prediction error
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(Xtrain,Ytrain)
Ypred = knn.predict(Xtest)
print( "The accuracy is: ", accuracy_score(Ytest,Ypred) )