Let's look at the Iris dataset. We have already seen this dataset in the "Clustering" case study before. This dataset measures a bunch of flower measurements (Petal Length, Petal Width, Sepal Length, Sepal Width) for different types of flowers.
In the Clustering case study, we used K-Means to cluster the flowers into different groups without making use of the labels (unsupervised learning).
In this case study, we will use K-Nearest-Neighbors (KNN) to train a classifier using the labeled flowers. We will then use this information to predict labels for test flowers.
import pandas as pd df = pd.read_csv("iris.csv") df.sample(10)
# a useful method for categorical data df.species.unique() df.species.value_counts()
setosa 50 versicolor 50 virginica 50 Name: species, dtype: int64
Let's see what different classes or species of flowers are present in the dataset
# the describe method does something different when data is categorical! df.species.describe()
count 150 unique 3 top setosa freq 50 Name: species, dtype: object
So there are 50 datapoints each belonging to one of three flower-species or classes.
Since it's difficult to visualize 4 dimensions, let's plot all combinations of these features pair-wise and see if these data points are separable.
Index(['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth', 'species'], dtype='object')
features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
ax = df[df.species=='setosa'].plot.scatter(x='petalLength', y='petalWidth', c='blue') df[df.species=='virginica'].plot.scatter(x='petalLength', y='petalWidth', c='red', ax=ax) df[df.species=='versicolor'].plot.scatter(x='petalLength', y='petalWidth', c='purple', ax=ax) ax.legend(['setosa','virginica','versicolor']);
First, let's train and test on the whole dataset of 150 points.
# We select the training features and labels Xtrain = df[features] Ytrain = df.species
from sklearn.neighbors import KNeighborsClassifier # Instantiate learning model (k = 5) knn = KNeighborsClassifier(n_neighbors=5) # Fitting the model knn.fit(Xtrain,Ytrain) # Predicting the Test set results Ypred = knn.predict(Xtrain) # compute (number of correct predictions) / (total number of predictions) accuracy = sum(Ytrain == Ypred) / len(Ypred) print("The model accuracy is: ", accuracy)
The model accuracy is: 0.9666666666666667
# A built-in way of measuring accuracy: from sklearn.metrics import accuracy_score accuracy_score(Ytrain, Ypred)
That means that with K=5, we have achieved 96.67% accuracy on testing with the same dataset we trained with.
Let's see what happens with K = 1, i.e. with just 1 Nearest Neighbour.
knn1 = KNeighborsClassifier(n_neighbors=1) knn1.fit(Xtrain, Ytrain) Ypred1 = knn1.predict(Xtrain) print("Model accuracy:", accuracy_score(Ytrain, Ypred1))
Model accuracy: 1.0
We see 100% accuracy! WHY???
This time, we will only use two of the features. What accuracy do we obtain if we only use the
sepalLength features for classification and KNN with K=5? Here is some starter code:
# YOUR CODE HERE: (define Xtrain and Ytrain here) Xtrain = Ytrain =
# the following code generates the KNN classifier and computes the prediction error knn = KNeighborsClassifier(n_neighbors = 5) knn.fit(Xtrain,Ytrain) Ypred = knn.predict(Xtrain) print( "The accuracy is: ", accuracy_score(Ytrain,Ypred) )
Rather than using the same training and testing set, we will do the following:
When we do this with KNN (K = 5), what accuracy do we obtain? Here is some code to get you started:
# all the indices (check to see what "everything" is!) everything = list(df.index) # train on the even indices, test on the odd indices training = everything[0::2] testing = everything[1::2]
# YOUR CODE HERE: (define Xtrain, Ytrain, Xtest, and Ytest) Xtrain = Ytrain = Xtest = Ytest =
# generate knn classifier with K=5 and form the prediction error knn = KNeighborsClassifier(n_neighbors = 5) knn.fit(Xtrain,Ytrain) Ypred = knn.predict(Xtest) print( "The accuracy is: ", accuracy_score(Ytest,Ypred) )