Notebook

Python Machine Learning: Classification¶

A common task in computational research is to classify an object based on a set of features. In supervised machine learning, we can give an algorithm a dataset of training examples that say "here are specific features, and this is the target class it belongs to". With enough training examples, a model can be built that recognizes important features in determining an object's class. This model can then be used to predict the class of an object given its known features.

First let's import the packages that we need for this notebook.

In [ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, precision_score, f1_score

Penguin Data¶

Let's say that we are studying penguins in Antartica. We have a set of penguins that we have body measurements for, of three different species: Adelie, Chinstrap, and Gentoo. We are interested in being able to differentiate between these three species based on the measurements. First, let's take a look at our data set.

Now, let's load in our preprocessed penguins data set.

In [ ]:

X_train = pd.read_csv('../data/penguins_X_train.csv')
X_test = pd.read_csv('../data/penguins_X_test.csv')
y_train = pd.read_csv('../data/penguins_y_train.csv')
y_test = pd.read_csv('../data/penguins_y_test.csv')

Let's start with just two penguin species: Adelie and Gentoo.

In [ ]:

X_train = X_train[y_train['species'].isin(['Adelie','Gentoo'])].reset_index()
X_test = X_test[y_test['species'].isin(['Adelie','Gentoo'])].reset_index()
y_train = y_train[y_train['species'].isin(['Adelie','Gentoo'])].reset_index()
y_test = y_test[y_test['species'].isin(['Adelie','Gentoo'])].reset_index()

Null Accuracy¶

Let's say that we wanted to assign a species to each unknown measured penguin. One way to do this is to assign all observations to the majority classes. The code below shows the proportion of each species in the training data.

Question: If we want to maximize accuracy, which species label would we assign to all observations?

In [ ]:

y_train.value_counts('species')/sum(y_train.value_counts('species'))

This accuracy is our baseline model, and is the number that we will try to improve on with classification.

Let's get to know our dataset by conducting some exploratory data analysis. We'll be using some rudimentary data analysis to see there's a relationship between the independent variables across species.

Let's say that we decide that body mass might be a good way to differentiate between Adelie and Gentoo penguins. We can look at a plot of the histogram to see how the distribution of this variable changes between species.

Question: Where would you place a line to minimize the overlap in the distribution?

In [ ]:

sb.histplot(data=X_train.loc[y_train['species'].isin(['Adelie','Gentoo'])],
                x = 'body_mass_g',
                hue = y_train['species'],kde=True,bins=20)
#plt.axvline(.28,color= 'red')

Now let's apply this same decision boundary to the test data.

Question: Is this still the best boundary?

In [ ]:

sb.histplot(data=X_test.loc[y_test['species'].isin(['Gentoo','Adelie'])],
                x = 'body_mass_g',
                hue = y_test['species'],kde=True,bins=20)
#plt.axvline(.28,color= 'red')

This is the basic goal of classification. Based on your boundary criteria, you would classify all each of the penguins. However there would be some error involved. We can be more confident in our classification at the far ends of the distribution, and less confident where the distributions overlap.

Now let's figure out how to separate out these groups mathematically. For this, we will start by using an algorithm called Logistic Regression.

Logistic Regression¶

Logistic regression is a supervised classification algorithm that is used to predict a binary outcome. Similar to linear regression, this model uses coefficients or betas to make its predictions. However unlike a linear regression, its predictions range from 0 to 1, where 0 and 1 stand for 'confidently class A and B' respectively. Predictions along the middle of the line show less confidence in the prediction.

The function for the logistic regression is: $$ p(x) = \frac{1}{1 + e^{(-\beta_0+\beta_1x_1...)}}$$

where $\beta$ are the learned parameters and $x$ are the input features.

Let's train a logistic regression model on the variable: body_mass_g

Modeling with Logistic Regression¶

Logistic regression uses the same general steps as many other sklearn algorithms:

Initialize Model
Fit model on training data
Evaluate on training and testing datasets

In [ ]:

#1) Initialize Model
lr = LogisticRegression(max_iter=170)

#2) Fit model
lr.fit(X_train['body_mass_g'].values.reshape(-1, 1), y_train['species'])

#3) Evaluate 
train_score = lr.score(X_train['body_mass_g'].values.reshape(-1, 1), y_train['species'])
test_score = lr.score(X_test['body_mass_g'].values.reshape(-1, 1), y_test['species'])

print("Training score:", train_score.round(3), "Testing score:", test_score.round(3))

Question: How well did the model do compared to baseline?

Multivariate Logistic Regression¶

The logistic regression did a pretty good job at classifying the penguins. However, we have more than just body mass to base our decision of species based on. For example, let's look at the combination of culmen depth and body mass in our data by using a scatterplot.

In the two dimensional space, the intuition is that we want to draw a line that separates the classes.

Question: Is it possible to draw a line that separates the groups? If it is, this is a linearly seperable problem

In [ ]:

sb.scatterplot(data=X_train.loc[y_train['species'].isin(['Adelie','Gentoo'])],
                x = 'culmen_depth_mm',
                y = 'body_mass_g',
                hue = y_train['species'])

Let's retrain the logistic model with two variables.

In [ ]:

lr = LogisticRegression(max_iter=170)
lr.fit(X_train[['body_mass_g','culmen_depth_mm']], y_train['species'])

train_score = lr.score(X_train[['body_mass_g','culmen_depth_mm']], y_train['species'])
test_score = lr.score(X_test[['body_mass_g','culmen_depth_mm']], y_test['species'])

print("Training score = {}, testing score = {}".format(train_score.round(3), test_score.round(3)))

While this doesn't happen often in real life, we got a perfect score! We could add more features to the model, but there isn't a need since our model is already behaving perfectly. Now let's take a look at the coefficients of the model. We reference the lr.coef_ attribute to see the coefficients

In [ ]:

coef = pd.Series(index=['body_mass_g','culmen_depth_mm'], data=lr.coef_[0])

coef.sort_values()

Question: What do you think the magnitude and sign of the coefficients means about how these variables are related to each category? Hint: Refer back to the scatter plot!

Model evaluation¶

We've covered accuracy already but there a whole litany of other ways to evaluate the performance of a classification model.

In a binary classification task, there are four major types of predictions:

Confusion Matrix (Wikipedia):

true positive (TP): A test result that correctly indicates the presence of a condition or characteristic
true negative (TN): A test result that correctly indicates the absence of a condition or characteristic
false positive (FP): A test result which wrongly indicates that a particular condition or attribute is present
false negative (FN): A test result which wrongly indicates that a particular condition or attribute is absent

Accuracy, which is the most common metric used with classification can be characterized as:

$$ Accuracy= \frac{\sum{\text{True Positives}}+\sum{\text{True Negatives}}}{\sum{\text{Total Population}}}$$

We can combine the prediction measures above to create three helpful metrics for evaluating classification: precision, recall, and specificity.

Precision:

$$\frac{\sum{\text{True Positives}}}{\sum{\text{Predicted Positives}}}$$

Recall (or Sensitivity):

$$\frac{\sum{\text{True Positives}}}{\sum{\text{Condition Positives}}}$$

3. Specificity (like recall for negative examples): $$\frac{\sum{\text{True Negatives}}}{\sum{\text{Condition Negatives}}}$$

Let's make a confusion matrix and derive the recall and precision scores.

First, let's go back to the original (not perfect) model so we can see what these rates look like.

First we will retrain the model and make predictions on the test set.

In [ ]:

lr.fit(X_train['body_mass_g'].values.reshape(-1, 1), y_train['species'])
preds = lr.predict(X_test[['body_mass_g']])

In [ ]:

# Pass y_test and preds into confusion_matrix
confusion_matrix(y_test['species'], preds)

Challenge 1: Model Evaluation¶

1). What are the TP, FP, TN, FN in these model results?

2). What is the precision and recall for this model?

3). Which is more important, precision or recall?

Depending on your task, other metrics than accuracy might be more beneficial to understanding your model's performance. At the very least, examining the confusion matrix is a great way to get a better sense of how your model is performing across classes.

Decision Trees¶

Let's now include all three species of penguin that we want to differentiate between. We can turn to other models that can handle two or more classes for classification. One such example is the Decision Tree Classifier. In terms of logic, this is like a flow chart.

In this flow chart the data is that the lamp doesn't work, and the features are information about how the lamp doesn't work. The classes is the action that is taken at the end.

Alt

While the ultimate goal of classification remains the same, machine learning algorithms vary widely in terms of how they go about this task. The neat thing about sklearn is that many algorithms use the same syntax, which makes comparing their performance on a task fairly straightforward. However, each model will have different underlying parameters and methods to identify the optimal split. When you are using a new model it is helpful to read up on how the model works.

The documentation is a great way to do that. Read the documentation for the Decision Tree and let's try to answer the following questions:

1). What are two advantages and two disadvantages of the Decision Tree? 2). What measure do Decision Trees use to determine optimal split? 3). How do you import the Decision Tree from sklearn?

Decision Trees are a classification/regression supervised learning algorithm that uses a series of splits to make its predictions.

Decision Trees learn from the data by picking the feature-threshold that maximizes the information gain of the target variable. In other words it chooses a splitting point that produces the most imbalanced/pure proportions in the target variable. The goal of the model is to keep splitting until all the data in a terminal node or leaf are exclusively one class.

The model iterates through a set of values for each feature and then calculate the information gain for each split and the one that produces the lowest value is the designated split.

Parameters

There are many parameters for the Decision Tree Classifier. A few relevant to this notebook are described here:

criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

splitter: The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split: The minimum number of samples required to split an internal node

min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

max_features: The number of features to consider when looking for the best split

Now let's train a decision tree model on the penguins data set. We are going to start with a default DT model, meaning we're not going to pass in any parameters of our own. Like we did before, we are going to fit a model and then evaluate it on the training and testing datasets. Let's start with a single x-feature.

In [ ]:

# Initialize model
dt = DecisionTreeClassifier()

# Fit model on the dataset
dt.fit(X_train[['body_mass_g']], y_train['species'])

# Derive the training accuracy score
dt.score(X_train[['body_mass_g']], y_train['species'])

In [ ]:

# Test score
dt.score(X_test[['body_mass_g']], y_test['species'])

Question: Our testing score is considerably lower. When the testing score is lower than the training score, what does that mean?

We can take advantage of some of the parameters of the decision tree in order to help prevent overfitting of the model. Let's try a model in which we impose some constraints on the tree?

Question: From the documentation, what is one parameter that might help?

In [ ]:

# Initialize
dt = DecisionTreeClassifier(max_depth=2)
# Fit 
dt.fit(X_train[['body_mass_g']], y_train['species'])

# Evaluate
train_score = dt.score(X_train[['body_mass_g']], y_train['species'])
test_score = dt.score(X_test[['body_mass_g']], y_test['species'])

print("Our training score is {} and our testing score is {}".format(train_score.round(3), test_score.round(3)))

The gap between the two scores is considerably lower. Arguably we don't have an over fit model anymore. However, we could likely improve on the accuracy of this model by including more features.

Tree Visualization¶

One big advantage of the Decision Tree is that it can be visualized no matter how many features were involved.

Let's retrain it with a small max_depth

In [ ]:

dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train[['body_mass_g']], y_train['species'])

Question: What is the first criteria used to split the decision tree?

In [ ]:

plt.figure(figsize=(28, 20))
plot_tree(dt, feature_names=['body_mass_g'], class_names=["Adelie", "Chinstrap","Gentoo"], 
          filled = True, proportion=True, fontsize=18
         );

Using the tree, how would we make predictions about the following customers?

- Penguin A: Body Mass of .5
- Penguin B: Body Mass of 0

Challenge 2: Classification with SVM¶

Now let's try another new model. The Support Vector Machine is another class of machine learning algorithm that is used for classification.

Choose two features of the data set to train your model on. Then, using the documentation for the support vector machine, follow the steps to:

Initialize the model
Fit it to the training data
Evaluate the model on both the training and testing data

Is your model underfit? Is it overfit?

How does SVM fit in with the linearly separable problem identified in the scatter plots above?

In [ ]:

## YOUR CODE HERE
from sklearn.svm import SVC
X_train_subset = X_train[['feature1','feature2']]
X_test_subset = X_test[['feature1','feature2']]
y_train_subset = y_train['species']
y_test_subset = y_test['species']

##1) Initialize SVM

##2) Train SVM on Training data 

##3) Evaluate SVM on Training and Test Data

In [ ]: