Author: Leonardo Uieda
This notebook is meant as a very brief hands-on introduction to machine learning. It will cover some of the common nomenclature, principles, and applications. It's designed to be taught as a 1-2 hour session with live-coding.
Some features of machine learning (from my personal point of view):
Oversimplified summary: Fit a mathematical model to data and use it to make predictions.
Disclaimer: I'm not an ML researcher. Don't quote me on this.
In Python, the main tool used for machine learning is scikit-learn. We'll use it and some of the other scientific Python stack to play with some data as we work through the core principles of machine learning.
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import (
preprocessing,
neighbors,
decomposition,
cluster,
metrics,
model_selection,
pipeline,
)
For this tutorial, we'll use one of the sample datasets from the seaborn library. It contains measurements of anatomical features of 3 different species of penguin collected Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The data are distributed under a CC-0 license from the GitHub repository allisonhorst/palmerpenguins. The data were originally published in Gorman et al. (2014).
data = sns.load_dataset("penguins")
data
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
The seaborn function returns the data in a pandas.DataFrame
, which would be standard way to load data stored in CSV files or Excell spreadsheets. Each row contains information about an individual and each column is a type of observation.
Example of good practice: Notice how each numerical column name includes the unit! This is a great example and your future self will be very grateful if you replicate this for your own data. As a bonus, also use all lowercase characters and underscores
_
instead of spaces or dashes-
.
Give the data that we have, let's set ourselves the following goal: predict the species from the anatomical measurements.
Since the species is a category and not a continuous variable, we're dealing with a classification problem and not a regression problem. This is important to know since it will guide the models and methods that we use and research.
Data cleaning and inspection is probably the most crucial part of a machine learning workflow. Without first having sufficient, good, standardized, and cleaned data there is no algorithm that will solve your problems.
The first thing to note is that our data has missing values (NaN
or "Not a Number"). Some of the rows don't have all of the information for all features. This can cause a lot of problems in our processing and machine learning pipelines. So let's first see how many there are and which values are missing.
data[np.any(data.isna(), axis=1)]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
8 | Adelie | Torgersen | 34.1 | 18.1 | 193.0 | 3475.0 | NaN |
9 | Adelie | Torgersen | 42.0 | 20.2 | 190.0 | 4250.0 | NaN |
10 | Adelie | Torgersen | 37.8 | 17.1 | 186.0 | 3300.0 | NaN |
11 | Adelie | Torgersen | 37.8 | 17.3 | 180.0 | 3700.0 | NaN |
47 | Adelie | Dream | 37.5 | 18.9 | 179.0 | 2975.0 | NaN |
246 | Gentoo | Biscoe | 44.5 | 14.3 | 216.0 | 4100.0 | NaN |
286 | Gentoo | Biscoe | 46.2 | 14.4 | 214.0 | 4650.0 | NaN |
324 | Gentoo | Biscoe | 47.3 | 13.8 | 216.0 | 4725.0 | NaN |
336 | Gentoo | Biscoe | 44.5 | 15.7 | 217.0 | 4875.0 | NaN |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
It's not that many and most are in the sex
column, which we won't use here. So let's first drop that entire column from our dataset.
data.drop("sex", axis=1, inplace=True)
data
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 |
344 rows × 6 columns
And then we drop every row that has at least 1 missing value.
data.dropna(axis=0, inplace=True)
data
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 |
... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 |
342 rows × 6 columns
Now the problem is that the index doesn't match the number of rows in our table (notice that the last row is 343 when we only have 342 rows). This happens because pandas
deletes rows but keeps the original index numbers intact (which is a sensible default).
Let's re-align the index with the number of rows.
data.reset_index(drop=True, inplace=True)
data
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
4 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 |
... | ... | ... | ... | ... | ... | ... |
337 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 |
338 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 |
339 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 |
340 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 |
341 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 |
342 rows × 6 columns
Now we have data that has no missing values and we're ready to start.
Now that we have clean and ordered data, it's important to visualize the data so that we know what we're dealing with. For this dataset, we have few enough features (columns) and samples (rows) to visualize it straight away. Otherwise, we should apply some sort of dimensionality reduction first.
A standard way to plot the different features and the pair-wise relationship between then is with a "pair plot". Thankfully, we don't have to create these ourselves with matplotlib
since seaborn
has a function just for this.
sns.pairplot(data, hue="species")
<seaborn.axisgrid.PairGrid at 0x7f18a05dd4c0>
In the figure above, each graph is a cross-plot of 2 different numerical columns in our dataset. The diagonal in the middle is the estimated distribution of that numerical column. Each plot is colored by the respective species.
What we can see from this is that some features are highly correlated for some species. For example, body mass and flipper length or flipper length and and bill depth (for Chinstrap and Adelie). Notice also how the distributions overlap heavily but the overlap isn't always the same. For example, the body mass of Adelie and Chinstrap penguins overlap but not their bill length.
If all features were correlated and overlapping, we'd have very little chance of differentiating between the different classes (species) using them and our machine learning pipeline would halt here. It would indicated that we probably need to gather data from a different feature.
Since that's not the case, let's move on and see what we can do.
Scikit-learn and a lot of other Python libraries for machine learning (which largely try to emulate scikit-learn), require the data to be formatted in a certain way before it can be used.
Features: The observations that we will use as predictors of the species. These need to be stored in a 2D array (basically a matrix). Each row of this matrix is a sample (in our case the measurements of an individual) and each column corresponds to a different feature. This matrix is often called X
in the scikit-learn documentation.
We can make the feature matrix X
from our pandas.DataFrame
by selecting the columns and getting the .values
attribute (which returns the data in a DataFrame
as a 2D array).
feature_columns = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
X = data[feature_columns].values
X
array([[ 39.1, 18.7, 181. , 3750. ], [ 39.5, 17.4, 186. , 3800. ], [ 40.3, 18. , 195. , 3250. ], ..., [ 50.4, 15.7, 222. , 5750. ], [ 45.2, 14.8, 212. , 5200. ], [ 49.9, 16.1, 213. , 5400. ]])
This is another example of why getting the data sorted into a standard format early is important. Making the feature matrix would be difficult to do if we hadn't done that previously.
Labels/classes: The thing that we are trying to predict from our features. In our case, we have labels for each sample in our dataset (the species
column). The labels must be stored in a 1D array (can be 2D depending on the application). In the scikit-learn documentation this is often called y
.
y = data.species.values
y
array(['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo'], dtype=object)
That we have our X
and y
variables, we can start trying to learn from the data with scikit-learn.
Sometimes we may have a bunch of features but no labels already present in the data. Without them, we can't train our models to predict the labels from the features. But that doesn't mean that there is nothing we can do.
Unsupervised learning methods are all about finding patterns in the data without first training a model with given labels. An example of this is clustering, which tries to categorize the data based on how much they group around each other.
From our pair-plot above, we can see that the 3 species tend to be clustered around certain ranges of values depending on the features. There is more or less overlap between species depending on the features used so we can expect that clustering methods may struggle a bit to separate them.
Let's see what the simplest form of clustering (K-Means) can do. We need to tell it the number of clusters to look. Thankfully, we already know that there are only 3 possible species.
kmeans = cluster.KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
cluster_labels = kmeans.predict(X)
cluster_labels
array([0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1], dtype=int32)
The prediction output is a value of 0, 1, or 2 indicating to which cluster each data sample belongs. We don't know which one corresponds to which species (if any) but we can do a visual inspection to check the performance.
To do so, we can assign the k-means labels as a column in our data
variable and use it to create a new pair-plot.
data = data.assign(cluster=cluster_labels)
sns.pairplot(data, hue="cluster")
<seaborn.axisgrid.PairGrid at 0x7f18a05dd6d0>
Pro tip: Open a "New view for notebook" in the JupyterLab "File" menu to place the original pair-plot and our k-means prediction side by side for comparison.
The clustering doesn't seem to have performed very well in the prediction of the labels. For one, it seems to have divided the body mass 3 ways into low, mid, and high values.
I guess this means that k-means is not able to predict the species properly so we should move on to more complex and sophisticated models. 🤷
🛑 Stop! 🛑
Before moving on, we should always try to understand why the model failed and check if there is a problem with our data. If our data are faulty or need some more cleaning, there is no amount of cleverness in model choice that will overcome that. As the saying goes:
Garbage in, garbage out. 🗑️
Going back to our data, let's have a look at the statistics for each feature.
data[feature_columns].mean(axis=0)
bill_length_mm 43.921930 bill_depth_mm 17.151170 flipper_length_mm 200.915205 body_mass_g 4201.754386 dtype: float64
data[feature_columns].std(axis=0)
bill_length_mm 5.459584 bill_depth_mm 1.974793 flipper_length_mm 14.061714 body_mass_g 801.954536 dtype: float64
Notice that the body mass is 2 orders of magnitude larger in both mean and standard deviation than the bill measurements. That should raise some red flags since our clustering seems to have divided the data based on body mass alone.
K-means calculates the clusters based on the distance between points and their "center", as defined by the mean of all their dimensions (features). And what we know about means is large values will skew the mean towards them. So it's no wonder that the clusters are being picked based on the body mass along.
What can we do to mitigate this? The standard practice, and an underlying assumption for many machine learning methods, is that all features have zero mean and unit standard deviation (in other words, are as close to normally distributed as possible).
The simplest possible method to resolve this is to subtract the mean and divide by the standard deviation for each feature. In scikit-learn, we can do this with the StandardScaler
.
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled
array([[-0.88449874, 0.78544923, -1.41834665, -0.56414208], [-0.81112573, 0.1261879 , -1.06225022, -0.50170305], [-0.66437972, 0.43046236, -0.42127665, -1.18853234], ..., [ 1.18828874, -0.73592307, 1.50164406, 1.93341896], [ 0.23443963, -1.19233476, 0.7894512 , 1.24658968], [ 1.09657248, -0.53307343, 0.86067049, 1.49634578]])
X_scaled.mean(axis=0)
array([ 1.66208827e-16, -1.41277503e-15, -8.31044135e-16, 4.15522068e-17])
X_scaled.std(axis=0)
array([1., 1., 1., 1.])
Now let's try the clustering again but with our scaled features.
kmeans.fit(X_scaled)
cluster_labels_scaled = kmeans.predict(X_scaled)
Instead of trying to visually match the cluster labels with the species, let's see if we can match them with code.
cluster_labels_scaled
array([1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
Since our data was originally sorted by species, we can tell from the prediction that:
Let's make an array that replaces the cluster labels with the species name and add it to our DataFrame
.
cluster_species_scaled = np.empty_like(data.species)
cluster_species_scaled[cluster_labels_scaled == 0] = "Gentoo"
cluster_species_scaled[cluster_labels_scaled == 1] = "Adelie"
cluster_species_scaled[cluster_labels_scaled == 2] = "Chinstrap"
data = data.assign(species_cluster=cluster_species_scaled)
sns.pairplot(data, hue="species_cluster", vars=feature_columns)
<seaborn.axisgrid.PairGrid at 0x7f18974bd1c0>
Comparing these predicted cluster labels with our actual known species shows that k-means is a decent method to predict species. Of course, the clustering is not perfect and some samples are mislabeled. Let's see if we can do better with a supervised learning approach.
I can't stress this enough:
Always start with the simplest model and check your data before moving to more complex models!
Since we have labels for our data (the species), we can train models to predict these labels based on the input features. Machine learning techniques that follow this procedure fall under the supervised learning category. Examples of supervised learning include linear regression, support vector machines, neural networks (including deep learning), and more. Once we have trained our model, we can then use it to predict the labels of new data that comes in without labels (new penguin measurements by someone who doesn't know the species).
The simplest possible supervised learning model for our classification problems is a k-nearest-neighbors classifier (kNN). The general idea is to assign labels based on the labels of a sample's k nearest neighbors (the label with a majority is used). This method is fast and scales well to larger datasets so it's a good choice for a first pass.
knn = neighbors.KNeighborsClassifier()
knn.fit(X, y)
species_prediction = knn.predict(X)
species_prediction
array(['Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Chinstrap', 'Chinstrap', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Adelie', 'Chinstrap', 'Chinstrap', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Chinstrap', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo'], dtype=object)
An advantage of the supervised model is that is can give us the predicted species directly, without us having to guess which label corresponds to which.
Instead of looking at the pair-plot for this again, let's try a more quantitative approach. We can calculate something called the accuracy score (read the reference for the maths behind it). Basically, it tells us how close two non-numerical arrays are by counting the number of entries that are equal and dividing by the number of samples. If all entries are the same, the score will be 1. If none of them are, the score will be 0.
metrics.accuracy_score(y, species_prediction)
0.8391812865497076
That's not a bad score! Particularly when we consider that we did nothing to tune the model or configure it in any way.
One thing that may have sparked something in your mind is that the kNN is calculating distances between samples, just like our K-means clustering was. And so, it's very likely suffering from the same scaling issue we faces there. Let's see if using the scaled version of the features improves our accuracy score.
knn = neighbors.KNeighborsClassifier()
knn.fit(X_scaled, y)
species_prediction_scaled = knn.predict(X_scaled)
metrics.accuracy_score(y, species_prediction_scaled)
0.9912280701754386
That's a big improvement! The prediction is almost perfect now.
For good measure, let's create our pair plot again with the kNN predictions this time.
data = data.assign(species_knn=species_prediction_scaled)
sns.pairplot(data, hue="species_knn", vars=feature_columns)
<seaborn.axisgrid.PairGrid at 0x7f188c0ea7c0>
Since scaling is such an important thing, we can make it a permanent part of our machine learning workflow for this problem by using a scikit-learn Pipeline
.
classifier = pipeline.make_pipeline(
preprocessing.StandardScaler(), neighbors.KNeighborsClassifier(),
)
classifier.fit(X, y)
species_prediction_pipeline = classifier.predict(X)
metrics.accuracy_score(y, species_prediction_pipeline)
0.9912280701754386
What is this high score actually telling us? It's calculated by comparing the known labels that we used to train our model against the labels predicted by the model. So it indicates how well our model fits the data. That's a good thing to know. But our data has noise (for example, misidentified species) so we probably don't want to fit the data perfectly, otherwise we'd be replicating noise in the predictions. This is often called over-fitting in machine learning and is a major pitfall to avoid!
What we actually want to know is not how well we can fit the training data, but how well will our model perform when given new data. In other words, what is the predictive power of our model? The accuracy scores above don't tell us anything about that.
If we want to answer the question "how well does our model perform on new data", we have to keep back some data from the training process. We can then use this data to test our model's performance on unseen data. This is known as cross-validation, where we validate the model using the data itself.
With scikit-learn, the easiest way to do this is with the train_test_split
function. By default, it splits the input data randomly into 75% training data and 25% testing data.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
X_train.shape, X_test.shape
((256, 4), (86, 4))
The procedure now is to fit our model pipeline (scaling + kNN) on the training data.
classifier.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('kneighborsclassifier', KNeighborsClassifier())])
And then score the trained model prediction against the testing data.
metrics.accuracy_score(y_test, classifier.predict(X_test))
1.0
Doing that we get a lower score that is more representative of the accuracy we should expect when applying our model in the real world. It's still a very good score, though!
A limitation of this method is that the score is dependent on how the data were split. If you run the entire code above (split, train, test) multiple times, you will get different scores depending on the roll of the dice (give it a try!).
We can do better by using a more exhaustive form of cross-validation known as K-Fold. This involves splitting the data multiple times systematically and alternating which part is used to training and testing. This way, all of our data has been used for both purposes by the end.
Scikit-learn provides a useful function for performing cross-validation that does the splitting, training, testing loop for us and returns the score from each iteration.
scores = model_selection.cross_val_score(classifier, X, y)
scores
array([0.98550725, 0.98550725, 0.98529412, 0.98529412, 1. ])
Now we can calculate the mean score, which will be more stable across multiple runs and a better representation of our model accuracy.
scores.mean()
0.9883205456095482
Again, that's a very good score!
Let's compare this against the accuracy of our unsupervised K-means clustering.
metrics.accuracy_score(y, data.species_cluster)
0.9152046783625731
That's a good improvement in the prediction accuracy!
A few things to keep in mind:
The main take home messages of this tutorial are:
KMeans
for clustering, LinearModel
for regression, or KNeighborsClassifier
for classification) before reaching for more complex models (looking at you deep learning 👺). Your goal with other models is always to be at least better than these simple models.train_test_split
)cross_val_score
)train_test_split
+ cross_val_score
)cross_val_score
)What to look into for the future.
The original material for this tutorial can be found at leouieda/ml-intro. Comments, corrections, and additions are welcome.
All Python source code is made available under the BSD 3-clause license. You can freely use and modify the code, without warranty, so long as you provide attribution to the authors.
Unless otherwise specified, all figures and Jupyter notebooks are available under the Creative Commons Attribution 4.0 License (CC-BY).
The full text of these licenses is provided in the LICENSE.txt
file.