import pandas as pd
df = pd.read_csv("iris.csv").iloc[:, :4]
df.head()
petalLength | petalWidth | sepalLength | sepalWidth | |
---|---|---|---|---|
0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | 1.4 | 0.2 | 4.9 | 3.0 |
2 | 1.3 | 0.2 | 4.7 | 3.2 |
3 | 1.5 | 0.2 | 4.6 | 3.1 |
4 | 1.4 | 0.2 | 5.0 | 3.6 |
This dataset contains measurements for different types of flowers. Here are more detail on the measurements:
Let's do the most basic level of investigation: looking at the data!
There are only 4 features for each flower measurement. Let's visualize all pairs between different features, i.e., plotting petalLength
and sepalLength
together with a scatter plot. If there's a clear relation between a pair of variables, this will make the relationship more clear.
It looks like there's at least two species of flowers in this dataset -- for example, look at sepalWidth
against petalWidth
:
df.plot.scatter(x="sepalWidth", y="petalWidth")
<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>
That's a pretty clear separation.
However, the above plot visualizes one pair of variables. What if 3 or 4 variables are important in determining the species?
Let's consider the embedding: [petalLength, petalWidth, sepalLength, sepalWidth]
.
And let's use K-Means to cluster the points (flowers) into different groups.
from sklearn.cluster import KMeans
Scikit-Learn implements many machine learning algorithms. More details on KMeans specifically can be found in the user guide, which walks through some examples: https://scikit-learn.org/stable/modules/clustering.html#k-means
km = KMeans(n_clusters=2)
km.fit(df)
KMeans(n_clusters=2)
Now compute the assignment of each datapoint its associated cluster
y_hat = km.predict(df)
A shortcut: you can do the fitting and predicting in one shot using the "fit_predict" command:
y_hat = km.fit_predict(df)
Let's visualize the results:
df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")
<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>
This looks good in most cases, but there are a few points that look incorrect.
Let's try changing the number of clusters:
km = KMeans(n_clusters=3)
y_hat = km.fit_predict(df)
df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")
<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>
I can't tell if that's better from this one pair of variables, petalWidth
and petalLength
. Let's try visualizing different pairs of variables, and see how they look:
columns = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
for i in range(4):
for j in range(i+1,4):
df.plot.scatter(
x=columns[i], y=columns[j], c=y_hat,
cmap="viridis", colorbar=False, figsize=(3,3) )
Two of the classes are always mashed together. Should n_clusters
be 2 or 3? I can't tell from these plots: two of the clusters are always mashed together.
I'd most likely say n_clusters=2
if I hadn't already seen the underlying dataset. Either way, there are at least two groups. Here's the underlying dataset:
df = pd.read_csv("iris.csv")
print(df.species.unique())
df.head()
['setosa' 'versicolor' 'virginica']
petalLength | petalWidth | sepalLength | sepalWidth | species | |
---|---|---|---|---|---|
0 | 1.4 | 0.2 | 5.1 | 3.5 | setosa |
1 | 1.4 | 0.2 | 4.9 | 3.0 | setosa |
2 | 1.3 | 0.2 | 4.7 | 3.2 | setosa |
3 | 1.5 | 0.2 | 4.6 | 3.1 | setosa |
4 | 1.4 | 0.2 | 5.0 | 3.6 | setosa |
This example of an unknown number of clusters will be revisited in this lecture. First, let's see how accurate KMeans
performed the clustering -- does it group flowers of the same species together?
Here's the process to check this:
species
labelkm = KMeans(n_clusters=3, random_state=42)
features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
y_hat = km.fit_predict(df[features])
Only the features are used, not the true clusters. It's the job of the clustering algorithm to predict that.
The random_state
keyword in KMeans
removes some of the randomness in KMeans
clustering. Specifying random_state
as an integer is an easy way to get the same result each time.
Predicted labels are numeric:
y_hat
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])
However, the actual labels are text:
df["species"].head()
0 setosa 1 setosa 2 setosa 3 setosa 4 setosa Name: species, dtype: object
It'd be easiest if there's a dictionary that maps between setosa
and 1
, something like {1: "setosa", ...}
.
To do that, let's look at the most common label for each numeric label:
# First, assign a column in the dataframe
df["numerical_prediction"] = y_hat
# now look at the numerical predictions for each label:
df.numerical_prediction[ df.species == 'virginica' ]
100 2 101 0 102 2 103 2 104 2 105 2 106 0 107 2 108 2 109 2 110 2 111 2 112 2 113 0 114 0 115 2 116 2 117 2 118 2 119 0 120 2 121 0 122 2 123 0 124 2 125 2 126 0 127 0 128 2 129 2 130 2 131 2 132 2 133 0 134 2 135 2 136 2 137 2 138 0 139 2 140 2 141 2 142 0 143 2 144 2 145 2 146 0 147 2 148 2 149 0 Name: numerical_prediction, dtype: int32
It looks like 1 means "setosa", 2 means "viriginica" and 0 means "versicolor".
Next week, we will learn about an easier method to get these numbers (using group_by
or pivot_table
).
mapping = {1: "setosa", 2: "virginica", 0: "versicolor"}
def get_label(numeric):
return mapping[numeric]
df["predicted_species"] = df.numerical_prediction.apply(get_label)
print(len(df))
df.head()
150
petalLength | petalWidth | sepalLength | sepalWidth | species | numerical_prediction | predicted_species | |
---|---|---|---|---|---|---|---|
0 | 1.4 | 0.2 | 5.1 | 3.5 | setosa | 1 | setosa |
1 | 1.4 | 0.2 | 4.9 | 3.0 | setosa | 1 | setosa |
2 | 1.3 | 0.2 | 4.7 | 3.2 | setosa | 1 | setosa |
3 | 1.5 | 0.2 | 4.6 | 3.1 | setosa | 1 | setosa |
4 | 1.4 | 0.2 | 5.0 | 3.6 | setosa | 1 | setosa |
def accuracy(actual, pred):
return (actual == pred).sum() / len(actual)
accuracy(df.species, df.predicted_species)
0.8933333333333333
Looks like the KMeans finds the groups with 89.33% accuracy!
What happens what KMeans
gets an incorrect number of classes? To investigate that, let's create a synthetic dataset in two dimensions.
from sklearn.datasets import make_blobs
import numpy as np
import pandas as pd
X, _ = make_blobs(
n_samples=1500,
random_state=170,
)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)
## Your code here -- plot the data. How many clusters are there?
KMeans
will handle this fine -- each blob is pretty well defined and nicely shaped.
But let's try to see how KMeans
handles a simple error. Let's try mis-specifying the number of clusters:
n_clusters
¶## Your code here -- specify 2 clusters in KMeans, and visualize the results
# (hint: add a column to the dataframe and use df.plot.scatter)
#
# What two clusters are mis-clustered as the same class?
This makes sense because the classes that have the same mean for one variable are closer together. KMeans
clusters these two together because they are closer than the third cluster.
KMeans
certainly depends on the data position.
How does KMeans
depends on the data shape?
X, y = make_blobs(n_samples=1500, random_state=170)
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X = np.dot(X, transformation)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)
# Your code here -- plot the data. What's the data look like?
The groups are now unevenly shaped. Let's try KMeans again with the correct number of classes and see what happens.
## Your code here -- provide some clustering in the `y_pred` variable
# with 3 clusters. What does the clustering do?
# define y_pred, which should be the cluster labels
y_pred =
df = pd.DataFrame(X, columns=["x", "y"])
df["predicted"] = y_pred
df.plot.scatter(x="x", y="y", c="predicted",
cmap="viridis", colorbar=False)
What is KMeans
trying to do? It finds some cluster centers so that all the points are close to the closest cluster center. That means that it cares more about one effective dimension than the other.