How many different species of flowers were measured?¶

Read data¶

In [1]:

import pandas as pd
df = pd.read_csv("iris.csv").iloc[:, :4]
df.head()

Out[1]:

	petalLength	petalWidth	sepalLength	sepalWidth
0	1.4	0.2	5.1	3.5
1	1.4	0.2	4.9	3.0
2	1.3	0.2	4.7	3.2
3	1.5	0.2	4.6	3.1
4	1.4	0.2	5.0	3.6

This dataset contains measurements for different types of flowers. Here are more detail on the measurements:

Visualize data¶

Let's do the most basic level of investigation: looking at the data!

There are only 4 features for each flower measurement. Let's visualize all pairs between different features, i.e., plotting petalLength and sepalLength together with a scatter plot. If there's a clear relation between a pair of variables, this will make the relationship more clear.

It looks like there's at least two species of flowers in this dataset -- for example, look at sepalWidth against petalWidth:

In [2]:

df.plot.scatter(x="sepalWidth", y="petalWidth")

Out[2]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

That's a pretty clear separation.

However, the above plot visualizes one pair of variables. What if 3 or 4 variables are important in determining the species?

Clustering¶

Let's consider the embedding: [petalLength, petalWidth, sepalLength, sepalWidth].

And let's use K-Means to cluster the points (flowers) into different groups.

In [3]:

from sklearn.cluster import KMeans

Scikit-Learn implements many machine learning algorithms. More details on KMeans specifically can be found in the user guide, which walks through some examples: https://scikit-learn.org/stable/modules/clustering.html#k-means

In [4]:

km = KMeans(n_clusters=2)
km.fit(df)

Out[4]:

KMeans(n_clusters=2)

Now compute the assignment of each datapoint its associated cluster

In [5]:

y_hat = km.predict(df)

A shortcut: you can do the fitting and predicting in one shot using the "fit_predict" command:

In [6]:

y_hat = km.fit_predict(df)

Let's visualize the results:

In [7]:

df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")

Out[7]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

This looks good in most cases, but there are a few points that look incorrect.

Let's try changing the number of clusters:

In [8]:

km = KMeans(n_clusters=3)

y_hat = km.fit_predict(df)

In [9]:

df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")

Out[9]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

I can't tell if that's better from this one pair of variables, petalWidth and petalLength. Let's try visualizing different pairs of variables, and see how they look:

In [10]:

columns = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']

for i in range(4):
    for j in range(i+1,4):

        df.plot.scatter(
            x=columns[i], y=columns[j], c=y_hat,
            cmap="viridis", colorbar=False, figsize=(3,3) )

Two of the classes are always mashed together. Should n_clusters be 2 or 3? I can't tell from these plots: two of the clusters are always mashed together.

I'd most likely say n_clusters=2 if I hadn't already seen the underlying dataset. Either way, there are at least two groups. Here's the underlying dataset:

In [11]:

df = pd.read_csv("iris.csv")
print(df.species.unique())
df.head()

['setosa' 'versicolor' 'virginica']

Out[11]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
0	1.4	0.2	5.1	3.5	setosa
1	1.4	0.2	4.9	3.0	setosa
2	1.3	0.2	4.7	3.2	setosa
3	1.5	0.2	4.6	3.1	setosa
4	1.4	0.2	5.0	3.6	setosa

This example of an unknown number of clusters will be revisited in this lecture. First, let's see how accurate KMeans performed the clustering -- does it group flowers of the same species together?

How accurate is the clustering?¶

Here's the process to check this:

Re-run our predictions with 3 clusters
Match the predicted numerical labels with the species label
See the predicted labels match with the actual labels

Re-running predictions¶

In [12]:

km = KMeans(n_clusters=3, random_state=42)

features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
y_hat = km.fit_predict(df[features])

Only the features are used, not the true clusters. It's the job of the clustering algorithm to predict that.

The random_state keyword in KMeans removes some of the randomness in KMeans clustering. Specifying random_state as an integer is an easy way to get the same result each time.

Matching numerical with predicted label¶

Predicted labels are numeric:

In [13]:

y_hat

Out[13]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

However, the actual labels are text:

In [14]:

df["species"].head()

Out[14]:

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

It'd be easiest if there's a dictionary that maps between setosa and 1, something like {1: "setosa", ...}.

To do that, let's look at the most common label for each numeric label:

In [15]:

# First, assign a column in the dataframe
df["numerical_prediction"] = y_hat

# now look at the numerical predictions for each label:
df.numerical_prediction[ df.species == 'virginica' ]

Out[15]:

100    2
101    0
102    2
103    2
104    2
105    2
106    0
107    2
108    2
109    2
110    2
111    2
112    2
113    0
114    0
115    2
116    2
117    2
118    2
119    0
120    2
121    0
122    2
123    0
124    2
125    2
126    0
127    0
128    2
129    2
130    2
131    2
132    2
133    0
134    2
135    2
136    2
137    2
138    0
139    2
140    2
141    2
142    0
143    2
144    2
145    2
146    0
147    2
148    2
149    0
Name: numerical_prediction, dtype: int32

It looks like 1 means "setosa", 2 means "viriginica" and 0 means "versicolor".

Next week, we will learn about an easier method to get these numbers (using group_by or pivot_table).

In [16]:

mapping = {1: "setosa", 2: "virginica", 0: "versicolor"}

In [17]:

def get_label(numeric):
    return mapping[numeric]

df["predicted_species"] = df.numerical_prediction.apply(get_label)
print(len(df))
df.head()

Out[17]:

	petalLength	petalWidth	sepalLength	sepalWidth	species	numerical_prediction	predicted_species
0	1.4	0.2	5.1	3.5	setosa	1	setosa
1	1.4	0.2	4.9	3.0	setosa	1	setosa
2	1.3	0.2	4.7	3.2	setosa	1	setosa
3	1.5	0.2	4.6	3.1	setosa	1	setosa
4	1.4	0.2	5.0	3.6	setosa	1	setosa

Difference between predicted and actual labels¶

In [18]:

def accuracy(actual, pred):
    return (actual == pred).sum() / len(actual)

accuracy(df.species, df.predicted_species)

Out[18]:

0.8933333333333333

Looks like the KMeans finds the groups with 89.33% accuracy!

PROBLEMS¶

Question 1: Number of clusters¶

What happens what KMeans gets an incorrect number of classes? To investigate that, let's create a synthetic dataset in two dimensions.

In [ ]:

from sklearn.datasets import make_blobs
import numpy as np
import pandas as pd

X, _ = make_blobs(
    n_samples=1500,
    random_state=170,
)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)

In [ ]:

## Your code here -- plot the data. How many clusters are there?

KMeans will handle this fine -- each blob is pretty well defined and nicely shaped.

But let's try to see how KMeans handles a simple error. Let's try mis-specifying the number of clusters:

Question 2: Mis-specification of `n_clusters`¶

In [ ]:

## Your code here -- specify 2 clusters in KMeans, and visualize the results
# (hint: add a column to the dataframe and use df.plot.scatter)
#
# What two clusters are mis-clustered as the same class?

This makes sense because the classes that have the same mean for one variable are closer together. KMeans clusters these two together because they are closer than the third cluster.

Question 3: Data shape¶

KMeans certainly depends on the data position.

How does KMeans depends on the data shape?

In [ ]:

X, y = make_blobs(n_samples=1500, random_state=170)
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X = np.dot(X, transformation)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)

In [ ]:

# Your code here -- plot the data. What's the data look like?

The groups are now unevenly shaped. Let's try KMeans again with the correct number of classes and see what happens.

In [ ]:

## Your code here -- provide some clustering in the `y_pred` variable
# with 3 clusters. What does the clustering do?


# define y_pred, which should be the cluster labels
y_pred = 

In [ ]:

df = pd.DataFrame(X, columns=["x", "y"])
df["predicted"] = y_pred
df.plot.scatter(x="x", y="y", c="predicted",
                cmap="viridis", colorbar=False)

What is KMeans trying to do? It finds some cluster centers so that all the points are close to the closest cluster center. That means that it cares more about one effective dimension than the other.

In [ ]:

	petalLength	petalWidth	sepalLength	sepalWidth
0	1.4	0.2	5.1	3.5
1	1.4	0.2	4.9	3.0
2	1.3	0.2	4.7	3.2
3	1.5	0.2	4.6	3.1
4	1.4	0.2	5.0	3.6

	petalLength	petalWidth	sepalLength	sepalWidth
0	1.4	0.2	5.1	3.5
1	1.4	0.2	4.9	3.0
2	1.3	0.2	4.7	3.2
3	1.5	0.2	4.6	3.1
4	1.4	0.2	5.0	3.6

How many different species of flowers were measured?¶

Read data¶

Visualize data¶

Clustering¶

How accurate is the clustering?¶

Re-running predictions¶

Matching numerical with predicted label¶

Difference between predicted and actual labels¶

PROBLEMS¶

Question 1: Number of clusters¶

Question 2: Mis-specification of n_clusters¶

Question 3: Data shape¶

Question 2: Mis-specification of `n_clusters`¶

	petalLength	petalWidth	sepalLength	sepalWidth
0	1.4	0.2	5.1	3.5
1	1.4	0.2	4.9	3.0
2	1.3	0.2	4.7	3.2
3	1.5	0.2	4.6	3.1
4	1.4	0.2	5.0	3.6