In [1]:

```
import pandas as pd
df = pd.read_csv("iris.csv").iloc[:, :4]
df.head()
```

Out[1]:

petalLength | petalWidth | sepalLength | sepalWidth | |
---|---|---|---|---|

0 | 1.4 | 0.2 | 5.1 | 3.5 |

1 | 1.4 | 0.2 | 4.9 | 3.0 |

2 | 1.3 | 0.2 | 4.7 | 3.2 |

3 | 1.5 | 0.2 | 4.6 | 3.1 |

4 | 1.4 | 0.2 | 5.0 | 3.6 |

Let's do the most basic level of investigation: looking at the data!

There are only 4 features for each flower measurement. Let's visualize all pairs between different features, i.e., plotting `petalLength`

and `sepalLength`

together with a scatter plot. If there's a clear relation between a pair of variables, this will make the relationship more clear.

`sepalWidth`

against `petalWidth`

:

In [2]:

```
df.plot.scatter(x="sepalWidth", y="petalWidth")
```

Out[2]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

That's a pretty clear separation.

However, the above plot visualizes one *pair* of variables. **What if 3 or 4 variables are important in determining the species?**

Let's consider the embedding: `[petalLength, petalWidth, sepalLength, sepalWidth]`

.

And let's use K-Means to cluster the points (flowers) into different groups.

In [3]:

```
from sklearn.cluster import KMeans
```

In [4]:

```
km = KMeans(n_clusters=2)
km.fit(df)
```

Out[4]:

KMeans(n_clusters=2)

Now compute the assignment of each datapoint its associated cluster

In [5]:

```
y_hat = km.predict(df)
```

A shortcut: you can do the fitting and predicting in one shot using the "fit_predict" command:

In [6]:

```
y_hat = km.fit_predict(df)
```

Let's visualize the results:

In [7]:

```
df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")
```

Out[7]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

This looks good in most cases, but there are a few points that look incorrect.

Let's try changing the number of clusters:

In [8]:

```
km = KMeans(n_clusters=3)
y_hat = km.fit_predict(df)
```

In [9]:

```
df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis")
```

Out[9]:

<AxesSubplot:xlabel='sepalWidth', ylabel='petalWidth'>

`petalWidth`

and `petalLength`

. Let's try visualizing different pairs of variables, and see how they look:

In [10]:

```
columns = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
for i in range(4):
for j in range(i+1,4):
df.plot.scatter(
x=columns[i], y=columns[j], c=y_hat,
cmap="viridis", colorbar=False, figsize=(3,3) )
```

Two of the classes are always mashed together. **Should n_clusters be 2 or 3?** I can't tell from these plots: two of the clusters are always mashed together.

I'd most likely say `n_clusters=2`

if I hadn't already seen the underlying dataset. Either way, there are *at least* two groups. Here's the underlying dataset:

In [11]:

```
df = pd.read_csv("iris.csv")
print(df.species.unique())
df.head()
```

['setosa' 'versicolor' 'virginica']

Out[11]:

petalLength | petalWidth | sepalLength | sepalWidth | species | |
---|---|---|---|---|---|

0 | 1.4 | 0.2 | 5.1 | 3.5 | setosa |

1 | 1.4 | 0.2 | 4.9 | 3.0 | setosa |

2 | 1.3 | 0.2 | 4.7 | 3.2 | setosa |

3 | 1.5 | 0.2 | 4.6 | 3.1 | setosa |

4 | 1.4 | 0.2 | 5.0 | 3.6 | setosa |

`KMeans`

performed the clustering -- does it group flowers of the same species together?

Here's the process to check this:

- Re-run our predictions with 3 clusters
- Match the predicted
*numerical*labels with the`species`

label - See the predicted labels match with the actual labels

In [12]:

```
km = KMeans(n_clusters=3, random_state=42)
features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth']
y_hat = km.fit_predict(df[features])
```

`random_state`

keyword in `KMeans`

removes some of the randomness in `KMeans`

clustering. Specifying `random_state`

as an integer is an easy way to get the same result each time.

Predicted labels are *numeric*:

In [13]:

```
y_hat
```

Out[13]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

However, the actual labels are text:

In [14]:

```
df["species"].head()
```

Out[14]:

0 setosa 1 setosa 2 setosa 3 setosa 4 setosa Name: species, dtype: object

It'd be easiest if there's a dictionary that maps between `setosa`

and `1`

, something like `{1: "setosa", ...}`

.

To do that, let's look at the most common label for each numeric label:

In [15]:

```
# First, assign a column in the dataframe
df["numerical_prediction"] = y_hat
# now look at the numerical predictions for each label:
df.numerical_prediction[ df.species == 'virginica' ]
```

Out[15]:

100 2 101 0 102 2 103 2 104 2 105 2 106 0 107 2 108 2 109 2 110 2 111 2 112 2 113 0 114 0 115 2 116 2 117 2 118 2 119 0 120 2 121 0 122 2 123 0 124 2 125 2 126 0 127 0 128 2 129 2 130 2 131 2 132 2 133 0 134 2 135 2 136 2 137 2 138 0 139 2 140 2 141 2 142 0 143 2 144 2 145 2 146 0 147 2 148 2 149 0 Name: numerical_prediction, dtype: int32

It looks like 1 means "setosa", 2 means "viriginica" and 0 means "versicolor".

Next week, we will learn about an easier method to get these numbers (using `group_by`

or `pivot_table`

).

In [16]:

```
mapping = {1: "setosa", 2: "virginica", 0: "versicolor"}
```

In [17]:

```
def get_label(numeric):
return mapping[numeric]
df["predicted_species"] = df.numerical_prediction.apply(get_label)
print(len(df))
df.head()
```

150

Out[17]:

petalLength | petalWidth | sepalLength | sepalWidth | species | numerical_prediction | predicted_species | |
---|---|---|---|---|---|---|---|

0 | 1.4 | 0.2 | 5.1 | 3.5 | setosa | 1 | setosa |

1 | 1.4 | 0.2 | 4.9 | 3.0 | setosa | 1 | setosa |

2 | 1.3 | 0.2 | 4.7 | 3.2 | setosa | 1 | setosa |

3 | 1.5 | 0.2 | 4.6 | 3.1 | setosa | 1 | setosa |

4 | 1.4 | 0.2 | 5.0 | 3.6 | setosa | 1 | setosa |

In [18]:

```
def accuracy(actual, pred):
return (actual == pred).sum() / len(actual)
accuracy(df.species, df.predicted_species)
```

Out[18]:

0.8933333333333333

Looks like the KMeans finds the groups with 89.33% accuracy!

**What happens what KMeans gets an incorrect number of classes?** To investigate that, let's create a synthetic dataset in two dimensions.

In [ ]:

```
from sklearn.datasets import make_blobs
import numpy as np
import pandas as pd
X, _ = make_blobs(
n_samples=1500,
random_state=170,
)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)
```

In [ ]:

```
## Your code here -- plot the data. How many clusters are there?
```

`KMeans`

will handle this fine -- each blob is pretty well defined and nicely shaped.

But let's try to see how `KMeans`

handles a simple error. Let's try mis-specifying the number of clusters:

`n_clusters`

¶In [ ]:

```
## Your code here -- specify 2 clusters in KMeans, and visualize the results
# (hint: add a column to the dataframe and use df.plot.scatter)
#
# What two clusters are mis-clustered as the same class?
```

`KMeans`

clusters these two together because they are closer than the third cluster.

`KMeans`

certainly depends on the *data position.*

How does `KMeans`

depends on the *data shape*?

In [ ]:

```
X, y = make_blobs(n_samples=1500, random_state=170)
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X = np.dot(X, transformation)
df = pd.DataFrame(X, columns=["x", "y"])
df.head(n=2)
```

In [ ]:

```
# Your code here -- plot the data. What's the data look like?
```

In [ ]:

```
## Your code here -- provide some clustering in the `y_pred` variable
# with 3 clusters. What does the clustering do?
# define y_pred, which should be the cluster labels
y_pred =
```

In [ ]:

```
df = pd.DataFrame(X, columns=["x", "y"])
df["predicted"] = y_pred
df.plot.scatter(x="x", y="y", c="predicted",
cmap="viridis", colorbar=False)
```

`KMeans`

trying to do? It finds some cluster centers so that all the points are close to the closest cluster center. That means that it cares more about one effective dimension than the other.

In [ ]:

```
```