#!/usr/bin/env python # coding: utf-8 # # **How many different species of flowers were measured?** # ## Read data # In[1]: import pandas as pd df = pd.read_csv("iris.csv").iloc[:, :4] df.head() # This dataset contains measurements for different types of flowers. Here are more detail on the measurements: # # # ## Visualize data # Let's do the most basic level of investigation: looking at the data! # # There are only 4 features for each flower measurement. Let's visualize all pairs between different features, i.e., plotting `petalLength` and `sepalLength` together with a scatter plot. If there's a clear relation between a pair of variables, this will make the relationship more clear. # It looks like there's at least two species of flowers in this dataset -- for example, look at `sepalWidth` against `petalWidth`: # In[2]: df.plot.scatter(x="sepalWidth", y="petalWidth") # That's a pretty clear separation. # # However, the above plot visualizes one *pair* of variables. **What if 3 or 4 variables are important in determining the species?** # ## Clustering # Let's consider the embedding: `[petalLength, petalWidth, sepalLength, sepalWidth]`. # # And let's use K-Means to cluster the points (flowers) into different groups. # In[3]: from sklearn.cluster import KMeans # Scikit-Learn implements many machine learning algorithms. More details on KMeans specifically can be found in the user guide, which walks through some examples: https://scikit-learn.org/stable/modules/clustering.html#k-means # In[4]: km = KMeans(n_clusters=2) km.fit(df) # Now compute the assignment of each datapoint its associated cluster # In[5]: y_hat = km.predict(df) # A shortcut: you can do the fitting and predicting in one shot using the "fit_predict" command: # In[6]: y_hat = km.fit_predict(df) # Let's visualize the results: # In[7]: df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis") # This looks good in most cases, but there are a few points that look incorrect. # # Let's try changing the number of clusters: # In[8]: km = KMeans(n_clusters=3) y_hat = km.fit_predict(df) # In[9]: df.plot.scatter(x="sepalWidth", y="petalWidth", c=y_hat, cmap="viridis") # I can't tell if that's better from this one pair of variables, `petalWidth` and `petalLength`. Let's try visualizing different pairs of variables, and see how they look: # In[10]: columns = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth'] for i in range(4): for j in range(i+1,4): df.plot.scatter( x=columns[i], y=columns[j], c=y_hat, cmap="viridis", colorbar=False, figsize=(3,3) ) # Two of the classes are always mashed together. **Should `n_clusters` be 2 or 3?** I can't tell from these plots: two of the clusters are always mashed together. # # I'd most likely say `n_clusters=2` if I hadn't already seen the underlying dataset. Either way, there are *at least* two groups. Here's the underlying dataset: # In[11]: df = pd.read_csv("iris.csv") print(df.species.unique()) df.head() # This example of an unknown number of clusters will be revisited in this lecture. First, let's see how accurate `KMeans` performed the clustering -- does it group flowers of the same species together? # ## How accurate is the clustering? # Here's the process to check this: # # 1. Re-run our predictions with 3 clusters # 2. Match the predicted *numerical* labels with the `species` label # 3. See the predicted labels match with the actual labels # ### Re-running predictions # In[12]: km = KMeans(n_clusters=3, random_state=42) features = ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth'] y_hat = km.fit_predict(df[features]) # Only the features are used, not the true clusters. It's the job of the clustering algorithm to predict that. # The `random_state` keyword in `KMeans` removes some of the randomness in `KMeans` clustering. Specifying `random_state` as an integer is an easy way to get the same result each time. # ### Matching numerical with predicted label # # Predicted labels are *numeric*: # In[13]: y_hat # However, the actual labels are text: # In[14]: df["species"].head() # It'd be easiest if there's a dictionary that maps between `setosa` and `1`, something like `{1: "setosa", ...}`. # # To do that, let's look at the most common label for each numeric label: # In[15]: # First, assign a column in the dataframe df["numerical_prediction"] = y_hat # now look at the numerical predictions for each label: df.numerical_prediction[ df.species == 'virginica' ] # It looks like 1 means "setosa", 2 means "viriginica" and 0 means "versicolor". # # Next week, we will learn about an easier method to get these numbers (using `group_by` or `pivot_table`). # In[16]: mapping = {1: "setosa", 2: "virginica", 0: "versicolor"} # In[17]: def get_label(numeric): return mapping[numeric] df["predicted_species"] = df.numerical_prediction.apply(get_label) print(len(df)) df.head() # ### Difference between predicted and actual labels # In[18]: def accuracy(actual, pred): return (actual == pred).sum() / len(actual) accuracy(df.species, df.predicted_species) # Looks like the KMeans finds the groups with 89.33% accuracy! # --- # # # **PROBLEMS** # ## Question 1: Number of clusters # **What happens what `KMeans` gets an incorrect number of classes?** To investigate that, let's create a synthetic dataset in two dimensions. # In[ ]: from sklearn.datasets import make_blobs import numpy as np import pandas as pd X, _ = make_blobs( n_samples=1500, random_state=170, ) df = pd.DataFrame(X, columns=["x", "y"]) df.head(n=2) # In[ ]: ## Your code here -- plot the data. How many clusters are there? # `KMeans` will handle this fine -- each blob is pretty well defined and nicely shaped. # # But let's try to see how `KMeans` handles a simple error. Let's try mis-specifying the number of clusters: # ## Question 2: Mis-specification of `n_clusters` # In[ ]: ## Your code here -- specify 2 clusters in KMeans, and visualize the results # (hint: add a column to the dataframe and use df.plot.scatter) # # What two clusters are mis-clustered as the same class? # This makes sense because the classes that have the same mean for one variable are closer together. `KMeans` clusters these two together because they are closer than the third cluster. # ## Question 3: Data shape # `KMeans` certainly depends on the *data position.* # # How does `KMeans` depends on the *data shape*? # In[ ]: X, y = make_blobs(n_samples=1500, random_state=170) transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]] X = np.dot(X, transformation) df = pd.DataFrame(X, columns=["x", "y"]) df.head(n=2) # In[ ]: # Your code here -- plot the data. What's the data look like? # The groups are now unevenly shaped. Let's try KMeans again with the correct number of classes and see what happens. # In[ ]: ## Your code here -- provide some clustering in the `y_pred` variable # with 3 clusters. What does the clustering do? # define y_pred, which should be the cluster labels y_pred = # In[ ]: df = pd.DataFrame(X, columns=["x", "y"]) df["predicted"] = y_pred df.plot.scatter(x="x", y="y", c="predicted", cmap="viridis", colorbar=False) # What is `KMeans` trying to do? It finds some cluster centers so that all the points are close to the closest cluster center. That means that it cares more about one effective dimension than the other. # In[ ]: