TODO:
- Remove train/test split
- Focus on just X
In this notebook, we'll perform clustering on the Penguins dataset using K-means. We'll train on a subset of the data and see how our model generalizes to new, unseen penguins.
I've done this for you.
from google.colab import drive
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
drive.mount('/content/gdrive')
df = pd.read_csv('/content/gdrive/My Drive/datasets/penguins.csv')
df = df.dropna()
df = df.drop([9, 14])
# Inspect the results
df.head()
Before we begin processing, let's split our data into training and test sets (80/20 split) using sklearn's train_test_split.
from sklearn.model_selection import train_test_split
# Create train/test split (20% test)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
print(f"Training set size: {len(df_train)}")
print(f"Test set size: {len(df_test)}")
We think the penguin sex
might be useful for our clustering. Let's one-hot encode it for both train and test sets.
# One-hot encode training data
df_train = pd.get_dummies(df_train).drop("sex_.", axis=1, errors="ignore")
# One-hot encode test data
df_test = pd.get_dummies(df_test).drop("sex_.", axis=1, errors="ignore"
Inspect both dataframes after you've done this.
We'll scale the features to standardize them. Remember to fit the scaler on training data only!
Note: Because this is an unsupervised algorithm, all features in our dataframe become our feature matrix X!
scaler = StandardScaler()
# Fit scaler on training data and transform
X_train = pd.DataFrame(
scaler.fit_transform(df_train),
columns=df_train.columns,
index=df_train.index
)
# Transform test data using the fitted scaler
X_test = pd.DataFrame(
scaler.transform(df_test),
columns=df_test.columns,
index=df_test.index
)
Inspect both X_train and X_test after scaling.
In addition to scaling, we'll reduce the number of features using Principal Component Analysis (PCA). Again, fit only on training data!
# First, determine optimal number of components using training data
pca = PCA(n_components=None)
pca_temp = pca.fit(X_train)
n_components = sum(pca_temp.explained_variance_ratio_ > 0.1)
print(f"Number of components with variance > 0.1: {n_components}")
# Now fit PCA with optimal components
pca = PCA(n_components=n_components)
X_train = pd.DataFrame(
pca.fit_transform(X_train),
index=X_train.index
)
# Transform test data
X_test = pd.DataFrame(
pca.transform(X_test),
index=X_test.index
)
Inspect both X_train and X_test, as well as n_components.
You can either guess the number of clusters or use the elbow method to find the optimal k. We'll use the training data for this.
inertia = []
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42).fit(X_train)
inertia.append(kmeans.inertia_)
plt.plot(range(1, 10), inertia, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()
Look for the "elbow" in the plot where adding more clusters shows diminishing returns.
Pick a number that feels right and assign it to n_clusters
:
n_clusters = ...
Apply K-means clustering with your chosen number of clusters.
# Fit K-means on training data
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X_train)
# Get cluster assignments for training data
train_clusters = kmeans.labels_
# Visualize the training clusters on the first two principal components
plt.scatter(X_train.iloc[:, 0], X_train.iloc[:, 1], c=train_clusters, cmap="viridis")
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.title(f"K-means Clustering on Training Data (K={n_clusters})")
plt.show()
Now let's apply our clustering model to the test set. While we can't evaluate accuracy (no ground truth labels in unsupervised learning!), we can see how the model assigns clusters to new data.
# Predict clusters for test data
y_pred = kmeans.predict(X_test)
# Show cluster distribution in test set
print("Test set cluster distribution:")
print(pd.Series(y_pred).value_counts().sort_index())
# Visualize test set predictions
plt.figure(figsize=(12, 5))
# Plot training data
plt.subplot(1, 2, 1)
plt.scatter(X_train.iloc[:, 0], X_train.iloc[:, 1], c=train_clusters, cmap="viridis", alpha=0.6)
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.title(f"Training Data Clusters (K={n_clusters})")
# Plot test data
plt.subplot(1, 2, 2)
plt.scatter(X_test.iloc[:, 0], X_test.iloc[:, 1], c=y_pred, cmap="viridis", alpha=0.6)
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.title(f"Test Data Predictions")
plt.tight_layout()
plt.show()
# This clustering could be useful for:
# - Identifying different penguin subgroups for conservation efforts
# - Understanding natural groupings in the population
# - Feature engineering for supervised learning tasks
print("\nClustering complete! These groups could represent different penguin subpopulations or behavioral patterns.")