Here's a (hopefully) fun application of K-means clustering. I've provided a CSV of italian restuaraunts, their names, lats and longs, and ratings.
You can read more about this dataset and its fields on Kaggle: https://www.kaggle.com/datasets/jcraggy/nyc-italian-restaurants-plus
Imagine you're a data scientist for a startup that's trying to provide recommendations to restauraunt goers. When your customer can't get a table at their desired restaurant, you'd like to recommend them an alternative they might find similar.
You're going to do this by k-means clustering the restaurants from this dataset and then visualize by scatter-plotting them on a map.
Load the dataset.
from google.colab import drive
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
drive.mount('/content/gdrive')
df = pd.read_csv("/content/gdrive/My Drive/datasets/nyc_italian.csv")
# Save a copy of the orignal df before you do any transformation to it, we'll want this later for plotting.
original_df = df.copy()
# Inspect it with df.head()
print("Sample:")
display(df.sample())
Do some k-means clustering. Feel free to do a simple k-means or introduce some feature scaling and/or PCA like we did in our Penguin's clustering exercise. Refer to the Penguins Clustering Exercise (just like great art, great data science is often theft of prior work).
You pick the features to cluster on. You may not feel all of them are relevant to your clustering.
Don't bother with the Train/Test split this time, just cluster all of the data.
Make sure you call the kmeans results, kmeans
like we did in the prior exercise. If you call it something else, you'll just have to change the below code.
Use the new k-means clusters to group and average your original dataframe and print it out. Each row in this new grouped table represents one of your clusters. How would you describe this cluster?
original_df["kmeans"] = kmeans.labels_
clusters_df = original_df.drop(["Case", "Restaurant", "latitude", "longitude"], axis=1).groupby("kmeans").mean()
print("Clusters:")
display(clusters_df)
Once you've build your clusters, scatter plot the results on their original lat/longs over a map of manhattan using folium
.
import seaborn as sns
import matplotlib.pyplot as plt
import folium
fmap = folium.Map(location=[40.7128, -74.0060], zoom_start=12)
colors = ['beige', 'lightblue', 'gray', 'blue', 'darkred', 'lightgreen', 'purple', 'red', 'green', 'lightred', 'white', 'darkblue', 'darkpurple', 'cadetblue', 'orange', 'pink', 'lightgray', 'darkgreen']
# Plot each entry in df by it's latitude and longitude on the folium map
for index, row in original_df.iterrows():
color = colors[kmeans.labels_[index]]
description = f"{row['Restaurant']} price={row['Price']} food={row['Food']} decor={row['Decor']} service={row['Service']}"
folium.Marker([row["latitude"], row["longitude"]], popup=description, icon=folium.Icon(color=color)).add_to(fmap)
# Display the map
display(fmap)