Within each of the neighborhoods, there are several versions of each map. In this analysis, we'll cluster version of each of the maps, based on the variations of the maps and the size of the region formed by the four ground control point intersections.
For instance, here are two versions of the East Village map.
![]() |
![]() |
Part of the process for digitizing these maps is to annotate four well-known intersections on each map; these intersection positions are known as ground control points. These annotations are used to transform pixel position to geographic coordinates, but they can also be used to determine the scale for a map, but looking at the four positions as a rectangle. Here is the intersection polygon for the East Village Maps.
We can use the (x,y) coordinates for these rectangles to do a k-means clustering on the maps, and later use the map grouping and scale information as input to extracting features that can help identify similar maps in the future.
import seaborn as sns
import metapack as mp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from kneed import KneeLocator
from shapely import wkt
import dtcv
from dtcv.images import get_image, upper_left
from dtcv.plotting import *
from sklearn.cluster import KMeans
import statsmodels.api as sm
import cv2
base_cmap = plt.get_cmap('gist_rainbow')
%matplotlib inline
sns.set_context('notebook')
mp.jupyter.init()
First, load in the Downtown Homelessness Computer Vision datasets.
pkg = mp.jupyter.open_package()
#pkg = mp.jupyter.open_source_package()
pkg
The intersection_regions
dataset has URLS for each of the map images, as well as Shapely ( WKT format ) polygons composed of the intersection points. The gcp
dataset has the individual, labeld intersections, but for our purposes, the polygons are easier to use.
r = pkg.resource('intersection_regions')
df = r.dataframe()
display(r)
df.head().T
For the East Village neighborhood, there are two kinds of maps that are different enough ( East Village South ) that the neightborhood has two sets of intersections. The intersections_id
variable identifies this case.
First, let's get a view of what the space of intersection shape looks like. Here is a plot of all of the map intersection shapes, by width and height, and colored by the intersection_id, showing some clear patterns. In particular, it looks like all of the regions for a particular neighborhood all line on a line, which means that they are just scaled versions of each other, with a constant aspect ratio. And, there appear to be clusters of points.
That gives us two interesting ways to categorize the maps: K-Means clustering, to find the clusters, and fitting them to a line.
Fitting them to a line is probably really easy, because the lines all intersect 0, and the slope will just be the ratio of Y to X.
dims = pd.DataFrame({
'map_no': df.map_name.astype('category').cat.codes,
'map_name': df.map_name,
'x': df.source_shape_x,
'y': df.source_shape_y,
'image_url': df.image_url
}).set_index('image_url')
The line fits look interesting, but there are high, and very variable, y intercepts. I'd expect these all to be zero, because as the size of the region gets smaller, both the width and height should approach zero. Let's look at this again with a fit that does not use a constant, forcing the y intercept to be zero.
groups = dims.groupby(['map_name', 'map_no'])
cmap = resample_reorder_cmap(base_cmap,len(groups))
ax = dims.plot.scatter(x='x',y='y',c='map_no', colormap=cmap)
for (name, num), g in groups:
Y = g.y.to_numpy()
X = g.x.to_numpy()
Xc = sm.add_constant(X)
model = sm.OLS(Y,Xc)
results = model.fit()
print(name, results.params)
ax.plot(X, results.predict(Xc), label=name, color=cmap(num))
plt.legend(loc='center left', bbox_to_anchor=(1.3, 0.5));
Without the constant, it is more clear that the regression lines run through the middle of groups within each map that are linear within the group. For instance, the red line for Cortez, which runs through the sea-foam green points for Cortez, seems to run through two separate groups, one above the regression line, and one below. Other neighborhoods, like columbia,
groups.size()
groups = dims.groupby(['map_name', 'map_no'])
ax = dims.plot.scatter(x='x',y='y',c='map_no', colormap=cmap)
for (name, num), g in groups:
Y = g.y.to_numpy()
X = g.x.to_numpy()
model = sm.OLS(Y,X)
results = model.fit()
print(name, results.params[0])
ax.plot(X, results.predict(X), label=name, color=cmap(num))
plt.legend(loc='center left', bbox_to_anchor=(1.3, 0.5));
The other idea for clustering is k-means. The k-means procedure unfortunately requires knowing how many clusters you want, and we don't know. But, we do nknow it is a small number. So, we can run the process multiple times and look for the number of clusters where the matching score is sufficiently good. The scores will ( should form a plot ) with a knee, which you can find easily with the kneed package
t = dims[dims.map_name == 'gaslamp_16']
X = t[['x','y']].values
scores = []
for i in range(2,10):
kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
pred = kmeans.predict(X)
scores.append(kmeans.score(X))
knee = KneeLocator(x=range(len(scores)), y=scores)
knee.plot_knee();
Here is the whole procedure to first run multiple k-means to find the optimal clusters, then run it one more time to find the clusters.
##
## Use KMeans to find clusters of region point shapes, to identify different kinds of maps.
##
dims['cluster'] = None
# Force some of the maps to a specific number of clusters.
force_clusters = {
'east_village_c': 2 # Not clear why it finds 4, not 2
}
for name, g in dims.groupby('map_name'):
X = g[['x','y']].values
# If there isn't a lot of variation in the shapes, then skip
# the clustering.
m = X.mean(axis=0) # Centroid
# hmmm .. the distance from the origin to the point?
s = [np.linalg.norm(x-m) for x in X]
if pd.Series(s).std() > 100:
#
# Run the KMeans multiple times and look for the knee in the
# score vs #clusters plot
if not name in force_clusters:
scores = []
n_clusters = []
for i in range(2,10):
kmeans = KMeans(n_clusters=i, random_state=0).fit(X)
pred = kmeans.predict(X)
scores.append(kmeans.score(X))
n_clusters.append(i)
knee = KneeLocator(x=range(len(scores)), y=scores).knee
clusters = n_clusters[knee]
else:
clusters = force_clusters[name]
# Now re-run with that number of clusters.
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(X)
pred = kmeans.predict(X)
else:
print('Skipping ', name)
pred = 0
dims.loc[dims.map_name==name,'cluster'] = pred
dims['map_cluster'] = dims.apply(lambda r: f"{r.map_name}-{r.cluster}", axis=1)
dims['map_cluster_no'] = dims.map_cluster.astype('category').cat.codes
dims.plot.scatter(x='x',y='y',c='map_cluster_no', colormap='tab20c');
That seemed to work, so let's look at the regression again, this time on the clusters, and ignoring the y-intercept.
groups = dims.groupby(['map_cluster', 'map_cluster_no'])
cmap = resample_reorder_cmap(base_cmap,len(groups)) # rand_cmap(len(groups), type='bright')
ax = dims.plot.scatter(x='x',y='y',c='map_cluster_no', colormap=cmap)
import statsmodels.api as sm
for (name, num), g in groups:
Y = g.y.to_numpy()
X = g.x.to_numpy()
model = sm.OLS(Y,X)
results = model.fit()
ax.plot(X, results.predict(X), label=name, color=cmap(num))
plt.legend(loc='center left', bbox_to_anchor=(1.3, 0.5));
These clusters look very good, mostly organized around the regression lines, so we can merge the cluster values back into the main dataset and use the clusters for further analysis.
df = df.merge(dims.reset_index()[['image_url','map_cluster']])
df.head().T
Here are the first images, and the intersections polygon, from one image from each of the clusters.
plot_image_and_poly(df.groupby('map_cluster').first(), max_x=3, figsize=(20,60),
titlef=lambda r: r.map_name)