import warnings
warnings.filterwarnings("ignore")
The world's hardest questions are complex and multi-faceted. Effective methods to learn from data recognize this. Many questions and challenges are inherently multidimensional; they are affected, shaped, and defined by many different components all acting simultaneously. In statistical terms, these processes are called multivariate processes, as opposed to univariate processes, where only a single variable acts at once. Clustering is a fundamental method of geographical analysis that draws insights from large, complex multivariate processes. It works by finding similarities among the many dimensions in a multivariate process, condensing them down into a simpler representation. Thus, through clustering, a complex and difficult to understand process is recast into a simpler one that even non-technical audiences can use.
Clustering (as we discuss it in this chapter) borrows heavily from unsupervised statistical learning {cite}friedman2001elements
.
Often, clustering involves sorting observations into groups without any prior idea about what the groups are (or, in machine learning jargon, without any labels, hence the unsupervised name).
These groups are delineated so that members of a group should be more
similar to one another than they are to members of a different group.
Each group is referred to as a cluster while the process of assigning
objects to groups is known as clustering. If done well, these clusters can be
characterized by their profile, a simple summary of what members of a group are like in terms of the original multivariate phenomenon.
Since a good cluster is more similar internally than it is to any other cluster, these cluster-level profiles provide a convenient shorthand to describe the original complex multivariate phenomenon we are interested in. Observations in one group may have consistently high scores on some traits but low scores on others. The analyst only needs to look at the profile of a cluster in order to get a good sense of what all the observations in that cluster are like, instead of having to consider all of the complexities of the original multivariate process at once. Throughout data science, and particularly in geographic data science, clustering is widely used to provide insights on the (geographic) structure of complex multivariate (spatial) data.
In the context of explicitly spatial questions, a related concept, the region, is also instrumental. A region is similar to a cluster, in the sense that all members of a region have been grouped together, and the region should provide a shorthand for the original data within the region. For a region to be analytically useful, its members also should display stronger similarity to each other than they do to the members of other regions. However, regions are more complex than clusters because they combine this similarity in profile with additional information about the location of their members: they should also describe a clear geographic area. In short, regions are like clusters (since they have a consistent profile) where all its members are geographically consistent.
The process of creating regions is called regionalization {cite}duque2007supervised
.
A regionalization is a special kind of clustering where the objective is
to group observations which are similar in their statistical attributes,
but also in their spatial location. In this sense, regionalization embeds the same
logic as standard clustering techniques, but also applies a series of geographical constraints. Often, these
constraints relate to connectivity: two candidates can only be grouped together in the
same region if there exists a path from one member to another member
that never leaves the region. These paths often model the spatial relationships
in the data, such as contiguity or proximity. However, connectivity does not
always need to hold for all regions, and in certain contexts it makes
sense to relax connectivity or to impose different types of geographic constraints.
In this chapter we consider clustering techniques and regionalization methods. In the process, we will explore the socioeconomic characteristics of neighborhoods in San Diego. We will extract common patterns from the cloud of multidimensional data that the Census Bureau produces about small areas through the American Community Survey. We begin with an exploration of the multivariate nature of our dataset by suggesting some ways to examine the statistical and spatial distribution before carrying out any clustering. Focusing on the individual variables, as well as their pairwise associations, can help guide the subsequent application of clusterings or regionalizations. We then consider geodemographic approaches to clustering—the application of multivariate clustering to spatially referenced demographic data. Two popular clustering algorithms are employed: k-means and Ward's hierarchical method. As we will see, mapping the spatial distribution of the resulting clusters reveals interesting insights on the socioeconomic structure of the San Diego metropolitan area. We also see that in many cases, clusters are spatially fragmented. That is, a cluster may actually consist of different areas that are not spatially connected. Indeed, some clusters will have their members strewn all over the map. This will illustrate why connectivity might be important when building insight about spatial data, since these clusters will not at all provide intelligible regions. With this insight in mind, we will move on to regionalization, exploring different approaches that incorporate geographical constraints into the exploration of the social structure of San Diego. Applying a regionalization approach is not always required but it can provide additional insights into the spatial structure of the multivariate statistical relationships that traditional clustering is unable to articulate.
from esda.moran import Moran
from libpysal.weights import Queen, KNN
import seaborn
import pandas
import geopandas
import numpy
import matplotlib.pyplot as plt
We return to the San Diego tracts dataset we have used earlier in the book. In this case, we will not only rely on its polygon geometries, but also on its attribute information. The data comes from the American Community Survey (ACS) from 2017. Let us begin by reading in the data.
# Read file
db = geopandas.read_file("../data/sandiego/sandiego_tracts.gpkg")
To make things easier later on, let us collect the variables we will use to characterize Census tracts. These variables capture different aspects of the socioeconomic reality of each area and, taken together, provide a comprehensive characterization of San Diego as a whole. We thus create a list with the names of the columns we will use later on:
cluster_variables = [
"median_house_value", # Median house value
"pct_white", # % tract population that is white
"pct_rented", # % households that are rented
"pct_hh_female", # % female-led households
"pct_bachelor", # % tract population with a Bachelors degree
"median_no_rooms", # Median n. of rooms in the tract's households
"income_gini", # Gini index measuring tract wealth inequality
"median_age", # Median age of tract population
"tt_work", # Travel time to work
]
Let's start building up our understanding of this dataset through both visual and statistical summaries. The first stop is considering the spatial distribution of each variable alone. This will help us draw a picture of the multi-faceted view of the tracts we want to capture with our clustering. Let's use (quantile) choropleth maps for each attribute and compare them side-by-side:
f, axs = plt.subplots(nrows=3, ncols=3, figsize=(12, 12))
# Make the axes accessible with single indexing
axs = axs.flatten()
# Start a loop over all the variables of interest
for i, col in enumerate(cluster_variables):
# select the axis where the map will go
ax = axs[i]
# Plot the map
db.plot(
column=col,
ax=ax,
scheme="Quantiles",
linewidth=0,
cmap="RdPu",
)
# Remove axis clutter
ax.set_axis_off()
# Set the axis title to the name of variable being plotted
ax.set_title(col)
# Display the figure
plt.show()
Several visual patterns jump out from the maps, revealing both commonalities as
well as differences across the spatial distributions of the individual variables.
Several variables tend to increase in value from the east to the west
(pct_rented
, median_house_value
, median_no_rooms
, and tt_work
) while others
have a spatial trend in the opposite direction (pct_white
, pct_hh_female
,
pct_bachelor
, median_age
). This will help show the strengths of clustering;
when variables have
different spatial distributions, each variable contributes distinct
information to the profiles of each cluster. However, if all variables display very similar
spatial patterns, the amount of useful information across the maps is
actually smaller than it appears, so cluster profiles may be much less useful as well.
It is also important to consider whether the variables display any
spatial autocorrelation, as this will affect the spatial structure of the
resulting clusters.
Recall from Chapter 6 that Moran's I is a commonly used measure for global spatial autocorrelation. We can use it to formalise some of the intuitions built from the maps. Recall from earlier in the book that we will need to represent the spatial configuration of the data points through a spatial weights matrix. We will start with queen contiguity:
w = Queen.from_dataframe(db)
Now let's calculate Moran's I for the variables being used. This will measure the extent to which each variable contains spatial structure:
# Set seed for reproducibility
numpy.random.seed(123456)
# Calculate Moran's I for each variable
mi_results = [
Moran(db[variable], w) for variable in cluster_variables
]
# Structure results as a list of tuples
mi_results = [
(variable, res.I, res.p_sim)
for variable, res in zip(cluster_variables, mi_results)
]
# Display on table
table = pandas.DataFrame(
mi_results, columns=["Variable", "Moran's I", "P-value"]
).set_index("Variable")
table
Moran's I | P-value | |
---|---|---|
Variable | ||
median_house_value | 0.646618 | 0.001 |
pct_white | 0.602079 | 0.001 |
pct_rented | 0.451372 | 0.001 |
pct_hh_female | 0.282239 | 0.001 |
pct_bachelor | 0.433082 | 0.001 |
median_no_rooms | 0.538996 | 0.001 |
income_gini | 0.295064 | 0.001 |
median_age | 0.381440 | 0.001 |
tt_work | 0.102748 | 0.001 |
Each of the variables displays significant positive spatial autocorrelation, suggesting clear spatial structure in the socioeconomic geography of San Diego. This means it is likely the clusters we find will have a non random spatial distribution.
Spatial autocorrelation only describes relationships between observations for a single attribute at a time. So, the fact that all of the clustering variables are positively autocorrelated does not say much about how attributes co-vary over space. To explore cross-attribute relationships, we need to consider the spatial correlation between variables. We will take our first dip in this direction exploring the bivariate correlation in the maps of covariates themselves. This would mean that we would be comparing each pair of choropleths to look for associations and differences. Given there are nine attributes, there are 36 pairs of maps that must be compared.
This would be too many maps to process visually. Instead, we focus directly on the bivariate relationships between each pair of attributes, devoid for now of geography, and use a scatterplot matrix.
_ = seaborn.pairplot(
db[cluster_variables], kind="reg", diag_kind="kde"
)