Authors: Sergio Rey, Xin Feng
The maxp
problem involves the clustering of a set of geographic areas into the maximum number of homogeneous regions such that the value of a spatially extensive regional attribute is above a predefined threshold value.
The spatially extensive attribute can be specified to ensure that each region contains sufficient population size, or a minimum number of enumeration units. The number of regions p is endogenous to the problem and is useful for regionalization problems where the analyst does not require a fixed number of regions a-priori.
Originally formulated as a mixed-integer problem in Duque, Anselin, Rey (2012), maxp
is an NP-hard problem and exact solutions are only feasible for small problem sizes. As such, a number of heuristic solution approaches have been suggested. PySAL implements the heuristic approach described in
Wei, Rey, and Knaap (2020).
%load_ext watermark
%watermark
The watermark extension is already loaded. To reload it, use: %reload_ext watermark Last updated: 2021-01-19T22:54:03.185491+00:00 Python implementation: CPython Python version : 3.9.1 IPython version : 7.18.1 Compiler : GCC 9.3.0 OS : Linux Release : 4.15.0-132-generic Machine : x86_64 Processor : x86_64 CPU cores : 4 Architecture: 64bit
import warnings
warnings.filterwarnings('ignore')
import geopandas as gpd
import libpysal
import numpy
import sys
sys.path.append("../")
from spopt import MaxPHeuristic as MaxP
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 8]
To illustrate maxp
we utilize data on regional incomes for Mexican states over the period 1940-2000, originally used in Rey and Sastré-Gutiérrez (2010).
We can first explore the data by plotting the per capital gross regional domestic product (in constant USD 2000 dollars) for each year in the sample, using a quintile classification:
pth = libpysal.examples.get_path('mexicojoin.shp')
mexico = gpd.read_file(pth)
for year in range(1940, 2010, 10):
ax = mexico.plot(column=f'PCGDP{year}', scheme='Quantiles', cmap='GnBu', edgecolor='grey', legend=True)
_ = ax.axis('off')
plt.title(str(year))
In general terms, the north-south divide in incomes is present in each of the 7 decades. There is some variation in states moving across quintiles however, and this is true at both the bottom and top of the state income distribution.
To develop a holistic view of the Mexican space economy over this timespan, we can try to form a set of spatially connected regions that maximizes the internal socieconomic levels of the states belonging to each region.
To develop our holistic view, we can treat the six cross-sections as a multidimensional array and seek to cluster 32 Mexican states into the maximum number of regions such that each region as at least 6 = 32 // 5 states and homogeneity in per capita gross regional product over 1940-2000 is maximized.
We first define the variables in the dataframe that will be used to measure regional homogeneity:
attrs_name = [f'PCGDP{year}' for year in range(1940,2010, 10)]
attrs_name
['PCGDP1940', 'PCGDP1950', 'PCGDP1960', 'PCGDP1970', 'PCGDP1980', 'PCGDP1990', 'PCGDP2000']
Next, we specify a number of parameters that will serve as input to the maxp
model.
A spatial weights object expresses the spatial connectivity of the states:
w = libpysal.weights.Queen.from_dataframe(mexico)
The remaining arguments are the minimum number of states each region must have (threshold
):
threshold = 6
and the number of the top candidate regions to consider when assigning enclaves (top_n
):
top_n = 2
We create the attribute count
which will serve as the threshold attribute which we add to the dataframe:
mexico['count'] = 1
threshold_name = 'count'
The model can then be instantiated and solved:
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()
mexico['maxp_new'] = model.labels_
mexico[['maxp_new','AREA']].groupby(by='maxp_new').count()
AREA | |
---|---|
maxp_new | |
1 | 6 |
2 | 7 |
3 | 6 |
4 | 6 |
5 | 7 |
model.p
5
mexico.plot(column='maxp_new', categorical=True, edgecolor='w')
<AxesSubplot:>
The model solution results in five regions, three of which have six states, and two with seven states each.
Each region is a spatially connected component, as required by the maxp
problem.
numpy.random.seed(123456)
threshold=3
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()
mexico['maxp_3'] = model.labels_
mexico[['maxp_3','AREA']].groupby(by='maxp_3').count()
AREA | |
---|---|
maxp_3 | |
1 | 3 |
2 | 3 |
3 | 4 |
4 | 3 |
5 | 3 |
6 | 3 |
7 | 4 |
8 | 3 |
9 | 3 |
10 | 3 |
model.p
10
ax = mexico.plot(column='maxp_3', categorical=True, edgecolor='w')
_ = ax.axis('off')
for year in attrs_name:
model = MaxP(mexico, w, year, threshold_name, 5, top_n)
model.solve()
lab = year+'labels_'
mexico[lab] = model.labels_
ax = mexico.plot(column=lab, categorical=True, edgecolor='w')
plt.title(year)
_ = ax.axis('off')