In territory management, a territory is a customer group or geographic area over which either an individual salesperson or a sales team has responsibility. These territories are usually defined based on geography, sales potential, number of clients or a combination of these factors.
The main complexity in territory management is to create areas that are balanced with regards to more than one factor that usually behave very differently. There is no one-size-fits-all solution, and if the balance is off, sales management is likely to leave someone within their organization unhappy or leave money on the table. This is why it is very important to identify and understand all the components and requirements of your use case to apply the most appropriate technique.
We can differentiate between two main use cases: when the location of sales reps is important (usually because they have to travel to visit their clients) and when it is not (travel rarely occurs). The first case is clearly more complex than the latter.
In this notebook we will use two different techniques to solve territory management problems when only the location of clients needs to be considered, i.e., we will have a single layer of data consisting of client locations. We will prove the value Spatial Data Science techniques by showing their additional value compared to traditional techniques.
A pharma lab is interested in balancing their sales territories in the state of Texas based on the number of current and potential clients per territory.
Their clients are mainly offices and clinics of medical doctors.
They are interested in creating 5 balanced territories.
We will use the following two datasets from CARTO's Data Observatory:
Note the POI dataset is premium and a subscription is needed to access this data.
We'll start by importing all packages we'll use.
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from cartoframes.auth import set_default_credentials
from cartoframes.data.observatory import *
from cartoframes.viz import *
from h3 import h3
from libpysal.weights import Rook
from shapely import wkt
from shapely.geometry import mapping, Polygon
from sklearn.cluster import KMeans
from spopt.region.maxp import MaxPHeuristic
pd.set_option('display.max_columns', None)
plt.rc('axes', titlesize='large')
plt.rc('xtick', labelsize='large')
plt.rc('ytick', labelsize='large')
sns.set_style('whitegrid')
In order to be able to use the Data Observatory via CARTOframes, you need to set your CARTO account credentials first.
Please, visit the Authentication guide for further detail.
set_default_credentials('creds.json')
The following function creates an H3 polyfill of the polygon and at the resolution indicated.
def create_h3_grid(polygon, resolution=8):
hex_id_list = list(h3.polyfill(geojson = mapping(polygon), res = resolution, geo_json_conformant=True))
hexagon_list = list(map(lambda x : Polygon(h3.h3_to_geo_boundary(h=x, geo_json=True)), hex_id_list))
grid = pd.DataFrame(data={'hex_id':hex_id_list, 'geometry':hexagon_list})
grid = gpd.GeoDataFrame(grid, crs='epsg:4326')
return grid
The function below is used throughout the analysis to check is clusters are balanced based on different metrics.
The function arguments are:
cluster_names
so that we can provide descriptive names to clustersareas_df
is the GeoDataFramegroupby
is the column with the cluster to which each cell belongs to**kaggregations
for the different metrics we'd like to calculatedef plot_clinic_balance(clusters, areas_df, groupby, **kaggregations):
areas_df_g = areas_df.groupby(groupby).agg(kaggregations).reset_index()
n_plots = len(kaggregations)
fig, axs = plt.subplots(1, n_plots, figsize=(9 + 3*n_plots,4))
if n_plots == 1:
axs = [axs]
for i in range(n_plots):
sns.barplot(y=groupby, x=list(kaggregations.keys())[i], data=areas_df_g, order=clusters,
palette=['#7F3C8D','#11A579','#3969AC','#F2B701','#E73F74'], ax=axs[i])
axs[i].set_xlabel(list(kaggregations.keys())[i], fontsize=13)
axs[i].set_ylabel('Sales rep locations', fontsize=13)
fig.tight_layout()
return axs
Next, we will download the data described in the usecase using CARTOframes.
Note in this notebook some prior knowledge on how to explore and download data from the Data Observatory is assumed. If this is your first time exploring and downloading data from the Data Observatory, take a look at CARTOframes Guides and the Data Observatory examples and discover how easy it is to get started!
We are interested in the geometry of the state of Texas. We'll download it from Who's on First GeoJSON - Global dataset.
wof_grographies = Geography.get('wof_geojson_4e78587c')
wof_grographies.to_dict()
{'slug': 'wof_geojson_4e78587c', 'name': 'GeoJSON - Global', 'description': "The main table in Who's On First. Holds all the relevant information for a place in the 'body' JSON field.", 'country_id': 'glo', 'provider_id': 'whos_on_first', 'geom_type': 'MULTIPLE', 'update_frequency': None, 'is_public_data': True, 'lang': 'eng', 'version': '20190520', 'provider_name': "Who's On First", 'id': 'carto-do-public-data.whos_on_first.geography_glo_geojson_20190520'}
state_name = 'Texas'
country_code = 'US'
placetype = 'region'
sql_query = f"""SELECT *
FROM $geography$
WHERE name = '{state_name}' AND
country = '{country_code}' AND
placetype='{placetype}'"""
tx_boundary = wof_grographies.to_dataframe(sql_query=sql_query)
tx_boundary['geom'] = list(map(wkt.loads, tx_boundary['geom']))
tx_boundary = gpd.GeoDataFrame(tx_boundary, geometry='geom', crs='epsg:4326')
tx_boundary
geoid | id | body | name | country | parent_id | is_current | placetype | geometry_type | bbox | geom | lastmodified | lastmodified_timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 85688753 | 85688753 | {"id": 85688753, "type": "Feature", "propertie... | Texas | US | 85633793 | 1 | region | Polygon | POLYGON((-93.508039 25.837164, -93.508039 36.5... | POLYGON ((-103.06466 32.95910, -103.06460 32.9... | 1555446728 | 2019-04-16 20:32:08+00:00 |
We'll download all POIs in Texas classified as "OFFICES AND CLINICS OF MEDICAL DOCTORS" from Pitney Bowes POI-Consumer dataset.
Note this is a premium dataset and a subscription is required.
poi_dataset = Dataset.get('pb_consumer_po_62cddc04')
sql_query = """
SELECT * except(do_label) FROM $dataset$
WHERE SUB_CLASS = 'OFFICES AND CLINICS OF MEDICAL DOCTORS'
AND STABB = 'TX'
AND CAST(do_date AS date) >= (SELECT MAX(CAST(do_date AS date)) from $dataset$)
"""
pois = poi_dataset.to_dataframe(sql_query=sql_query)
pois.columns = list(map(str.lower, pois.columns))
pois['geom'] = list(map(wkt.loads, pois['geom']))
pois = gpd.GeoDataFrame(pois, geometry='geom', crs='epsg:4326')
pois.head()
geoid | do_date | NAME | BRANDNAME | PB_ID | TRADE_NAME | FRANCHISE_NAME | ISO3 | AREANAME4 | AREANAME3 | AREANAME2 | AREANAME1 | STABB | POSTCODE | FORMATTEDADDRESS | MAINADDRESSLINE | ADDRESSLASTLINE | LONGITUDE | LATITUDE | GEORESULT | CONFIDENCE_CODE | COUNTRY_ACCESS_CODE | TEL_NUM | FAXNUM | HTTP | OPEN_24H | BUSINESS_LINE | SIC1 | SIC2 | SIC8 | SIC8_DESCRIPTION | ALT_INDUSTRY_CODE | MICODE | TRADE_DIVISION | GROUP | CLASS | SUB_CLASS | EMPLOYEE_HERE | EMPLOYEE_COUNT | YEAR_START | SALES_VOLUME_LOCAL | SALES_VOLUME_US_DOLLARS | CURRENCY_CODE | AGENT_CODE | LEGAL_STATUS_CODE | STATUS_CODE | SUBSIDIARY_INDICATOR | PARENT_BUSINESS_NAME | PARENT_ADDRESS | PARENT_STREET_ADDRESS | PARENT_AREANAME3 | PARENT_AREANAME1 | PARENT_COUNTRY | PARENT_POSTCODE | DOMESTIC_ULTIMATE_BUSINESS_NAME | DOMESTIC_ULTIMATE_ADDRESS | DOMESTIC_ULTIMATE_STREET_ADDRESS | DOMESTIC_ULTIMATE_AREANAME3 | DOMESTIC_ULTIMATE_AREANAME1 | DOMESTIC_ULTIMATE_POSTCODE | GLOBAL_ULTIMATE_INDICATOR | GLOBAL_ULTIMATE_BUSINESS_NAME | GLOBAL_ULTIMATE_ADDRESS | GLOBAL_ULTIMATE_STREET_ADDRESS | GLOBAL_ULTIMATE_AREANAME3 | GLOBAL_ULTIMATE_AREANAME1 | GLOBAL_ULTIMATE_COUNTRY | GLOBAL_ULTIMATE_POSTCODE | FAMILY_MEMBERS | HIERARCHY_CODE | TICKER_SYMBOL | EXCHANGE_NAME | geom | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1484649291#-95.391433#29.730764 | 2020-08-01 | CHARLES GARCIA | NaN | 1484649291 | CHARLES GARCIA, MD PA | NaN | USA | NaN | HOUSTON | NaN | TEXAS | TX | 77006-6122 | 4704 MONTROSE BLVD, HOUSTON, TX, 77006-6122 | 4704 MONTROSE BLVD | HOUSTON, TX, 77006-6122 | -95.391433 | 29.730764 | S8HPNTSCZA | HIGH | 1.0 | (713) 333-0151 | NaN | NaN | WWW.CHARLESGARCIAMD.COM | NaN | OFFICES AND CLINICS MEDICAL DOCTORS,NSK | 8011.0 | NaN | 80110513 | OPTHALMOLOGIST | 621111.0 | 10942513 | DIVISION I. - SERVICES | HEALTH SERVICES | OFFICES AND CLINICS OF DOCTORS OF MEDICINE | OFFICES AND CLINICS OF MEDICAL DOCTORS | 75.0 | 75.0 | 2019.0 | 515248.0 | 515248.0 | 20.0 | G | 13.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | POINT (-95.39143 29.73076) |
1 | 1121837904#-93.987719#29.956197 | 2020-08-01 | SOUTHEAST TEXAS IMAGING LLP | NaN | 1121837904 | NaN | NaN | USA | NaN | NEDERLAND | NaN | TEXAS | TX | 77627-6290 | 1323 S 27TH ST STE 700, NEDERLAND, TX, 77627-6290 | 1323 S 27TH ST STE 700 | NEDERLAND, TX, 77627-6290 | -93.987719 | 29.956197 | S8HPNTSCZA | HIGH | 1.0 | (409) 729-5400 | NaN | NaN | NaN | NaN | OFFICES AND CLINICS MEDICAL DOCTORS,NSK | 8011.0 | NaN | 80110519 | RADIOLOGIST | 621111.0 | 10942519 | DIVISION I. - SERVICES | HEALTH SERVICES | OFFICES AND CLINICS OF DOCTORS OF MEDICINE | OFFICES AND CLINICS OF MEDICAL DOCTORS | 11.0 | 11.0 | 1995.0 | 1386157.0 | 1386157.0 | 20.0 | G | 12.0 | 1.0 | 0.0 | SOUTHEAST TEXAS IMAGING LLP | 1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762... | 1323 S 27TH ST STE 700 | NEDERLAND | TEXAS | USA | 776276290.0 | SOUTHEAST TEXAS IMAGING LLP | 1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762... | 1323 S 27TH ST STE 700 | NEDERLAND | TEXAS | 776276290.0 | Y | SOUTHEAST TEXAS IMAGING LLP | 1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762... | 1323 S 27TH ST STE 700 | NEDERLAND | TEXAS | USA | 7.76276e+08 | 2.0 | 1.0 | NaN | NaN | POINT (-93.98772 29.95620) |
2 | 1129360394#-97.401965#27.776169 | 2020-08-01 | CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI | NaN | 1129360394 | NaN | NaN | USA | NaN | CORPUS CHRISTI | NaN | TEXAS | TX | 78404-3160 | 1521 S STAPLES ST STE 700, CORPUS CHRISTI, TX,... | 1521 S STAPLES ST STE 700 | CORPUS CHRISTI, TX, 78404-3160 | -97.401965 | 27.776169 | S8HPNTSCZA | HIGH | 1.0 | (361) 888-8271 | NaN | NaN | WWW.HEARTOFCC.COM | NaN | OFFICES AND CLINICS MEDICAL DOCTORS,NSK | 8011.0 | NaN | 80110101 | CARDIOLOGIST AND CARDIO-VASCULAR SPECIALIST | 621111.0 | 10942101 | DIVISION I. - SERVICES | HEALTH SERVICES | OFFICES AND CLINICS OF DOCTORS OF MEDICINE | OFFICES AND CLINICS OF MEDICAL DOCTORS | 100.0 | 100.0 | 1976.0 | 11077130.0 | 11077130.0 | 20.0 | G | 3.0 | 1.0 | 0.0 | CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI | 1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX... | 1521 S STAPLES ST STE 700 | CORPUS CHRISTI | TEXAS | USA | 784043160.0 | CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI | 1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX... | 1521 S STAPLES ST STE 700 | CORPUS CHRISTI | TEXAS | 784043160.0 | Y | CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI | 1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX... | 1521 S STAPLES ST STE 700 | CORPUS CHRISTI | TEXAS | USA | 7.84043e+08 | 3.0 | 1.0 | NaN | NaN | POINT (-97.40197 27.77617) |
3 | 1119324698#-98.203039#26.254586 | 2020-08-01 | VINCENT HONRUBIA, M.D., P.A. | NaN | 1119324698 | NaN | NaN | USA | NaN | EDINBURG | NaN | TEXAS | TX | 78539-1406 | 2821 MICHAELANGELO DR STE 201, EDINBURG, TX, 7... | 2821 MICHAELANGELO DR STE 201 | EDINBURG, TX, 78539-1406 | -98.203039 | 26.254586 | S8HPNTSCZA | HIGH | 1.0 | (956) 421-4500 | NaN | NaN | WWW.SOUTHTEXASSINUSINSTITUTE.COM | NaN | OFFICES AND CLINICS MEDICAL DOCTORS,NSK | 8011.0 | NaN | 80119901 | GENERAL AND FAMILY PRACTICE, PHYSICIAN/SURGEON | 621111.0 | 10230302 | DIVISION I. - SERVICES | HEALTH SERVICES | OFFICES AND CLINICS OF DOCTORS OF MEDICINE | OFFICES AND CLINICS OF MEDICAL DOCTORS | 13.0 | 13.0 | 2008.0 | 1464692.0 | 1464692.0 | 20.0 | G | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | POINT (-98.20304 26.25459) |
4 | 1130039254#-95.282651#32.278874 | 2020-08-01 | TINSLEY ASSOCIATES, L.L.P. | NaN | 1130039254 | EXCEPTIONAL HOME CARE | NaN | USA | NaN | TYLER | NaN | TEXAS | TX | 75703-3978 | 1510 E GRANDE BLVD, TYLER, TX, 75703-3978 | 1510 E GRANDE BLVD | TYLER, TX, 75703-3978 | -95.282651 | 32.278874 | S8HPNTSCZA | HIGH | 1.0 | (903) 533-0290 | NaN | NaN | NaN | NaN | OFFICES AND CLINICS MEDICAL DOCTORS,NSK | 8011.0 | 8082.0 | 80110516 | PEDIATRICIAN | 621111.0 | 10942516 | DIVISION I. - SERVICES | HEALTH SERVICES | OFFICES AND CLINICS OF DOCTORS OF MEDICINE | OFFICES AND CLINICS OF MEDICAL DOCTORS | 23.0 | 200.0 | 2002.0 | 11307457.0 | 11307457.0 | 20.0 | G | 12.0 | 1.0 | 0.0 | TINSLEY ASSOCIATES, L.L.P. | 1510 E GRANDE BLVD, TYLER, TEXAS, 757033978, USA | 1510 E GRANDE BLVD | TYLER | TEXAS | USA | 757033978.0 | TINSLEY ASSOCIATES, L.L.P. | 1510 E GRANDE BLVD, TYLER, TEXAS, 757033978 | 1510 E GRANDE BLVD | TYLER | TEXAS | 757033978.0 | Y | TINSLEY ASSOCIATES, L.L.P. | 1510 E GRANDE BLVD, TYLER, TEXAS, 757033978, USA | 1510 E GRANDE BLVD | TYLER | TEXAS | USA | 7.57034e+08 | 2.0 | 1.0 | NaN | NaN | POINT (-95.28265 32.27887) |
pois.shape
(59554, 74)
Map([Layer(tx_boundary,
style=basic_style(opacity=0, stroke_color='#11A579', stroke_width=5),
legends=basic_legend('Texas Boundary')),
Layer(pois,
style=basic_style(color='#F2B701', size=2, opacity=0.9, stroke_width=0),
popup_hover=[popup_element('name', 'Client'),
popup_element('employee_here', '# Employees')],
legends=basic_legend('Client Locations'))],
basemap=basemaps.darkmatter)
A fundamental step in territory management is to discretize space. Territory management algorithms are computationally complex and hence it is crucial to leverage the spatial component to reduce complexity. We can do this by working at an aggregated level instead of considering each client location independently.
We first need to identify the smallest spatial aggregation that makes sense for your business, our geographic support. This can be census block groups, zip codes or counties, or you can be interested in using a standard grid, in which case it would ideally be a hierarchical spatial index such as Quadkey grid or H3 grid.
In this notebook we will use an H3 grid of resolution 4. We can easily discretize space by performing a polyfill of the Texas boundary polygon.
Note a buffer has been applied because H3 will fill the polygon with all hexagons of resolution 4 whose centroid lies within the polygon to be filled and we want to make sure the whole territory is covered.
# Buffer
buffer = 2.5e4 # in meters
tx_boundary['geometry_buffer'] = tx_boundary.to_crs('epsg:26914').buffer(buffer).to_crs('epsg:4326')
grid = create_h3_grid(tx_boundary['geometry_buffer'].iloc[0], 4)
grid.head()
hex_id | geometry | |
---|---|---|
0 | 84489edffffffff | POLYGON ((-97.03425 29.34525, -96.79420 29.479... |
1 | 8426d31ffffffff | POLYGON ((-102.58398 34.37842, -102.33428 34.5... |
2 | 8448f29ffffffff | POLYGON ((-104.42876 30.05621, -104.18949 30.2... |
3 | 84446ddffffffff | POLYGON ((-95.82009 29.75644, -95.57955 29.887... |
4 | 84489c5ffffffff | POLYGON ((-98.02512 29.30940, -97.78501 29.445... |
Map([Layer(grid,
style=basic_style(opacity=0.75),
legends=basic_legend('H3 grid')),
Layer(tx_boundary,
style=basic_style(opacity=0, stroke_color='#E73F74', stroke_width=5),
legends=basic_legend('Texas Boundary'))])
We will aggregate clients by:
pois_g = gpd.sjoin(pois, grid, how='right').groupby('hex_id').agg({'geoid':'count', 'employee_here':'mean'}).\
reset_index().rename(columns={'geoid':'poi_count', 'employee_here':'employee_avg'})
pois_g[['poi_count', 'employee_avg']] = pois_g[['poi_count', 'employee_avg']].fillna(0)
areas = grid.merge(pois_g, on='hex_id')
areas = gpd.GeoDataFrame(areas, crs='epsg:4326')
areas.head()
hex_id | geometry | poi_count | employee_avg | |
---|---|---|---|---|
0 | 84489edffffffff | POLYGON ((-97.03425 29.34525, -96.79420 29.479... | 15 | 4.250000 |
1 | 8426d31ffffffff | POLYGON ((-102.58398 34.37842, -102.33428 34.5... | 1 | 2.000000 |
2 | 8448f29ffffffff | POLYGON ((-104.42876 30.05621, -104.18949 30.2... | 0 | 0.000000 |
3 | 84446ddffffffff | POLYGON ((-95.82009 29.75644, -95.57955 29.887... | 1108 | 4.801538 |
4 | 84489c5ffffffff | POLYGON ((-98.02512 29.30940, -97.78501 29.445... | 496 | 4.854093 |
areas['poi_count'].describe(percentiles=[0.25, 0.5, 0.75, 0.95])
count 402.000000 mean 148.144279 std 709.392374 min 0.000000 25% 0.000000 50% 4.000000 75% 40.500000 95% 496.950000 max 8268.000000 Name: poi_count, dtype: float64
breaks=[1, 4, 40, 500]
Map(Layer(areas,
style=color_bins_style('poi_count', breaks=breaks),
legends=color_bins_legend('Number of Clients')))
Once we have our data aggregated, it's time to start working on building balanced territories.
We will explore two different techniques:
This is a very well known and broadly used technique. However, this technique doesn't allow you to incorporate balancing criteria and usually generates low quality results for territory management.
no_territories = 5
kmeans = KMeans(no_territories, random_state=111)
kmeans.fit_predict(list(map(lambda point:[point.x, point.y], areas.to_crs('epsg:26914').centroid)),
sample_weight=areas['poi_count'])
print('Done!')
Done!
We assign each cell to the cluster it belongs to.
In order to have comparable results with other techniques, we identify 5 representative cells, each of which being far enough from each other to make sure they are in different clusters. This way, we can compare how each cluster changes.
areas['kmeans_cluster'] = kmeans.labels_
areas['kmeans_cluster'] += 1
# This dictionary contains the representative cells with the cluster they represent.
trans_dict={'8448c69ffffffff':1,
'8426d47ffffffff':2,
'8426cdbffffffff':3,
'84446edffffffff':4,
'844880dffffffff':5}
areas['kmeans_cluster_aux'] = -1
for hex_id in trans_dict:
areas.loc[areas['kmeans_cluster'] == areas.loc[areas['hex_id'] == hex_id, 'kmeans_cluster'].iloc[0], 'kmeans_cluster_aux'] = trans_dict[hex_id]
areas['kmeans_cluster'] = areas['kmeans_cluster_aux']
areas.drop(columns='kmeans_cluster_aux', inplace=True)
We create a string variable with a more descriptive name for our visualization
areas['kmeans_cluster_cat'] = list(map(lambda v:f'Cluster_{v}', areas['kmeans_cluster']))
KMeans calculates nice compact clusters (see map). However, we get very unbalanced clusters with Cluster_3 having 10 times more clients than Cluster_1, as can be seen in the chart below.
Map(Layer(areas,
style=color_category_style('kmeans_cluster_cat', cat=sorted(areas['kmeans_cluster_cat'].unique())),
legends=color_category_legend('KMeans Clustering')))
plot_clinic_balance(sorted(areas['kmeans_cluster_cat'].unique()), areas, 'kmeans_cluster_cat', poi_count='sum')
Let's now try to balance the number of clients per cluster while maintaining connected clusters as compact as possible.
We will use Pysal's implementation of the Max-p algorithm. Max-p is a spatial clustering algorithm that calculates spatially connected clusters, with similar properties, while balancing one criterion, or mixed or criteria.
The first thing we need to do is calculate the adjacency matrix which will tell the algorithm which cells are contiguous.
We will use Rook weights which considers two polygons to be contiguous if they share one edge.
wgt = Rook.from_dataframe(areas, geom_col='geometry')
wgt.histogram
[(2, 6), (3, 27), (4, 34), (5, 33), (6, 302)]
We would like to balance clusters based on total number of clients. Normally we are not looking for a perfect balance, especially when dealing with more than one multiple criteria, and a balance tolerance is introduced. In our case, we will consider a tolerance of 20%, which means that we allow clusters to be as much as 20% below the perfect balance.
Note we are only using the number of clients to balance, but this dataset also has the number of employees per client and you might even have other data you might be interested in using. The good news is Max-p allows you to do that.
# Trick to help the algorithm find compact areas
areas['poi_count'] += 10
balance_tolerance = 0.2 # 5%
perfect_balance = areas['poi_count'].sum()/5
threshold = int(np.floor(perfect_balance * (1-balance_tolerance)))
print('Minimum number of clients per cluster', threshold)
Minimum number of clients per cluster 10171
Max-p also allows you to set similarity criteria. These are variables that you want to have a similar behavior within clusters. For example, you might be interested in having clusters with similar demographic or socioeconomic characteristics.
In this case, we don't have any specific criteria, so we will use the gris cell centroid coordinates as similarity criteria in order to get clusters as compact as possible. You can try removing these or only adding one of the coordinates to clearly see whats the effect of these similarity criteria.
areas['lat'] = np.array(list(map(lambda point:[point.y, point.x], areas.centroid)))[:,0]
areas['lon'] = np.array(list(map(lambda point:[point.y, point.x], areas.centroid)))[:,1]
areas['lat_norm'] = (areas['lat'] - areas['lat'].min())/(areas['lat'].max() - areas['lat'].min())
areas['lon_norm'] = (areas['lon'] - areas['lon'].min())/(areas['lon'].max() - areas['lon'].min())
areas.head()
hex_id | geometry | poi_count | employee_avg | kmeans_cluster | kmeans_cluster_cat | lat | lon | lat_norm | lon_norm | maxp_cluster | maxp_cluster_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 84489edffffffff | POLYGON ((-97.03425 29.34525, -96.79420 29.479... | 25 | 4.250000 | 5 | Cluster_5 | 29.597247 | -97.048882 | 0.352024 | 0.726035 | 5 | Cluster_5 |
1 | 8426d31ffffffff | POLYGON ((-102.58398 34.37842, -102.33428 34.5... | 11 | 2.000000 | 2 | Cluster_2 | 34.626792 | -102.612015 | 0.816273 | 0.308192 | 2 | Cluster_2 |
2 | 8448f29ffffffff | POLYGON ((-104.42876 30.05621, -104.18949 30.2... | 10 | 0.000000 | 1 | Cluster_1 | 30.309779 | -104.459796 | 0.417793 | 0.169406 | 1 | Cluster_1 |
3 | 84446ddffffffff | POLYGON ((-95.82009 29.75644, -95.57955 29.887... | 1118 | 4.801538 | 4 | Cluster_4 | 30.007381 | -95.832104 | 0.389881 | 0.817426 | 4 | Cluster_4 |
4 | 84489c5ffffffff | POLYGON ((-98.02512 29.30940, -97.78501 29.445... | 506 | 4.854093 | 5 | Cluster_5 | 29.561949 | -98.041929 | 0.348765 | 0.651448 | 5 | Cluster_5 |
maxp_heur = MaxPHeuristic(areas, wgt, ['lat_norm', 'lon_norm'], 'poi_count', threshold,
5, max_iterations_construction=2, max_iterations_sa=5, verbose=True)
maxp_heur.solve()
max_p: 5 number of good partitions: 2 0 totalWithinRegionDistance after SA: 6900.531405712055 totalWithinRegionDistance after SA: 7225.077640946365 totalWithinRegionDistance after SA: 7395.269464681606 totalWithinRegionDistance after SA: 6905.536769392361 totalWithinRegionDistance after SA: 7370.850274070599 1 totalWithinRegionDistance after SA: 5642.22448183758 totalWithinRegionDistance after SA: 5613.498459949133 totalWithinRegionDistance after SA: 5618.573555920197 totalWithinRegionDistance after SA: 5749.532705832657 totalWithinRegionDistance after SA: 5677.7927687007 best objective value: 5613.498459949133
np.max(maxp_heur.labels_)
5
# Undo change
areas['poi_count'] -= 10
We assign each cell to the cluster it belongs to and rename clusters based on the representative cells we mentioned in the KMeans section.
areas['maxp_cluster'] = maxp_heur.labels_
areas['maxp_cluster_aux'] = -1
for hex_id in trans_dict:
areas.loc[areas['maxp_cluster'] == areas.loc[areas['hex_id'] == hex_id, 'maxp_cluster'].iloc[0], 'maxp_cluster_aux'] = trans_dict[hex_id]
areas['maxp_cluster'] = areas['maxp_cluster_aux']
areas.drop(columns='maxp_cluster_aux', inplace=True)
areas['maxp_cluster_cat'] = list(map(lambda v:f'Cluster_{v}', areas['maxp_cluster']))
We can see how clusters now are less compact than they were with KMeans, but now clusters are balanced, with all of them satisfying the minimum requirement of 4256 clients per cluster.
Layout([Map(Layer(areas,
style=color_category_style('kmeans_cluster_cat', cat=sorted(areas['kmeans_cluster_cat'].unique())),
legends=color_category_legend('KMeans Clustering'),
popup_hover=[popup_element('hex_id'), popup_element('maxp_cluster_cat')])),
Map(Layer(areas,
style=color_category_style('maxp_cluster_cat', cat=sorted(areas['maxp_cluster_cat'].unique())),
legends=color_category_legend('Max-p Clustering'),
popup_hover=[popup_element('hex_id'), popup_element('maxp_cluster_cat')]))
], map_height=400)
plot_clinic_balance(sorted(areas['maxp_cluster_cat'].unique()), areas, 'maxp_cluster_cat', poi_count='sum')