Single-layer Territory Management. Optimizing territories considering only clients¶

In territory management, a territory is a customer group or geographic area over which either an individual salesperson or a sales team has responsibility. These territories are usually defined based on geography, sales potential, number of clients or a combination of these factors.

The main complexity in territory management is to create areas that are balanced with regards to more than one factor that usually behave very differently. There is no one-size-fits-all solution, and if the balance is off, sales management is likely to leave someone within their organization unhappy or leave money on the table. This is why it is very important to identify and understand all the components and requirements of your use case to apply the most appropriate technique.

We can differentiate between two main use cases: when the location of sales reps is important (usually because they have to travel to visit their clients) and when it is not (travel rarely occurs). The first case is clearly more complex than the latter.

In this notebook we will use two different techniques to solve territory management problems when only the location of clients needs to be considered, i.e., we will have a single layer of data consisting of client locations. We will prove the value Spatial Data Science techniques by showing their additional value compared to traditional techniques.

Use case description¶

A pharma lab is interested in balancing their sales territories in the state of Texas based on the number of current and potential clients per territory.

Their clients are mainly offices and clinics of medical doctors.

They are interested in creating 5 balanced territories.

We will use the following two datasets from CARTO's Data Observatory:

Points of Interest (POIs). In particular, office and clinic of medical doctors POIs. We will use Pitney Bowes POI-Consumer dataset.
Texas boundary geometry. We'll use Who's on First GeoJSON - Global dataset.

Note the POI dataset is premium and a subscription is needed to access this data.

0. Setup¶

We'll start by importing all packages we'll use.

In [2]:

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from cartoframes.auth import set_default_credentials
from cartoframes.data.observatory import *
from cartoframes.viz import *
from h3 import h3
from libpysal.weights import Rook
from shapely import wkt
from shapely.geometry import mapping, Polygon
from sklearn.cluster import KMeans
from spopt.region.maxp import MaxPHeuristic

pd.set_option('display.max_columns', None)
plt.rc('axes', titlesize='large')
plt.rc('xtick', labelsize='large')
plt.rc('ytick', labelsize='large')
sns.set_style('whitegrid')

In order to be able to use the Data Observatory via CARTOframes, you need to set your CARTO account credentials first.

Please, visit the Authentication guide for further detail.

In [3]:

set_default_credentials('creds.json')

0.1. Functions¶

The following function creates an H3 polyfill of the polygon and at the resolution indicated.

In [39]:

def create_h3_grid(polygon, resolution=8):
    hex_id_list = list(h3.polyfill(geojson = mapping(polygon), res = resolution, geo_json_conformant=True))
    hexagon_list = list(map(lambda x : Polygon(h3.h3_to_geo_boundary(h=x, geo_json=True)), hex_id_list))
    grid = pd.DataFrame(data={'hex_id':hex_id_list, 'geometry':hexagon_list})
    grid = gpd.GeoDataFrame(grid, crs='epsg:4326')
    return grid

The function below is used throughout the analysis to check is clusters are balanced based on different metrics.

The function arguments are:

cluster_names so that we can provide descriptive names to clusters
areas_df is the GeoDataFrame
groupby is the column with the cluster to which each cell belongs to
**kaggregations for the different metrics we'd like to calculate

In [143]:

def plot_clinic_balance(clusters, areas_df, groupby, **kaggregations):
    areas_df_g = areas_df.groupby(groupby).agg(kaggregations).reset_index()

    n_plots = len(kaggregations)
    fig, axs = plt.subplots(1, n_plots, figsize=(9 + 3*n_plots,4))
    if n_plots == 1:
        axs = [axs]
    
    for i in range(n_plots):
        sns.barplot(y=groupby, x=list(kaggregations.keys())[i], data=areas_df_g, order=clusters, 
                    palette=['#7F3C8D','#11A579','#3969AC','#F2B701','#E73F74'], ax=axs[i])
        axs[i].set_xlabel(list(kaggregations.keys())[i], fontsize=13)
        axs[i].set_ylabel('Sales rep locations', fontsize=13)
    
    fig.tight_layout()
    
    return axs

1. Download and visualize data¶

Next, we will download the data described in the usecase using CARTOframes.

Note in this notebook some prior knowledge on how to explore and download data from the Data Observatory is assumed. If this is your first time exploring and downloading data from the Data Observatory, take a look at CARTOframes Guides and the Data Observatory examples and discover how easy it is to get started!

1.1 Texas boundary geometry¶

We are interested in the geometry of the state of Texas. We'll download it from Who's on First GeoJSON - Global dataset.

In [151]:

wof_grographies = Geography.get('wof_geojson_4e78587c')
wof_grographies.to_dict()

Out[151]:

{'slug': 'wof_geojson_4e78587c',
 'name': 'GeoJSON - Global',
 'description': "The main table in Who's On First. Holds all the relevant information for a place in the 'body' JSON field.",
 'country_id': 'glo',
 'provider_id': 'whos_on_first',
 'geom_type': 'MULTIPLE',
 'update_frequency': None,
 'is_public_data': True,
 'lang': 'eng',
 'version': '20190520',
 'provider_name': "Who's On First",
 'id': 'carto-do-public-data.whos_on_first.geography_glo_geojson_20190520'}

In [36]:

state_name = 'Texas'
country_code = 'US'
placetype = 'region'

sql_query = f"""SELECT * 
                FROM $geography$ 
                WHERE name = '{state_name}' AND 
                      country = '{country_code}' AND 
                      placetype='{placetype}'"""

tx_boundary = wof_grographies.to_dataframe(sql_query=sql_query)

tx_boundary['geom'] = list(map(wkt.loads, tx_boundary['geom']))
tx_boundary = gpd.GeoDataFrame(tx_boundary, geometry='geom', crs='epsg:4326')
tx_boundary

Out[36]:

	geoid	id	body	name	country	parent_id	is_current	placetype	geometry_type	bbox	geom	lastmodified	lastmodified_timestamp
0	85688753	85688753	{"id": 85688753, "type": "Feature", "propertie...	Texas	US	85633793	1	region	Polygon	POLYGON((-93.508039 25.837164, -93.508039 36.5...	POLYGON ((-103.06466 32.95910, -103.06460 32.9...	1555446728	2019-04-16 20:32:08+00:00

1.2. Client locations¶

We'll download all POIs in Texas classified as "OFFICES AND CLINICS OF MEDICAL DOCTORS" from Pitney Bowes POI-Consumer dataset.

Note this is a premium dataset and a subscription is required.

In [21]:

poi_dataset = Dataset.get('pb_consumer_po_62cddc04')

In [32]:

sql_query = """
    SELECT * except(do_label) FROM $dataset$ 
    WHERE SUB_CLASS = 'OFFICES AND CLINICS OF MEDICAL DOCTORS' 
    AND STABB = 'TX'
    AND CAST(do_date AS date) >= (SELECT MAX(CAST(do_date AS date)) from $dataset$)
"""
pois = poi_dataset.to_dataframe(sql_query=sql_query)
pois.columns = list(map(str.lower, pois.columns))
pois['geom'] = list(map(wkt.loads, pois['geom']))
pois = gpd.GeoDataFrame(pois, geometry='geom', crs='epsg:4326')
pois.head()

Out[32]:

	geoid	do_date	NAME	BRANDNAME	PB_ID	TRADE_NAME	FRANCHISE_NAME	ISO3	AREANAME4	AREANAME3	AREANAME2	AREANAME1	STABB	POSTCODE	FORMATTEDADDRESS	MAINADDRESSLINE	ADDRESSLASTLINE	LONGITUDE	LATITUDE	GEORESULT	CONFIDENCE_CODE	COUNTRY_ACCESS_CODE	TEL_NUM	FAXNUM	EMAIL	HTTP	OPEN_24H	BUSINESS_LINE	SIC1	SIC2	SIC8	SIC8_DESCRIPTION	ALT_INDUSTRY_CODE	MICODE	TRADE_DIVISION	GROUP	CLASS	SUB_CLASS	EMPLOYEE_HERE	EMPLOYEE_COUNT	YEAR_START	SALES_VOLUME_LOCAL	SALES_VOLUME_US_DOLLARS	CURRENCY_CODE	AGENT_CODE	LEGAL_STATUS_CODE	STATUS_CODE	PARENT_BUSINESS_NAME	PARENT_ADDRESS	PARENT_STREET_ADDRESS	PARENT_AREANAME3	PARENT_AREANAME1	PARENT_COUNTRY	PARENT_POSTCODE	DOMESTIC_ULTIMATE_BUSINESS_NAME	DOMESTIC_ULTIMATE_ADDRESS	DOMESTIC_ULTIMATE_STREET_ADDRESS	DOMESTIC_ULTIMATE_AREANAME3	DOMESTIC_ULTIMATE_AREANAME1	DOMESTIC_ULTIMATE_POSTCODE	GLOBAL_ULTIMATE_INDICATOR	GLOBAL_ULTIMATE_BUSINESS_NAME	GLOBAL_ULTIMATE_ADDRESS	GLOBAL_ULTIMATE_STREET_ADDRESS	GLOBAL_ULTIMATE_AREANAME3	GLOBAL_ULTIMATE_AREANAME1	GLOBAL_ULTIMATE_COUNTRY	GLOBAL_ULTIMATE_POSTCODE	FAMILY_MEMBERS	HIERARCHY_CODE	TICKER_SYMBOL	EXCHANGE_NAME	geom
0	1484649291#-95.391433#29.730764	2020-08-01	CHARLES GARCIA	NaN	1484649291	CHARLES GARCIA, MD PA	NaN	USA	NaN	HOUSTON	NaN	TEXAS	TX	77006-6122	4704 MONTROSE BLVD, HOUSTON, TX, 77006-6122	4704 MONTROSE BLVD	HOUSTON, TX, 77006-6122	-95.391433	29.730764	S8HPNTSCZA	HIGH	1.0	(713) 333-0151	NaN	NaN	WWW.CHARLESGARCIAMD.COM	NaN	OFFICES AND CLINICS MEDICAL DOCTORS,NSK	8011.0	NaN	80110513	OPTHALMOLOGIST	621111.0	10942513	DIVISION I. - SERVICES	HEALTH SERVICES	OFFICES AND CLINICS OF DOCTORS OF MEDICINE	OFFICES AND CLINICS OF MEDICAL DOCTORS	75.0	75.0	2019.0	515248.0	515248.0	20.0	G	13.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	POINT (-95.39143 29.73076)
1	1121837904#-93.987719#29.956197	2020-08-01	SOUTHEAST TEXAS IMAGING LLP	NaN	1121837904	NaN	NaN	USA	NaN	NEDERLAND	NaN	TEXAS	TX	77627-6290	1323 S 27TH ST STE 700, NEDERLAND, TX, 77627-6290	1323 S 27TH ST STE 700	NEDERLAND, TX, 77627-6290	-93.987719	29.956197	S8HPNTSCZA	HIGH	1.0	(409) 729-5400	NaN	NaN	NaN	NaN	OFFICES AND CLINICS MEDICAL DOCTORS,NSK	8011.0	NaN	80110519	RADIOLOGIST	621111.0	10942519	DIVISION I. - SERVICES	HEALTH SERVICES	OFFICES AND CLINICS OF DOCTORS OF MEDICINE	OFFICES AND CLINICS OF MEDICAL DOCTORS	11.0	11.0	1995.0	1386157.0	1386157.0	20.0	G	12.0	1.0	SOUTHEAST TEXAS IMAGING LLP	1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762...	1323 S 27TH ST STE 700	NEDERLAND	TEXAS	USA	776276290.0	SOUTHEAST TEXAS IMAGING LLP	1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762...	1323 S 27TH ST STE 700	NEDERLAND	TEXAS	776276290.0	Y	SOUTHEAST TEXAS IMAGING LLP	1323 S 27TH ST STE 700, NEDERLAND, TEXAS, 7762...	1323 S 27TH ST STE 700	NEDERLAND	TEXAS	USA	7.76276e+08	2.0	1.0	NaN	NaN	POINT (-93.98772 29.95620)
2	1129360394#-97.401965#27.776169	2020-08-01	CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI	NaN	1129360394	NaN	NaN	USA	NaN	CORPUS CHRISTI	NaN	TEXAS	TX	78404-3160	1521 S STAPLES ST STE 700, CORPUS CHRISTI, TX,...	1521 S STAPLES ST STE 700	CORPUS CHRISTI, TX, 78404-3160	-97.401965	27.776169	S8HPNTSCZA	HIGH	1.0	(361) 888-8271	NaN	NaN	WWW.HEARTOFCC.COM	NaN	OFFICES AND CLINICS MEDICAL DOCTORS,NSK	8011.0	NaN	80110101	CARDIOLOGIST AND CARDIO-VASCULAR SPECIALIST	621111.0	10942101	DIVISION I. - SERVICES	HEALTH SERVICES	OFFICES AND CLINICS OF DOCTORS OF MEDICINE	OFFICES AND CLINICS OF MEDICAL DOCTORS	100.0	100.0	1976.0	11077130.0	11077130.0	20.0	G	3.0	1.0	CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI	1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX...	1521 S STAPLES ST STE 700	CORPUS CHRISTI	TEXAS	USA	784043160.0	CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI	1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX...	1521 S STAPLES ST STE 700	CORPUS CHRISTI	TEXAS	784043160.0	Y	CARDIOLOGY ASSOCIATES OF CORPUS CHRISTI	1521 S STAPLES ST STE 700, CORPUS CHRISTI, TEX...	1521 S STAPLES ST STE 700	CORPUS CHRISTI	TEXAS	USA	7.84043e+08	3.0	1.0	NaN	NaN	POINT (-97.40197 27.77617)
3	1119324698#-98.203039#26.254586	2020-08-01	VINCENT HONRUBIA, M.D., P.A.	NaN	1119324698	NaN	NaN	USA	NaN	EDINBURG	NaN	TEXAS	TX	78539-1406	2821 MICHAELANGELO DR STE 201, EDINBURG, TX, 7...	2821 MICHAELANGELO DR STE 201	EDINBURG, TX, 78539-1406	-98.203039	26.254586	S8HPNTSCZA	HIGH	1.0	(956) 421-4500	NaN	NaN	WWW.SOUTHTEXASSINUSINSTITUTE.COM	NaN	OFFICES AND CLINICS MEDICAL DOCTORS,NSK	8011.0	NaN	80119901	GENERAL AND FAMILY PRACTICE, PHYSICIAN/SURGEON	621111.0	10230302	DIVISION I. - SERVICES	HEALTH SERVICES	OFFICES AND CLINICS OF DOCTORS OF MEDICINE	OFFICES AND CLINICS OF MEDICAL DOCTORS	13.0	13.0	2008.0	1464692.0	1464692.0	20.0	G	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	POINT (-98.20304 26.25459)
4	1130039254#-95.282651#32.278874	2020-08-01	TINSLEY ASSOCIATES, L.L.P.	NaN	1130039254	EXCEPTIONAL HOME CARE	NaN	USA	NaN	TYLER	NaN	TEXAS	TX	75703-3978	1510 E GRANDE BLVD, TYLER, TX, 75703-3978	1510 E GRANDE BLVD	TYLER, TX, 75703-3978	-95.282651	32.278874	S8HPNTSCZA	HIGH	1.0	(903) 533-0290	NaN	NaN	NaN	NaN	OFFICES AND CLINICS MEDICAL DOCTORS,NSK	8011.0	8082.0	80110516	PEDIATRICIAN	621111.0	10942516	DIVISION I. - SERVICES	HEALTH SERVICES	OFFICES AND CLINICS OF DOCTORS OF MEDICINE	OFFICES AND CLINICS OF MEDICAL DOCTORS	23.0	200.0	2002.0	11307457.0	11307457.0	20.0	G	12.0	1.0	TINSLEY ASSOCIATES, L.L.P.	1510 E GRANDE BLVD, TYLER, TEXAS, 757033978, USA	1510 E GRANDE BLVD	TYLER	TEXAS	USA	757033978.0	TINSLEY ASSOCIATES, L.L.P.	1510 E GRANDE BLVD, TYLER, TEXAS, 757033978	1510 E GRANDE BLVD	TYLER	TEXAS	757033978.0	Y	TINSLEY ASSOCIATES, L.L.P.	1510 E GRANDE BLVD, TYLER, TEXAS, 757033978, USA	1510 E GRANDE BLVD	TYLER	TEXAS	USA	7.57034e+08	2.0	1.0	NaN	NaN	POINT (-95.28265 32.27887)

In [33]:

pois.shape

Out[33]:

(59554, 74)

1.3 Visualize data¶

In [44]:

Map([Layer(tx_boundary,
           style=basic_style(opacity=0, stroke_color='#11A579', stroke_width=5),
           legends=basic_legend('Texas Boundary')),
     Layer(pois, 
           style=basic_style(color='#F2B701', size=2, opacity=0.9, stroke_width=0),
           popup_hover=[popup_element('name', 'Client'),
                        popup_element('employee_here', '# Employees')],
           legends=basic_legend('Client Locations'))], 
    basemap=basemaps.darkmatter)

Out[44]:

2. Discretize space. H3 grid¶

A fundamental step in territory management is to discretize space. Territory management algorithms are computationally complex and hence it is crucial to leverage the spatial component to reduce complexity. We can do this by working at an aggregated level instead of considering each client location independently.

We first need to identify the smallest spatial aggregation that makes sense for your business, our geographic support. This can be census block groups, zip codes or counties, or you can be interested in using a standard grid, in which case it would ideally be a hierarchical spatial index such as Quadkey grid or H3 grid.

In this notebook we will use an H3 grid of resolution 4. We can easily discretize space by performing a polyfill of the Texas boundary polygon.

Note a buffer has been applied because H3 will fill the polygon with all hexagons of resolution 4 whose centroid lies within the polygon to be filled and we want to make sure the whole territory is covered.

In [45]:

# Buffer
buffer = 2.5e4  # in meters
tx_boundary['geometry_buffer'] = tx_boundary.to_crs('epsg:26914').buffer(buffer).to_crs('epsg:4326')

grid = create_h3_grid(tx_boundary['geometry_buffer'].iloc[0], 4)
grid.head()

Out[45]:

	hex_id	geometry
0	84489edffffffff	POLYGON ((-97.03425 29.34525, -96.79420 29.479...
1	8426d31ffffffff	POLYGON ((-102.58398 34.37842, -102.33428 34.5...
2	8448f29ffffffff	POLYGON ((-104.42876 30.05621, -104.18949 30.2...
3	84446ddffffffff	POLYGON ((-95.82009 29.75644, -95.57955 29.887...
4	84489c5ffffffff	POLYGON ((-98.02512 29.30940, -97.78501 29.445...

In [46]:

Map([Layer(grid,
           style=basic_style(opacity=0.75),
           legends=basic_legend('H3 grid')),
     Layer(tx_boundary,
           style=basic_style(opacity=0, stroke_color='#E73F74', stroke_width=5),
           legends=basic_legend('Texas Boundary'))])

Out[46]:

2.1. Aggregate client info at the grid level.¶

We will aggregate clients by:

Counting the number of clients per cell
Calculating the median rating of clients within eack cell

In [48]:

pois_g = gpd.sjoin(pois, grid, how='right').groupby('hex_id').agg({'geoid':'count', 'employee_here':'mean'}).\
            reset_index().rename(columns={'geoid':'poi_count', 'employee_here':'employee_avg'})
pois_g[['poi_count', 'employee_avg']] = pois_g[['poi_count', 'employee_avg']].fillna(0)
areas = grid.merge(pois_g, on='hex_id')
areas = gpd.GeoDataFrame(areas, crs='epsg:4326')
areas.head()

Out[48]:

	hex_id	geometry	poi_count	employee_avg
0	84489edffffffff	POLYGON ((-97.03425 29.34525, -96.79420 29.479...	15	4.250000
1	8426d31ffffffff	POLYGON ((-102.58398 34.37842, -102.33428 34.5...	1	2.000000
2	8448f29ffffffff	POLYGON ((-104.42876 30.05621, -104.18949 30.2...	0	0.000000
3	84446ddffffffff	POLYGON ((-95.82009 29.75644, -95.57955 29.887...	1108	4.801538
4	84489c5ffffffff	POLYGON ((-98.02512 29.30940, -97.78501 29.445...	496	4.854093

In [49]:

areas['poi_count'].describe(percentiles=[0.25, 0.5, 0.75, 0.95])

Out[49]:

count     402.000000
mean      148.144279
std       709.392374
min         0.000000
25%         0.000000
50%         4.000000
75%        40.500000
95%       496.950000
max      8268.000000
Name: poi_count, dtype: float64

In [51]:

breaks=[1, 4, 40, 500]

Map(Layer(areas, 
          style=color_bins_style('poi_count', breaks=breaks),
          legends=color_bins_legend('Number of Clients')))

Out[51]:

3. Territory Optimization¶

Once we have our data aggregated, it's time to start working on building balanced territories.

We will explore two different techniques:

3.1 Approach 1. KMeans clustering¶

This is a very well known and broadly used technique. However, this technique doesn't allow you to incorporate balancing criteria and usually generates low quality results for territory management.

3.1.1. Calculate clusters¶

In [52]:

no_territories = 5
kmeans = KMeans(no_territories, random_state=111)

kmeans.fit_predict(list(map(lambda point:[point.x, point.y], areas.to_crs('epsg:26914').centroid)), 
                   sample_weight=areas['poi_count'])
print('Done!')

Done!

3.1.2. Label grid cells¶

We assign each cell to the cluster it belongs to.

In order to have comparable results with other techniques, we identify 5 representative cells, each of which being far enough from each other to make sure they are in different clusters. This way, we can compare how each cluster changes.

In [53]:

areas['kmeans_cluster'] = kmeans.labels_
areas['kmeans_cluster'] += 1

In [107]:

# This dictionary contains the representative cells with the cluster they represent.
trans_dict={'8448c69ffffffff':1,
            '8426d47ffffffff':2,
            '8426cdbffffffff':3,
            '84446edffffffff':4,
            '844880dffffffff':5}

In [55]:

areas['kmeans_cluster_aux'] = -1
for hex_id in trans_dict:
    areas.loc[areas['kmeans_cluster'] == areas.loc[areas['hex_id'] == hex_id, 'kmeans_cluster'].iloc[0], 'kmeans_cluster_aux'] = trans_dict[hex_id]
areas['kmeans_cluster'] = areas['kmeans_cluster_aux']
areas.drop(columns='kmeans_cluster_aux', inplace=True)

We create a string variable with a more descriptive name for our visualization

In [56]:

areas['kmeans_cluster_cat'] = list(map(lambda v:f'Cluster_{v}', areas['kmeans_cluster']))

3.1.3. Visualize and analyze results¶

KMeans calculates nice compact clusters (see map). However, we get very unbalanced clusters with Cluster_3 having 10 times more clients than Cluster_1, as can be seen in the chart below.

In [58]:

Map(Layer(areas, 
          style=color_category_style('kmeans_cluster_cat', cat=sorted(areas['kmeans_cluster_cat'].unique())),
          legends=color_category_legend('KMeans Clustering')))

Out[58]:

In [59]:

plot_clinic_balance(sorted(areas['kmeans_cluster_cat'].unique()), areas, 'kmeans_cluster_cat', poi_count='sum')

3.2 Approach 2. Max-p¶

Let's now try to balance the number of clients per cluster while maintaining connected clusters as compact as possible.

We will use Pysal's implementation of the Max-p algorithm. Max-p is a spatial clustering algorithm that calculates spatially connected clusters, with similar properties, while balancing one criterion, or mixed or criteria.

3.2.1. Weights. Adjacency matrix¶

The first thing we need to do is calculate the adjacency matrix which will tell the algorithm which cells are contiguous.

We will use Rook weights which considers two polygons to be contiguous if they share one edge.

In [60]:

wgt = Rook.from_dataframe(areas, geom_col='geometry')

In [61]:

wgt.histogram

Out[61]:

[(2, 6), (3, 27), (4, 34), (5, 33), (6, 302)]

3.2.2 Balancing criteria¶

We would like to balance clusters based on total number of clients. Normally we are not looking for a perfect balance, especially when dealing with more than one multiple criteria, and a balance tolerance is introduced. In our case, we will consider a tolerance of 20%, which means that we allow clusters to be as much as 20% below the perfect balance.

Note we are only using the number of clients to balance, but this dataset also has the number of employees per client and you might even have other data you might be interested in using. The good news is Max-p allows you to do that.

In [132]:

# Trick to help the algorithm find compact areas
areas['poi_count'] += 10

In [133]:

balance_tolerance = 0.2  # 5%
perfect_balance = areas['poi_count'].sum()/5
threshold = int(np.floor(perfect_balance * (1-balance_tolerance)))
print('Minimum number of clients per cluster', threshold)

Minimum number of clients per cluster 10171

3.2.3. Similarity criteria¶

Max-p also allows you to set similarity criteria. These are variables that you want to have a similar behavior within clusters. For example, you might be interested in having clusters with similar demographic or socioeconomic characteristics.

In this case, we don't have any specific criteria, so we will use the gris cell centroid coordinates as similarity criteria in order to get clusters as compact as possible. You can try removing these or only adding one of the coordinates to clearly see whats the effect of these similarity criteria.

In [134]:

areas['lat'] = np.array(list(map(lambda point:[point.y, point.x], areas.centroid)))[:,0]
areas['lon'] = np.array(list(map(lambda point:[point.y, point.x], areas.centroid)))[:,1]
areas['lat_norm'] = (areas['lat'] - areas['lat'].min())/(areas['lat'].max() - areas['lat'].min())
areas['lon_norm'] = (areas['lon'] - areas['lon'].min())/(areas['lon'].max() - areas['lon'].min())
areas.head()

Out[134]:

	hex_id	geometry	poi_count	employee_avg	kmeans_cluster	kmeans_cluster_cat	lat	lon	lat_norm	lon_norm	maxp_cluster	maxp_cluster_cat
0	84489edffffffff	POLYGON ((-97.03425 29.34525, -96.79420 29.479...	25	4.250000	5	Cluster_5	29.597247	-97.048882	0.352024	0.726035	5	Cluster_5
1	8426d31ffffffff	POLYGON ((-102.58398 34.37842, -102.33428 34.5...	11	2.000000	2	Cluster_2	34.626792	-102.612015	0.816273	0.308192	2	Cluster_2
2	8448f29ffffffff	POLYGON ((-104.42876 30.05621, -104.18949 30.2...	10	0.000000	1	Cluster_1	30.309779	-104.459796	0.417793	0.169406	1	Cluster_1
3	84446ddffffffff	POLYGON ((-95.82009 29.75644, -95.57955 29.887...	1118	4.801538	4	Cluster_4	30.007381	-95.832104	0.389881	0.817426	4	Cluster_4
4	84489c5ffffffff	POLYGON ((-98.02512 29.30940, -97.78501 29.445...	506	4.854093	5	Cluster_5	29.561949	-98.041929	0.348765	0.651448	5	Cluster_5

3.2.4. Calculate clusters¶

In [135]:

maxp_heur = MaxPHeuristic(areas, wgt, ['lat_norm', 'lon_norm'], 'poi_count', threshold, 
                          5, max_iterations_construction=2, max_iterations_sa=5, verbose=True)
maxp_heur.solve()

max_p:  5
number of good partitions: 2
0
totalWithinRegionDistance after SA: 
6900.531405712055
totalWithinRegionDistance after SA: 
7225.077640946365
totalWithinRegionDistance after SA: 
7395.269464681606
totalWithinRegionDistance after SA: 
6905.536769392361
totalWithinRegionDistance after SA: 
7370.850274070599
1
totalWithinRegionDistance after SA: 
5642.22448183758
totalWithinRegionDistance after SA: 
5613.498459949133
totalWithinRegionDistance after SA: 
5618.573555920197
totalWithinRegionDistance after SA: 
5749.532705832657
totalWithinRegionDistance after SA: 
5677.7927687007
best objective value:
5613.498459949133

In [136]:

np.max(maxp_heur.labels_)

Out[136]:

In [137]:

# Undo change
areas['poi_count'] -= 10

3.2.5. Label grid cells¶

We assign each cell to the cluster it belongs to and rename clusters based on the representative cells we mentioned in the KMeans section.

In [138]:

areas['maxp_cluster'] = maxp_heur.labels_

In [139]:

areas['maxp_cluster_aux'] = -1
for hex_id in trans_dict:
    areas.loc[areas['maxp_cluster'] == areas.loc[areas['hex_id'] == hex_id, 'maxp_cluster'].iloc[0], 'maxp_cluster_aux'] = trans_dict[hex_id]
areas['maxp_cluster'] = areas['maxp_cluster_aux']
areas.drop(columns='maxp_cluster_aux', inplace=True)

In [140]:

areas['maxp_cluster_cat'] = list(map(lambda v:f'Cluster_{v}', areas['maxp_cluster']))

3.2.6. Visualize and analyze results¶

We can see how clusters now are less compact than they were with KMeans, but now clusters are balanced, with all of them satisfying the minimum requirement of 4256 clients per cluster.

In [141]:

Layout([Map(Layer(areas, 
                  style=color_category_style('kmeans_cluster_cat', cat=sorted(areas['kmeans_cluster_cat'].unique())),
                  legends=color_category_legend('KMeans Clustering'),
                  popup_hover=[popup_element('hex_id'), popup_element('maxp_cluster_cat')])),
        Map(Layer(areas, 
                  style=color_category_style('maxp_cluster_cat', cat=sorted(areas['maxp_cluster_cat'].unique())),
                  legends=color_category_legend('Max-p Clustering'),
                  popup_hover=[popup_element('hex_id'), popup_element('maxp_cluster_cat')]))
        ], map_height=400)

Out[141]:

In [142]:

plot_clinic_balance(sorted(areas['maxp_cluster_cat'].unique()), areas, 'maxp_cluster_cat', poi_count='sum')