Geosocial data is location-based social media data that can be interpreted and analyzed as part of any location-oriented business decision.
In this notebook, Spatial.ai geosocial data is combined with Mastercard financial data to quantify how much of retail performance can be explained by geosocial behavior. We compare the cities of Chicago and Los Angeles to show how retail performance is driven by different social segments depending on the city.
Note this use case leverages premium datasets from CARTO Data Observatory.
The notebook is organized in the following sections:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from cartoframes.auth import set_default_credentials
from cartoframes.data.observatory import *
from cartoframes.viz import *
from shapely import wkt
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
%matplotlib inline
In order to be able to use the Data Observatory via CARTOframes, you need to set your CARTO account credentials first.
Please, visit the Authentication guide for further detail.
from cartoframes.auth import set_default_credentials
set_default_credentials('creds.json')
Note about credentials
For security reasons, we recommend storing your credentials in an external file to prevent publishing them by accident when sharing your notebooks. You can get more information in the section Setting your credentials of the Authentication guide.
In this section, we'll download the two datasets we're interested in and combine them into a single dataframe.
For more information on how to access Data Observatory datasets using CARTOframes visit the Guides or take a look at the Access Premium Data template.
The following function loads the census block group ids for the city of interest. This is used afterwards to filter the data downloaded from the Data Observatory.
LA_CBG_PATH = 'https://libs.cartocdn.com/cartoframes/samples/la_cbg.csv'
LA_BB = '-118.673619,33.553967,-117.997960,34.360425'
CHICACO_CBG_PATH = 'https://libs.cartocdn.com/cartoframes/samples/chicago_cbg.csv'
CHICAGO_BB = '-88.638285,41.434892,-87.487468,42.502873'
def read_cbg(city):
cbg_list = pd.DataFrame()
bbox = ''
if city == 'la':
cbg_list = pd.read_csv(LA_CBG_PATH, dtype={'geoid':str})['geoid'].tolist()
bbox = LA_BB
elif city == 'chicago':
cbg_list = pd.read_csv(CHICACO_CBG_PATH, dtype={'geoid':str})['geoid'].tolist()
bbox = CHICAGO_BB
return cbg_list, bbox
The following function downloads the latest Spatial.ai Geosocial Segments data for the specified bounding box.
def download_social(bbox, cbg_list):
dataset = Dataset.get('spa_geosocial_s_d5dc42ae')
sql_query = f"SELECT * FROM $dataset$ WHERE CAST(do_date AS date) >= (SELECT MAX(CAST(do_date AS date)) FROM $dataset$) AND ST_IntersectsBox(geom, {bbox})"
social = dataset.to_dataframe(sql_query=sql_query)
social = social[social['geoid'].isin(cbg_list)]
social.drop(columns=['do_label', 'do_area', 'do_perimeter', 'do_num_vertices'], inplace=True)
return social
The following function downloads the latest Mastercard Geographic Insights data for the specified bounding box.
Note the data is also filtered by geo_type. There are three types of geo_types depending if the indices shown represent a comparison with regards to country (c), province (p) or metropolitan area (m).
def download_mastercard(bbox, cbg_list):
dataset = Dataset.get('mc_geographic__7980c5c3')
sql_query = f"SELECT * FROM $dataset$ WHERE geo_type = 'm' AND CAST(do_date AS date) >= (SELECT MAX(CAST(do_date AS date)) FROM $dataset$) AND ST_IntersectsBox(geom, {bbox})"
mrli = dataset.to_dataframe(sql_query=sql_query)
mrli = mrli[mrli['geoid'].isin(cbg_list)]
mrli.drop(columns=['do_label', 'do_area', 'do_perimeter', 'do_num_vertices'], inplace=True)
mrli = mrli.sort_values(['geoid', 'industry', 'segment', 'geo_type', 'do_date']).reset_index(drop=True)
return mrli
The following function loads the three datasets and merges them into a single one.
Note we can merge the dataframes because both datasets are defined at the census block group level, so the geoid's are common to both of them.
def download_data(city):
cbg_list, bbox = read_cbg(city)
social = download_social(bbox, cbg_list)
social.drop(columns='geom', inplace=True)
mrli = download_mastercard(bbox, cbg_list)
mrli = mrli[(mrli['segment'] == 'o')]
mrli = mrli.merge(social, on='geoid')
mrli = gpd.GeoDataFrame(mrli, crs='epsg:4326')
return mrli
Here we'll load the data for Chicago and LA.
chicago = download_data('chicago')
chicago.head(3)
geoid | do_date_x | region_id | industry | segment | geo_type | txn_amt | txn_cnt | acct_cnt | avg_ticket | avg_freq | avg_spend_amt | geom | comparison_level | comparison_geography_name | comparison_geography_ID | comparison_geography_count | EA01_bookish | EA02_engine_enthusiasts | EA03_green_thumb | EA04_natural_beauty | EA05_wanderlust | EA06_handcrafted | EA07_animal_advocates | EA08_dog_lovers | EA09_smoke_culture | EB01_daily_grind | EB02_nerd_culture | EB03_lgbtq_culture | EB04_wealth_signaling | EB05_hipster | EB06_student_life | EB07_farm_culture | EC01_love_and_romance | EC02_happily_ever_after | EC03_dating_life | EC04_connected_motherhood | EC05_family_time | EC06_my_crew | EC07_girl_squad | EC08_networking | ED01_sweet_treats | ED02_coffee_connoisseur | ED03_trendy_eats | ED04_whiskey_business | ED05_asian_food_and_culture | ED06_ingredient_attentive | ED07_fueling_for_fitness | ED08_wine_lovers | ED09_hops_and_brews | EE01_film_lovers | EE02_competitive_nature | EE03_live_experiences | EE04_late_night_leisure | EE05_party_life | EE06_live_and_local_music | EE07_lighthearted_fun | EF01_deep_emotions | EF02_heartfelt_sharing | EF03_awestruck | EF04_happy_place | EF05_gratitude | EF06_memory_lane | EG01_dance_devotion | EG02_artistic_appreciation | EG03_pieces_of_history | EG04_sites_to_see | EG05_hip_hop_culture | EH01_mindfulness_and_spirituality | EH02_activism | EH03_politically_engaged | EH04_praise_and_worship | EH05_conservation | EH06_civic_attentiveness | EI01_trend_trackers | EI02_all_about_hair | EI03_mens_style | EI04_fitness_fashion | EI05_body_art | EI06_smart_chic | EJ01_home_and_leisure | EJ02_past_reflections | EJ03_humanitarian | EJ04_deal_seekers | EK01_organized_sports | EK02_fitness_obsession | EK03_outdoor_adventures | EK04_yoga_advocates | EK05_functional_fitness | index_high_end_affinity | index_mid_range_consumer_affinity | index_discount_affinity | index_foodie_affinity | index_breakfast_and_brunch_affinity | index_fast_casual_affinity | index_coffee_affinity | index_fast_food_affinity | index_organic_and_local_affinity | index_fashion_affinity | index_health_conscious_affinity | index_entertainment_affinity | index_late_night_affinity | index_personal_care_affinity | index_younger_affinity | index_older_affinity | index_family_affinity | index_politically_liberal_affinity | index_politically_conservative_affinity | VOLUME_INDEX | do_date_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 170310101002 | 2020-11-01 | 17 | app | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-87.67710 42.02255, -87.67355 42.019... | National | USA | USA | 217739.0 | 83.62 | 43.28 | 68.41 | 14.97 | 6.35 | 25.00 | 32.48 | 28.97 | 29.19 | 52.57 | 67.76 | 93.59 | 16.93 | 56.80 | 48.33 | 33.18 | 14.03 | 8.83 | 27.18 | 5.68 | 30.27 | 9.68 | 21.23 | 77.17 | 71.39 | 42.15 | 71.69 | 13.01 | 61.73 | 56.89 | 51.81 | 38.39 | 68.34 | 78.92 | 29.93 | 67.00 | 69.64 | 64.63 | 46.62 | 23.23 | 44.27 | 49.23 | 18.97 | 49.69 | 67.29 | 20.29 | 79.46 | 79.36 | 98.01 | 11.81 | 53.29 | 53.82 | 86.23 | 45.19 | 23.09 | 40.48 | 81.35 | 28.62 | 34.72 | 6.44 | 26.99 | 69.13 | 12.31 | 62.56 | 40.60 | 45.24 | 23.93 | 52.53 | 53.74 | 12.33 | 56.89 | 50.25 | 54.33 | 41.08 | 60.49 | 42.64 | 59.48 | 32.71 | 67.48 | 36.41 | 54.37 | 21.82 | 53.04 | 59.01 | 54.09 | 42.52 | 68.76 | 22.14 | 20.38 | 64.57 | 24.70 | 2.55 | 2020-10-01 |
1 | 170310101002 | 2020-11-01 | 17 | app | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-87.67710 42.02255, -87.67355 42.019... | DMA | Chicago (IL) | 602 | 6735.0 | 79.19 | 64.75 | 71.43 | 29.65 | 21.09 | 42.97 | 50.86 | 45.40 | 35.06 | 52.34 | 64.25 | 87.77 | 24.20 | 57.88 | 52.57 | 60.02 | 20.35 | 20.87 | 29.03 | 14.91 | 34.34 | 19.88 | 19.74 | 66.99 | 67.17 | 42.67 | 68.96 | 22.51 | 57.91 | 64.65 | 57.94 | 51.36 | 70.87 | 75.45 | 37.59 | 60.06 | 67.76 | 59.11 | 50.06 | 44.04 | 49.97 | 42.51 | 33.33 | 52.34 | 69.80 | 26.48 | 74.55 | 76.61 | 95.61 | 24.87 | 42.35 | 59.96 | 86.19 | 51.25 | 34.39 | 52.47 | 77.81 | 32.89 | 36.26 | 6.89 | 31.87 | 71.56 | 20.08 | 65.11 | 50.15 | 45.91 | 30.44 | 58.59 | 63.08 | 32.02 | 64.68 | 47.21 | 61.84 | 51.14 | 58.85 | 49.54 | 59.94 | 38.50 | 67.08 | 49.82 | 60.42 | 25.60 | 58.30 | 58.63 | 53.51 | 46.46 | 65.93 | 39.58 | 32.71 | 62.90 | 36.44 | 2.55 | 2020-10-01 |
2 | 170310101002 | 2020-11-01 | 17 | app | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-87.67710 42.02255, -87.67355 42.019... | County | Cook County | 17031 | 3993.0 | 74.23 | 68.14 | 67.17 | 29.38 | 20.38 | 44.77 | 52.08 | 47.39 | 36.71 | 48.09 | 66.71 | 81.19 | 19.84 | 49.97 | 52.40 | 68.95 | 18.57 | 20.65 | 27.63 | 13.77 | 32.43 | 17.92 | 18.87 | 57.57 | 62.78 | 39.22 | 62.83 | 22.21 | 51.99 | 63.94 | 55.20 | 50.46 | 71.48 | 67.82 | 36.47 | 56.31 | 65.61 | 46.31 | 47.25 | 46.74 | 44.12 | 34.31 | 34.58 | 51.43 | 67.74 | 22.86 | 69.68 | 69.57 | 94.15 | 25.47 | 32.51 | 54.07 | 79.43 | 45.28 | 33.05 | 48.57 | 70.86 | 32.13 | 31.70 | 4.82 | 30.24 | 69.16 | 18.17 | 62.61 | 44.31 | 44.29 | 29.19 | 62.61 | 67.76 | 37.60 | 62.83 | 46.79 | 59.19 | 51.37 | 52.82 | 48.17 | 55.57 | 38.74 | 61.86 | 50.94 | 56.24 | 23.41 | 59.04 | 56.91 | 49.90 | 43.05 | 61.87 | 42.37 | 33.22 | 55.65 | 38.14 | 2.55 | 2020-10-01 |
la = download_data('la')
la.head(3)
geoid | do_date_x | region_id | industry | segment | geo_type | txn_amt | txn_cnt | acct_cnt | avg_ticket | avg_freq | avg_spend_amt | geom | comparison_level | comparison_geography_name | comparison_geography_ID | comparison_geography_count | EA01_bookish | EA02_engine_enthusiasts | EA03_green_thumb | EA04_natural_beauty | EA05_wanderlust | EA06_handcrafted | EA07_animal_advocates | EA08_dog_lovers | EA09_smoke_culture | EB01_daily_grind | EB02_nerd_culture | EB03_lgbtq_culture | EB04_wealth_signaling | EB05_hipster | EB06_student_life | EB07_farm_culture | EC01_love_and_romance | EC02_happily_ever_after | EC03_dating_life | EC04_connected_motherhood | EC05_family_time | EC06_my_crew | EC07_girl_squad | EC08_networking | ED01_sweet_treats | ED02_coffee_connoisseur | ED03_trendy_eats | ED04_whiskey_business | ED05_asian_food_and_culture | ED06_ingredient_attentive | ED07_fueling_for_fitness | ED08_wine_lovers | ED09_hops_and_brews | EE01_film_lovers | EE02_competitive_nature | EE03_live_experiences | EE04_late_night_leisure | EE05_party_life | EE06_live_and_local_music | EE07_lighthearted_fun | EF01_deep_emotions | EF02_heartfelt_sharing | EF03_awestruck | EF04_happy_place | EF05_gratitude | EF06_memory_lane | EG01_dance_devotion | EG02_artistic_appreciation | EG03_pieces_of_history | EG04_sites_to_see | EG05_hip_hop_culture | EH01_mindfulness_and_spirituality | EH02_activism | EH03_politically_engaged | EH04_praise_and_worship | EH05_conservation | EH06_civic_attentiveness | EI01_trend_trackers | EI02_all_about_hair | EI03_mens_style | EI04_fitness_fashion | EI05_body_art | EI06_smart_chic | EJ01_home_and_leisure | EJ02_past_reflections | EJ03_humanitarian | EJ04_deal_seekers | EK01_organized_sports | EK02_fitness_obsession | EK03_outdoor_adventures | EK04_yoga_advocates | EK05_functional_fitness | index_high_end_affinity | index_mid_range_consumer_affinity | index_discount_affinity | index_foodie_affinity | index_breakfast_and_brunch_affinity | index_fast_casual_affinity | index_coffee_affinity | index_fast_food_affinity | index_organic_and_local_affinity | index_fashion_affinity | index_health_conscious_affinity | index_entertainment_affinity | index_late_night_affinity | index_personal_care_affinity | index_younger_affinity | index_older_affinity | index_family_affinity | index_politically_liberal_affinity | index_politically_conservative_affinity | VOLUME_INDEX | do_date_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60371011102 | 2020-11-01 | 6 | aut | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-118.29945 34.25598, -118.29104 34.2... | National | USA | USA | 217739.0 | 99.11 | 93.39 | 98.63 | 81.01 | 86.05 | 81.89 | 92.44 | 79.65 | 65.64 | 80.14 | 96.65 | 95.14 | 57.91 | 82.17 | 96.79 | 71.58 | 97.04 | 70.15 | 78.75 | 92.92 | 98.63 | 98.44 | 63.37 | 65.62 | 90.24 | 94.63 | 92.97 | 84.73 | 91.36 | 56.71 | 73.29 | 89.60 | 92.81 | 99.06 | 61.66 | 98.07 | 96.98 | 86.23 | 97.13 | 98.71 | 98.55 | 96.21 | 98.83 | 80.65 | 90.82 | 92.39 | 99.49 | 96.52 | 89.71 | 95.96 | 95.53 | 96.29 | 88.59 | 78.29 | 98.69 | 77.74 | 83.34 | 52.02 | 86.27 | 58.59 | 68.67 | 95.61 | 67.51 | 92.95 | 98.59 | 99.44 | 77.21 | 86.57 | 75.94 | 44.97 | 88.27 | 64.15 | 89.79 | 85.64 | 86.32 | 76.91 | 93.58 | 79.28 | 91.15 | 81.09 | 89.03 | 66.61 | 77.64 | 94.26 | 87.30 | 78.58 | 95.65 | 76.62 | 92.05 | 93.62 | 75.81 | 0.73 | 2020-10-01 |
1 | 60371011102 | 2020-11-01 | 6 | aut | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-118.29945 34.25598, -118.29104 34.2... | DMA | Los Angeles (CA) | 803 | 10816.0 | 99.52 | 88.67 | 98.91 | 87.12 | 88.63 | 83.45 | 91.40 | 83.68 | 56.52 | 81.82 | 97.66 | 88.44 | 56.34 | 75.89 | 97.49 | 91.51 | 97.05 | 80.63 | 80.34 | 94.37 | 98.85 | 98.60 | 75.50 | 70.75 | 88.76 | 92.90 | 89.00 | 90.99 | 89.55 | 65.53 | 73.97 | 90.48 | 96.05 | 94.90 | 63.92 | 96.09 | 97.33 | 84.31 | 96.26 | 99.22 | 96.09 | 92.04 | 98.99 | 84.75 | 81.49 | 91.32 | 98.72 | 92.87 | 90.35 | 96.50 | 93.83 | 91.29 | 82.88 | 88.66 | 98.60 | 80.79 | 78.17 | 73.71 | 84.72 | 54.19 | 70.00 | 92.82 | 64.82 | 92.87 | 98.71 | 99.51 | 67.67 | 89.37 | 77.37 | 51.53 | 86.51 | 61.55 | 88.08 | 86.62 | 82.07 | 80.79 | 91.70 | 84.66 | 88.42 | 84.90 | 86.00 | 69.49 | 77.76 | 95.14 | 89.89 | 76.08 | 95.54 | 83.43 | 96.39 | 89.17 | 83.32 | 0.73 | 2020-10-01 |
2 | 60371011102 | 2020-11-01 | 6 | aut | o | m | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((-118.29945 34.25598, -118.29104 34.2... | County | Los Angeles County | 6037 | 6425.0 | 99.42 | 90.01 | 98.62 | 88.70 | 89.88 | 84.85 | 91.16 | 83.57 | 56.72 | 79.97 | 97.75 | 83.02 | 52.28 | 70.08 | 97.65 | 94.57 | 97.00 | 83.31 | 78.01 | 95.43 | 99.12 | 98.85 | 76.10 | 66.60 | 87.42 | 92.62 | 87.48 | 91.13 | 88.70 | 62.53 | 73.89 | 89.23 | 96.30 | 92.49 | 63.11 | 94.68 | 97.15 | 78.82 | 95.61 | 99.48 | 94.50 | 88.63 | 99.08 | 83.88 | 79.34 | 90.58 | 98.20 | 89.67 | 87.81 | 96.71 | 91.23 | 88.71 | 75.03 | 85.88 | 98.93 | 77.62 | 73.29 | 74.88 | 84.98 | 50.35 | 69.03 | 93.42 | 62.72 | 91.13 | 98.15 | 99.46 | 67.39 | 91.23 | 79.47 | 55.72 | 84.38 | 61.75 | 85.79 | 86.98 | 79.04 | 80.04 | 90.48 | 85.09 | 85.97 | 85.32 | 83.54 | 68.39 | 78.14 | 94.47 | 88.80 | 74.42 | 94.37 | 84.98 | 97.38 | 85.11 | 84.68 | 0.73 | 2020-10-01 |
Once we have the data ready for analysis, we will start by selecting a target variable out of the Mastercard financial variables. This is the variable we are interested in explaining. We'll select the variable txn_amt
which is the total transaction amount by census block group. In addition, we can select an industry. Mastercard provides data for 5 different industries: eating places, groceries, apparel, automotive fuel, and accomodation. They also provide the industry retail which is the sum of all the latter. For this use case, we'll select eating places, eap
.
target_var = 'txn_amt' # Total transaction amount
industry = 'eap' # Eating places
ch_ret = chicago[(chicago['industry'] == industry) & (chicago['comparison_level'] == 'DMA')]
la_ret = la[(la['industry'] == industry) & (la['comparison_level'] == 'DMA')]
Here we identify the top 5% performers and visualize them.
An interesting insight here is how in Chicago top performers tend to concentrate in the downtown area, while in LA they are more spread throughout the city.
def classify_finance(value, thres_95):
if value >= thres_95:
return 'Top 5%'
else:
return 'Rest'
ch_ret.loc[:, f'{target_var}_class'] = list(map(classify_finance, ch_ret[target_var], [ch_ret[target_var].quantile(0.95)]*ch_ret.shape[0]))
la_ret.loc[:, f'{target_var}_class'] = list(map(classify_finance, la_ret[target_var], [la_ret[target_var].quantile(0.95)]*la_ret.shape[0]))
Layout([Map(Layer(ch_ret[ch_ret[f'{target_var}_class'] != 'Rest'],
geom_col='geom',
style=color_category_style(f'{target_var}_class',
cat=['Top 5%'],
palette=['#009B9E']),
legends=color_category_legend('Performers'))),
Map(Layer(la_ret[la_ret[f'{target_var}_class'] != 'Rest'],
geom_col='geom',
style=color_category_style(f'{target_var}_class',
cat=['Top 5%'],
palette=['#009B9E']),
legends=color_category_legend('Performers')))],
map_height=420
)
In this section, we'll identify what social segments characterize top performers in Chicago and LA and compare the two cities.
Main insights:
social_segment_columns = ch_ret.columns[17:-22]
top_ch_ret = ch_ret[ch_ret[f'{target_var}_class'] == 'Top 5%']
top_ch_ret = top_ch_ret[social_segment_columns].describe().transpose()
top_ch_ret.sort_values(['50%'], ascending=False, inplace=True)
top_la_ret = la_ret[la_ret[f'{target_var}_class'] == 'Top 5%']
top_la_ret = top_la_ret[social_segment_columns].describe().transpose()
top_la_ret.sort_values(['50%'], ascending=False, inplace=True)
We'll use the following function to make sure social segments have same color in both cities for an easier comparison.
Note we'll use CARTO's bold palette colors. Here you can explore all our palettes.
def palette(segments, color_dict):
return [color_dict[segment] for segment in segments]
unique_segments = np.unique(top_ch_ret.head(7).index.tolist() + top_la_ret.head(7).index.tolist())
palette_c=['#7F3C8D','#11A579','#3969AC','#F2B701','#E73F74','#80BA5A','#E68310','#008695','#CF1C90','#f97b72', '#4b4b8f', '#A5AA99']
color_dict = dict(zip(unique_segments, palette_c))
def plot_top_segments(df, ax, color_dict):
sns.barplot(x='50%', y='social_segment', data=df.head(7).reset_index().rename(columns={'index':'social_segment'}),
alpha=0.96, ax=ax, palette=palette(df.head(7).index.tolist(), color_dict))
ax.set_title('Chicago', fontsize=18, fontweight='bold')
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelsize=15)
fig, axs = plt.subplots(1, 2, figsize=(18, 6))
plot_top_segments(top_ch_ret, axs[0], color_dict)
plot_top_segments(top_la_ret, axs[1], color_dict)
fig.tight_layout()
Main insights:
social_indices_columns = ch_ret.columns[-22:-3]
top_ch_ret_ix = ch_ret[ch_ret[f'{target_var}_class'] == 'Top 5%']
top_ch_ret_ix = top_ch_ret_ix[social_indices_columns].describe().transpose()
top_ch_ret_ix.sort_values(['50%'], ascending=False, inplace=True)
top_la_ret_ix = la_ret[la_ret[f'{target_var}_class'] == 'Top 5%']
top_la_ret_ix = top_la_ret_ix[social_indices_columns].describe().transpose()
top_la_ret_ix.sort_values(['50%'], ascending=False, inplace=True)
unique_segments = np.unique(top_ch_ret_ix.head(7).index.tolist() + top_la_ret_ix.head(7).index.tolist())
palette_c=['#7F3C8D','#11A579','#3969AC','#F2B701','#E73F74','#80BA5A','#E68310','#008695','#CF1C90','#f97b72']
color_dict = dict(zip(unique_segments, palette_c))
def plot_top_indices(df, ax, color_dict):
sns.barplot(x='50%', y='index', data=df.head(7).reset_index(),
alpha=0.96, ax=ax, palette=palette(df.head(7).index.tolist(), color_dict))
ax.set_title('Chicago', fontsize=18, fontweight='bold')
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelsize=11)
fig, axs = plt.subplots(1, 2, figsize=(18, 6))
plot_top_indices(top_ch_ret_ix, axs[0], color_dict)
plot_top_indices(top_la_ret_ix, axs[1], color_dict)
fig.tight_layout()
In this section we'll carry out a deeper analysis on how geosocial data can help explain retail performance. We'll compare the most important features for Chicago and LA.
Here we calculate the correlation coefficient between every social segment and the total transaction amount for restaurants. This allows us to identify which segments have a strongest impact and the differences between both cities.
This analysis provides very interesting insights. We can see how the social segment "sites to see" impacts positively in Chicago, while it barely impacts in LA.
def calculate_corrcoef(target_var, df, columns):
corr_coefs = []
for ssegment in columns:
corr_aux = df[~df[ssegment].isnull()]
corr_coefs.append(np.corrcoef(corr_aux[target_var], corr_aux[ssegment])[0][1])
corr_df_ch = pd.DataFrame(data={'social_segment':columns, 'corr_coef':corr_coefs})
corr_df_ch['corr_coef_abs'] = np.abs(corr_df_ch['corr_coef'])
corr_df_ch.sort_values('corr_coef_abs', ascending=False, inplace=True)
return corr_df_ch
corr_df_ch = calculate_corrcoef(target_var, ch_ret, social_segment_columns)
corr_df_la = calculate_corrcoef(target_var, la_ret, social_segment_columns)
corr_df_ch['city'] = 'Chicago'
corr_df_la['city'] = 'LA'
corr_df = pd.concat([corr_df_ch, corr_df_la], ignore_index=True)
plt.figure(figsize=(18, 6))
sns.barplot(x='social_segment', y='corr_coef', hue='city', data=corr_df, alpha=0.99)
plt.xticks(rotation=90)
plt.xlabel('')
plt.title(f"Correlation strength of social segments with target var '{target_var}' - industry '{industry}'",
fontsize=15, fontweight='light', pad=15)
plt.tight_layout()
Now, we'll calculate the correlation coefficient between every social index and the metric total transaction amount for restaurants. This allows us to identify which indices have a strongest impact and the differences between both cities.
It is interesting to see the different correlation coefficients for older affinity and discount affinity in Chicago and LA.
corr_df_ch_ix = calculate_corrcoef(target_var, ch_ret, social_indices_columns)
corr_df_la_ix = calculate_corrcoef(target_var, la_ret, social_indices_columns)
corr_df_ch_ix['city'] = 'Chicago'
corr_df_la_ix['city'] = 'LA'
corr_df_ix = pd.concat([corr_df_ch_ix, corr_df_la_ix], ignore_index=True)
plt.figure(figsize=(10, 6))
sns.barplot(y='social_segment', x='corr_coef', hue='city', data=corr_df_ix, alpha=0.95)
plt.xticks(rotation=90)
plt.ylabel('')
plt.yticks(fontsize=12)
plt.title(f"Correlation strength of indices with target var '{target_var}' - industry '{industry}'",
fontsize=14, fontweight='light', pad=15)
plt.tight_layout()