Hawkeye Liquor Mart (note: fake company name) is considering opening a new liquor store in Iowa. As a first step, this analysis is meant to help identify geographic areas that merit more intensive market research and scouting. This executive summary provides a high-level synopsis of the approach, findings, limitations, and recommended next steps.
This analysis works primarily with a dataset obtained from the Iowa Alcoholic Beverages Division, which contains transaction records of liquor sales to retailers holding Class E Liquor licenses.
The primary analytical approach was to assess Iowa counties to identify where there is evidence that:
Through standard analysis techniques and incorporating basic population and economic data, this analysis derived a "Opportunity Score" metric which assigns a numerical score based on the three factors above.
There are 15 counties (out of 99) that are in the top 75% percentile as measured by the Opportunity Score, and are recommended for further analysis.
Further, we can use basic linear regression to identify potential revenue levels that a new store might achieve in a given county (where we have sufficient data). For example, among the top counties on the list, the baseline projected revenue is:
It is important to note that the dataset does not include retail pricing (cost to consumer), only cost per unit to the liquor retailers. Therefore, we should think of data labeled as "sales" or "revenue" as cost of goods sold (COGS). While this does not allow us to calculate actual revenue or profit, it is a reasonable starting point to understand the market.
This data and the resulting county list is intended to provide a starting point for deeper research into each county. I recommend evaluating and understanding the list and prioritizing and/or pairing it down as needed. In particular, I recommend researching actual retail pricing, and other margin-impacting measures (overhead, inventory turnover, etc.).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Note: this file is large (>377 MB), which is why I have not made it available on Github. If you are interested in
# reproducing this, you can access the same data from the Iowa Alcoholic Beverages Division's website (https://abd.iowa.gov/)
# I may consider hosting this specific file if enough people express interest.
rawdata = pd.read_csv("../../datasets/Project2/Iowa_Liquor_Sales_reduced__FULL_SET.csv", header=0, low_memory=False)
rawdata.head()
Date | Store Number | City | Zip Code | County Number | County | Category | Category Name | Vendor Number | Item Number | Item Description | Bottle Volume (ml) | State Bottle Cost | State Bottle Retail | Bottles Sold | Sale (Dollars) | Volume Sold (Liters) | Volume Sold (Gallons) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03/31/2016 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 370 | 87152 | Avion Silver | 375 | $9.99 | $14.99 | 12 | $179.88 | 4.500 | 1.190 |
1 | 03/31/2016 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 395 | 89197 | Jose Cuervo Especial Reposado Tequila | 1000 | $12.50 | $18.75 | 2 | $37.50 | 2.000 | 0.530 |
2 | 03/31/2016 | 4959 | CEDAR FALLS | 50613 | 7.000 | Black Hawk | 1071100.000 | AMERICAN COCKTAILS | 380 | 63959 | Uv Blue Raspberry Lemonade Pet | 1750 | $5.97 | $8.96 | 6 | $53.76 | 10.500 | 2.770 |
3 | 03/31/2016 | 2190 | DES MOINES | 50314 | 77.000 | Polk | 1031200.000 | VODKA FLAVORED | 205 | 40597 | New Amsterdam Red Berry | 200 | $2.24 | $3.36 | 48 | $161.28 | 9.600 | 2.540 |
4 | 03/31/2016 | 5240 | WEST BRANCH | 52358 | nan | NaN | 1081200.000 | CREAM LIQUEURS | 305 | 73055 | Rumchata | 750 | $12.50 | $18.75 | 6 | $112.50 | 4.500 | 1.190 |
# Replacing whitespace and special chars so that I don't have to use brackets & quotes every time I need to reference a column.
rawdata.columns = [c.replace(' ', '_') for c in rawdata.columns]
rawdata.columns = [c.replace('(', '') for c in rawdata.columns]
rawdata.columns = [c.replace(')', '') for c in rawdata.columns]
rawdata.columns
Index([u'Date', u'Store_Number', u'City', u'Zip_Code', u'County_Number', u'County', u'Category', u'Category_Name', u'Vendor_Number', u'Item_Number', u'Item_Description', u'Bottle_Volume_ml', u'State_Bottle_Cost', u'State_Bottle_Retail', u'Bottles_Sold', u'Sale_Dollars', u'Volume_Sold_Liters', u'Volume_Sold_Gallons'], dtype='object')
rawdata.shape
(2709552, 18)
rawdata.isnull().sum()
Date 0 Store_Number 0 City 0 Zip_Code 0 County_Number 10913 County 10913 Category 779 Category_Name 6109 Vendor_Number 0 Item_Number 0 Item_Description 0 Bottle_Volume_ml 0 State_Bottle_Cost 0 State_Bottle_Retail 0 Bottles_Sold 0 Sale_Dollars 0 Volume_Sold_Liters 0 Volume_Sold_Gallons 0 dtype: int64
There are missing values in County_Number, County, Category, Category_Name. I really only care about County and Category_Name.
rawdata.dtypes
Date object Store_Number int64 City object Zip_Code object County_Number float64 County object Category float64 Category_Name object Vendor_Number int64 Item_Number int64 Item_Description object Bottle_Volume_ml int64 State_Bottle_Cost object State_Bottle_Retail object Bottles_Sold int64 Sale_Dollars object Volume_Sold_Liters float64 Volume_Sold_Gallons float64 dtype: object
Some of these clearly need to be changed--Date should be a date/time format, and all cost oriented variables should be floats.
# Strip out the non-numeric characters in the cost-related variables
rawdata.State_Bottle_Cost.replace('[\$,)]','', regex=True, inplace=True)
rawdata.State_Bottle_Retail.replace('[\$,)]','', regex=True, inplace=True)
rawdata.Sale_Dollars.replace('[\$,)]','', regex=True, inplace=True)
# Now convert them to floats
rawdata.State_Bottle_Cost = rawdata.State_Bottle_Cost.astype(float, copy=True)
rawdata.State_Bottle_Retail = rawdata.State_Bottle_Retail.astype(float, copy=True)
rawdata.Sale_Dollars = rawdata.Sale_Dollars.astype(float, copy=True)
# Converting Date to datetime
rawdata.Date = pd.to_datetime(rawdata.Date)
# Converting zip codes to floats for now (even though zip codes are categorical, this can
# identify data issues)
rawdata.Zip_Code = rawdata.Zip_Code.astype(int, copy=True, raise_on_error=False)
# What county is this value associated with?
rawdata[rawdata.Zip_Code == '712-2'].County.value_counts()
Harrison 1725 Name: County, dtype: int64
# Assigning Harrison county to the county NaNs where zip code = 712-2
rawdata.loc[rawdata.Zip_Code == '712-2', 'County'] = 'Harrison'
What cities are associated with Harrison and 712-2?
rawdata[(rawdata.County == 'Harrison') & (rawdata.Zip_Code == '712-2')].City.value_counts()
DUNLAP 2000 Name: City, dtype: int64
All of them are Dunlap, IA. What zip code is Dunlap?
rawdata[rawdata.City == 'DUNLAP'].Zip_Code.value_counts()
712-2 2000 Name: Zip_Code, dtype: int64
All are 712-2, the error value. Need a correct source to fix this. I found a listing of all Iowa zip codes and corresponding counties and cities here. I copied the data into a csv file.
# Manually scraped from website and saved to CSV. Source: http://www.zipcodestogo.com/Iowa/
iowa_zips = pd.read_csv("../../datasets/Project2/Iowa_zips_list.csv", header=0)
# Renaming columns to avoid duplication with column names in the rawdata, for merging purposes.
iowa_zips.rename(columns={'Zip Code':'zip_code_ref', 'County':'county_ref'}, inplace=True)
iowa_zips.dtypes
zip_code_ref int64 City object county_ref object dtype: object
# Looking up zip code for Dunlap. I don't know text format, so passing all caps and title caps.
print "Dunlap zip code is: " + str(iowa_zips[(iowa_zips.City == 'DUNLAP') | (iowa_zips.City == "Dunlap")].zip_code_ref )
Dunlap zip code is: 651 51529 Name: zip_code_ref, dtype: int64
# Assigning correct zip code to Dunlap in the rawdata table
rawdata.loc[rawdata.City == 'DUNLAP', 'Zip_Code'] = '51529'
# Confirming change
rawdata[rawdata.City == 'DUNLAP'].Zip_Code.value_counts()
51529 2000 Name: Zip_Code, dtype: int64
# Converting all zips to integers
rawdata.Zip_Code = rawdata.Zip_Code.astype(int)
Now to fix the missing county values. The iowa_zips df can also help with that.
# Merging rawdata and iowa_zips on their respective zip code columns.
dataset = pd.merge(rawdata, iowa_zips[['zip_code_ref', 'county_ref']], left_on='Zip_Code', right_on='zip_code_ref')
dataset.head()
Date | Store_Number | City | Zip_Code | County_Number | County | Category | Category_Name | Vendor_Number | Item_Number | Item_Description | Bottle_Volume_ml | State_Bottle_Cost | State_Bottle_Retail | Bottles_Sold | Sale_Dollars | Volume_Sold_Liters | Volume_Sold_Gallons | zip_code_ref | county_ref | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-31 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 370 | 87152 | Avion Silver | 375 | 9.990 | 14.990 | 12 | 179.880 | 4.500 | 1.190 | 52806 | Scott |
1 | 2016-03-31 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 395 | 89197 | Jose Cuervo Especial Reposado Tequila | 1000 | 12.500 | 18.750 | 2 | 37.500 | 2.000 | 0.530 | 52806 | Scott |
2 | 2016-03-31 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 410 | 88294 | Patron Silver Tequila | 375 | 14.000 | 21.000 | 12 | 252.000 | 4.500 | 1.190 | 52806 | Scott |
3 | 2016-03-31 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1052010.000 | IMPORTED GRAPE BRANDIES | 389 | 49185 | Remy Martin Vsop (flask) | 375 | 10.660 | 15.990 | 12 | 191.880 | 4.500 | 1.190 | 52806 | Scott |
4 | 2016-03-31 | 5029 | DAVENPORT | 52806 | 82.000 | Scott | 1022100.000 | TEQUILA | 410 | 88296 | Patron Tequila Silver | 750 | 27.000 | 40.500 | 12 | 486.000 | 9.000 | 2.380 | 52806 | Scott |
# county_ref replaces County.
dataset.drop('County', axis=1, inplace=True)
dataset.rename(columns={'county_ref': 'County'}, inplace=True)
I want to add a few more date-oriented columns to the master dataset, in case it proves useful.
dataset['Quarter'] = dataset.Date.dt.quarter
dataset['Year'] = dataset.Date.dt.year
dataset['Month'] = dataset.Date.dt.month
dataset['Year_Quarter'] = dataset.Year.astype(str) + '-Q' + dataset.Quarter.astype(str)
dataset.describe()
Store_Number | Zip_Code | County_Number | Category | Vendor_Number | Item_Number | Bottle_Volume_ml | State_Bottle_Cost | State_Bottle_Retail | Bottles_Sold | Sale_Dollars | Volume_Sold_Liters | Volume_Sold_Gallons | zip_code_ref | Quarter | Year | Month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2695123.000 | 2695123.000 | 2684210.000 | 2694349.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 | 2695123.000 |
mean | 3584.177 | 51258.332 | 57.253 | 1043845.577 | 256.410 | 46023.501 | 925.280 | 9.823 | 14.750 | 9.841 | 129.300 | 8.937 | 2.361 | 51258.332 | 2.255 | 2015.194 | 5.819 |
std | 947.805 | 987.564 | 27.351 | 50110.180 | 140.919 | 53201.525 | 546.595 | 14.938 | 22.407 | 23.546 | 400.420 | 28.309 | 7.478 | 987.564 | 1.172 | 0.395 | 3.620 |
min | 2106.000 | 50002.000 | 1.000 | 1011100.000 | 10.000 | 146.000 | 0.000 | 0.890 | 1.340 | 1.000 | 1.340 | 0.000 | 0.000 | 50002.000 | 1.000 | 2015.000 | 1.000 |
25% | 2603.000 | 50315.000 | 31.000 | 1012200.000 | 115.000 | 26827.000 | 750.000 | 5.510 | 8.270 | 2.000 | 30.540 | 1.500 | 0.400 | 50315.000 | 1.000 | 2015.000 | 3.000 |
50% | 3715.000 | 51054.000 | 62.000 | 1031200.000 | 260.000 | 38176.000 | 750.000 | 8.090 | 12.300 | 6.000 | 70.560 | 5.250 | 1.390 | 51054.000 | 2.000 | 2015.000 | 5.000 |
75% | 4374.000 | 52310.000 | 77.000 | 1062310.000 | 380.000 | 64601.000 | 1000.000 | 11.960 | 17.940 | 12.000 | 135.000 | 10.500 | 2.770 | 52310.000 | 3.000 | 2015.000 | 9.000 |
max | 9023.000 | 52807.000 | 99.000 | 1701100.000 | 978.000 | 999275.000 | 225000.000 | 6468.000 | 9702.000 | 3960.000 | 106326.000 | 3960.000 | 1046.120 | 52807.000 | 4.000 | 2016.000 | 12.000 |
sns.pairplot(dataset, vars=['State_Bottle_Retail', 'Bottles_Sold', 'Sale_Dollars', 'Volume_Sold_Gallons', 'Quarter'])
<seaborn.axisgrid.PairGrid at 0x10eb17f10>
A lot of pretty obvious linear relationships here, like gallons of alcohol sold vs sales (or bottles). Interestingly, the plots by quarter suggest there might be some seasonality in Q2 and Q4.
sns.distplot(dataset.Bottles_Sold)
# VERY skewed
<matplotlib.axes._subplots.AxesSubplot at 0x10f648210>
Because the master dataset is at the transaction level, I need to make some summary tables in order to be able to start to analyze which counties might be the right places to consider.
# Creating a rollup view of store performance on key measures.
store_summary = dataset.pivot_table(index='Store_Number', values=['Bottles_Sold', 'Sale_Dollars','Volume_Sold_Gallons'], aggfunc=np.sum)
sns.distplot(store_summary.Sale_Dollars)
<matplotlib.axes._subplots.AxesSubplot at 0x114d0ec90>
The distribution of Sale_Dollars (e.g., what the liquor stores paid to the state) is extremely right skewed. That said, the bulk of the data does not have any abnormal or jagged shape that might be difficult to interpret.
# Converting the pivot table to a dataframe
store_summary_df = pd.DataFrame(store_summary)
# Setting the store ID as the index
store_summary['Store_ID'] = store_summary.index
sns.pairplot(store_summary)
<seaborn.axisgrid.PairGrid at 0x10f896790>
This is not at all surprising. The pairs that are relevant (store ID should be ignored since it is an arbitrary number), are extremely well correlated.
While it's obvious that volume of alcohol sold would be linearly related to dollar amount, it's still important to verify. In particular, this leads me to believe that there is not a noticable segment of specialty or high-end liquor stores that might sell fewer gallons but at a higher price per gallon.
# Stores list--deduping master dataset by store_number to grab county, city, zip code by store.
stores = dataset.copy()
stores.drop_duplicates("Store_Number", inplace=True)
# I only want unique stores for this.
stores.drop(['Item_Number', 'Item_Description', 'Bottle_Volume_ml', 'State_Bottle_Cost', 'State_Bottle_Retail', 'Bottles_Sold', 'Sale_Dollars', 'Volume_Sold_Liters', 'Volume_Sold_Gallons', 'Category_Name', 'Quarter', 'Year', 'Month', 'Year_Quarter', 'Date', 'Category'], axis=1, inplace=True)
# Merging the store-oriented dataframes into one, store performance dataframe
stores_performance = pd.merge(stores, store_summary_df, left_on='Store_Number', right_on='Store_ID')
sns.regplot(x='Volume_Sold_Gallons', y='Sale_Dollars', data=stores_performance)
<matplotlib.axes._subplots.AxesSubplot at 0x137c1ac10>
This tells me that there's no differentiation in revenue based on selling expensive vs. cheap/regular price alcohol. If that were the case, there would at least be some notable outliers where revenue (Sales_Dollars) was higher as compared to the liquid volume of alcohol sold.
# Adding calculated columns related to store performance
stores_performance['Avg_Bottles_Sold'] = stores_performance.Sale_Dollars / stores_performance.Bottles_Sold
stores_performance['Avg_Bottle_Size_Gal'] = stores_performance.Volume_Sold_Gallons / stores_performance.Bottles_Sold
# Summing up performance by county
county_sums = dataset.pivot_table(index='County', values=['Bottles_Sold', 'Sale_Dollars','Volume_Sold_Gallons'], aggfunc=np.sum)
county_sums.rename(columns={'Bottles_Sold': 'Total_Bottles_Sold', 'Sale_Dollars': 'Total_Sales', 'Volume_Sold_Gallons': 'Total_Gallons_Sold'},inplace=True)
# Counting unique number of stores per county
stores_per_county = dataset.pivot_table(index='County', values='Store_Number', aggfunc=lambda x: len(x.unique()))
stores_per_county = pd.DataFrame(stores_per_county)
stores_per_county.rename(columns={'Store_Number': 'Count_of_Stores'}, inplace=True)
# Calculating median values at the county level
county_medians = stores_performance.pivot_table(index='County', values=['Sale_Dollars', 'Bottles_Sold', 'Volume_Sold_Gallons'], aggfunc=np.median)
county_medians.rename(columns={'Sale_Dollars' : 'Median_Sales', 'Bottles_Sold': 'Median_Bottles_Sold', 'Volume_Sold_Gallons': 'Median_Gallons_Sold'},inplace=True)
# Calculating mean values at the county level
county_means = stores_performance.pivot_table(index='County', values=['Sale_Dollars', 'Bottles_Sold', 'Volume_Sold_Gallons'], aggfunc=np.mean)
county_means.rename(columns={'Sale_Dollars' : 'Mean_Sales', 'Bottles_Sold': 'Mean_Bottles_Sold', 'Volume_Sold_Gallons': 'Mean_Gallons_Sold'},inplace=True)
# Merging the pivot tables into a master dataframe of county-level stats
county_master = pd.merge(stores_per_county, county_sums, left_index=True, right_index=True)
county_master = pd.merge(county_master, county_means, left_index=True, right_index=True)
county_master = pd.merge(county_master, county_medians, left_index=True, right_index=True)
# Adding calculated columns at the county level
county_master['Avg_Sales_per_Store'] = county_master.Total_Sales / county_master.Count_of_Stores
county_master["Avg_Price_per_Bottle"] = county_master.Total_Sales / county_master.Total_Bottles_Sold
sns.distplot(county_master.Avg_Sales_per_Store)
<matplotlib.axes._subplots.AxesSubplot at 0x11508ca10>
This plot is very interesting--it shows the distribution of average sales per store (distribution of means), and it looks very close to a normal distribution. This is probably the central limit theorem at work.
Now, it's helpful to explore the variables to identify any correlations that might be of interest for modeling purposes.
feature_correlation_matrix = np.corrcoef(county_master.T)
feature_correlation_matrix
array([[ 1. , 0.98999848, 0.98635155, 0.98996605, 0.54600617, 0.53240087, 0.44839641, 0.08582828, -0.01728394, -0.08454499, 0.53240087, 0.09594789], [ 0.98999848, 1. , 0.99855936, 0.9989693 , 0.56241395, 0.55026813, 0.46909394, 0.12850948, 0.03074696, -0.03409194, 0.55026813, 0.09697773], [ 0.98635155, 0.99855936, 1. , 0.99927445, 0.55324953, 0.54572563, 0.46505384, 0.11969717, 0.02837408, -0.03494363, 0.54572563, 0.11261634], [ 0.98996605, 0.9989693 , 0.99927445, 1. , 0.56627649, 0.55759603, 0.47840223, 0.12552183, 0.03268836, -0.03191289, 0.55759603, 0.11499437], [ 0.54600617, 0.56241395, 0.55324953, 0.56627649, 1. , 0.99163517, 0.97652045, 0.63384614, 0.54019713, 0.47256971, 0.99163517, 0.36764752], [ 0.53240087, 0.55026813, 0.54572563, 0.55759603, 0.99163517, 1. , 0.98689059, 0.62332686, 0.54978658, 0.48353802, 1. , 0.46539987], [ 0.44839641, 0.46909394, 0.46505384, 0.47840223, 0.97652045, 0.98689059, 1. , 0.66066315, 0.60831987, 0.55713816, 0.98689059, 0.48440569], [ 0.08582828, 0.12850948, 0.11969717, 0.12552183, 0.63384614, 0.62332686, 0.66066315, 1. , 0.94453872, 0.9213529 , 0.62332686, 0.21173297], [-0.01728394, 0.03074696, 0.02837408, 0.03268836, 0.54019713, 0.54978658, 0.60831987, 0.94453872, 1. , 0.98066557, 0.54978658, 0.30403622], [-0.08454499, -0.03409194, -0.03494363, -0.03191289, 0.47256971, 0.48353802, 0.55713816, 0.9213529 , 0.98066557, 1. , 0.48353802, 0.29387599], [ 0.53240087, 0.55026813, 0.54572563, 0.55759603, 0.99163517, 1. , 0.98689059, 0.62332686, 0.54978658, 0.48353802, 1. , 0.46539987], [ 0.09594789, 0.09697773, 0.11261634, 0.11499437, 0.36764752, 0.46539987, 0.48440569, 0.21173297, 0.30403622, 0.29387599, 0.46539987, 1. ]])
import seaborn.linearmodels as sblm
sns.heatmap(feature_correlation_matrix, annot=True, xticklabels=True, yticklabels=True)
<matplotlib.axes._subplots.AxesSubplot at 0x115162810>
sns.pairplot(county_master[['Count_of_Stores', 'Total_Sales', 'Total_Bottles_Sold', 'Total_Gallons_Sold']])
<seaborn.axisgrid.PairGrid at 0x1151b8510>