Project goals

To determine the affect of weather on MLB attendance and whether the weather and related variables can reasonably predict attendance per game.

https://www.wsj.com/articles/major-league-baseball-sees-a-sharp-drop-in-attendance-1529063246?mod=hp_lead_pos8

Steps:

  1. Find reliable datasets
  2. Clean and merge all relevant data
  3. Exploratory data analysis for outliers, missing values, etc.
  4. Feature selection
  5. Model building and interpretation
  6. Statistical Testing and Inference
  7. Results and conclusion
In [1]:
## Import libraries

import pandas as pd # Data storage and manipulation
import numpy as np # Mathematical operations
import matplotlib.pyplot as plt # Plotting and visualization
import seaborn as sns # Plotting and visualization
import dc_stat_think as dcst #Statistical significance testing

pd.options.display.max_columns = None

sns.set()

pd.options.mode.chained_assignment = None

from sklearn.model_selection import train_test_split # Data splitting for modeling
from sklearn.metrics import mean_absolute_error as mae # Error metric for models

Game data:

  1. All data was taken from Retrosheet.org. Standard disclaimer from their website: The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".
  2. Downloads were done for game logs from 1990-2017, which includes ~160+ total variables.
  3. Each year's data was saved as a .txt file and read in collectively using the code below.
In [2]:
## Read in game datasets

df_all = pd.DataFrame()

for i in range (1990, 2018):
    df = pd.read_csv(r'C:\Users\jbean\Dropbox\Other\Python\baseballdatabank-2017.1\Project\Game Data\GL%s.txt' % i,header=None)
    
    df_all = pd.concat([df_all, df])
In [3]:
## Isolate relevant columns and assign names; "Relevant columns" include numerical data and other categorical variables
## that could feature in a final analysis or model.

df_all = df_all.iloc[:,np.r_[0,2:11, 12, 16:19, 21:27, 30, 32, 38, 49:55, 58, 60, 66, 160]]
cols = ['Date','Day_of_week','Visitor','V_league','V_game_num','Home','H_league','H_game_num','V_score','H_score','Time_of_day','Park','Attendance','Game_time','V_atbats','V_hits','V_doubles','V_triples','V_hr','V_rbi','V_bb','V_strikeouts','V_pitchersused','H_atbats','H_hits','H_doubles','H_triples','H_hr','H_rbi','H_bb','H_strikeouts','H_pitchersused','Data_quality']
df_all.columns = cols

df_all.head()
Out[3]:
Date Day_of_week Visitor V_league V_game_num Home H_league H_game_num V_score H_score Time_of_day Park Attendance Game_time V_atbats V_hits V_doubles V_triples V_hr V_rbi V_bb V_strikeouts V_pitchersused H_atbats H_hits H_doubles H_triples H_hr H_rbi H_bb H_strikeouts H_pitchersused Data_quality
0 19900409 Mon DET AL 1 BOS AL 1 2 5 D BOS07 35199.0 167.0 33 5 1 0 0 2 4 6 3 33 8 4 0 0 4 4 3 3 Y
1 19900409 Mon SEA AL 1 CAL AL 1 7 4 N ANA01 38406.0 176.0 40 15 2 0 1 7 1 2 3 31 5 0 0 1 3 5 6 5 Y
2 19900409 Mon MIL AL 1 CHA AL 1 1 2 D CHI10 40008.0 158.0 30 4 1 0 0 1 1 8 3 29 6 1 1 0 1 2 1 5 Y
3 19900409 Mon BAL AL 1 KCA AL 1 7 6 D KAN06 37671.0 211.0 45 14 1 0 2 6 2 8 5 44 10 1 0 2 6 4 3 4 Y
4 19900409 Mon MIN AL 1 OAK AL 1 3 8 N OAK01 48219.0 170.0 37 10 1 0 1 3 2 3 3 32 11 2 0 2 7 6 4 3 Y

Weather data

Notes:

  1. All data was collected from w2.weather.gov > Climate > City, with monthly averages for every year between 1990 and 2017.
  2. The two Chicago teams were given the same weather data
  3. For other inter-city teams (Yankees/Mets, Angels/Dodgers, Oakland/SF) distinct temperatures were used to capture weather in different neighborhoods that were closer to the team's ballpark. For example, the Central Park weather data was used for the Yankees while Queens data was used for the Mets.
  4. For TOR, data for Buffalo, NY was used for consistency of data source and proximity; likewise, the Plattsburgh, NY data was used for Montreal, which was a bit more of a stretch geographically, but MON has an indoor stadium, partially mitigating the assumption.
  5. The "Stadium_type" columns is 1 for indoor ballparks and 0 for outdoor. Stadiums with a retractable roof (e.g. Seattle) were considered outdoor. Only fully indoor stadiums, such as Montreal's stadium, were classified as indoor. For teams that switched from an indoor to outdoor stadium over the course of the dataset, that was incorporated into the variable (for example, Minnesota switched from an indoor to an outdoor stadium starting in 2010.
In [4]:
## Read in the aggregated weather data

df_weather = pd.read_csv(r'C:\Users\jbean\Dropbox\Other\Python\baseballdatabank-2017.1\Project\city_weather_data.csv')
df_weather['Date'] = pd.to_datetime(df_weather['Date'])

df_weather.set_index('Date',inplace=True)
df_weather.index = df_weather.index.to_period('M').to_timestamp('M') # Match format of Date to monthly grouped game data
df_weather['month'] = df_weather.index.month
df_weather.head()
Out[4]:
Temperature City Stadium_Type month
Date
1990-04-30 47.6 BOS 0 4
1990-05-31 54.9 BOS 0 5
1990-06-30 66.6 BOS 0 6
1990-07-31 73.1 BOS 0 7
1990-08-31 73.3 BOS 0 8
In [5]:
## Convert date to datetime and set as index

df_all['Date'] = pd.to_datetime(df_all['Date'],format='%Y%m%d')

df_all.set_index('Date',inplace=True)
In [6]:
## Set up a column that will sum games when grouped; will be used to calculate per/game metrics after groupby operations
## are performed in later analyses.

df_all['game'] = 1
In [7]:
## FLO = MIA, ANA = CAL; reflect name changes but they are the same team (http://www.retrosheet.org/CurrentNames.csv)

df_all.replace('FLO','MIA',inplace=True)
df_all.replace('CAL','ANA',inplace=True)

Exploratory Data Analysis

In [8]:
## Group the data by year and plot the annual attendance overall (axis in millions)

grouped_year = pd.DataFrame(df_all.groupby(df_all.index.year).sum())

grouped_year['Attendance'] = grouped_year['Attendance'].astype(int)

plt.plot(grouped_year['Attendance'])
plt.xlabel('Year')
plt.ylabel('Annual Attendance')
plt.title('Annual attendance by year for MLB games')
plt.ylim([40000000, 90000000])

plt.show()

Overall attendance is a good macro factor, but attendance per game is a more accurate metric in this case, so we will create that variable.

In [9]:
## Create attendance per game variable and see how that changes per year

grouped_year['attendance_per_game'] = grouped_year['Attendance'] / grouped_year['game']

plt.plot(grouped_year['attendance_per_game'])

plt.xlabel('Year',fontweight='bold')
plt.ylabel('Attendance per game, all',fontweight='bold')
plt.title('Trended attendance per game, MLB',fontweight='bold')
plt.ylim([21000,35000])

plt.show()

What's going on in the mid 90's?

Total attendance shows a pretty steep dropoff in 1994 and 1995, and on a per-game basis, the 1995 data shows a drop of over 6,000 fans per game, on average, for the season. What's the deal?

This is where EDA helps us contextualize the data that we have. A bit of digging shows that MLB had a players strike that started near the end of the 1994 season and continued through April 1995. This not only explains the lower overall attendance, but we can also make some inferences about 1995. For example, the drop in attendance could very well be from a drop in fan trust as a result of the strike, and the way they show distrust is via not going to games.

We also see on a per-game basis, MLB didn't recover to their pre-strike peak until around 2006, after which they gained through 2008, followed by another drop. This time, the more likely cause for the drastic drop in attendance is a more macroeconomic factor, the Great Recession.

In [10]:
## Isolate average attendance per month to see the impact of missing August and September for a season

grouped_month = df_all.groupby(by=df_all.index.month).sum()
grouped_month = grouped_month[grouped_month.index != 3] # Exclude March because it's a smaller sample size
attendance_per_month = grouped_month['Attendance'] / grouped_month['game']

attendance_per_month.plot(kind='bar')
plt.xlabel('Month')
plt.ylabel('Attendance per game')
plt.title('Attendance per game per month')

plt.show()
In [11]:
## Plot the change in time per game to see if game time has increased between 1990 and 2017

grouped_year['time_per_game'] = grouped_year['Game_time'] / grouped_year['game']

plt.plot(grouped_year['time_per_game'])
plt.xlabel('Year', fontweight='bold')
plt.ylabel('Average time per game',fontweight='bold')
plt.title('Trend in time per game, 1990-2017',fontweight='bold')
plt.ylim([155, 200])

plt.show()

In tandem with the change in attendance, the average length of an MLB game is a worthwhile statistic to understand other factors at play. Historically seen as a slower paced game compared to basketball or football, this graphic isn't a good trending sign for MLB. Since 2005, the average time per game has increased by more than 10%, with a largely steady upward trend. The importance of this trend is that part of the drop in attendance may be a result of longer game times. This is something we'll test later.

In order to combat this trend, MLB has implemented a number of smaller rules that should collectively work to speed up games. These include limiting the length of time between innings, number of mound visits for each team, and keeping batters in the box, among others.

In [12]:
## Create aggregated HR and run columns to provide proxy for offensive output; look for other potentially
## correlated variables to attendance per game.

grouped_year['total_hr_game'] = (grouped_year['V_hr'] + grouped_year['H_hr']) / grouped_year['game']
grouped_year['total_run_game'] = (grouped_year['V_score'] + grouped_year['H_score']) / grouped_year['game']
In [13]:
## Look for correlations between attendance and some common variables

corr_columns = ['time_per_game','total_hr_game','total_run_game','attendance_per_game']

sns.pairplot(grouped_year[corr_columns], diag_kind = 'kde',size=3)

plt.show()
In [14]:
## Create a dataframe that groups by month and year for more data points

df_m_y = pd.DataFrame(df_all.groupby(by=[df_all.index.year, df_all.index.month]).sum())
In [15]:
## Add the same variables to the monthly data for more overall data points

df_m_y['total_hr_game'] = (df_m_y['V_hr'] + df_m_y['H_hr']) / df_m_y['game']
df_m_y['total_run_game'] = (df_m_y['V_score'] + df_m_y['H_score']) / df_m_y['game']
df_m_y['time_per_game'] = df_m_y['Game_time'] / df_m_y['game']
df_m_y['attendance_per_game'] = df_m_y['Attendance'] / df_m_y['game']
In [16]:
## Look for correlations between attendance and some common variables

sns.pairplot(df_m_y[corr_columns], diag_kind = 'kde',size=3)

plt.show()

Findings

In looking for factors that could correlate with attendance per game outside of Temperature, I started with a few initial hypotheses. More offensive output (runs and/or homeruns) would be more exciting for fans, thereby generating higher attendance. Likewise, intuititively we would think that shorter time for games may increase attendance, but this also doesn't look to be the case from the plot.

In [17]:
## Add HR and Run variables to the full dataframe

df_all['total_hr_game'] = (df_all['V_hr'] + df_all['H_hr'])
df_all['total_run_game'] = (df_all['V_score'] + df_all['H_score'])
In [18]:
## Resample the dataframe by Month and Year; remove additional months that had all 0's that were not 
## included in DF (December, January, etc.)

df_monthly = df_all.resample('M').sum()
df_monthly.drop(df_monthly[df_monthly.game < 1].index, inplace=True) # Remove offseason months
In [19]:
## Add variables to new monthly dataframe

df_monthly['hr_game'] = df_monthly['total_hr_game'] / df_monthly['game']

df_monthly['total_run_game'] = df_monthly['total_run_game'] / df_monthly['game']

df_monthly['attendance_per_game'] = df_monthly['Attendance'] / df_monthly['game']
In [20]:
## Create a new DF that is regular season only to mitigate potential bias of increased attendance at postseason games

df_monthly_rs = df_monthly[(df_monthly.index.month >= 4) & (df_monthly.index.month <= 9)]
In [21]:
## Group the weather data by month, filter out November, use monthly resampled dataset filtered for April onward
## and add variables; end goal is to plot temperature against attendance per game.

weather_month = df_weather.groupby(df_weather.index).mean()
weather_month = weather_month[weather_month['month'] != 11.0]

attendance_month = df_monthly[df_monthly > '04/01'] # Date filtering
attendance_month['attendance_per_game'] = attendance_month['Attendance'] / attendance_month['game']
In [22]:
## Scatter plot for attendance against average fans per game; aw = attendance + weather

aw_month = weather_month[['Temperature']]
aw_month['attendance_per_game'] = attendance_month['attendance_per_game'].values

sns.regplot(x='Temperature',y='attendance_per_game',data=aw_month)
plt.title('Monthly average temperature against attendance per game')

print('Correlation between Temperature and attendance per game: %.2f' % aw_month.corr().iloc[0,1])
Correlation between Temperature and attendance per game: 0.15

At first glance, there's surprisingly little relation between Temperature and attendance per game. The correlation coefficient is slightly positive, but if we were to remove some of the values in the upper third of the plot, the relationship may drop to essentially 0.

This means that there are likely other factors that are more influential to attendace per game than temperature. These could include the team's record (this season or last), the team in town (for example, playing against the Yankees might draw a bigger crowd than Toronto), or playoff positioning (competing for a playoff spot in September vs. mid-July game in the Texas heat with the team sub-.500).

In [23]:
## Join team and weather data to group each team by month and see how the data distributes

df_weather.sort_index(inplace=True)

df_all['month'] = df_all.index.month

df_all_month = df_all.groupby(by=['month','Home']).sum()
df_all_month = df_all_month.loc[4:11]

df_weather_mt = df_weather.groupby(by=[df_weather.index.month,'City']).mean()
df_weather_mt = df_weather_mt.loc[4:10]

df_total_yr = pd.concat([df_all_month, df_weather_mt[['Temperature','Stadium_Type']]],axis=1)

df_total_yr.head(10)
Out[23]:
V_game_num H_game_num V_score H_score Attendance Game_time V_atbats V_hits V_doubles V_triples V_hr V_rbi V_bb V_strikeouts V_pitchersused H_atbats H_hits H_doubles H_triples H_hr H_rbi H_bb H_strikeouts H_pitchersused game total_hr_game total_run_game Temperature Stadium_Type
month Home
4 ANA 4017 4036 1521 1500 11002699.0 59062.0 11682 3002 606 36 355 1442 1179 2271 1154 11157 2927 541 53 312 1415 966 1962 1231 331 667 3021 64.103571 0.0
ARI 3509 3577 1380 1407 7791983.0 47498.0 9496 2528 577 59 327 1312 863 1964 1064 9034 2440 525 75 339 1350 935 1784 1056 269 666 2787 72.867857 1.0
ATL 3621 3677 1169 1322 9160879.0 51438.0 10307 2477 480 48 230 1098 1018 2196 1110 9967 2575 477 52 297 1261 1046 1993 1135 302 527 2491 62.978571 0.0
BAL 3992 3931 1444 1514 9974960.0 57824.0 11212 2782 501 34 363 1372 1183 2121 1158 10740 2803 539 37 339 1424 1133 1926 1191 324 702 2958 54.939286 0.0
BOS 4213 4198 1415 1676 10325894.0 59548.0 11068 2769 649 44 281 1340 1125 2403 1181 10773 3044 719 59 328 1580 1254 1995 1202 319 609 3091 48.857143 0.0
CHA 4232 4161 1373 1497 6667771.0 55232.0 10836 2648 518 49 286 1301 1131 2165 1125 10355 2680 500 66 363 1424 1162 1862 1138 314 649 2870 49.125000 0.0
CHN 3821 3721 1337 1382 9424494.0 53000.0 10497 2556 510 58 304 1263 1100 2390 1147 10178 2641 512 54 288 1311 1066 2068 1194 305 592 2719 49.125000 0.0
CIN 3568 3596 1561 1456 7722879.0 55876.0 11025 2780 595 62 377 1480 1142 2407 1235 10471 2671 516 53 356 1365 1179 2191 1253 320 733 3017 54.717857 0.0
CLE 3906 3820 1453 1365 6801755.0 51679.0 10184 2680 554 38 302 1371 993 1976 1029 9570 2496 509 42 308 1307 1098 1866 1113 287 610 2818 49.689286 0.0
COL 3796 3773 1782 1788 10273776.0 52452.0 10409 3059 642 76 358 1703 1099 1861 1123 9988 3018 610 115 359 1706 1022 1821 1211 286 717 3570 47.739286 0.0
In [24]:
## Define function that allows us to calculate a per/game statistic to cancel out some of the data noise

def stat_per_game(df, col1, col2='game'):
    
    """
    
    Returns the normalized value of the test statistic on a per/game basis.
    
    """
    
    stat_per_game = df.loc[:,col1] / df.loc[:,col2]
    
    return stat_per_game
In [25]:
## Create another scatter plot to measure Temperature against attendance using grouped dataframe that takes into account
## changes in city/team rather than aggregated monthly values, then look at the linear relationship and correlation.

df_total_yr['attendance_per_game'] = stat_per_game(df_total_yr, 'Attendance')

sns.regplot(x='Temperature', y='attendance_per_game', data=df_total_yr)
plt.title('Attendance per game per month by city Temperature')

corr = df_total_yr[['Temperature','attendance_per_game']].corr()

print('Correlation between attendance per game and Temperature: %.2f' %corr.iloc[0,1])
Correlation between attendance per game and Temperature: 0.14
In [26]:
## Create a dataframe grouped by team and see how average attendance per game changes per team

df_teams = pd.DataFrame(df_all[['Home','Attendance','game']].groupby(by='Home').sum())
df_teams['attendance_per_game'] = stat_per_game(df_teams, 'Attendance','game')

fig = plt.figure(figsize=(14, 8))

ax= sns.barplot(x=df_teams.index, y='attendance_per_game', data=df_teams)

plt.xticks(rotation=45)
plt.title('Attendance per game by team')

plt.show()
In [27]:
## Create a new dataframe that has all of the team and weather information.
## The new dataframe groups and sums the statistics for each month of each year, then joins the weather data for that
## period to the dataframe. This is an uncoupled version of the dataframe made above that now allows us to manipulate,
## plot, and model the data with included temperature data.

new_df_all = pd.DataFrame()

for team in list(set(df_all['Home'])):
    
    teams = df_all[df_all['Home'] == team] # Filter for each specific team
    teams = teams.resample('M').sum() # Resample to monthly data with inputs summed
    teams = teams[teams['game'] > 0] # Eliminate offseason months
    teams = teams[teams.index.month.isin([4,5,6,7,8,9])] # Include only regular season

    weather = df_weather[df_weather['City'] == team] # Filter weather data for team
    weather = weather.resample('M').mean() # Resample by month
    weather = weather[(weather['Temperature'] > 0) & (weather['month'] < 10)] # Eliminate offseason months
    
    weather.drop(weather[weather.index.isin(teams.index) == False].index,inplace=True) #Drops rows where the date is 
    #not in the team index (such as when the strike occurred)
    
    new_teams = pd.concat([teams, weather],axis=1) # Merge datasets
    new_teams['team'] = team # Add team variable for identification
    new_teams = pd.DataFrame(new_teams)

    new_df_all = new_df_all.append(new_teams)
    
new_df_all.head()
Out[27]:
V_game_num H_game_num V_score H_score Attendance Game_time V_atbats V_hits V_doubles V_triples V_hr V_rbi V_bb V_strikeouts V_pitchersused H_atbats H_hits H_doubles H_triples H_hr H_rbi H_bb H_strikeouts H_pitchersused game total_hr_game total_run_game month Temperature Stadium_Type month team
Date
1990-04-30 48 54 33 23 108296.0 949.0 208 62 17 1 3 30 23 30 15 194 51 9 0 3 23 21 28 21 6 6 56 24 53.3 0.0 4.0 PHI
1990-05-31 455 451 81 75 338900.0 2451.0 487 136 28 2 18 81 56 61 43 469 138 23 5 10 73 54 76 47 14 28 156 70 61.3 0.0 5.0 PHI
1990-06-30 899 876 47 57 392520.0 2499.0 494 112 23 3 12 44 59 79 41 495 124 17 1 12 53 54 68 38 15 24 104 90 72.2 0.0 6.0 PHI
1990-07-31 993 990 56 48 390412.0 2001.0 413 102 25 4 13 55 47 71 31 382 86 13 2 9 46 45 82 40 12 22 104 84 78.0 0.0 7.0 PHI
1990-08-31 1593 1573 77 52 386692.0 2515.0 513 143 36 5 10 70 58 83 43 477 123 25 1 7 49 53 67 53 14 17 129 112 75.8 0.0 8.0 PHI
In [28]:
## Add attendance per game variable to new Dataframe

new_df_all['attendance_per_game'] = stat_per_game(new_df_all, 'Attendance','game')
In [29]:
## Scatter plot to do a further dive into the temperature and attendance per game relationship, this time incorporating
## a data point for each month of each year for every team.

fig, ax = plt.subplots(figsize=(13, 10))

ax = sns.regplot(x='Temperature',y='attendance_per_game', data=new_df_all)
plt.xlim([35, 105])
plt.ylim([0, 70000])
plt.title('Attendance per game and Temperature, All Months', fontweight='bold')

print('Correlation coefficient between attendance per game and Temperature: %.2f' % new_df_all[['Temperature', 'attendance_per_game']].corr().iloc[0,1])
Correlation coefficient between attendance per game and Temperature: 0.09
In [30]:
## Create a dictionary with the median temperatures for each month over the timespan

city_median_month_temps = {}

for team in list(set(new_df_all['team'])):
    
    city = new_df_all[new_df_all['team'] == team]
    city_grpd = city.groupby(by=city.index.month).median()

    city_median_month_temps[team] = list(zip(city_grpd.index, round(city_grpd['Temperature'],1)))
        
city_median_month_temps
Out[30]:
{'ANA': [(4, 64.1), (5, 66.4), (6, 70.0), (7, 73.6), (8, 75.0), (9, 73.7)],
 'ARI': [(4, 73.6), (5, 82.9), (6, 92.4), (7, 95.4), (8, 94.3), (9, 89.7)],
 'ATL': [(4, 62.4), (5, 71.0), (6, 77.7), (7, 80.4), (8, 79.5), (9, 73.8)],
 'BAL': [(4, 54.9), (5, 63.7), (6, 73.4), (7, 78.2), (8, 75.9), (9, 69.0)],
 'BOS': [(4, 49.0), (5, 58.5), (6, 68.2), (7, 74.3), (8, 72.8), (9, 65.3)],
 'CHA': [(4, 49.4), (5, 59.7), (6, 69.6), (7, 74.2), (8, 73.2), (9, 64.7)],
 'CHN': [(4, 49.4), (5, 59.7), (6, 69.6), (7, 74.2), (8, 73.2), (9, 64.7)],
 'CIN': [(4, 54.9), (5, 63.4), (6, 72.7), (7, 75.0), (8, 74.7), (9, 67.7)],
 'CLE': [(4, 49.4), (5, 60.2), (6, 69.4), (7, 73.4), (8, 72.2), (9, 65.0)],
 'COL': [(4, 48.0), (5, 57.5), (6, 68.2), (7, 74.5), (8, 71.7), (9, 64.2)],
 'DET': [(4, 49.8), (5, 60.7), (6, 70.6), (7, 73.8), (8, 72.9), (9, 64.4)],
 'HOU': [(4, 69.4), (5, 77.0), (6, 82.2), (7, 84.2), (8, 84.7), (9, 79.8)],
 'KCA': [(4, 54.8), (5, 63.9), (6, 73.9), (7, 77.8), (8, 77.0), (9, 68.8)],
 'LAN': [(4, 64.1), (5, 66.4), (6, 70.0), (7, 73.6), (8, 75.0), (9, 73.7)],
 'MIA': [(4, 76.8), (5, 80.5), (6, 82.7), (7, 84.3), (8, 84.2), (9, 83.2)],
 'MIL': [(4, 45.8), (5, 56.8), (6, 66.5), (7, 72.2), (8, 71.6), (9, 63.4)],
 'MIN': [(4, 47.4), (5, 58.9), (6, 69.5), (7, 73.6), (8, 71.6), (9, 62.9)],
 'MON': [(4, 42.9), (5, 55.2), (6, 64.0), (7, 69.6), (8, 67.6), (9, 59.4)],
 'NYA': [(4, 53.8), (5, 62.9), (6, 71.9), (7, 77.6), (8, 75.6), (9, 68.8)],
 'NYN': [(4, 53.2), (5, 63.4), (6, 72.8), (7, 78.6), (8, 77.2), (9, 70.2)],
 'OAK': [(4, 58.0), (5, 60.6), (6, 63.4), (7, 63.6), (8, 65.4), (9, 66.3)],
 'PHI': [(4, 54.8), (5, 64.4), (6, 74.1), (7, 79.1), (8, 77.7), (9, 70.3)],
 'PIT': [(4, 52.2), (5, 61.1), (6, 69.6), (7, 73.3), (8, 71.6), (9, 64.3)],
 'SDN': [(4, 61.8), (5, 64.2), (6, 67.2), (7, 69.8), (8, 72.1), (9, 70.6)],
 'SEA': [(4, 50.6), (5, 56.4), (6, 60.6), (7, 66.9), (8, 66.4), (9, 61.8)],
 'SFN': [(4, 57.0), (5, 58.0), (6, 60.2), (7, 60.6), (8, 62.1), (9, 63.4)],
 'SLN': [(4, 58.4), (5, 66.8), (6, 76.6), (7, 80.2), (8, 79.6), (9, 70.2)],
 'TBA': [(4, 73.3), (5, 79.0), (6, 82.4), (7, 83.2), (8, 83.6), (9, 82.2)],
 'TEX': [(4, 67.2), (5, 74.4), (6, 82.8), (7, 87.1), (8, 86.6), (9, 80.5)],
 'TOR': [(4, 46.1), (5, 57.6), (6, 67.0), (7, 71.4), (8, 70.4), (9, 62.8)],
 'WAS': [(4, 58.8), (5, 66.7), (6, 76.5), (7, 81.2), (8, 79.8), (9, 72.9)]}

Variable creation

Goal: Create a dummy variable to determine when a row observation is below the median temperature for that city, which will allow us to determine whether lower temperature games show meaningfully lower attendance.

  1. Match the team name to the name in the dictionary
  2. Find the month of the observation
  3. Match the observation month to the dictionary month
  4. Find the Temperature column for that row
  5. Compare the Temperature in the row for the median temperature for that row
  6. Create a (1) variable for anywhere that the temperature is below the Median temperature or a (0) otherwise
In [31]:
copied_df = new_df_all.copy()
copied_df.reset_index(inplace=True)

copied_df['month'] = new_df_all.index.month
copied_df['month'] = copied_df['month'].replace({4:0, 5:1, 6:2, 7:3, 8:4, 9:5}) # Aligns month values with dictionary index values

copied_df = copied_df[['attendance_per_game', 'team','month','Temperature']]

all_nums = []

for val in range(len(copied_df)):
         
    team = copied_df.iloc[val,-4]
    month = copied_df.iloc[val, -3]
    temp = copied_df.iloc[val, -1]
    median_temp = city_median_month_temps.get(team)[month][1]
    
    num = [1 if temp < median_temp else 0]
    all_nums.append(num)

new_df_all['below_median'] = all_nums
new_df_all['below_median'] = [x[0] for x in new_df_all['below_median']] # Extracts value from list of lists

print(new_df_all[['team','Temperature','below_median']].head(10))
           team  Temperature  below_median
Date                                      
1990-04-30  PHI         53.3             1
1990-05-31  PHI         61.3             1
1990-06-30  PHI         72.2             1
1990-07-31  PHI         78.0             1
1990-08-31  PHI         75.8             1
1990-09-30  PHI         68.0             1
1991-04-30  PHI         55.5             0
1991-05-31  PHI         70.8             0
1991-06-30  PHI         75.7             0
1991-07-31  PHI         79.0             1
In [32]:
## Plot attendance per game against temperature with a hue set to whether the temperature was above or 
## below the median for each team; 1 = Below median, 0 = Above median

sns.lmplot(x="Temperature", y="attendance_per_game", data=new_df_all, fit_reg=True, hue='below_median', legend=True, size=10)
plt.title('Attendance per game against temperature with below-median hue',fontweight='bold')

plt.show()

Observations

It doesn't appear that there is a meaningful difference in attendance for games that are above or below the median temperature for a given team or city overall. The next step will be to isolate the months that will be most affected by temperature swings (April, May) since a temperature below the median in summer months is usually still pleasant to watch a game.

In [33]:
new_df_spring = new_df_all[new_df_all.index.month.isin([4,5])] # Date index is in April or May

sns.lmplot(x="Temperature", y="attendance_per_game", data=new_df_spring, fit_reg=True, hue='below_median', legend=True, size=13)
plt.title('Attendance per game against Temperature, spring', fontweight='bold')

plt.show()