%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)
df = pd.read_csv('data/Consumo_cerveja.csv',
decimal=',',
thousands='.',
header=0,
names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'],
parse_dates=['date'],
nrows=365)
holidays = pd.read_html('https://en.wikipedia.org/wiki/Public_holidays_in_Brazil', header=0)[0]
df = df.set_index('date')
Now we've gone through a number of techniques for working with data in Pandas - let's try applying some of these to answer some questions about our data - remember, we are interested in learning about factors that affect beer consumption in Sao Paulo!
So what kind of questions can we ask of our data?
We would hypothesize that when it's warmer outside, people go out for a drink more often. However, as we know from here in Denmark, Danes often grab a beer inside a nice, warm bar when it's cold and dark!
From collective personal experience, we would assume that consumption is higher during the weekend.
Are people drinking more beer in autumn vs winter?
Does the weekend start on Friday? Or maybe people need a beer on Monday to help them start the week?
Does more rain lead to less beer drinking? Or do people stay inside the bars longer?
It feels like the weather is worse when we have time off, no?
We can always start with a simple correlation matrix - this is easy to get in Pandas!
df.corr()
It's a bit hard to parse just by looking - we could use seaborn to draw a heatmap, but we are sticking as much as we can to Pandas. Let's use the new .style
api!
# Note! Calculated row-or-columnwise
df.corr().style.background_gradient()
Let's pick out the consumption correlations to better see what we are interested in
df.corr()[['consumption']].style.bar(align='mid', color=['red', 'green'])
We see some evidence to back up our hypotheses - temperature and weekend appears correllated with consumption, while precipiation is slightly negatively correlated.
Before we dive in a bit more, just want to mention one technique that's very useful - I've been using it throughout, but now we'll give it a name:
One neat feature of the pandas api is that many operations return a modified DataFrame/Series. This means that we can continue to apply operations in a chain, making it simple and readable to transform data. My favorite example of the difference this makes is stolen from Tom Augspurg (who stole it from Jeff Allen). Imagine we are telling the story of Jack and Jill - traditionally it would look like this:
tumble_after(
broke(
fell_down(
fetch(went_up(jack_jill, "hill"), "water"),
jack),
"crown"),
"jill"
)
What's the story happening here? You might be a clever software engineer and rewrite it to something like this:
on_hill = went_up(jack_jill, 'hill')
with_water = fetch(on_hill, 'water')
fallen = fell_down(with_water, 'jack')
broken = broke(fallen, 'jack')
after = tumble_after(broken, 'jill')
That looks better, but now I had to come up with 5 sensible variable names that are essentially thrown out - and it's not even that clear what the story is!
Pandas let's us do something like this:
(jack_jill.went_up('hill')
.fetch('water')
.fell_down('jack')
.broke('crown')
.tumble_after('jill')
)
That's the best of both worlds! I can clearly read the story, while at the same time, avoiding the need for "throwaway" variable names
Now back to the data!
In the EDA phase, we just want to plot everything, but we are not looking at production grade plots - we want quick and dirty with features - pandas has you covered! Let's start by plotting each column over time
df.plot(subplots=True, title='Daily trend');
That's a bit hard to parse - There are definitely some seasonal trends in the precipitation, but it's hard to get a clear picture of what's happening. Weekend is also not very useful in this context
Let's use some of the techniques we looked at to capture a higher-level trend
numeric = df.drop('weekend',axis='columns')
rolling = numeric.rolling(7)
axes = rolling.mean().plot(subplots=True, title='Rolling 7 Days Trend');
The correlation between temperature and consumption is now more obvious - we also see that the big increase in precipitation around March does not significantly impact consumption, but in mid September, consumption is very low. Let's try to smooth it out even more by looking at monthly data.
resampler = numeric.resample('M')
resampler.mean().plot(subplots=True, title='Mean Monthly Trend');
There is now a clear trend between temperature and consumption but we lost some knowledge. For example, the inverse relationship with preciptiation in mid-september is not as clear as in the rolling 7 graph. Always use different aggregation levels!
Let's start trying to answer our questions about the data:
Looking at our trend over time graphs above, we would conclude that yes, the temperature clearly affects consumption (It would be difficult to make the argument that consumption affects temperature!)
Let's examine the relationship a bit more using a scatterplot. Scatterplots are a great way to visualize numerical relationships - we lose the time dimension, but are able to see the correlation better.
df.plot.scatter(x='median_temp', y='consumption', title='Median Temp vs Consumption');
df.plot.scatter(x='min_temp', y='consumption', title='Min Temp vs Consumption');
These two graphs are a bit hard to compare as they have different x axes. Just for illustration purposes, let's plot them on the same graph
ax = df.plot(kind='scatter', x='min_temp', y='consumption', marker='^', label='min_temp');
df.plot(kind='scatter', x='max_temp', y='consumption', c='red', marker='+', ax=ax, label='max_temp');
Min_temp and max_temp are clearly different - Surprising! :-) It would be more clear if we could plot them each in their own graph, but keep the y axis scale. We could adjust them manually, but it's better to have that done for us! That does require some matplotlib code though - knowing some matplotlib is very helpful to get plots the way you want them!
fig, axes = plt.subplots(nrows=3, sharey=True)
for label, ax in zip(['min_temp', 'max_temp', 'median_temp'], axes):
df.plot.scatter(x=label, y='consumption', ax=ax, label=label, title=f"Consumption vs {label}")
fig.tight_layout()
Looks a lot like our line graphs before! Just looking at the graphs - max_temp seems a bit more clustered around an imaginary trend line while min_temp is fairly spaced out. Let's note that max_temp does seem be a more linear predictor than the other two.
We can confirm this by looking at the mean consumption per 5 degrees
df.groupby(pd.cut(df.max_temp, np.arange(10, 45, 5))).consumption.mean().plot.bar(title='Mean Consumption per 5 degrees Max Temp');
I think we can conclusively say temperature matters!
This is a pretty easy question to answer - we simply take the mean of weekends vs mean of non-weekends!
df.groupby('weekend').consumption.mean().plot(kind='bar');
That seems pretty conclusive! We can also examine the distribution of values using a boxplot
df.boxplot(column='consumption', by='weekend')
We must be careful with averages! While there is a clear difference, there are a fair number of weekdays where consumption is at weekend levels.
weekend_mean = df.query('weekend == 1').consumption.mean()
num_above_mean_weekdays = len(df[(df.weekend == 0) & (df.consumption > weekend_mean)])
print(f"There are {num_above_mean_weekdays} weekdays above the weekend mean")
print(f"That's {num_above_mean_weekdays / len(df.weekend == 0):.1%}!")
Almost 8% of our weekdays would look like a weekend if we trusted our average! We can conclude that being the weekend is definitely related to consumption - but always examine the distributions!
Seasons can have an impact in many different ways - holidays, number of tourists and of course the weather! Unfortunately we only have one year of data, so it will be hard to estimate these effects in a robust manner.
First let's use our mapper from previously to add in seasons. The easy mistake here is to forget that Brazil is in the Southern Hemisphere, so seasons are inverted!
season = {
"summer": [12, 1, 2],
"autumn": [3, 4, 5],
"winter": [6, 7, 8],
"spring": [9, 10,11]
}
season_map = {i: k
for k, v in season.items()
for i in v
}
df['season'] = df.index.month.map(season_map)
df.groupby('season').consumption.mean().sort_values().plot.bar(title='')
Spring and summer does seem to have a higher beer consumption. Let's check the boxplot
df.boxplot(column=['consumption'], by='season')
Spring is fairly spread out here while summer seems the best differentiator - Note the overlaps here!
We get a similar picture by looking at month, but notice how season "wraps around"
df['month'] = df.index.month
df.boxplot(column='consumption', by='month')
There does seem to be a significant jump going into August - could be worth noting!
df['holidays'] = np.where(df.index.isin(holidays.Date), 1, 0)
Do holidays matter?
df.groupby('holidays').consumption.mean().plot.bar()
df.groupby(['season', 'holidays']).consumption.mean().unstack().plot.bar()
That's how easy it is to incorporate some extra data in!
Does the party start on Friday? Are Thursday bars popular?
data = df.groupby(df.index.weekday).consumption.mean()
data.plot(kind='bar');
data.sort_values().to_frame().style.format('{:,.0f}')
The weekend is clearly the most popular day, with no particular standouts during the week
df['weekday'] = df.index.weekday
df.boxplot(column='consumption', by='weekday')
Maybe there is a difference
pivoted = df.pivot_table(values='consumption', index='weekday', columns='month')
pivoted
That's a lot of numbers! Let's use our style trick again to see a bit better
pivoted.style.background_gradient(cmap='Blues')
There are some days that stick out, such as Thursdays in September, or Tuesdays in January - so the combination can matter!
Let's have a look at precipitation to round out our analysis - our hypothesis would be that rainy weather and beer consumption do not match!
df.plot(kind='scatter', x='precip', y='consumption', title='Precipitation vs Consumption');
We don't see a nice linear trend, but this is also affected by many days with close to zero rain - lucky Brazilians!
We can try to get a bit fancy - let's logtransform precip
and see if that makes a difference. Here we are using method chaining to get a nice story out of our transforms
If we wrap our method chain in
()
we can align each operation nicely on a new line!
(df.query('precip > 0.1')
.assign(precip=lambda x: np.log1p(x.precip))
.plot(kind='scatter', x='precip', y='consumption', title='Precipitation vs Consumption - Rainy Days'))
There does seem to be a slight downward trend if we squint a bit, but definitely not as clear cut as with temperature
Is it true that the weather is always bad when we have time off? It can certainly feel that way!
Let's see how true that is (at least for Brazil!) in a oneliner!
(df.groupby('weekend')
[['median_temp', 'min_temp', 'max_temp', 'precip']]
.mean()
.pct_change()
.loc[1, :]
.plot
.barh());
After all that exploration - let's build a regression model to see how good it is!
Let's start from a clean read and incorporate our various transformations we have looked at
df = pd.read_csv('data/Consumo_cerveja.csv',
decimal=',',
thousands='.',
header=0,
names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'],
parse_dates=['date'],
nrows=365)
def add_lags(df, col, lags=5):
lags = pd.concat([df[col].shift(i).rename(f"{col} t_{-i}") for i in range(1, lags)], axis=1)
return df.join(lags, how='inner')
data = (df.set_index('date')
.assign(mean_temp=lambda x: (x.max_temp + x.min_temp) / 2)
.assign(month=lambda x: x.index.month)
.assign(day=lambda x: x.index.day)
.assign(weekday=lambda x: x.index.dayofweek)
.assign(precip=lambda x: np.log1p(x.precip))
.assign(seasons=lambda x: x.index.month.map(season_map))
.assign(holidays=lambda x: np.where(x.index.isin(holidays.Date), 1, 0))
.pipe(add_lags, 'consumption', lags=10)
.pipe(add_lags, 'max_temp', lags=10)
.dropna()
.reset_index(drop=True)
)
y = data['consumption']
X = pd.get_dummies(data.drop('consumption', axis=1))
data.head()
data.tail()
Let's start with a standard Linear Regression
import statsmodels.api as sm
regressor_X = sm.add_constant(X)
model = sm.OLS(y, regressor_X, hasconst=True)
result = model.fit()
result.summary()
And of course we have to have a Random Forest - this is PyData after all!
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, random_state=42)
clf = RandomForestRegressor(n_estimators=1000)
clf.fit(train_x, train_y)
clf.score(test_x, test_y)
pd.Series(index=test_x.columns, data=clf.feature_importances_).sort_values().plot.barh()