import pandas as pd
import numpy as np
df = pd.read_csv('data/Consumo_cerveja.csv',
decimal=',',
thousands='.',
header=0,
names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'],
parse_dates=['date'],
nrows=365)
Let's finish up by talking about a number of useful functions for analysis before we put it all together!
Often we have some data that we want to replace with a better representation. For example, in our beer drinking data, I would assume that the seasons are important, but since winter actually overlaps 2 years (winter is December, January, February) there's no easy numeric way to divide the year up.
Enter mapping - map is a great convenience function that lets you apply a function or dictionary on every row of data. To create our season mapping, let's set up a dictionary mapping label to number of month:
season = {
"winter": [12, 1, 2],
"spring": [3, 4, 5],
"summer": [6, 7, 8],
"autumn": [9, 10,11]
}
We then invert the dictionary, so we have a nice representation mapping month number to season
season_map = {i: k
for k, v in season.items()
for i in v
}
season_map
Something we haven't mentioned yet, but converting date to a datetime type gives us access to the special .dt namespace which has datetime specific functionality. In this example, we use the .month property to get the month of a given date
df.date.dt.month
Now we are ready to map our season dictionary onto our months
df['season'] = df.date.dt.month.map(season_map)
Let's get some random samples to check that it worked as expected
Use
.sample
instead of .head
- you'll catch data errors from the middle of your dataset!
df.sample(10)
We can also use the .value_counts
method to check that everything is as expected
df.season.value_counts()
Another very common operation is to want to assign data to a bin. For example, we might want to turn a regression problem of predicting consumption into a classification problem of low vs high consumption.
Let's arbitrarily choose 25,000 as our cutoff for high consumption. Since pandas is based on Numpy, we can often use numpy functions when it suits us.
There is nothing in pandas that does quite what np.where does, so I use it all the time for this type of operation
df['high_consumption'] = np.where(df.consumption < 25000, 0, 1)
df.sample(10)
df.high_consumption.value_counts()
Often our usecase is a bit more complicated than a simple higher or lower. pd.cut
gives us a lot more flexibility in setting our cutoff points. Let's let pandas figure out the best place to split the data, by specifying that we want 2 bins -this will find the point that divides our data into equal width bins
pd.cut(df.consumption, bins=2).value_counts()
While descriptive, that's not very pretty nor easy to select - pd.cut
also supports passing a list of labels, so let's do that.
pd.cut(df.consumption, bins=2, labels=['low', 'high']).value_counts()
Same result, but with nice labels for ease of indexing
Of course we can also pass our own bins if we have irregular intervals
pd.cut(df.consumption, bins=[0, 25000, 99999], labels=['low', 'high'])
Sometimes you want your bins to be based on quantiles instead of arbitrary intervals - pd.qcut
makes that easy.
df['consumption_group'] = pd.qcut(df.consumption, q=3, labels=['low', 'medium', 'high'])
We can of course specify our quantiles explicitly
quartiles = pd.qcut(df.consumption, q=[0, 0.25, 0.75, 1], labels=['low', 'medium', 'high'])
quartiles
If you looked closely, you might have noticed that the dtype of the Series we got back from .cut
is a Categorical. A Categorical is a new datatype introduced by Pandas, and so there is no Numpy equivalent.
Categoricals are a neat way of expressing data that represents a category with multiple levels. In this example, we have 3:
They are also in order from low to high.
The primary benefit from categoricals is saving space - Strings are expensive to store, as we have to store a python object inside each array - very inefficient. Categoricals store codes instead - integers that map to a string label. Just like with dates, where we can access special methods using the .dt
namespace, we can use .cat
to use categorical specific methods
quartiles.cat.categories
quartiles.cat.codes
Now we are storing 365 int8s and 3 strings instead of 365 strings - much nicer!
We can also use categories to indicate data that should be there, but is not - this is great for machine learning production, where you have to encode a row of data into the correct number of categories
example = quartiles.cat.add_categories('example')
example.value_counts()
One common task in datascience, is to one-hot encode categorical columns. As this is also known as "creating dummy variables" pandas has a built in solution for that - .get_dummies
. It takes your dataframe and one-hot-encodes all categorical columns it finds (usually string columns, but also categoricals). In combination with the categorical dtype from before, it's easy to encode all possible categories, even though they are not present in data
# There are no values labelled 'example' in the data
pd.get_dummies(example)
You can also pass your full DataFrame to .get_dummies
- it will generally do the right thing!
pd.get_dummies(df)
You can also specify which columns to encode, as well as passing the drop_first
parameter if you're trying to avoid multicollinearity.
Note that the high_consumption
column is an int dtype - all columns passed to columns
will be one-hot encoded, regardless of datatype.
pd.get_dummies(df, drop_first=True, columns=['season', 'high_consumption'])
.get_dummies
is a great convenience function that is very handy for modelling work!
Sometimes, you want to compare running differences - what's the change between days or months? Pandas provides utility methods to do that in various forms
# Subtract the previous value
df.consumption.diff()
# Subtract the value from 30 days before
df.consumption.diff(periods=30).dropna()
# Get the percentage change compared to 7 days ago
df.consumption.pct_change(periods=7)
These are convenience methods built around .shift
- .shift
lets you easily compare a row with another row
# Shift all columns one step
df.shift(1).head()
Notice how 25,461 was at index 0 previously and is now at index 1
df.consumption.head()
This can be a great way to create a lagged feature set for time series modelling. If you want to predict next days consumption, I can simply create a dataframe of shifted periods
pd.concat([df.consumption.shift(i).rename(f't_{-i}') for i in range(5)], axis=1).head(10)
A nice special case comes when using a Datetime Index like we did before. Then we can use .shift
's frequency argument and get some nice benefits
df = df.set_index('date')
We don't get any missing values, as we are simply incrementing the index by one period
df.shift(1).head()
We can set different frequencies to shift by - for example, using 'M' "rounds up" to the nearest month end
df.shift(1, freq='M').head()
Stacking and unstacking are nice tools to manipulate our data into the form that we want. We often use it when we end up with a MultiIndex
- an index with multiple levels
. The most frequent way to end up with a MultiIndex
is to do a groupby on multiple columns.
# Is there a difference in weekend consumption between seasons?
weekend_season = df.groupby(['season', 'consumption_group']).max_temp.mean()
weekend_season
This Series has two levels
in its index - season
and consumption_group
. How can I access these values?
# This works
weekend_season.loc['autumn']
# This doesn't
weekend_season.loc['low']
How would I get only the low consumption group?
I can use tuples to dig into a level and even do slices
# Get low consumption in autumn
weekend_season.loc[('autumn', 'low')]
# Get all values between autumn, low to spring, medium
weekend_season.loc[('autum', 'low'):('spring', 'medium')]
What if I want all seasons, but only the low consumption group?
I could swap the levels in the index - this is a neat trick and can be useful sometimes
weekend_season.swaplevel()#.loc['low']
The best way is to use the IndexSlice convenience slicer - then we can just use our familiar :
syntax to indicate "everything from this level" and it works on as many levels as you might have!
# The recommended method!
weekend_season.loc[pd.IndexSlice[:, 'low']]
df.groupby(['season', 'consumption_group', 'weekend']).max_temp.mean()#.loc[pd.IndexSlice[:, 'low', :]]
Another way to manipulate MultiIndex
are using .stack
and .unstack
. Using these, you can "pivot" levels from an index into columns and viceversa - it's a common idiom to do a .unstack
after a groupby!
# consumption_group is now our columns
weekend_season.unstack()
# The outermost level is now stacked
weekend_season.unstack(level=0)
Stacking is not as common, as ending up with a MultiIndex
on columns is not as common, but it does what one would expect - pivot from columns to index
agg_groupby = df.groupby('season').agg({"consumption": ['mean', 'sum', 'std'],
"max_temp": ['mean', 'sum', 'std']
})
agg_groupby
# The aggregations are now in the index
agg_groupby.stack()
# Pass the level parameter to stack the other level
agg_groupby.stack(level=0)