In [ ]:

import pandas as pd
import numpy as np
df = pd.read_csv('data/Consumo_cerveja.csv', 
                 decimal=',', 
                 thousands='.', 
                 header=0, 
                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], 
                 parse_dates=['date'], 
                 nrows=365)

Let's finish up by talking about a number of useful functions for analysis before we put it all together!

Mapping¶

Often we have some data that we want to replace with a better representation. For example, in our beer drinking data, I would assume that the seasons are important, but since winter actually overlaps 2 years (winter is December, January, February) there's no easy numeric way to divide the year up.

Enter mapping - map is a great convenience function that lets you apply a function or dictionary on every row of data. To create our season mapping, let's set up a dictionary mapping label to number of month:

In [ ]:

season = {
    "winter": [12, 1, 2],
    "spring": [3, 4, 5],
    "summer": [6, 7, 8],
    "autumn": [9, 10,11]
}

We then invert the dictionary, so we have a nice representation mapping month number to season

In [ ]:

season_map = {i: k
              for k, v in season.items()
              for i in v
             }
season_map

Something we haven't mentioned yet, but converting date to a datetime type gives us access to the special .dt namespace which has datetime specific functionality. In this example, we use the .month property to get the month of a given date

In [ ]:

df.date.dt.month

Now we are ready to map our season dictionary onto our months

In [ ]:

df['season'] = df.date.dt.month.map(season_map)

Let's get some random samples to check that it worked as expected

Fun Fact Use .sample instead of .head - you'll catch data errors from the middle of your dataset!

In [ ]:

df.sample(10)

We can also use the .value_counts method to check that everything is as expected

In [ ]:

df.season.value_counts()

Binning¶

Another very common operation is to want to assign data to a bin. For example, we might want to turn a regression problem of predicting consumption into a classification problem of low vs high consumption.

Let's arbitrarily choose 25,000 as our cutoff for high consumption. Since pandas is based on Numpy, we can often use numpy functions when it suits us.

There is nothing in pandas that does quite what np.where does, so I use it all the time for this type of operation

In [ ]:

df['high_consumption'] = np.where(df.consumption < 25000, 0, 1)
df.sample(10)

In [ ]:

df.high_consumption.value_counts()

Often our usecase is a bit more complicated than a simple higher or lower. pd.cut gives us a lot more flexibility in setting our cutoff points. Let's let pandas figure out the best place to split the data, by specifying that we want 2 bins -this will find the point that divides our data into equal width bins

In [ ]:

pd.cut(df.consumption, bins=2).value_counts()

While descriptive, that's not very pretty nor easy to select - pd.cut also supports passing a list of labels, so let's do that.

In [ ]:

pd.cut(df.consumption, bins=2, labels=['low', 'high']).value_counts()

Same result, but with nice labels for ease of indexing

Of course we can also pass our own bins if we have irregular intervals

In [ ]:

pd.cut(df.consumption, bins=[0, 25000, 99999], labels=['low', 'high'])

Sometimes you want your bins to be based on quantiles instead of arbitrary intervals - pd.qcut makes that easy.

In [ ]:

df['consumption_group'] = pd.qcut(df.consumption, q=3, labels=['low', 'medium', 'high'])

We can of course specify our quantiles explicitly

In [ ]:

quartiles = pd.qcut(df.consumption, q=[0, 0.25, 0.75, 1], labels=['low', 'medium', 'high'])
quartiles

Categoricals¶

If you looked closely, you might have noticed that the dtype of the Series we got back from .cut is a Categorical. A Categorical is a new datatype introduced by Pandas, and so there is no Numpy equivalent.

Categoricals are a neat way of expressing data that represents a category with multiple levels. In this example, we have 3:

low
medium
high

They are also in order from low to high.

The primary benefit from categoricals is saving space - Strings are expensive to store, as we have to store a python object inside each array - very inefficient. Categoricals store codes instead - integers that map to a string label. Just like with dates, where we can access special methods using the .dt namespace, we can use .cat to use categorical specific methods

In [ ]:

quartiles.cat.categories

In [ ]:

quartiles.cat.codes

Now we are storing 365 int8s and 3 strings instead of 365 strings - much nicer!

We can also use categories to indicate data that should be there, but is not - this is great for machine learning production, where you have to encode a row of data into the correct number of categories

In [ ]:

example = quartiles.cat.add_categories('example')

In [ ]:

example.value_counts()

Get Dummies¶

One common task in datascience, is to one-hot encode categorical columns. As this is also known as "creating dummy variables" pandas has a built in solution for that - .get_dummies. It takes your dataframe and one-hot-encodes all categorical columns it finds (usually string columns, but also categoricals). In combination with the categorical dtype from before, it's easy to encode all possible categories, even though they are not present in data

In [ ]:

# There are no values labelled 'example' in the data
pd.get_dummies(example)

You can also pass your full DataFrame to .get_dummies - it will generally do the right thing!

In [ ]:

pd.get_dummies(df)

You can also specify which columns to encode, as well as passing the drop_first parameter if you're trying to avoid multicollinearity.

Note that the high_consumption column is an int dtype - all columns passed to columns will be one-hot encoded, regardless of datatype.

In [ ]:

pd.get_dummies(df, drop_first=True, columns=['season', 'high_consumption'])

.get_dummies is a great convenience function that is very handy for modelling work!

Shifting and Diffing¶

Sometimes, you want to compare running differences - what's the change between days or months? Pandas provides utility methods to do that in various forms

In [ ]:

# Subtract the previous value
df.consumption.diff()

In [ ]:

# Subtract the value from 30 days before
df.consumption.diff(periods=30).dropna()

In [ ]:

# Get the percentage change compared to 7 days ago
df.consumption.pct_change(periods=7)

These are convenience methods built around .shift - .shift lets you easily compare a row with another row

In [ ]:

# Shift all columns one step
df.shift(1).head()

Notice how 25,461 was at index 0 previously and is now at index 1

In [ ]:

df.consumption.head()

This can be a great way to create a lagged feature set for time series modelling. If you want to predict next days consumption, I can simply create a dataframe of shifted periods

In [ ]:

pd.concat([df.consumption.shift(i).rename(f't_{-i}') for i in range(5)], axis=1).head(10)

A nice special case comes when using a Datetime Index like we did before. Then we can use .shift's frequency argument and get some nice benefits

In [ ]:

df = df.set_index('date')

We don't get any missing values, as we are simply incrementing the index by one period

In [ ]:

df.shift(1).head()

We can set different frequencies to shift by - for example, using 'M' "rounds up" to the nearest month end

In [ ]:

df.shift(1, freq='M').head()

Stacking & Unstacking & MultiIndex¶

Stacking and unstacking are nice tools to manipulate our data into the form that we want. We often use it when we end up with a MultiIndex - an index with multiple levels. The most frequent way to end up with a MultiIndex is to do a groupby on multiple columns.

In [ ]:

# Is there a difference in weekend consumption between seasons?
weekend_season = df.groupby(['season', 'consumption_group']).max_temp.mean()
weekend_season

This Series has two levels in its index - season and consumption_group. How can I access these values?

In [ ]:

# This works
weekend_season.loc['autumn']

In [ ]:

# This doesn't
weekend_season.loc['low']

How would I get only the low consumption group?

I can use tuples to dig into a level and even do slices

In [ ]:

# Get low consumption in autumn
weekend_season.loc[('autumn', 'low')]

In [ ]:

# Get all values between autumn, low to spring, medium
weekend_season.loc[('autum', 'low'):('spring', 'medium')]

What if I want all seasons, but only the low consumption group?

I could swap the levels in the index - this is a neat trick and can be useful sometimes

In [ ]:

weekend_season.swaplevel()#.loc['low']

The best way is to use the IndexSlice convenience slicer - then we can just use our familiar : syntax to indicate "everything from this level" and it works on as many levels as you might have!

In [ ]:

# The recommended method!
weekend_season.loc[pd.IndexSlice[:, 'low']]

In [ ]:

df.groupby(['season', 'consumption_group', 'weekend']).max_temp.mean()#.loc[pd.IndexSlice[:, 'low', :]]

Another way to manipulate MultiIndex are using .stack and .unstack. Using these, you can "pivot" levels from an index into columns and viceversa - it's a common idiom to do a .unstack after a groupby!

In [ ]:

# consumption_group is now our columns
weekend_season.unstack()

In [ ]:

# The outermost level is now stacked
weekend_season.unstack(level=0)

Stacking is not as common, as ending up with a MultiIndex on columns is not as common, but it does what one would expect - pivot from columns to index

In [ ]:

agg_groupby = df.groupby('season').agg({"consumption": ['mean', 'sum', 'std'],
                                        "max_temp": ['mean', 'sum', 'std']
                                       })
agg_groupby

In [ ]:

# The aggregations are now in the index
agg_groupby.stack()

In [ ]:

# Pass the level parameter to stack the other level
agg_groupby.stack(level=0)