This notebook is part of my Python data science curriculum. It demonstrates some Pandas functions which I thought were not adequately explained in the Jake VanderPlas book.
import numpy as np
import pandas as pd
import scipy.stats as stats
from plotnine.data import diamonds
# This is a standard Python demo dataset. You can also load it from your Python packages
# dir with pd.read_csv if you don't want to import seaborn.
import seaborn as sns
tips = sns.load_dataset('tips')
For no apparent reason, the VanderPlas book doesn't document pd.read_csv
! This is definitely functionality that you need. If you have the Wes McKinney book available, he has a description that you can read in Chapter 6; otherwise just read the online docs.
pd.read_csv?
pd.read_excel?
Be aware that describe() ignores all non-numeric columns by default, which might not be what you wanted:
tips.describe()
total_bill | tip | size | |
---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 |
std | 8.902412 | 1.383638 | 0.951100 |
min | 3.070000 | 1.000000 | 1.000000 |
25% | 13.347500 | 2.000000 | 2.000000 |
50% | 17.795000 | 2.900000 | 2.000000 |
75% | 24.127500 | 3.562500 | 3.000000 |
max | 50.810000 | 10.000000 | 6.000000 |
You can force it to include them with include='all', but they compute different statistics, so the result is ugly:
tips.describe(include='all')
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
count | 244.000000 | 244.000000 | 244 | 244 | 244 | 244 | 244.000000 |
unique | NaN | NaN | 2 | 2 | 4 | 2 | NaN |
top | NaN | NaN | Male | No | Sat | Dinner | NaN |
freq | NaN | NaN | 157 | 151 | 87 | 176 | NaN |
mean | 19.785943 | 2.998279 | NaN | NaN | NaN | NaN | 2.569672 |
std | 8.902412 | 1.383638 | NaN | NaN | NaN | NaN | 0.951100 |
min | 3.070000 | 1.000000 | NaN | NaN | NaN | NaN | 1.000000 |
25% | 13.347500 | 2.000000 | NaN | NaN | NaN | NaN | 2.000000 |
50% | 17.795000 | 2.900000 | NaN | NaN | NaN | NaN | 2.000000 |
75% | 24.127500 | 3.562500 | NaN | NaN | NaN | NaN | 3.000000 |
max | 50.810000 | 10.000000 | NaN | NaN | NaN | NaN | 6.000000 |
You probably want to do the numeric and string types separately. Note that values which look like strings might be np.object
or pd.Categorical
.
tips.describe(include=pd.Categorical)
sex | smoker | day | time | |
---|---|---|---|---|
count | 244 | 244 | 244 | 244 |
unique | 2 | 2 | 4 | 2 |
top | Male | No | Sat | Dinner |
freq | 157 | 151 | 87 | 176 |
.value_counts()
and .unique()
work on any type of column, but only one column at a time, not an entire dataframe.
tips['sex'].value_counts()
Male 157 Female 87 Name: sex, dtype: int64
tips['sex'].unique()
[Female, Male] Categories (2, object): [Female, Male]
These specialized functions are quite a bit faster than the general approach with groupby
:
%timeit -n100 diamonds.color.unique()
%timeit -n100 diamonds.color.value_counts()
%timeit -n100 diamonds.groupby('color').count()
905 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.4 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 11.7 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To understand stacking and unstacking, let's start by creating a multindex on the rows.
tss = tips.groupby(['sex','smoker']).aggregate('mean')
tss
total_bill | tip | size | ||
---|---|---|---|---|
sex | smoker | |||
Male | Yes | 22.284500 | 3.051167 | 2.500000 |
No | 19.791237 | 3.113402 | 2.711340 | |
Female | Yes | 17.977879 | 2.931515 | 2.242424 |
No | 18.105185 | 2.773519 | 2.592593 |
unstack
will them move the INNER level of the index from the ROWS to the COLUMNS:
tss.unstack()
total_bill | tip | size | ||||
---|---|---|---|---|---|---|
smoker | Yes | No | Yes | No | Yes | No |
sex | ||||||
Male | 22.284500 | 19.791237 | 3.051167 | 3.113402 | 2.500000 | 2.711340 |
Female | 17.977879 | 18.105185 | 2.931515 | 2.773519 | 2.242424 | 2.592593 |
stack
moves the innermost level from the columns to the rows:
tss.unstack().stack()
total_bill | tip | size | ||
---|---|---|---|---|
sex | smoker | |||
Male | Yes | 22.284500 | 3.051167 | 2.500000 |
No | 19.791237 | 3.113402 | 2.711340 | |
Female | Yes | 17.977879 | 2.931515 | 2.242424 |
No | 18.105185 | 2.773519 | 2.592593 |
If you remove the last (only remaining) level from either the rows or the columns, you then get a one-dimensional Series instead of a DataFrame:
tss.unstack().unstack()
smoker sex total_bill Yes Male 22.284500 Female 17.977879 No Male 19.791237 Female 18.105185 tip Yes Male 3.051167 Female 2.931515 No Male 3.113402 Female 2.773519 size Yes Male 2.500000 Female 2.242424 No Male 2.711340 Female 2.592593 dtype: float64
Once you have a series, it can only be unstack
ed, not stack
ed. You can use level=
to control which part of the index gets turned back into columns.
tss.unstack().unstack().unstack()
sex | Male | Female | |
---|---|---|---|
smoker | |||
total_bill | Yes | 22.284500 | 17.977879 |
No | 19.791237 | 18.105185 | |
tip | Yes | 3.051167 | 2.931515 |
No | 3.113402 | 2.773519 | |
size | Yes | 2.500000 | 2.242424 |
No | 2.711340 | 2.592593 |
tss.unstack().unstack().unstack(level=1)
smoker | Yes | No | |
---|---|---|---|
sex | |||
total_bill | Male | 22.284500 | 19.791237 |
Female | 17.977879 | 18.105185 | |
tip | Male | 3.051167 | 3.113402 |
Female | 2.931515 | 2.773519 | |
size | Male | 2.500000 | 2.711340 |
Female | 2.242424 | 2.592593 |
tss.unstack().unstack().unstack(level=0)
total_bill | tip | size | ||
---|---|---|---|---|
smoker | sex | |||
Yes | Male | 22.284500 | 3.051167 | 2.500000 |
Female | 17.977879 | 2.931515 | 2.242424 | |
No | Male | 19.791237 | 3.113402 | 2.711340 |
Female | 18.105185 | 2.773519 | 2.592593 |
melt
is very similar to stack
, except that it applies to all columns and not just the innermost level, and it the converts them into a normal column instead of an index level.
tips.head(2)
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
tips.head(2).stack()
0 total_bill 16.99 tip 1.01 sex Female smoker No day Sun time Dinner size 2 1 total_bill 10.34 tip 1.66 sex Male smoker No day Sun time Dinner size 3 dtype: object
# Note that I have to move the row index into a column, which is called "index", to
# preserve the association of the data in the original rows through the melt.
tips_melted=tips.reset_index().head(2).melt(id_vars='index')
tips_melted
index | variable | value | |
---|---|---|---|
0 | 0 | total_bill | 16.99 |
1 | 1 | total_bill | 10.34 |
2 | 0 | tip | 1.01 |
3 | 1 | tip | 1.66 |
4 | 0 | sex | Female |
5 | 1 | sex | Male |
6 | 0 | smoker | No |
7 | 1 | smoker | No |
8 | 0 | day | Sun |
9 | 1 | day | Sun |
10 | 0 | time | Dinner |
11 | 1 | time | Dinner |
12 | 0 | size | 2 |
13 | 1 | size | 3 |
There is no corresponding cast
like R has. Instead, use .pivot
tips_melted.pivot(index='index',columns='variable')
value | |||||||
---|---|---|---|---|---|---|---|
variable | day | sex | size | smoker | time | tip | total_bill |
index | |||||||
0 | Sun | Female | 2 | No | Dinner | 1.01 | 16.99 |
1 | Sun | Male | 3 | No | Dinner | 1.66 | 10.34 |
The distinction between .pivot
and .pivot_table
is that the latter does aggregation:
tips.pivot_table(values='tip',index='sex',columns='time',aggfunc='max')
time | Dinner | Lunch |
---|---|---|
sex | ||
Female | 6.5 | 5.17 |
Male | 10.0 | 6.70 |
Note the gotcha that the arguments are not in the same order if you specify them positionally:
tips.pivot_table(values=None, index=None, columns=None, ...
tips.pivot(index=None, columns=None, values=None)
The full syntax of arguments to aggregate()
is fairly complex. You can have:
tips.groupby(['sex','smoker']).aggregate(
{'tip':[('mean',np.mean),('50pct',np.median)],
'time':[('pct_dinner', lambda x: 100*np.mean(x=='Dinner'))]
})
tip | time | |||
---|---|---|---|---|
mean | 50pct | pct_dinner | ||
sex | smoker | |||
Male | Yes | 3.051167 | 3.00 | 78.333333 |
No | 3.113402 | 2.74 | 79.381443 | |
Female | Yes | 2.931515 | 2.88 | 69.696970 |
No | 2.773519 | 2.68 | 53.703704 |
But note that if you specify a tuple for one function, you had better specify it for all, or you get bad column names for the ones you didn't specify:
tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,('50pct',np.median)]})
tip | |||
---|---|---|---|
<function mean at 0x7fdb78089ae8> | 50pct | ||
sex | smoker | ||
Male | Yes | 3.051167 | 3.00 |
No | 3.113402 | 2.74 | |
Female | Yes | 2.931515 | 2.88 |
No | 2.773519 | 2.68 |
Whereas if you don't specify any names, you get sane defaults. I dunno.
tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median]})
tip | |||
---|---|---|---|
mean | median | ||
sex | smoker | ||
Male | Yes | 3.051167 | 3.00 |
No | 3.113402 | 2.74 | |
Female | Yes | 2.931515 | 2.88 |
No | 2.773519 | 2.68 |
When we aggregate multiple columns with multiple functions, we get hierarchical column names:
tm = tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median],
'total_bill':[np.mean,np.median]})
# See that we have a MultiIndex:
tm.columns
MultiIndex(levels=[['tip', 'total_bill'], ['mean', 'median']], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
If we want to flip which is the first and which is the second level of the index, we can do it with .swaplevel
:
tm
tip | total_bill | ||||
---|---|---|---|---|---|
mean | median | mean | median | ||
sex | smoker | ||||
Male | Yes | 3.051167 | 3.00 | 22.284500 | 20.39 |
No | 3.113402 | 2.74 | 19.791237 | 18.24 | |
Female | Yes | 2.931515 | 2.88 | 17.977879 | 16.27 |
No | 2.773519 | 2.68 | 18.105185 | 16.69 |
tm.swaplevel(axis=1)
mean | median | mean | median | ||
---|---|---|---|---|---|
tip | tip | total_bill | total_bill | ||
sex | smoker | ||||
Male | Yes | 3.051167 | 3.00 | 22.284500 | 20.39 |
No | 3.113402 | 2.74 | 19.791237 | 18.24 | |
Female | Yes | 2.931515 | 2.88 | 17.977879 | 16.27 |
No | 2.773519 | 2.68 | 18.105185 | 16.69 |
# If we then want the columns sorted by the new index, we can do that explicitly:
tm.swaplevel(axis=1).sort_index(axis=1)
mean | median | ||||
---|---|---|---|---|---|
tip | total_bill | tip | total_bill | ||
sex | smoker | ||||
Male | Yes | 3.051167 | 22.284500 | 3.00 | 20.39 |
No | 3.113402 | 19.791237 | 2.74 | 18.24 | |
Female | Yes | 2.931515 | 17.977879 | 2.88 | 16.27 |
No | 2.773519 | 18.105185 | 2.68 | 16.69 |
# The same thing works on the rows:
tm.swaplevel(axis=0).sort_index(axis=0)
tip | total_bill | ||||
---|---|---|---|---|---|
mean | median | mean | median | ||
smoker | sex | ||||
Yes | Male | 3.051167 | 3.00 | 22.284500 | 20.39 |
Female | 2.931515 | 2.88 | 17.977879 | 16.27 | |
No | Male | 3.113402 | 2.74 | 19.791237 | 18.24 |
Female | 2.773519 | 2.68 | 18.105185 | 16.69 |
If we had more than two levels, we could specify which two we wanted to swap with additional arguments.
Some tools (including Altair) can't use data with hierarchical column names at all, so they have to be flattened. There's no built-in function for doing this, but the following idiom seems standard:
tm.columns = [c[0] + "." + c[1] for c in tm.columns]
tm
tip.mean | tip.median | total_bill.mean | total_bill.median | ||
---|---|---|---|---|---|
sex | smoker | ||||
Male | Yes | 3.051167 | 3.00 | 22.284500 | 20.39 |
No | 3.113402 | 2.74 | 19.791237 | 18.24 | |
Female | Yes | 2.931515 | 2.88 | 17.977879 | 16.27 |
No | 2.773519 | 2.68 | 18.105185 | 16.69 |
There is an .isin
function for quickly checking set membership.
tips['weekend'] = tips.day.isin(['Sat','Sun'])
tips.tail()
total_bill | tip | sex | smoker | day | time | size | weekend | |
---|---|---|---|---|---|---|---|---|
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | True |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | True |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | True |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | True |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | False |
Performance of .isin
is good compared to the alternatives:
%timeit -n100 diamonds.color.isin(['D','E','F'])
%timeit -n100 diamonds.eval('color in ["D","E","F"]')
1.32 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 3.92 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In order to make it easier to create chains of manipulation functions, there is an assign()
which creates new columns. It and the array-index filtering both take lambdas, which let you refer to an intermediate result that doesn't have a name.
tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
[lambda x: x.tip_pct > 70]
total_bill | tip | sex | smoker | day | time | size | weekend | tip_pct | |
---|---|---|---|---|---|---|---|---|---|
172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | True | 71.034483 |
There is also a "rename" for changing column names without needing to assign to the .columns
or .index
property of a named variable.
tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
[lambda x: x.tip_pct > 70].rename({'day':'dayofweek'},axis='columns')
total_bill | tip | sex | smoker | dayofweek | time | size | weekend | tip_pct | |
---|---|---|---|---|---|---|---|---|---|
172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | True | 71.034483 |
There is a sort_values which goes with sort_index():
tips.set_index('total_bill').sort_index().head()
tip | sex | smoker | day | time | size | weekend | |
---|---|---|---|---|---|---|---|
total_bill | |||||||
3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | True |
5.75 | 1.00 | Female | Yes | Fri | Dinner | 2 | False |
7.25 | 1.00 | Female | No | Sat | Dinner | 1 | True |
7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 | True |
7.51 | 2.00 | Male | No | Thur | Lunch | 2 | False |
tips.set_index('total_bill').sort_values('tip').head()
tip | sex | smoker | day | time | size | weekend | |
---|---|---|---|---|---|---|---|
total_bill | |||||||
3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | True |
12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 | True |
5.75 | 1.00 | Female | Yes | Fri | Dinner | 2 | False |
7.25 | 1.00 | Female | No | Sat | Dinner | 1 | True |
16.99 | 1.01 | Female | No | Sun | Dinner | 2 | True |
is in the .rank()
member function. The usual options for method (min, max, dense, etc) are available as an argument to rank()
.
Note that pct
actually gives numbers between 0 and 1, not 0 and 100. Pandas is very sloppy generally about the meaning of "percent".
tips.assign(tip_rank=tips.tip.rank(), tip_pct = tips.tip.rank(pct=True)).head()
total_bill | tip | sex | smoker | day | time | size | weekend | tip_rank | tip_pct | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | True | 5.0 | 0.020492 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | True | 33.0 | 0.135246 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | True | 177.0 | 0.725410 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | True | 165.0 | 0.676230 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | True | 185.0 | 0.758197 |
# Note that column b gets promoted from integer to float because NaN cannot be stored in an integer type in Numpy
d = pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':np.NaN}, {'a':5, 'b': 6}])
d
a | b | |
---|---|---|
0 | 1 | 2.0 |
1 | 3 | NaN |
2 | 5 | 6.0 |
You can replace individual values with map
, which takes a dict or a lambda. It operates on only one column at a time.
d.assign(b=d.b.map({2:99}))
a | b | |
---|---|---|
0 | 1 | 99.0 |
1 | 3 | NaN |
2 | 5 | NaN |
d.assign(b=d.b.map(lambda x: 99 if x==2 else x))
a | b | |
---|---|---|
0 | 1 | 99.0 |
1 | 3 | NaN |
2 | 5 | 6.0 |
You can run a map on all columns with applymap
:
d.applymap(lambda x: 99 if x==2 else x)
a | b | |
---|---|---|
0 | 1 | 99.0 |
1 | 3 | NaN |
2 | 5 | 6.0 |
You can fill NAs with fillna
, which optionally takes column-specific defaults:
d.fillna({'b':-99})
a | b | |
---|---|---|
0 | 1 | 2.0 |
1 | 3 | -99.0 |
2 | 5 | 6.0 |
That makes it especially convenient to do something like this:
d.fillna(d.mean())
a | b | |
---|---|---|
0 | 1 | 2.0 |
1 | 3 | 4.0 |
2 | 5 | 6.0 |
.combine_first
is like a version of coalesce which works at a full column or dataframe level.
# This fills in the value of column a into column b where there is a missing value:
d.assign(b=d.b.combine_first(d.a))
a | b | |
---|---|---|
0 | 1 | 2.0 |
1 | 3 | 3.0 |
2 | 5 | 6.0 |
e=pd.DataFrame([{'a':-99, 'b':-98}]*3)
e
a | b | |
---|---|---|
0 | -99 | -98 |
1 | -99 | -98 |
2 | -99 | -98 |
# This does the same thing at the full dataframe level instead of a single column:
d.combine_first(e)
a | b | |
---|---|---|
0 | 1 | 2.0 |
1 | 3 | -98.0 |
2 | 5 | 6.0 |
Just like .str
exposes special functions for strings, .cat
exposes special functions for categorical variables.
Let's make some categorical variables with cut (there is variant qcut, which bins by equal quantiles instead of equal width)
dc = diamonds.groupby(pd.cut(diamonds['carat'],np.arange(0,5,.5))).\
aggregate({'price':'mean'})
dc
price | |
---|---|
carat | |
(0.0, 0.5] | 839.718149 |
(0.5, 1.0] | 2811.342683 |
(1.0, 1.5] | 6513.526534 |
(1.5, 2.0] | 11321.774838 |
(2.0, 2.5] | 14918.141237 |
(2.5, 3.0] | 15472.904255 |
(3.0, 3.5] | 14822.000000 |
(3.5, 4.0] | 15636.500000 |
(4.0, 4.5] | 16576.500000 |
# We can move the categorical index back into a column and see that it has type Categorical
dc=dc.reset_index()
dc.carat.dtype
CategoricalDtype(categories=[(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]] ordered=True)
# Try tab-completing on dc.carat.cat.
# This gives us the integer values
dc.carat.cat.codes
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 dtype: int8
# This gives us the labels
dc.carat.cat.categories
IntervalIndex([(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]] closed='right', dtype='interval[float64]')
# This is a metadata flag indicating whether the category order is semantically meaningful
dc.carat.cat.ordered
True
There is a function to reorder categories, but it appears to have some bugs. Look at these examples:
# You can reorder categories:
dc.carat.cat.reorder_categories(dc.carat.cat.categories[[0,2,1,3,4,6,5,7,8]])
0 (0.0, 0.5] 1 (1.0, 1.5] 2 (0.5, 1.0] 3 (0.5, 1.0] 4 (1.5, 2.0] 5 (2.0, 2.5] 6 (3.0, 3.5] 7 (2.5, 3.0] 8 (2.5, 3.0] Name: carat, dtype: category Categories (9, interval[float64]): [(0.0, 0.5] < (1.0, 1.5] < (0.5, 1.0] < (1.5, 2.0] ... (3.0, 3.5] < (2.5, 3.0] < (3.5, 4.0] < (4.0, 4.5]]
dc.carat.cat.reorder_categories(dc.carat.cat.categories[::-1])
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN Name: carat, dtype: category Categories (9, interval[float64]): [(4.0, 4.5] < (3.5, 4.0] < (3.0, 3.5] < (2.5, 3.0] ... (1.5, 2.0] < (1.0, 1.5] < (0.5, 1.0] < (0.0, 0.5]]
I submitted a bug: https://github.com/pandas-dev/pandas/issues/23452
Python libraries are disappointingly sloppy about using the word "percent" correctly. "Cent" is 100 and percents are supposed to be on a scale of 100, but often the word is used very shoddily on the scale of 1 instead. Here's an example:
pd.DataFrame({'x':[1,1,2,1]}).pct_change()
x | |
---|---|
0 | NaN |
1 | 0.0 |
2 | 1.0 |
3 | -0.5 |
This should be a 100% increase from 1 to 2, and then a 50% decrease from 2 back to 1, but it's actually 1.0 and -0.5. It's not a percent.
The same thing is true in stats
, for example with the "percent point function" which actually goes from 0 to 1, not 0 to 100:
stats.norm.ppf([0.025,0.975]).round(2)
array([-1.96, 1.96])
Please do not follow these bad examples. The word "percent" does have a meaning.
These are some things I intend to write about but haven't gotten to yet:
.corr, .cov, .corrwith
.duplicated, .drop_duplicates
.sample(replace=), .take
.get_dummies