This notebook is part of my Python data science curriculum. It demonstrates some Pandas functions which I thought were not adequately explained in the Jake VanderPlas book.

In [1]:

import numpy as np
import pandas as pd
import scipy.stats as stats
from plotnine.data import diamonds

In [2]:

# This is a standard Python demo dataset.  You can also load it from your Python packages
# dir with pd.read_csv if you don't want to import seaborn.
import seaborn as sns
tips = sns.load_dataset('tips')

Loading Data¶

For no apparent reason, the VanderPlas book doesn't document pd.read_csv! This is definitely functionality that you need. If you have the Wes McKinney book available, he has a description that you can read in Chapter 6; otherwise just read the online docs.

In [3]:

pd.read_csv?

In [4]:

pd.read_excel?

For DB connections:
http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries
https://www.sqlalchemy.org/

Summary Tools¶

Be aware that describe() ignores all non-numeric columns by default, which might not be what you wanted:

In [5]:

tips.describe()

Out[5]:

	total_bill	tip	size
count	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672
std	8.902412	1.383638	0.951100
min	3.070000	1.000000	1.000000
25%	13.347500	2.000000	2.000000
50%	17.795000	2.900000	2.000000
75%	24.127500	3.562500	3.000000
max	50.810000	10.000000	6.000000

You can force it to include them with include='all', but they compute different statistics, so the result is ugly:

In [6]:

tips.describe(include='all')

Out[6]:

	total_bill	tip	sex	smoker	day	time	size
count	244.000000	244.000000	244	244	244	244	244.000000
unique	NaN	NaN	2	2	4	2	NaN
top	NaN	NaN	Male	No	Sat	Dinner	NaN
freq	NaN	NaN	157	151	87	176	NaN
mean	19.785943	2.998279	NaN	NaN	NaN	NaN	2.569672
std	8.902412	1.383638	NaN	NaN	NaN	NaN	0.951100
min	3.070000	1.000000	NaN	NaN	NaN	NaN	1.000000
25%	13.347500	2.000000	NaN	NaN	NaN	NaN	2.000000
50%	17.795000	2.900000	NaN	NaN	NaN	NaN	2.000000
75%	24.127500	3.562500	NaN	NaN	NaN	NaN	3.000000
max	50.810000	10.000000	NaN	NaN	NaN	NaN	6.000000

You probably want to do the numeric and string types separately. Note that values which look like strings might be np.object or pd.Categorical.

In [7]:

tips.describe(include=pd.Categorical)

Out[7]:

	sex	smoker	day	time
count	244	244	244	244
unique	2	2	4	2
top	Male	No	Sat	Dinner
freq	157	151	87	176

.value_counts() and .unique() work on any type of column, but only one column at a time, not an entire dataframe.

In [8]:

tips['sex'].value_counts()

Out[8]:

Male      157
Female     87
Name: sex, dtype: int64

In [9]:

tips['sex'].unique()

Out[9]:

[Female, Male]
Categories (2, object): [Female, Male]

These specialized functions are quite a bit faster than the general approach with groupby:

In [10]:

%timeit -n100 diamonds.color.unique()
%timeit -n100 diamonds.color.value_counts()
%timeit -n100 diamonds.groupby('color').count()

905 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.4 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.7 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Stacking, unstacking, melting, and pivoting¶

Stacking and unstacking¶

To understand stacking and unstacking, let's start by creating a multindex on the rows.

In [11]:

tss = tips.groupby(['sex','smoker']).aggregate('mean')
tss

Out[11]:

		total_bill	tip	size
sex	smoker
Male	Yes	22.284500	3.051167	2.500000
Male	No	19.791237	3.113402	2.711340
Female	Yes	17.977879	2.931515	2.242424
Female	No	18.105185	2.773519	2.592593

unstack will them move the INNER level of the index from the ROWS to the COLUMNS:

In [12]:

tss.unstack()

Out[12]:

	total_bill		tip		size
smoker	Yes	No	Yes	No	Yes	No
sex
Male	22.284500	19.791237	3.051167	3.113402	2.500000	2.711340
Female	17.977879	18.105185	2.931515	2.773519	2.242424	2.592593

stack moves the innermost level from the columns to the rows:

In [13]:

tss.unstack().stack()

Out[13]:

		total_bill	tip	size
sex	smoker
Male	Yes	22.284500	3.051167	2.500000
Male	No	19.791237	3.113402	2.711340
Female	Yes	17.977879	2.931515	2.242424
Female	No	18.105185	2.773519	2.592593

If you remove the last (only remaining) level from either the rows or the columns, you then get a one-dimensional Series instead of a DataFrame:

In [14]:

tss.unstack().unstack()

Out[14]:

            smoker  sex   
total_bill  Yes     Male      22.284500
                    Female    17.977879
            No      Male      19.791237
                    Female    18.105185
tip         Yes     Male       3.051167
                    Female     2.931515
            No      Male       3.113402
                    Female     2.773519
size        Yes     Male       2.500000
                    Female     2.242424
            No      Male       2.711340
                    Female     2.592593
dtype: float64

Once you have a series, it can only be unstacked, not stacked. You can use level= to control which part of the index gets turned back into columns.

In [15]:

tss.unstack().unstack().unstack()

Out[15]:

	sex	Male	Female
	smoker
total_bill	Yes	22.284500	17.977879
total_bill	No	19.791237	18.105185
tip	Yes	3.051167	2.931515
tip	No	3.113402	2.773519
size	Yes	2.500000	2.242424
size	No	2.711340	2.592593

In [16]:

tss.unstack().unstack().unstack(level=1)

Out[16]:

	smoker	Yes	No
	sex
total_bill	Male	22.284500	19.791237
total_bill	Female	17.977879	18.105185
tip	Male	3.051167	3.113402
tip	Female	2.931515	2.773519
size	Male	2.500000	2.711340
size	Female	2.242424	2.592593

In [17]:

tss.unstack().unstack().unstack(level=0)

Out[17]:

		total_bill	tip	size
smoker	sex
Yes	Male	22.284500	3.051167	2.500000
Yes	Female	17.977879	2.931515	2.242424
No	Male	19.791237	3.113402	2.711340
No	Female	18.105185	2.773519	2.592593

Melting and pivoting¶

melt is very similar to stack, except that it applies to all columns and not just the innermost level, and it the converts them into a normal column instead of an index level.

In [18]:

tips.head(2)

Out[18]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

In [19]:

tips.head(2).stack()

Out[19]:

0  total_bill     16.99
   tip             1.01
   sex           Female
   smoker            No
   day              Sun
   time          Dinner
   size               2
1  total_bill     10.34
   tip             1.66
   sex             Male
   smoker            No
   day              Sun
   time          Dinner
   size               3
dtype: object

In [20]:

# Note that I have to move the row index into a column, which is called "index", to
# preserve the association of the data in the original rows through the melt.
tips_melted=tips.reset_index().head(2).melt(id_vars='index')
tips_melted

Out[20]:

	index	variable	value
0	0	total_bill	16.99
1	1	total_bill	10.34
2	0	tip	1.01
3	1	tip	1.66
4	0	sex	Female
5	1	sex	Male
6	0	smoker	No
7	1	smoker	No
8	0	day	Sun
9	1	day	Sun
10	0	time	Dinner
11	1	time	Dinner
12	0	size	2
13	1	size	3

There is no corresponding cast like R has. Instead, use .pivot

In [21]:

tips_melted.pivot(index='index',columns='variable')

Out[21]:

	value
variable	day	sex	size	smoker	time	tip	total_bill
index
0	Sun	Female	2	No	Dinner	1.01	16.99
1	Sun	Male	3	No	Dinner	1.66	10.34

The distinction between .pivot and .pivot_table is that the latter does aggregation:

In [22]:

tips.pivot_table(values='tip',index='sex',columns='time',aggfunc='max')

Out[22]:

time	Dinner	Lunch
sex
Female	6.5	5.17
Male	10.0	6.70

Note the gotcha that the arguments are not in the same order if you specify them positionally:

tips.pivot_table(values=None, index=None, columns=None, ...
tips.pivot(index=None, columns=None, values=None)

Aggregate¶

Full Aggregate Syntax¶

The full syntax of arguments to aggregate() is fairly complex. You can have:

A dict where the keys are columns in your source data, and the values are:
- An array of functions to apply, where each element is:
  - A 2-tuple, where the first element is a string to call the output and the second element is the function

In [23]:

tips.groupby(['sex','smoker']).aggregate(
    {'tip':[('mean',np.mean),('50pct',np.median)],
     'time':[('pct_dinner', lambda x: 100*np.mean(x=='Dinner'))]
})

Out[23]:

		tip		time
		mean	50pct	pct_dinner
sex	smoker
Male	Yes	3.051167	3.00	78.333333
Male	No	3.113402	2.74	79.381443
Female	Yes	2.931515	2.88	69.696970
Female	No	2.773519	2.68	53.703704

But note that if you specify a tuple for one function, you had better specify it for all, or you get bad column names for the ones you didn't specify:

In [24]:

tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,('50pct',np.median)]})

Out[24]:

		tip
		<function mean at 0x7fdb78089ae8>	50pct
sex	smoker
Male	Yes	3.051167	3.00
Male	No	3.113402	2.74
Female	Yes	2.931515	2.88
Female	No	2.773519	2.68

Whereas if you don't specify any names, you get sane defaults. I dunno.

In [25]:

tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median]})

Out[25]:

		tip
		mean	median
sex	smoker
Male	Yes	3.051167	3.00
Male	No	3.113402	2.74
Female	Yes	2.931515	2.88
Female	No	2.773519	2.68

Multi-level Column Names¶

When we aggregate multiple columns with multiple functions, we get hierarchical column names:

In [26]:

tm = tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median],
                                              'total_bill':[np.mean,np.median]})

# See that we have a MultiIndex:
tm.columns

Out[26]:

MultiIndex(levels=[['tip', 'total_bill'], ['mean', 'median']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

If we want to flip which is the first and which is the second level of the index, we can do it with .swaplevel:

In [27]:

tm

Out[27]:

		tip		total_bill
		mean	median	mean	median
sex	smoker
Male	Yes	3.051167	3.00	22.284500	20.39
Male	No	3.113402	2.74	19.791237	18.24
Female	Yes	2.931515	2.88	17.977879	16.27
Female	No	2.773519	2.68	18.105185	16.69

In [28]:

tm.swaplevel(axis=1)

Out[28]:

		mean	median	mean	median
		tip	tip	total_bill	total_bill
sex	smoker
Male	Yes	3.051167	3.00	22.284500	20.39
Male	No	3.113402	2.74	19.791237	18.24
Female	Yes	2.931515	2.88	17.977879	16.27
Female	No	2.773519	2.68	18.105185	16.69

In [29]:

# If we then want the columns sorted by the new index, we can do that explicitly:
tm.swaplevel(axis=1).sort_index(axis=1)

Out[29]:

		mean		median
		tip	total_bill	tip	total_bill
sex	smoker
Male	Yes	3.051167	22.284500	3.00	20.39
Male	No	3.113402	19.791237	2.74	18.24
Female	Yes	2.931515	17.977879	2.88	16.27
Female	No	2.773519	18.105185	2.68	16.69

In [30]:

# The same thing works on the rows:
tm.swaplevel(axis=0).sort_index(axis=0)

Out[30]:

		tip		total_bill
		mean	median	mean	median
smoker	sex
Yes	Male	3.051167	3.00	22.284500	20.39
Yes	Female	2.931515	2.88	17.977879	16.27
No	Male	3.113402	2.74	19.791237	18.24
No	Female	2.773519	2.68	18.105185	16.69

If we had more than two levels, we could specify which two we wanted to swap with additional arguments.

Some tools (including Altair) can't use data with hierarchical column names at all, so they have to be flattened. There's no built-in function for doing this, but the following idiom seems standard:

In [31]:

tm.columns = [c[0] + "." + c[1] for c in tm.columns]
tm

Out[31]:

		tip.mean	tip.median	total_bill.mean	total_bill.median
sex	smoker
Male	Yes	3.051167	3.00	22.284500	20.39
Male	No	3.113402	2.74	19.791237	18.24
Female	Yes	2.931515	2.88	17.977879	16.27
Female	No	2.773519	2.68	18.105185	16.69

Set Membership¶

There is an .isin function for quickly checking set membership.

In [32]:

tips['weekend'] = tips.day.isin(['Sat','Sun'])
tips.tail()

Out[32]:

	total_bill	tip	sex	smoker	day	time	size	weekend
239	29.03	5.92	Male	No	Sat	Dinner	3	True
240	27.18	2.00	Female	Yes	Sat	Dinner	2	True
241	22.67	2.00	Male	Yes	Sat	Dinner	2	True
242	17.82	1.75	Male	No	Sat	Dinner	2	True
243	18.78	3.00	Female	No	Thur	Dinner	2	False

Performance of .isin is good compared to the alternatives:

In [33]:

%timeit -n100 diamonds.color.isin(['D','E','F'])
%timeit -n100 diamonds.eval('color in ["D","E","F"]')

1.32 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.92 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Method Chaining Helpers¶

In order to make it easier to create chains of manipulation functions, there is an assign() which creates new columns. It and the array-index filtering both take lambdas, which let you refer to an intermediate result that doesn't have a name.

In [34]:

tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
    [lambda x: x.tip_pct > 70]

Out[34]:

	total_bill	tip	sex	smoker	day	time	size	weekend	tip_pct
172	7.25	5.15	Male	Yes	Sun	Dinner	2	True	71.034483

There is also a "rename" for changing column names without needing to assign to the .columns or .index property of a named variable.

In [35]:

tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \
    [lambda x: x.tip_pct > 70].rename({'day':'dayofweek'},axis='columns')

Out[35]:

	total_bill	tip	sex	smoker	dayofweek	time	size	weekend	tip_pct
172	7.25	5.15	Male	Yes	Sun	Dinner	2	True	71.034483

Sorting and Ranking¶

sort_values and sort_index¶

There is a sort_values which goes with sort_index():

In [36]:

tips.set_index('total_bill').sort_index().head()

Out[36]:

	tip	sex	smoker	day	time	size	weekend
total_bill
3.07	1.00	Female	Yes	Sat	Dinner	1	True
5.75	1.00	Female	Yes	Fri	Dinner	2	False
7.25	1.00	Female	No	Sat	Dinner	1	True
7.25	5.15	Male	Yes	Sun	Dinner	2	True
7.51	2.00	Male	No	Thur	Lunch	2	False

In [37]:

tips.set_index('total_bill').sort_values('tip').head()

Out[37]:

	tip	sex	smoker	day	time	size	weekend
total_bill
3.07	1.00	Female	Yes	Sat	Dinner	1	True
12.60	1.00	Male	Yes	Sat	Dinner	2	True
5.75	1.00	Female	Yes	Fri	Dinner	2	False
7.25	1.00	Female	No	Sat	Dinner	1	True
16.99	1.01	Female	No	Sun	Dinner	2	True

ranking¶

is in the .rank() member function. The usual options for method (min, max, dense, etc) are available as an argument to rank().

Note that pct actually gives numbers between 0 and 1, not 0 and 100. Pandas is very sloppy generally about the meaning of "percent".

In [38]:

tips.assign(tip_rank=tips.tip.rank(), tip_pct = tips.tip.rank(pct=True)).head()

Out[38]:

	total_bill	tip	sex	smoker	day	time	size	weekend	tip_rank	tip_pct
0	16.99	1.01	Female	No	Sun	Dinner	2	True	5.0	0.020492
1	10.34	1.66	Male	No	Sun	Dinner	3	True	33.0	0.135246
2	21.01	3.50	Male	No	Sun	Dinner	3	True	177.0	0.725410
3	23.68	3.31	Male	No	Sun	Dinner	2	True	165.0	0.676230
4	24.59	3.61	Female	No	Sun	Dinner	4	True	185.0	0.758197

Replacing Values¶

In [39]:

# Note that column b gets promoted from integer to float because NaN cannot be stored in an integer type in Numpy
d = pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':np.NaN}, {'a':5, 'b': 6}])
d

Out[39]:

	a	b
0	1	2.0
1	3	NaN
2	5	6.0

You can replace individual values with map, which takes a dict or a lambda. It operates on only one column at a time.

In [40]:

d.assign(b=d.b.map({2:99}))

Out[40]:

	a	b
0	1	99.0
1	3	NaN
2	5	NaN

In [41]:

d.assign(b=d.b.map(lambda x: 99 if x==2 else x))

Out[41]:

	a	b
0	1	99.0
1	3	NaN
2	5	6.0

You can run a map on all columns with applymap:

In [42]:

d.applymap(lambda x: 99 if x==2 else x)

Out[42]:

	a	b
0	1	99.0
1	3	NaN
2	5	6.0

You can fill NAs with fillna, which optionally takes column-specific defaults:

In [43]:

d.fillna({'b':-99})

Out[43]:

	a	b
0	1	2.0
1	3	-99.0
2	5	6.0

That makes it especially convenient to do something like this:

In [44]:

d.fillna(d.mean())

Out[44]:

	a	b
0	1	2.0
1	3	4.0
2	5	6.0

.combine_first is like a version of coalesce which works at a full column or dataframe level.

In [45]:

# This fills in the value of column a into column b where there is a missing value:
d.assign(b=d.b.combine_first(d.a))

Out[45]:

	a	b
0	1	2.0
1	3	3.0
2	5	6.0

In [46]:

e=pd.DataFrame([{'a':-99, 'b':-98}]*3)
e

Out[46]:

	a	b
0	-99	-98
1	-99	-98
2	-99	-98

In [47]:

# This does the same thing at the full dataframe level instead of a single column:
d.combine_first(e)

Out[47]:

	a	b
0	1	2.0
1	3	-98.0
2	5	6.0

Categories¶

Just like .str exposes special functions for strings, .cat exposes special functions for categorical variables.

Let's make some categorical variables with cut (there is variant qcut, which bins by equal quantiles instead of equal width)

In [48]:

dc = diamonds.groupby(pd.cut(diamonds['carat'],np.arange(0,5,.5))).\
  aggregate({'price':'mean'})
dc

Out[48]:

	price
carat
(0.0, 0.5]	839.718149
(0.5, 1.0]	2811.342683
(1.0, 1.5]	6513.526534
(1.5, 2.0]	11321.774838
(2.0, 2.5]	14918.141237
(2.5, 3.0]	15472.904255
(3.0, 3.5]	14822.000000
(3.5, 4.0]	15636.500000
(4.0, 4.5]	16576.500000

In [49]:

# We can move the categorical index back into a column and see that it has type Categorical
dc=dc.reset_index()
dc.carat.dtype

Out[49]:

CategoricalDtype(categories=[(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]
              ordered=True)

In [50]:

# Try tab-completing on dc.carat.cat.

# This gives us the integer values
dc.carat.cat.codes

Out[50]:

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int8

In [51]:

# This gives us the labels
dc.carat.cat.categories

Out[51]:

IntervalIndex([(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]
              closed='right',
              dtype='interval[float64]')

In [52]:

# This is a metadata flag indicating whether the category order is semantically meaningful
dc.carat.cat.ordered

Out[52]:

True

There is a function to reorder categories, but it appears to have some bugs. Look at these examples:

In [53]:

# You can reorder categories:
dc.carat.cat.reorder_categories(dc.carat.cat.categories[[0,2,1,3,4,6,5,7,8]])

Out[53]:

0    (0.0, 0.5]
1    (1.0, 1.5]
2    (0.5, 1.0]
3    (0.5, 1.0]
4    (1.5, 2.0]
5    (2.0, 2.5]
6    (3.0, 3.5]
7    (2.5, 3.0]
8    (2.5, 3.0]
Name: carat, dtype: category
Categories (9, interval[float64]): [(0.0, 0.5] < (1.0, 1.5] < (0.5, 1.0] < (1.5, 2.0] ... (3.0, 3.5] < (2.5, 3.0] < (3.5, 4.0] < (4.0, 4.5]]

In [54]:

dc.carat.cat.reorder_categories(dc.carat.cat.categories[::-1])

Out[54]:

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
Name: carat, dtype: category
Categories (9, interval[float64]): [(4.0, 4.5] < (3.5, 4.0] < (3.0, 3.5] < (2.5, 3.0] ... (1.5, 2.0] < (1.0, 1.5] < (0.5, 1.0] < (0.0, 0.5]]

I submitted a bug: https://github.com/pandas-dev/pandas/issues/23452

What does "percent" mean?¶

Python libraries are disappointingly sloppy about using the word "percent" correctly. "Cent" is 100 and percents are supposed to be on a scale of 100, but often the word is used very shoddily on the scale of 1 instead. Here's an example:

In [55]:

pd.DataFrame({'x':[1,1,2,1]}).pct_change()

Out[55]:

	x
0	NaN
1	0.0
2	1.0
3	-0.5

This should be a 100% increase from 1 to 2, and then a 50% decrease from 2 back to 1, but it's actually 1.0 and -0.5. It's not a percent.

The same thing is true in stats, for example with the "percent point function" which actually goes from 0 to 1, not 0 to 100:

In [56]:

stats.norm.ppf([0.025,0.975]).round(2)

Out[56]:

array([-1.96,  1.96])

Please do not follow these bad examples. The word "percent" does have a meaning.

ToDo¶

These are some things I intend to write about but haven't gotten to yet:

.corr, .cov, .corrwith
.duplicated, .drop_duplicates
.sample(replace=), .take
.get_dummies