import addutils.toc ; addutils.toc.js(ipy_notebook=True)
In this tutorial we are going to see advanced data management with pandas
data structures.
import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()
Categorizing a data set and applying a function to each group, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a flexible and high-performance groupby facility.
By 'group by' we refer to a process involving one or more of the following steps:
Suppose we are managing a website and we have a log-file with number of wiews and likes coming from different cities:
d1 = pd.read_csv('temp/p07_d1.txt', index_col=0)
d1 = d1.reindex(columns=['State','City','Views','Likes'])
display(d1)
State | City | Views | Likes | |
---|---|---|---|---|
0 | NE | Page | 10 | 4 |
1 | KY | Stone | 9 | 3 |
2 | CO | Rye | 3 | 0 |
3 | CO | Rye | 7 | 2 |
4 | KY | Dema | 4 | 1 |
5 | KY | Keavy | 2 | 1 |
6 | CO | Rye | 1 | 0 |
7 | NE | Cairo | 8 | 3 |
8 | CO | Dumont | 12 | 7 |
groupby
groups DataFrame or Series by a parameter on a given axis:
g1 = d1.groupby('State')
print (g1.groups)
{'CO': Int64Index([2, 3, 6, 8], dtype='int64'), 'KY': Int64Index([1, 4, 5], dtype='int64'), 'NE': Int64Index([0, 7], dtype='int64')}
The variable groups
of a GroupBy
object is a dictionary containing indexes of each group member.
for name,group in g1:
print (name)
print (group)
print ('Total Views: %d - Total Likes: %d\n\n' %(group['Views'].sum(),
group['Likes'].sum()))
CO State City Views Likes 2 CO Rye 3 0 3 CO Rye 7 2 6 CO Rye 1 0 8 CO Dumont 12 7 Total Views: 23 - Total Likes: 9 KY State City Views Likes 1 KY Stone 9 3 4 KY Dema 4 1 5 KY Keavy 2 1 Total Views: 15 - Total Likes: 5 NE State City Views Likes 0 NE Page 10 4 7 NE Cairo 8 3 Total Views: 18 - Total Likes: 7
It is also possibile to apply a groupby
over a hierarchical index DataFrame
d2 = d1.set_index(['State','City'])
display(d2)
Views | Likes | ||
---|---|---|---|
State | City | ||
NE | Page | 10 | 4 |
KY | Stone | 9 | 3 |
CO | Rye | 3 | 0 |
Rye | 7 | 2 | |
KY | Dema | 4 | 1 |
Keavy | 2 | 1 | |
CO | Rye | 1 | 0 |
NE | Cairo | 8 | 3 |
CO | Dumont | 12 | 7 |
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. Here we use aggregate
. The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option
:
g2 = d2.groupby(level=[0])
print (g2.groups)
g2.aggregate(np.sum)
{'CO': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']], labels=[[0, 0, 0, 0], [5, 5, 5, 2]], names=['State', 'City']), 'KY': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']], labels=[[1, 1, 1], [6, 1, 3]], names=['State', 'City']), 'NE': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']], labels=[[2, 2], [4, 0]], names=['State', 'City'])}
Views | Likes | |
---|---|---|
State | ||
CO | 23 | 9 |
KY | 15 | 5 |
NE | 18 | 7 |
g3 = d2.groupby(level=[0,1])
g4 = d2.groupby(level=[0,1], as_index=False)
HTML(side_by_side2(g3.aggregate(np.sum), g4.aggregate(np.sum)))
Views | Likes | ||
---|---|---|---|
State | City | ||
CO | Dumont | 12 | 7 |
Rye | 11 | 2 | |
KY | Dema | 4 | 1 |
Keavy | 2 | 1 | |
Stone | 9 | 3 | |
NE | Cairo | 8 | 3 |
Page | 10 | 4 |
Views | Likes | |
---|---|---|
0 | 12 | 7 |
1 | 11 | 2 |
2 | 4 | 1 |
3 | 2 | 1 |
4 | 9 | 3 |
5 | 8 | 3 |
6 | 10 | 4 |
aggregate
allows to pass any function that returns a scalar value from a vector and can handle list of functions:
d1[['State', 'Views']].groupby('State').aggregate([np.sum, np.mean, np.std])
Views | |||
---|---|---|---|
sum | mean | std | |
State | |||
CO | 23 | 5.75 | 4.856267 |
KY | 15 | 5.00 | 3.605551 |
NE | 18 | 9.00 | 1.414214 |
apply
will extend the previous concepts to any Python function:
pd.set_option('display.float_format', lambda x: '{:.1f}'.format(x))
def add_field(group):
group['Tot.Views'] = group['Views'].sum()
group['Likes[%]'] = 100.0*group['Likes']/group['Likes'].sum()
return group
HTML(side_by_side2(d1, d1.groupby('State').apply(add_field)))
State | City | Views | Likes | |
---|---|---|---|---|
0 | NE | Page | 10 | 4 |
1 | KY | Stone | 9 | 3 |
2 | CO | Rye | 3 | 0 |
3 | CO | Rye | 7 | 2 |
4 | KY | Dema | 4 | 1 |
5 | KY | Keavy | 2 | 1 |
6 | CO | Rye | 1 | 0 |
7 | NE | Cairo | 8 | 3 |
8 | CO | Dumont | 12 | 7 |
State | City | Views | Likes | Tot.Views | Likes[%] | |
---|---|---|---|---|---|---|
0 | NE | Page | 10 | 4 | 18 | 57.1 |
1 | KY | Stone | 9 | 3 | 15 | 60.0 |
2 | CO | Rye | 3 | 0 | 23 | 0.0 |
3 | CO | Rye | 7 | 2 | 23 | 22.2 |
4 | KY | Dema | 4 | 1 | 15 | 20.0 |
5 | KY | Keavy | 2 | 1 | 15 | 20.0 |
6 | CO | Rye | 1 | 0 | 23 | 0.0 |
7 | NE | Cairo | 8 | 3 | 18 | 42.9 |
8 | CO | Dumont | 12 | 7 | 23 | 77.8 |
idx = pd.date_range('1999/5/28', periods=1500, freq='1B')
s1 = pd.Series(np.random.normal(5.5, 2, 1500), idx)
s1 = s1.rolling(10,10).mean().dropna()
#s1 = s1.rolling_mean(s1, 10, 10).dropna()
Here we define a grouping key for months and one for years:
def my_groupby_key_year(timestamp):
return timestamp.year
def my_groupby_key_month(timestamp):
return timestamp.month
def my_normalization(group):
return (group-group.mean())/group.std()
Here we normalize the data on a monthly base and check mean and std on an yearly base:
t1 = s1.groupby(my_groupby_key_month).apply(my_normalization)
HTML(side_by_side2(s1.head(8),
t1.head(8),
t1.groupby(my_groupby_key_year).aggregate([np.mean, np.std])))
0 | |
---|---|
1999-06-10 | 5.0 |
1999-06-11 | 4.8 |
1999-06-14 | 5.0 |
1999-06-15 | 5.0 |
1999-06-16 | 5.4 |
1999-06-17 | 5.5 |
1999-06-18 | 5.9 |
1999-06-21 | 6.4 |
0 | |
---|---|
1999-06-10 | -0.6 |
1999-06-11 | -1.0 |
1999-06-14 | -0.6 |
1999-06-15 | -0.6 |
1999-06-16 | -0.1 |
1999-06-17 | 0.1 |
1999-06-18 | 0.7 |
1999-06-21 | 1.5 |
mean | std | |
---|---|---|
1999 | 0.5 | 0.8 |
2000 | 0.1 | 1.2 |
2001 | -0.3 | 0.9 |
2002 | 0.0 | 1.0 |
2003 | 0.0 | 1.1 |
2004 | -0.1 | 0.8 |
2005 | -0.1 | 0.7 |
d3 = pd.read_csv('example_data/company.csv', index_col=0)
display(d3.head())
Value | expenses | employees | Dimension | |
---|---|---|---|---|
Company | ||||
Dapibus Company | 96 008 | 7124 | 78 | Big |
Pede Blandit Congue Company | 61 562 | 8454 | 60 | Big |
Pede Suspendisse Associates | 54 728 | 6641 | 4 | Small |
Dictum Associates | 16 802 | 6498 | 89 | Big |
Dui Cras Pellentesque Ltd | 93 954 | 5040 | 97 | Big |
Since the column "Value" is made by strings with a space separator we need a simpel intermediate step to convert values from string to floats:
d3['Value'] = d3['Value'].apply(lambda x: float(x.replace(' ', '')))
d3.head()
Value | expenses | employees | Dimension | |
---|---|---|---|---|
Company | ||||
Dapibus Company | 96008.0 | 7124 | 78 | Big |
Pede Blandit Congue Company | 61562.0 | 8454 | 60 | Big |
Pede Suspendisse Associates | 54728.0 | 6641 | 4 | Small |
Dictum Associates | 16802.0 | 6498 | 89 | Big |
Dui Cras Pellentesque Ltd | 93954.0 | 5040 | 97 | Big |
d3.groupby('Dimension').mean()
Value | expenses | employees | |
---|---|---|---|
Dimension | |||
Big | 49445.4 | 5474.9 | 61.1 |
Small | 60947.8 | 4521.4 | 31.8 |
Visit www.add-for.com for more tutorials and updates.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.