This notebook serves to show how to use the groupby_agg method from pyjanitor's general functions submodule.
The groupby_agg method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.
Currently in pandas, to append a column back to a dataframe, you do it in three steps:
The groupby_agg allows you to replicate the same process in one step and with sensible arguments. The example below illustrates this function
# load modules
import pandas as pd
import numpy as np
from janitor import groupby_agg
data = {
'item': ['shoe', 'shoe', 'bag', 'shoe', 'bag'],
'MRP': [220, 450, 320, 200, 305],
'number_sold': [100, 40, 56, 38, 25]
}
df = pd.DataFrame(data)
df
item | MRP | number_sold | |
---|---|---|---|
0 | shoe | 220 | 100 |
1 | shoe | 450 | 40 |
2 | bag | 320 | 56 |
3 | shoe | 200 | 38 |
4 | bag | 305 | 25 |
df = df.groupby_agg(
by='item',
agg='mean',
agg_column_name='MRP',
new_column_name='Avg_MRP'
)
df
item | MRP | number_sold | Avg_MRP | |
---|---|---|---|---|
0 | shoe | 220 | 100 | 290.0 |
1 | shoe | 450 | 40 | 290.0 |
2 | bag | 320 | 56 | 312.5 |
3 | shoe | 200 | 38 | 290.0 |
4 | bag | 305 | 25 | 312.5 |
Null cells are retained as well :
df = pd.DataFrame(
{
"name": ("black", "black", "black", "red", "red"),
"type": ("chair", "chair", "sofa", "sofa", "plate"),
"num": (4, 5, 12, 4, 3),
"nulls": (1, 1, np.nan, np.nan, 3),
}
)
df
name | type | num | nulls | |
---|---|---|---|---|
0 | black | chair | 4 | 1.0 |
1 | black | chair | 5 | 1.0 |
2 | black | sofa | 12 | NaN |
3 | red | sofa | 4 | NaN |
4 | red | plate | 3 | 3.0 |
filtered_df = df.groupby_agg(by=['nulls'],
agg='size',
agg_column_name='type',
new_column_name='counter'
)
filtered_df
name | type | num | nulls | counter | |
---|---|---|---|---|---|
0 | black | chair | 4 | 1.0 | 2 |
1 | black | chair | 5 | 1.0 | 2 |
2 | black | sofa | 12 | NaN | 2 |
3 | red | sofa | 4 | NaN | 2 |
4 | red | plate | 3 | 3.0 | 1 |
The groupby_agg method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained. Let's explore this with an example:
filtered_df = (df.groupby_agg(by=['name', 'type'],
agg='size',
agg_column_name='type',
new_column_name='counter'
)
.query('counter > 1'))
filtered_df
name | type | num | nulls | counter | |
---|---|---|---|---|---|
0 | black | chair | 4 | 1.0 | 2 |
1 | black | chair | 5 | 1.0 | 2 |