In spirit, most pandas operations are one of two functions.
Assuming that a SeriesGroupBy was built as a subtype of a Series object, in the Liskov Substitution sense, this would mean that..
f_elwise(SeriesGroupBy, SeriesGroupBy) -> Series
could easily support versions that are...
f_elwise(Series, "") -> ""
f_elwise("", "") -> SeriesGroupBy
This would be extremely convenient, since it means that defining a function like f_add = f_elwise(...)
, would support all operations in the python code below...
from siuba.data import mtcars
g_cyl = mtcars.groupby('cyl')
# assume this creates the function f_add
f_add = f_elwise('add')
mpg2 = f_add(mtcars.mpg, mtcars.mpg) # -> Series
g_cyl_mpg2 = f_add(g_cyl.mpg, g_cyl.mpg) # -> SeriesGroupBy
The reality is that pandas SeriesGroupBy objects are not subtypes of a Series. More than that, they do not support addition.
%%capture
import pandas as pd
pd.set_option("display.max_rows", 5)
from siuba import _
from siuba.data import mtcars
g_cyl = mtcars.groupby("cyl")
## Both snippets below raise an error.... :/
g_cyl.mpg + g_cyl.mpg
g_cyl.add(g_cyl.mpg)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-1-b2b9d78e502f> in <module> 9 10 ## Both snippets below raise an error.... :/ ---> 11 g_cyl.mpg + g_cyl.mpg 12 g_cyl.add(g_cyl.mpg) TypeError: unsupported operand type(s) for +: 'SeriesGroupBy' and 'SeriesGroupBy'
.obj
attribute.This is shown below (note that all results are Series).
# two ways to do it f_elwise
ser_mpg2 = mtcars.mpg + mtcars.mpg
ser_mpg2 = g_cyl.mpg.obj + g_cyl.mpg.obj
# doing grouped aggregate
g_cyl.mpg.mean()
cyl 4 26.663636 6 19.742857 8 15.100000 Name: mpg, dtype: float64
degroup = lambda ser: getattr(ser, "obj", ser)
f_add = lambda x, y: degroup(x) + degroup(y)
f_add(g_cyl.mpg, f_add(g_cyl.mpg, 1))
0 43.0 1 43.0 ... 30 31.0 31 43.8 Name: mpg, Length: 32, dtype: float64
Also, as noted in the first section, we are returning a Series here, but functions returning a SeriesGroupBy should also be compatible (so long as we enforce liskov substitution..).
f_elwise(f_agg(a), f_agg(b)) -> same length result
¶Suppose we wanted to add the mean mpg
of each group, to each row of mpg
in the original data.
In our system written above, this would look like...
f_mean = f_agg('mean')
f_add = f_elwise('add')
res = f_add(g_cyl.mpg, f_mean(g_cyl.mpg))
Remember that for f_add
, we laid out in the first section that it should allow functions to be substituted in that take a SeriesGroupBy (or parent type) and returns a Series (or subtype).
from pandas.core import algorithms
def broadcast_agg_result(grouper, result, obj):
# Simplified broadcasting from g_cyl.mpg.transform('mean')
ids, _, ngroup = grouper.group_info
out = algorithms.take_1d(result._values, ids)
return pd.Series(out, index=obj.index, name=obj.name)
f_mean = lambda x: broadcast_agg_result(x.grouper, x.mean(), degroup(x))
f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp))
0 142.028571 1 142.028571 ... 30 224.314286 31 109.300000 Length: 32, dtype: float64
Notice we can keep going with this, since
f_add(g_cyl.mpg, f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp)))
0 163.028571 1 163.028571 ... 30 239.314286 31 130.700000 Length: 32, dtype: float64
However, there is are two problems here...
The main issue is that a Series is implicitly a single group. To get around this, f_elwise should decide when to broadcast, and all operations should return SeriesGroupBy.
f_elwise(f_agg(a), f_agg(b)) -> agg length result
¶Above, we had the aggregate return a result the same length as the original data. But this goes against our initial description that f_agg returns a result whose length is the number of groupings.
In this case, we need to think more about f_agg's type signature.
To do this let's consider a new type, AggGroupBy
, where...
AggGroupBy
is a subtype of SeriesGroupBy
AggGroupBy
has 1 row per grouping.SeriesGroupBy
) -> AggGroupBy
Finally let's make this drastically simplifying requirement
This means that if our operations return grouped Series, then we don't need to worry about the Series case any more. For example, under this system these operations are allowed...
f_agg(g_cyl.mpg)
f_elwise(g_cyl.mpg, 1)
f_elwise(f_agg(g_cyl.mpg), g_cyl.mpg)
from pandas.core.groupby import SeriesGroupBy
from pandas.core import algorithms
# Define Agg Result ----
def create_agg_result(ser, orig_object, orig_grouper):
# since pandas groupby method is hard-coded to create a SeriesGroupBy, mock
# AggResult below by making it a SeriesGroupBy whose grouper has 2 extra attributes
obj = ser.groupby(ser.index)
obj.grouper.orig_grouper = orig_grouper
obj.grouper.orig_object = orig_object
return obj
def is_agg_result(x):
return hasattr(x, "grouper") and hasattr(x.grouper, "orig_grouper")
# Handling Grouped Operations ----
def regroup(ser, grouper):
return ser.groupby(grouper)
def degroup(ser):
# returns tuple of (Series or literal, Grouper or None)
# because we can't rely on type checking, use hasattr instead
return getattr(ser, "obj", ser), getattr(ser, "grouper", None)
def f_mean(x):
# SeriesGroupBy -> AggResult
return create_agg_result(x.mean(), x.obj, x.grouper)
def broadcast_agg_result(g_ser, compare=None):
"""Returns a tuple of (Series, final op grouper)"""
if not isinstance(g_ser, SeriesGroupBy):
return g_ser, compare.grouper
# NOTE: now only applying for agg_result
if not is_agg_result(g_ser):
return degroup(g_ser)
if g_ser.grouper.orig_grouper is compare.grouper:
orig = g_ser.grouper.orig_object
grouper = g_ser.grouper.orig_grouper
# Simplified broadcasting from g_cyl.mpg.transform('mean') implementation
ids, _, ngroup = grouper.group_info
out = algorithms.take_1d(g_ser.obj._values, ids)
return pd.Series(out, index=orig.index, name=orig.name), grouper
return degroup(g_ser)
# Define operations ----
def f_add(x, y):
# SeriesGroupBy, SeriesGroupBy -> ""
broad_x, grouper = broadcast_agg_result(x, y)
broad_y, __ = broadcast_agg_result(y, x)
res = broad_x + broad_y
return regroup(res, grouper)
grouped_agg = f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp))
# Notice, only 1 result per group
grouped_agg.obj
cyl 4 109.300000 6 142.028571 8 224.314286 dtype: float64
grouped_mutate = f_add(g_cyl.mpg, grouped_agg)
grouped_mutate.obj
0 163.028571 1 163.028571 ... 30 239.314286 31 130.700000 Name: mpg, Length: 32, dtype: float64
Functions should essentially follow...
We can use a final method at the end to validate, depending on if it's a mutate, summarize, or filter.
Additionally...
__getitem__
etc..