Notebook

Pandas fast mutate architecture¶

Problem: series operations are type invariant under grouping¶

What is type variance?¶

In spirit, most pandas operations are one of two functions.

f_elwise(a, [b]) - takes up to two series, returns a result of the same length.
f_agg(a, [b]) - takes up to two series, returns a result whose length is the number of groupings in the data.

Assuming that a SeriesGroupBy was built as a subtype of a Series object, in the Liskov Substitution sense, this would mean that..

f_elwise(SeriesGroupBy, SeriesGroupBy) -> Series

could easily support versions that are...

contravariant on input type - e.g. f_elwise(Series, "") -> ""
covariant on output type - e.g. f_elwise("", "") -> SeriesGroupBy

This would be extremely convenient, since it means that defining a function like f_add = f_elwise(...), would support all operations in the python code below...

from siuba.data import mtcars

g_cyl = mtcars.groupby('cyl')

# assume this creates the function f_add
f_add = f_elwise('add')

mpg2 = f_add(mtcars.mpg, mtcars.mpg)     # -> Series
g_cyl_mpg2 = f_add(g_cyl.mpg, g_cyl.mpg) # -> SeriesGroupBy

What does this look like in pandas?¶

The reality is that pandas SeriesGroupBy objects are not subtypes of a Series. More than that, they do not support addition.

In [1]:

%%capture
import pandas as pd

pd.set_option("display.max_rows", 5)

from siuba import _
from siuba.data import mtcars

g_cyl = mtcars.groupby("cyl")

## Both snippets below raise an error.... :/
g_cyl.mpg + g_cyl.mpg
g_cyl.add(g_cyl.mpg)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-b2b9d78e502f> in <module>
      9 
     10 ## Both snippets below raise an error.... :/
---> 11 g_cyl.mpg + g_cyl.mpg
     12 g_cyl.add(g_cyl.mpg)

TypeError: unsupported operand type(s) for +: 'SeriesGroupBy' and 'SeriesGroupBy'

How are grouped operations currently handled in pandas?¶

f_elwise(a, [b]) - is handled using ungrouped pandas object (e.g. Series), or by using the grouped series .obj attribute.
f_agg(a, [b]) - is handled using custom SeriesGroupBy methods

This is shown below (note that all results are Series).

In [2]:

# two ways to do it f_elwise
ser_mpg2 = mtcars.mpg + mtcars.mpg
ser_mpg2 = g_cyl.mpg.obj + g_cyl.mpg.obj

# doing grouped aggregate
g_cyl.mpg.mean()

Out[2]:

cyl
4    26.663636
6    19.742857
8    15.100000
Name: mpg, dtype: float64

What about composing f_elwise and f_agg operations?¶

Let's take this in two steps

composing f_elwise operations alone
composing them with f_agg operations

1) `f_elwise(a, f_elwise(b, [c]))`¶

In this case, since the result could be a Series, or a SeriesGroupBy, it shouldn't be a problem.

In [3]:

degroup = lambda ser: getattr(ser, "obj", ser)
f_add = lambda x, y: degroup(x) + degroup(y)

f_add(g_cyl.mpg, f_add(g_cyl.mpg, 1))

Out[3]:

0     43.0
1     43.0
      ... 
30    31.0
31    43.8
Name: mpg, Length: 32, dtype: float64

Also, as noted in the first section, we are returning a Series here, but functions returning a SeriesGroupBy should also be compatible (so long as we enforce liskov substitution..).

2) `f_elwise(f_agg(a), f_agg(b)) -> same length result`¶

Suppose we wanted to add the mean mpg of each group, to each row of mpg in the original data.

In our system written above, this would look like...

f_mean = f_agg('mean')
f_add = f_elwise('add')

res = f_add(g_cyl.mpg, f_mean(g_cyl.mpg))

Remember that for f_add, we laid out in the first section that it should allow functions to be substituted in that take a SeriesGroupBy (or parent type) and returns a Series (or subtype).

In [4]:

from pandas.core import algorithms

def broadcast_agg_result(grouper, result, obj):
    # Simplified broadcasting from g_cyl.mpg.transform('mean')
    ids, _, ngroup = grouper.group_info
    out = algorithms.take_1d(result._values, ids)

    return pd.Series(out, index=obj.index, name=obj.name)

f_mean = lambda x: broadcast_agg_result(x.grouper, x.mean(), degroup(x))

f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp))

Out[4]:

0     142.028571
1     142.028571
         ...    
30    224.314286
31    109.300000
Length: 32, dtype: float64

Notice we can keep going with this, since

f_add(SeriesGroupBy, SeriesGroupby) -> Series
f_mean(SeriesGroupBy, SeriesGroupby) -> Series
we are making SeriesGroupBy a subtype of Series

In [5]:

f_add(g_cyl.mpg, f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp)))

Out[5]:

0     163.028571
1     163.028571
         ...    
30    239.314286
31    130.700000
Length: 32, dtype: float64

However, there is are two problems here...

adding two means, or a number to a mean, shouldn't need to broadcast to the length of the data
in the code above, f_mean(Series) will return a single value (e.g. 1.2)!

The main issue is that a Series is implicitly a single group. To get around this, f_elwise should decide when to broadcast, and all operations should return SeriesGroupBy.

3) `f_elwise(f_agg(a), f_agg(b)) -> agg length result`¶

Above, we had the aggregate return a result the same length as the original data. But this goes against our initial description that f_agg returns a result whose length is the number of groupings.

In this case, we need to think more about f_agg's type signature. To do this let's consider a new type, AggGroupBy, where...

AggGroupBy is a subtype of SeriesGroupBy
AggGroupBy has 1 row per grouping.
f_agg(a, [b]), with type signature f_agg(SeriesGroupBy) -> AggGroupBy

Finally let's make this drastically simplifying requirement

any operation must take as input either the output of another operation, a literal, or a series using the same grouping.

This means that if our operations return grouped Series, then we don't need to worry about the Series case any more. For example, under this system these operations are allowed...

f_agg(g_cyl.mpg)
f_elwise(g_cyl.mpg, 1)
f_elwise(f_agg(g_cyl.mpg), g_cyl.mpg)

In [6]:

from pandas.core.groupby import SeriesGroupBy
from pandas.core import algorithms


# Define Agg Result ----
def create_agg_result(ser, orig_object, orig_grouper):
    # since pandas groupby method is hard-coded to create a SeriesGroupBy, mock
    # AggResult below by making it a SeriesGroupBy whose grouper has 2 extra attributes
    obj = ser.groupby(ser.index)
    obj.grouper.orig_grouper = orig_grouper
    obj.grouper.orig_object = orig_object
    return obj


def is_agg_result(x):
    return hasattr(x, "grouper") and hasattr(x.grouper, "orig_grouper")


# Handling Grouped Operations ----


def regroup(ser, grouper):
    return ser.groupby(grouper)


def degroup(ser):
    # returns tuple of (Series or literal, Grouper or None)
    # because we can't rely on type checking, use hasattr instead
    return getattr(ser, "obj", ser), getattr(ser, "grouper", None)


def f_mean(x):
    # SeriesGroupBy -> AggResult
    return create_agg_result(x.mean(), x.obj, x.grouper)


def broadcast_agg_result(g_ser, compare=None):
    """Returns a tuple of (Series, final op grouper)"""
    if not isinstance(g_ser, SeriesGroupBy):
        return g_ser, compare.grouper
    # NOTE: now only applying for agg_result
    if not is_agg_result(g_ser):
        return degroup(g_ser)

    if g_ser.grouper.orig_grouper is compare.grouper:
        orig = g_ser.grouper.orig_object
        grouper = g_ser.grouper.orig_grouper

        # Simplified broadcasting from g_cyl.mpg.transform('mean') implementation
        ids, _, ngroup = grouper.group_info
        out = algorithms.take_1d(g_ser.obj._values, ids)

        return pd.Series(out, index=orig.index, name=orig.name), grouper

    return degroup(g_ser)


# Define operations ----


def f_add(x, y):
    # SeriesGroupBy, SeriesGroupBy -> ""
    broad_x, grouper = broadcast_agg_result(x, y)
    broad_y, __ = broadcast_agg_result(y, x)

    res = broad_x + broad_y
    return regroup(res, grouper)

In [7]:

grouped_agg = f_add(f_mean(g_cyl.mpg), f_mean(g_cyl.hp))

# Notice, only 1 result per group
grouped_agg.obj

Out[7]:

cyl
4    109.300000
6    142.028571
8    224.314286
dtype: float64

In [8]:

grouped_mutate = f_add(g_cyl.mpg, grouped_agg)

grouped_mutate.obj

Out[8]:

0     163.028571
1     163.028571
         ...    
30    239.314286
31    130.700000
Name: mpg, Length: 32, dtype: float64

Decisions¶

Functions should essentially follow...

f_elwise(SeriesGroupBy, ...) -> SeriesGroupBy
f_agg(SeriesGroupBy, ...) -> AggGroupBy

We can use a final method at the end to validate, depending on if it's a mutate, summarize, or filter.

Additionally...

methods on grouped objects can be simply wrapped to keep LSP over functions (since they have some hard-coded constructors)
I need to investigate ops involving getting dim 0 properties (e.g. dtype)
need a strategy for keeping user-defined, custom groupby ops

Limitations¶

this considers a closed system of function calls, but right now siu expressions can include things like __getitem__ etc..