Notebook

In [16]:

import pandas as pd

pd.set_option("display.max_rows", 5)

Backends¶

Quick examples¶

pandas (fast grouped) _¶

In [23]:

# pandas fast grouped implementation ----
from siuba.data import cars
from siuba import _
from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize

fast_mutate(
    cars.groupby('cyl'),
    avg_mpg    = _.mpg.mean(),          # aggregation
    hp_per_mpg = _.hp / _.mpg,          # elementwise    
    demeaned   = _.hp - _.hp.mean(),    # elementwise + agg
)

Out[23]:

(grouped data frame)

	cyl	mpg	hp	avg_mpg	hp_per_mpg	demeaned
0	6	21.0	110	19.742857	5.238095	-12.285714
1	6	21.0	110	19.742857	5.238095	-12.285714
...	...	...	...	...	...	...
30	8	15.0	335	15.100000	22.333333	125.785714
31	4	21.4	109	26.663636	5.093458	26.363636

32 rows × 6 columns

SQL _¶

In [33]:

from siuba import _, mutate, group_by, summarize, show_query
from siuba.sql import LazyTbl
from sqlalchemy import create_engine

# create sqlite db, add pandas DataFrame to it
engine = create_engine("sqlite:///:memory:")
cars.to_sql("cars", engine, if_exists="replace")

# define query
q = (LazyTbl(engine, "cars")
    >> group_by(_.cyl)
    >> summarize(avg_mpg=_.mpg.mean())
)

q

Out[33]:

# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:

	cyl	avg_mpg
0	4	26.663636
1	6	19.742857
2	8	15.100000

# .. may have more rows

In [35]:

res = show_query(q)

SELECT cars.cyl, avg(cars.mpg) AS avg_mpg 
FROM cars GROUP BY cars.cyl

Supported methods¶

The table below shows the pandas methods supported by different backends. Note that the regular, ungrouped backend supports all methods, and the fast grouped implementation supports most methods a person could use without having to call the (slow) DataFrame.apply method.

🚧This table is displayed a bit funky, but will be cleaned up!

pandas (ungrouped)¶

In general, ungrouped pandas DataFrames do not require any translation. On this kind of data, verbs like mutate are just alternative implementations of methods like DataFrame.assign.

In [6]:

from siuba import _, mutate

df = pd.DataFrame({
    'g': ['a', 'a', 'b'],    
    'x': [1,2,3],
    })

df.assign(y = lambda _: _.x + 1)

mutate(df, y = _.x + 1)

Out[6]:

	g	x	y
0	a	1	2
1	a	2	3
2	b	3	4

Siuba verbs also work on grouped DataFrames, but are not always fast. They are the potentially slow, reference implementation.

In [36]:

mutate(
    df.groupby('g'),
    y = _.x + 1,
    z = _.x - _.x.mean()
)

Out[36]:

(grouped data frame)

	g	x	y	z
0	a	1	2	-0.5
1	a	2	3	0.5
2	b	3	4	0.0

pandas (fast grouped)¶

Note that you could easily enable these fast methods by default, by aliasing them at import.

from siuba.experimental.pd_groups import fast_mutate as mutate

Architecture (1)¶

Currently, the fast grouped implementation puts all the logic in the verbs. That is, fast_mutate dispatches for DataFrameGroupBy a function that handles all the necessary translation of lazy expressions.

See TODO link this ADR for more details.

SQL¶

Architecture (2)¶

Call

</foreignObject>Call<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

shape_call

</foreignObject>shape_call<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

track_call_windows

</foreignObject>track_call_wind...<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

(Call, List[sqla.OverClause])

</foreignObject>(Call, List[sqla.OverClause])SqlTblsource: sqla.Enginefuncs: Dicttbl: sql.Tableops: List[sqla.Select]group_by: Tupleorder_by: Tuple/last_op: Select-- passed to CallTreeLocalrm_attrcall_sub_attrdispatch_clsresult_clsappend_op(Select, ...) : SqlTblcopy(**kwargs): SqlTblshape_call(Call, ...) : Calltrack_call_windows(Call): get_ordered_col_names()WindowReplacercolumns: Mapping[sqla.Column]group_byorder_bywindow_ctewindows: List+ method(type): typeCallTreeLocallocal: dictcall_sub_attr: setchain_sub_attr: Booldispatch_cls: Typeresult_cls: Typecreate_local_call(str, ...) : CallCallListenergeneric_visit: typeenter(Call): Callexit (Call): Callgeneric_enter(Call): Callgeneric_exit (Call): Callenter_if_call(T): T-- enter, exit methods formatenter_<Call.func>(Call)-- egenter___get_item__(Call)<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

group_by(...)

</foreignObject>group_by(...)<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

arrange(...)

</foreignObject>arrange(...)<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

mutate/filter/summarize(...)

</foreignObject>mutate/filter/summarize(....<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

copy(group_by = Tuple[str])

</foreignObject>copy(group_by =...<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

copy(order_by = Tuple[sqla.Clause])

</foreignObject>copy(order_by =...<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

append_op(...)

</foreignObject>append_op(...)<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

<font style="font-size: 20px">SQL Architecture</font>

</foreignObject>SQL Architecture<foreignObject style="overflow: visible; text-align: left;" pointer-events="none" width="100%" height="100%" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility">

Note<br /><br />sqla = sqlachemy

</foreignObject>Note...Viewer does not support full SVG 1.1

The SQL implementation consists largely of the following:

LazyTbl - a class that holds a sqlalchemy connection, table name, and list of select statements.
Verbs that dispatch on LazyTbl - eg. mutate takes a LazyTbl, and returns a LazyTbl that has a new select statement corresponding to that mutate.
CallListeners for (1) translating lazy expressions to SQL specific functions, and (2) adding grouping information to OVER clauses.

See TODO link this ADR for more details.