import pandas as pd
pd.set_option("display.max_rows", 5)
# pandas fast grouped implementation ----
from siuba.data import cars
from siuba import _
from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
fast_mutate(
cars.groupby('cyl'),
avg_mpg = _.mpg.mean(), # aggregation
hp_per_mpg = _.hp / _.mpg, # elementwise
demeaned = _.hp - _.hp.mean(), # elementwise + agg
)
(grouped data frame)
cyl | mpg | hp | avg_mpg | hp_per_mpg | demeaned | |
---|---|---|---|---|---|---|
0 | 6 | 21.0 | 110 | 19.742857 | 5.238095 | -12.285714 |
1 | 6 | 21.0 | 110 | 19.742857 | 5.238095 | -12.285714 |
... | ... | ... | ... | ... | ... | ... |
30 | 8 | 15.0 | 335 | 15.100000 | 22.333333 | 125.785714 |
31 | 4 | 21.4 | 109 | 26.663636 | 5.093458 | 26.363636 |
32 rows × 6 columns
from siuba import _, mutate, group_by, summarize, show_query
from siuba.sql import LazyTbl
from sqlalchemy import create_engine
# create sqlite db, add pandas DataFrame to it
engine = create_engine("sqlite:///:memory:")
cars.to_sql("cars", engine, if_exists="replace")
# define query
q = (LazyTbl(engine, "cars")
>> group_by(_.cyl)
>> summarize(avg_mpg=_.mpg.mean())
)
q
# Source: lazy query # DB Conn: Engine(sqlite:///:memory:) # Preview:
cyl | avg_mpg | |
---|---|---|
0 | 4 | 26.663636 |
1 | 6 | 19.742857 |
2 | 8 | 15.100000 |
# .. may have more rows
res = show_query(q)
SELECT cars.cyl, avg(cars.mpg) AS avg_mpg FROM cars GROUP BY cars.cyl
The table below shows the pandas methods supported by different backends. Note that the regular, ungrouped backend supports all methods, and the fast grouped implementation supports most methods a person could use without having to call the (slow) DataFrame.apply
method.
🚧This table is displayed a bit funky, but will be cleaned up!
In general, ungrouped pandas DataFrames do not require any translation.
On this kind of data, verbs like mutate
are just alternative implementations of methods like DataFrame.assign
.
from siuba import _, mutate
df = pd.DataFrame({
'g': ['a', 'a', 'b'],
'x': [1,2,3],
})
df.assign(y = lambda _: _.x + 1)
mutate(df, y = _.x + 1)
g | x | y | |
---|---|---|---|
0 | a | 1 | 2 |
1 | a | 2 | 3 |
2 | b | 3 | 4 |
Siuba verbs also work on grouped DataFrames, but are not always fast. They are the potentially slow, reference implementation.
mutate(
df.groupby('g'),
y = _.x + 1,
z = _.x - _.x.mean()
)
(grouped data frame)
g | x | y | z | |
---|---|---|---|---|
0 | a | 1 | 2 | -0.5 |
1 | a | 2 | 3 | 0.5 |
2 | b | 3 | 4 | 0.0 |
Note that you could easily enable these fast methods by default, by aliasing them at import.
from siuba.experimental.pd_groups import fast_mutate as mutate
Currently, the fast grouped implementation puts all the logic in the verbs. That is, fast_mutate
dispatches for DataFrameGroupBy a function that handles all the necessary translation of lazy expressions.
See TODO link this ADR for more details.
The SQL implementation consists largely of the following:
See TODO link this ADR for more details.