Notebook

In [1]:

import pandas as pd

pd.set_option("display.max_rows", 5)

import tabulate

table = tabulate.tabulate([
    ["Column operations are pandas Series methods", "✅", "✅", "✅"],
    ["Table verbs supports user defined functions", "✅", "✅", "✅"],
    ["pipe syntax (`>>`) ", "✅", "✅", "❌"],
    ["concise, <b>lazy expressions</b> (`_.a + _.b`)", "✅", "✅", "❌"],
    ["No more reset_index", "✅", "✅", "❌"],
    ["<b>unified API</b> over (un)grouped data", "✅", "✅", "❌"],
    ["generate fast grouped operations", "✅", "❌", "✅"],
    ["<br>generate SQL queries</b> ", "✅", "❌", "❌"],
    ["Abstract syntax trees for <br>transforming operations<b>", "✅", "❌", "❌"],
    ["handles nested data", "✅", "❌", "⚠️"]


], tablefmt = "html", headers = ["siuba", "dplython", "pandas"])

Key features¶

In [2]:

# this is a hidden cell

print("""
<div class="output_area rendered_html docutils container">
  {table}
</div>
""".format(table = table.replace('\n', "")))

<div class="output_area rendered_html docutils container">
  <table><thead><tr><th>                                                        </th><th>siuba  </th><th>dplython  </th><th>pandas  </th></tr></thead><tbody><tr><td>Column operations are pandas Series methods             </td><td>✅     </td><td>✅        </td><td>✅      </td></tr><tr><td>Table verbs supports user defined functions             </td><td>✅     </td><td>✅        </td><td>✅      </td></tr><tr><td>pipe syntax (`>>`)                                      </td><td>✅     </td><td>✅        </td><td>❌      </td></tr><tr><td>concise, <b>lazy expressions</b> (`_.a + _.b`)          </td><td>✅     </td><td>✅        </td><td>❌      </td></tr><tr><td>No more reset_index                                     </td><td>✅     </td><td>✅        </td><td>❌      </td></tr><tr><td><b>unified API</b> over (un)grouped data                </td><td>✅     </td><td>✅        </td><td>❌      </td></tr><tr><td>generate fast grouped operations                        </td><td>✅     </td><td>❌        </td><td>✅      </td></tr><tr><td><br>generate SQL queries</b>                            </td><td>✅     </td><td>❌        </td><td>❌      </td></tr><tr><td>Abstract syntax trees for <br>transforming operations<b></td><td>✅     </td><td>❌        </td><td>❌      </td></tr><tr><td>handles nested data                                     </td><td>✅     </td><td>❌        </td><td>⚠️       </td></tr></tbody></table>
</div>

<table><thead><tr><th> </th><th>siuba </th><th>dplython </th><th>pandas </th></tr></thead><tbody><tr><td>Column operations are pandas Series methods </td><td>✅ </td><td>✅ </td><td>✅ </td></tr><tr><td>Table verbs supports user defined functions </td><td>✅ </td><td>✅ </td><td>✅ </td></tr><tr><td>pipe syntax (`>>`) </td><td>✅ </td><td>✅ </td><td>❌ </td></tr><tr><td>concise, lazy expressions (`_.a + _.b`) </td><td>✅ </td><td>✅ </td><td>❌ </td></tr><tr><td>No more reset_index </td><td>✅ </td><td>✅ </td><td>❌ </td></tr><tr><td>unified API over (un)grouped data </td><td>✅ </td><td>✅ </td><td>❌ </td></tr><tr><td>generate fast grouped operations </td><td>✅ </td><td>❌ </td><td>✅ </td></tr><tr><td><br>generate SQL queries </td><td>✅ </td><td>❌ </td><td>❌ </td></tr><tr><td>Abstract syntax trees for <br>transforming operations</td><td>✅ </td><td>❌ </td><td>❌ </td></tr><tr><td>handles nested data </td><td>✅ </td><td>❌ </td><td>⚠️ </td></tr></tbody></table>

Built on pandas¶

pandas is everywhere in python data analysis. The siuba library builds on this incredible work by using pandas Series methods as its reference implementation. This means that you use the pandas methods you've already learned!

In [3]:

import pandas as pd
from siuba import _, mutate

my_data = pd.DataFrame({
    'g': ['a', 'a', 'b'],
    'x': [1,2,3],
})

# pandas
my_data.assign(avg = lambda d: d.x.mean())

# siuba
mutate(my_data, avg = _.x.mean())

Out[3]:

	g	x	avg
0	a	1	2.0
1	a	2	2.0
2	b	3	2.0

Note how you can debug both pieces of code by running and inspecting df.a.mean().

While pandas is an incredibly powerful API, its syntax can get quite cumbersome.

In [4]:

(my_data
  .assign(avg = lambda d: d.x.mean())    # create new column
  .loc[lambda d: d.x != 3]               # filter out some rows
)

Out[4]:

	g	x	avg
0	a	1	2.0
1	a	2	2.0

Notice how much of this code is writing the word lambda.

Inspired by dplython¶

Like other ports of the popular R library, dplyr--such as dplython--siuba offers a simple, flexible way to work on tables of data.

Pipe syntax¶

The pipe syntax allows you to import table functions (verbs), rather than having 300+ methods on your DataFrame.

In [5]:

# actions can be imported individually
from siuba import mutate, arrange

# they can be combined using a pipe
my_data >> mutate(y = _.x + 1) >> arrange(_.g, -_.x)

Out[5]:

	g	x	y
1	a	2	3
0	a	1	2
2	b	3	4

Lazy expressions¶

Using lazy expressions saves you from repeating the name of your DataFrame over and over.

In [6]:

# rather than repeat the name of your data, you can use lazy expressions ---
my_data_frame = pd.DataFrame({'a': [1,2,3]})


# bad
my_data_frame["b"] = my_data_frame["a"] + 1
my_data_frame["c"] = my_data_frame["b"] + 2

# good
my_data_frame >> mutate(b = _.a + 1, c = _.b + 2)

Out[6]:

	a	b	c
0	1	2	4
1	2	3	5
2	3	4	6

No reset_index¶

Notice how siuba mutate can take a DataFrame, and return a DataFrame. Moreover, it doesn't stick columns onto the index. This means you don't need to call reset_index all the time.

A common place where reset_index is called is after a pandas grouped aggregation.

In [7]:

from siuba.data import mtcars
from siuba import summarize

g_cyl = mtcars.groupby("cyl")

agg_res = g_cyl[["hp", "mpg"]].agg("mean")
agg_res

Out[7]:

	hp	mpg
cyl
4	82.636364	26.663636
6	122.285714	19.742857
8	209.214286	15.100000

In [8]:

# bad
agg_res.reset_index()

Out[8]:

	cyl	hp	mpg
0	4	82.636364	26.663636
1	6	122.285714	19.742857
2	8	209.214286	15.100000

In [9]:

# good
summarize(g_cyl, hp = _.hp.mean(), mpg = _.mpg.mean())

Out[9]:

	cyl	hp	mpg
0	4	82.636364	26.663636
1	6	122.285714	19.742857
2	8	209.214286	15.100000

Unified (un)grouped API¶

In siuba it doesn't matter whether your data is grouped or not.

<table class="table-align-left docutils align-default"> <tr> <th>grouped?</th> <th>siuba</th> <th>pandas</th> </tr> <tr> <td> yes </td> <td>

mutate(mtcars,
  res = _.hp - _.hp.mean()
)

</td> <td>

mtcars.assign(
  res = lambda d: d.hp - d.hp.mean()
)

</td> </tr> <tr> <td> no </td> <td>

mutate(g_cyl,
  res = _.hp - _.hp.mean()
)

</td> <td>

mtcars.assign(
  res = mtcars.hp - g_cyl.hp.transform("mean")
)

</td> </tr> </table>

In [10]:

g_cyl = mtcars.groupby("cyl")

mtcars >> mutate(demeaned = _.hp - _.hp.mean())    # uses ungrouped mean
g_cyl  >> mutate(demeaned = _.hp - _.hp.mean())    # uses grouped mean

Out[10]:

(grouped data frame)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	demeaned
0	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4	-12.285714
1	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4	-12.285714
...	...	...	...	...	...	...	...	...	...	...	...	...
30	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8	125.785714
31	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2	26.363636

32 rows × 12 columns

In pandas you have to change your code for grouped data.

In [11]:

g_cyl = mtcars.groupby("cyl")

# ungrouped vs grouped mean
mtcars.assign(demeaned = lambda d: d.hp - d.hp.mean())
mtcars.assign(demeaned = g_cyl.obj.hp - g_cyl.hp.transform("mean"))

Out[11]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	demeaned
0	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4	-12.285714
1	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4	-12.285714
...	...	...	...	...	...	...	...	...	...	...	...	...
30	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8	125.785714
31	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2	26.363636

32 rows × 12 columns

Note that g_cyl does not have an assign method, and requires passing what operation you want to do ("mean") as a string to .transform().

Going beyond¶

Fast grouped operations¶

Consider some data (students) where 2,000 students have each completed 10 courses, and received a score on each course.

In [12]:

# fast grouped operations (pull from dev docs)
# PLOT of timing
import numpy as np
import pandas as pd

np.random.seed(123)
students = pd.DataFrame({
    'student_id': np.repeat(np.arange(2000), 10),
    'course_id': np.random.randint(1, 20, 20000),
    'score': np.random.randint(1, 100, 20000)
})

In [13]:

g_students = students.groupby('student_id')
g_students

Out[13]:

(grouped data frame)

	student_id	course_id	score
0	0	14	38
1	0	3	40
...	...	...	...
19998	1999	11	32
19999	1999	17	10

20000 rows × 3 columns

Suppose that we want to get the courses each student scored lowest on.

In pandas we could use some complex, but fast code.

In [14]:

%%time
# pandas
is_student_min = g_students.obj.score == g_students.score.transform('min')
low_scores = students[is_student_min]

CPU times: user 6.21 ms, sys: 1.9 ms, total: 8.11 ms
Wall time: 8.67 ms

In siuba it is simpler, and comparable in speed.

In [15]:

from siuba.experimental.pd_groups import fast_filter

In [16]:

%%time
# siuba
low_scores = fast_filter(g_students, _.score == _.score.min())

CPU times: user 9.96 ms, sys: 1.5 ms, total: 11.5 ms
Wall time: 18.5 ms

This is because siuba's lazy expressions let it optimize grouped operations.

However, dplython is over 100x slower in this case, because it uses the slower pandas DataFrame.apply() method under the hood.

In [17]:

# set up code for timing
from dplython import X, DplyFrame, sift, group_by as dply_group_by

g_students2 = DplyFrame(students) >> dply_group_by(X.student_id)

In [18]:

%%time
g_students2 >> sift(X.score == X.score.min())

CPU times: user 1.67 s, sys: 26 ms, total: 1.7 s
Wall time: 1.82 s

Out[18]:

	student_id	course_id	score
2	0	3	17
10	1	10	1
...	...	...	...
19987	1998	17	31
19997	1999	3	1

2117 rows × 3 columns

SQL queries¶

In [19]:

# generate SQL queries
from siuba.data import cars_sql
from siuba import group_by, mutate, show_query

q = (cars_sql
  >> group_by("cyl")
  >> mutate(demeaned = _.hp - _.hp.mean())
  >> show_query()
)

SELECT cars.cyl, cars.mpg, cars.hp, cars.hp - avg(cars.hp) OVER (PARTITION BY cars.cyl) AS demeaned 
FROM cars

Abstract syntax trees¶

This is made possible because siuba represents lazy expressions with abstract syntax trees. Fast grouped operations and SQL queries are just the beginning. This allows people to produce a whole range of interesting tools!

Siuba's lazy expressions consist of a Symbolic and Call class.

Symbolic is used to quickly create lazy expressions.

In [20]:

# ASTs for transforming
from siuba.siu import Symbolic, Call, strip_symbolic

_ = Symbolic()

sym = _.a.mean() + _["b"]
sym

Out[20]:

█─+
├─█─'__call__'
│ └─█─.
│   ├─█─.
│   │ ├─_
│   │ └─'a'
│   └─'mean'
└─█─[
  ├─_
  └─'b'

Each black box in the printout above is a Call. Calls are the pieces that represent the underlying operations. They have methods to inspect and transform them.

In [21]:

call = strip_symbolic(sym)

# get columns names used in lazy expression
call.op_vars(attr_calls = False)

Out[21]:

{'a', 'b'}

Nested data¶

In [22]:

from siuba import _, mutate, unnest

tagged = pd.DataFrame({
    'id': [1,2,3],
    'tags': ['a,b,c', 'd,e', 'f']
})

(tagged
    >> mutate(split_tags = _.tags.str.split(','))
    >> unnest("split_tags")
)

Out[22]:

	id	tags	split_tags
0	1	a,b,c	a
1	1	a,b,c	b
...	...	...	...
4	2	d,e	e
5	3	f	f

6 rows × 3 columns