(drafted around 21 March 2020)
Siuba aims to support as pandas Series methods across a range of backends. This means that users should be able to use the same method call they would on a pandas Series, and get back roughly the same result.
Backends include:
.apply
method.Note that since pandas (e.g. with the .apply
method) is the reference implementation. There are three big benefits of specifying the reference as data (eg in yaml): testing conformance, documentating, tracking changes over releases.
In this document, I'll first review the value of a spec, go through siuba's current spec, and then the script for transitioning to yaml.
Testing conformance is important because we need to trust the different backends can be swapped in. This means, that for every series method they support, there should be at least one test that they...
Documenting is important because there are over 300 Series methods. A compact representation of support across backends will let people work quickly.
Tracking changes over releases is important because as time goes on, we'll likely need to react to methods being deprecated in pandas.
Because there are 300+ Series methods, I wanted to prioritize a wide format, with few enclosing things ()/{}/[]
.
I was concerned that a long document would require a lot of scanning, and would be hard to jump in to.
I was also doing a lot of research / exploration on spreadsheet (actually, on airtable!).
Now that things are much further along, I'm ready to pay down the technical debt, while preserving two valuable modes of interacting with the spec:
Siuba supports multiple pandas versions, so the spec will contain methods that exist in one version but not another. Similar considerations appply for deprecated methods.
Backends may...
To this end, the spec allows the backends
field to override settings configured in action
.
The spec should be used to do the following, without pulling in other data sources...
Because all activity is now summarized through the spec, we should be able...
Adding contributions will likely follow these steps...
siuba.sql.transform.py
)siuba.sql.dialect.postgresql.py
)siuba.spec.series.yml
to change "todo" to "supported"Below I read the existing spec (written using siu expressions), and wrangle it into the new yaml format. This is needed, since flagging exceptions for actions on different backends were tacked on haphazardly as I went.
For example:
_.nunique()
can't be used in a mutate.Rather than override postgres's behavior in the second case, I'd prefer to declare it. By declaring it, we can always change how we handle it later.
Below I read in the siuba spec and convert it to yaml. It's messy, but it gets the job done.
import pandas as pd
from siuba.spec import series
# NOTE: this is very messy--but also a one way trip to the new YAML format hopefully forever
PLANS = {"Todo", "Maydo", "Wontdo"}
POSTGRESQL_STATUS = {"xfail": "todo", "not_impl": "wontdo", None: None}
def get_postgresql_status(result):
status = POSTGRESQL_STATUS[result.get("postgresql")]
if status is None:# and result["type"] not in PLANS:
if "sql_type" in result:
return {"postgresql": {"result_type": "float"}}
if "no_mutate" in result:
return {"postgresql": {"flags": ["no_mutate"]}}
return {}
return {"postgresql": {"status": status}}
def get_pandas_status(result):
return "supported" if result["type"] not in PLANS else result["type"].lower()
def get_type_info2(call, method, category):
if call.func != "__rshift__":
raise ValueError("Expected first expressions was >>")
out = {}
expr, result = call.args
#accessors = ['str', 'dt', 'cat', 'sparse']
#accessor = ([ameth for ameth in accessors if ameth in expr.op_vars()] + [None])[0]
result_dict = result.to_dict()
# format action ----
action = {
"status": get_pandas_status(result_dict),
**result_dict
}
if action["type"] not in PLANS:
action["kind"] = action["type"].lower()
if "postgresql" in action:
del action["postgresql"]
if "no_mutate" in action:
del action["no_mutate"]
if "sql_type" in action:
del action["sql_type"]
del action["type"]
if "op" in action:
action["input_type"] = "bool"
del action["op"]
# backends ---
backends = get_postgresql_status(result_dict)
return dict(
example = str(expr),
category = category,
#expr_frame = replace_meta_args(expr, _.x, _.y, _.z),
#accessor = accessor[0],
backends = backends,
action = action
)
out = {}
for category, d in series.funcs_stripped.items():
for name, call in d.items():
out[name] = get_type_info2(call, name, category)
pd.json_normalize([{'method': k, **v} for k, v in out.items()])
method | example | category | action.status | action.kind | action.input_type | backends.postgresql.status | backends.postgresql.flags | backends.postgresql.result_type | |
---|---|---|---|---|---|---|---|---|---|
0 | __invert__ | ~_ | _special_methods | supported | elwise | bool | NaN | NaN | NaN |
1 | __and__ | _ & _ | _special_methods | supported | elwise | bool | NaN | NaN | NaN |
2 | __or__ | _ | _ | _special_methods | supported | elwise | bool | NaN | NaN | NaN |
3 | __xor__ | _ ^ _ | _special_methods | supported | elwise | bool | todo | NaN | NaN |
4 | __neg__ | -_ | _special_methods | supported | elwise | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
365 | to_dense | _.to_dense() | io | wontdo | NaN | NaN | NaN | NaN | NaN |
366 | to_string | _.to_string() | io | todo | NaN | NaN | NaN | NaN | NaN |
367 | to_markdown | _.to_markdown() | io | todo | NaN | NaN | NaN | NaN | NaN |
368 | to_clipboard | _.to_clipboard() | io | wontdo | NaN | NaN | NaN | NaN | NaN |
369 | to_latex | _.to_latex() | io | wontdo | NaN | NaN | NaN | NaN | NaN |
370 rows × 9 columns
Key questions:
import yaml
print(yaml.dump(out)[:344])
T: action: status: wontdo backends: {} category: attributes example: _.T __add__: action: kind: elwise status: supported backends: {} category: _special_methods example: _ + _ __and__: action: input_type: bool kind: elwise status: supported backends: {} category: _special_methods example: _ & _
# uncomment to dump
#yaml.dump(out, open("../../siuba/spec/series.yml", "w"))
import pkg_resources
# will go here (on my filesystem)
pkg_resources.resource_filename("siuba.spec", "series.yml")
'/Users/machow/Dropbox/Repo/siuba/siuba/spec/series.yml'
%%time
spec = yaml.load(open("../../siuba/spec/series.yml"), Loader = yaml.SafeLoader)
CPU times: user 460 ms, sys: 12.6 ms, total: 472 ms Wall time: 600 ms
As a reminder, an entry from the yaml spec so far is shown below...
raw_spec = out
raw_spec["nunique"]
{'example': '_.nunique()', 'category': 'computations', 'backends': {'postgresql': {'flags': ['no_mutate']}}, 'action': {'status': 'supported', 'kind': 'agg'}}
This is useful, but I had tracked other information on the airtable, like..
For now, I'll pull out priority, and will likely just keep the other info in the airtable. I would rather start with less rather than more (and wrap up faster in the process).
import pandas as pd
# read yaml spec into a dataframe, so we can join w/ airtable
data = pd.DataFrame([{'method': k, 'data': v} for k, v in raw_spec.items()])
data.head()
method | data | |
---|---|---|
0 | __invert__ | {'example': '~_', 'category': '_special_method... |
1 | __and__ | {'example': '_ & _', 'category': '_special_met... |
2 | __or__ | {'example': '_ | _', 'category': '_special_met... |
3 | __xor__ | {'example': '_ ^ _', 'category': '_special_met... |
4 | __neg__ | {'example': '-_', 'category': '_special_method... |
from airtable import Airtable
import os
# note, airtable API key is in my environment
airtable = Airtable('appErTNqCFXn6stSH', 'methods')
air_entries = airtable.get_all()
air_df = pd.json_normalize(air_entries)
air_df.columns = air_df.columns.str.replace("fields.", "")
air_df.head()
id | createdTime | category | method_name | support_category | op_type | min_data_arity | Name | version_added | result_length | note | version_deprecated | max_data_arity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | rec09hioM01yPbBQB | 2020-01-20T18:31:22.000Z | Computations / descriptive stats | min | done | aggregation | 1.0 | min | NaN | NaN | NaN | NaN | NaN |
1 | rec0eOvXaPwn0KqKs | 2020-01-20T18:31:22.000Z | Timedelta methods | cat.as_ordered | priority-low | NaN | NaN | cat.as_ordered | NaN | NaN | NaN | NaN | NaN |
2 | rec0myu0wfZxDHVCK | 2020-01-20T18:31:22.000Z | Timedelta methods | str.findall | done | elementwise | NaN | str.findall | NaN | NaN | NaN | NaN | NaN |
3 | rec0qdwOY8z1A9AsO | 2020-01-20T18:31:22.000Z | Reshaping, sorting | argmin | deprecated | NaN | NaN | argmin | v0.21.0 | NaN | NaN | NaN | NaN |
4 | rec0vjO0Mni6FXWa7 | 2020-01-20T18:31:22.000Z | Timedelta methods | str.isdecimal | done | elementwise | NaN | str.isdecimal | NaN | NaN | NaN | NaN | NaN |
# Pull out priority info
from siuba import *
prioritized = (
data
>> full_join(_, air_df, {"method": "method_name"})
>> filter(~_.data.isna())
>> select(-_.method_name, -_.createdTime, -_.id, -_.expr_frame, -_.expr_series)
)
new_yaml = (prioritized
>> mutate(
priority = _.support_category.map({
'priority-high': 3, 'priority-medium': 2, 'priority-low': 1, 'priority-zero': 0
}),
data = _.apply(
lambda d: {
**d["data"],
**({'priority': int(d["priority"])} if not pd.isna(d["priority"]) else {})},
axis = 1
)
)
>> pipe(_.set_index("method").data.to_dict())
)
list(new_yaml.items())[109:112]
[('rpow', {'example': '_.rpow(_)', 'category': 'binary', 'backends': {'postgresql': {'status': 'todo'}}, 'action': {'status': 'supported', 'kind': 'elwise'}}), ('combine', {'example': "_.combine(_,'max')", 'category': 'binary', 'backends': {}, 'action': {'status': 'todo'}, 'priority': 1}), ('combine_first', {'example': "_.combine_first(_,'max')", 'category': 'binary', 'backends': {}, 'action': {'status': 'todo'}, 'priority': 1})]
# uncomment to save yaml spec
#yaml.dump(new_yaml, open("../../siuba/spec/series.yml", "w"))