In siuba, much of what users do involves expressions using _
.
Depending on the backend they're using, these expressions are then transformed and executed.
However, sometimes no translation exists for a method.
This is not so different from pandas or SQL alchemy, where a limited number of methods are available to users.
For example, in pandas...
some_data.cumsum()
some_data.cumany()
Moreover, you can use .cummean()
on an ungrouped, but not a grouped DataFrame. And as a final cruel twist, some methods are fast when grouped, while others (e.g. expanding().sum()
) use the slow apply route.
In pandas, it's not totally clear how you would define something like .cumany()
, and let it run on grouped or ungrouped data, without submitting a PR to pandas itself.
(maybe by registering an accessor, but this doesn't apply to grouped DataFrames.)
This is the tyranny of methods. The object defining the method owns the method. To add or modify a method, you need to modify the class behind the object.
Now, this isn't totally true--the class could provide a way for you to register your method (like accessors). But wouldn't it be nice if the actions we wanted to perform on data didn't have to check in with the data class itself? Why does the data class get to decide what we do with it, and why does it get priviledged methods?
Rather than registering functions onto your class (i.e. methods), singledispatch lets you register classes with your functions.
In singledispatch, this works by having the class of your first argument, decide which version of a function to call.
from functools import singledispatch
# by default dispatches on object, which everything inherits from
@singledispatch
def cool_func(x):
print("Default dispatch over:", type(x))
@cool_func.register(int)
def _cool_func_int(x):
print("Special dispatch for an integer!")
cool_func('x')
cool_func(1)
Default dispatch over: <class 'str'> Special dispatch for an integer!
This concept is incredibly powerful for two reasons...
siuba uses singledispatch in two places
mutate
, whose actions depend on the backend they're operating on (e.g. SQL vs pandas)It's worth looking at symbolic calls in detail
from siuba.siu import symbolic_dispatch, _
import pandas as pd
@symbolic_dispatch(cls = pd.Series)
def add2(x):
return x + 2
add2(pd.Series([1,2,3]))
0 3 1 4 2 5 dtype: int64
One special property of symbolic_dispatch
is that if we pass it a symbol, then it returns a symbol.
sym = add2(_.astype(int))
sym
█─'__call__' ├─█─'__custom_func__' │ └─<function add2 at 0x11026a598> └─█─'__call__' ├─█─. │ ├─_ │ └─'astype' └─<class 'int'>
sym(pd.Series(['1', '2']))
0 3 1 4 dtype: int64
Note that in this case these two bits of code work the same...
ser = pd.Series(['1', '2'])
sym = add2(_.astype(int))
sym(ser)
func = lambda _: add2(_.astype(int))
func(ser)
siuba knows that if the function's first argument is a symbolic expression, then the function needs to return a symbolic expression.
In essence, siuba needs to allow dispatching over the forms of data it can operate on, including..
I'm glad you asked! There is one very big risk with singledispatch, and it's this:
singledispatch will dispatch on the "closest" matching parent class it has registered.
This means that if it has object registered, then at the very least, it will dispatch on that. This is a big problem since e.g. sqlalchemy column mappings and everything else is an object.
In order to mitigate this risk, there are two compelling options...
The downsides are that (1) requires a custom dispatch implementation, and (2) requires that people know about type annotations.
That said, I'm curious to explore option (2), as this has an appealing logic: an appropriate function will be a subtype of the one we typically use.
In order to fully contextualize the process, consider the stage where something may need to be pulled from the dispatcher: call shaping via CallTreeLocal.
from siuba.siu import CallTreeLocal, strip_symbolic
def as_string(x):
return x.astype(str)
ctl = CallTreeLocal(local = {'as_string': as_string})
call = ctl.enter(strip_symbolic(_.as_string()))
# Call object holding function as first argument
call.__dict__
{'func': '__call__', 'args': (<function __main__.as_string(x)>, _), 'kwargs': {}}
# proof it's just the function
type(call.args[0])
function
Now this setup is good and well--but how is a user going to put their function on CallTreeLocal?
Register it? Nah. What they need is a clear interface.
We're already "bouncing" symbolic dispatch functions when they get a symbolic expression. We can use this mechanic to make CallTreeLocal more "democratic"
Notice that when we "bounce" add2, it reports the function as a "custom_func".
@symbolic_dispatch(cls = pd.Series)
def add2(x):
return x + 2
add2(_)
█─'__call__' ├─█─'__custom_func__' │ └─<function add2 at 0x1102bb488> └─_
This is because it's a special call, called a FuncArg
(name subject to change). We can modify CallTreeLocal to perform custom behavior when it enters / exits __custom_func__
.
class SpecialClass: pass
@add2.register(SpecialClass)
def _add2_special(x):
print("Wooweee!")
class CallTree2(CallTreeLocal):
# note: self.dispatch_cls already used in init for this very purpose
def enter___custom_func__(self, node):
# the function itself is the first arg
dispatcher = node.args[0]
# hardcoding for now...
return dispatcher.dispatch(self.dispatch_cls)
ctl2 = CallTree2({}, dispatch_cls = SpecialClass)
func = ctl2.enter(strip_symbolic(add2(_)))
func
<function _add2_special at 0x1102bb8c8>(_)
type(func)
siuba.siu.Call
However, there's one major problem--CallTree2 may still dispatch the default function!
@symbolic_dispatch
def add3(x):
print("Calling add3 default")
call3 = ctl2.enter(strip_symbolic(add3(_)))
call3(1)
Calling add3 default
THIS MEANS THAT EVERY SINGLEDISPATCH FUNCTION WILL AT LEAST USE ITS DEFAULT
Imagine that some defined the default, but then it gets fired for SQL, and for pandas, etc etc..
What a headache.
We can check the result annotation of the function we'd dispatch, to know whether it won't. In this case, we assume it won't work if the result is not a subclass of the one our SQL tools expect: ClauseElement. We can shut down the process early if we know the function won't return what we need.
This is because a function is a subtype of another function if it's input is contravarient (e.g. a parent), and it's output is covariant (e.g. a subclass).
# used to get type info
import inspect
# the most basic of SQL classes
from sqlalchemy.sql.elements import ClauseElement
RESULT_CLS = ClauseElement
class CallTree3(CallTreeLocal):
# note: self.dispatch_cls already used in init for this very purpose
def enter___custom_func__(self, node):
# the function itself is the first arg
dispatcher = node.args[0]
# hardcoding for now...
f = dispatcher.dispatch(self.dispatch_cls)
sig = inspect.signature(f)
ret_type = sig.return_annotation
if issubclass(ret_type, RESULT_CLS):
return f
raise TypeError("Return type, %s, not subclass of %s" %(ret_type, RESULT_CLS))
from sqlalchemy import sql
sel = sql.select([sql.column('id'), sql.column('x'), sql.column('y')])
# this is what siuba sql expressions operate on
col_class = sel.columns.__class__
clt3 = CallTree3({}, dispatch_cls = col_class)
@symbolic_dispatch
def f_bad(x):
return x + 1
@symbolic_dispatch
def f_good(x: ClauseElement) -> ClauseElement:
return x.contains('woah')
# here is the error for the first, without that pesky stack trace
try:
clt3.enter(strip_symbolic(f_bad(_)))
except TypeError as err:
print(err)
Return type, <class 'inspect._empty'>, not subclass of <class 'sqlalchemy.sql.elements.ClauseElement'>
# here is the good one going through
clt3.enter(strip_symbolic(f_good(_)))
<function f_good at 0x1109009d8>(_)
Well, runtime evaluation of result types isn't the most fleshed out process in python. And there are some edge cases.
For example, what should we do if the return type is a Union? Any?
There is also a bug with the Union implementation before 3.7, where if it receives 3 classes, and 1 is the parent of the others, it just returns the parent...
from typing import Union
class A: pass
class B(A): pass
class C(B): pass
Union[A,B,C]
__main__.A
To be honest--I think we can be optimistic for now that anyone using a Union as their return type knows what they're doing with siuba. I think the main behaviors we want to support are...
And even a crude result type check will ensure that. In some ways the existence of a result type is almost all the proof we need.
I think so. It would take a bit of work. Mostly PRs to the typing package to...
__rshift__
😅