%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
There are two ways to use the transform_column
function: by passing in a function that operates elementwise, or by passing in a function that operates columnwise.
We will show you both in this notebook.
import numpy as np
import pandas as pd
data = np.random.normal(size=(1_000_000, 4))
df = pd.DataFrame(data).clean_names()
Using the elementwise application:
%%timeit
# We are using a lambda function that operates on each element,
# to highlight the point about elementwise operations.
df.transform_column("0", lambda x: np.abs(x), "abs_0")
1.86 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And now using columnwise application:
%%timeit
df.transform_column("0", lambda s: np.abs(s), elementwise=False)
15.7 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Because np.abs
is vectorizable over the entire series,
it runs about 50X faster.
If you know your function is vectorizable,
then take advantage of the fact,
and use it inside transform_column
.
After all, all that transform_column
has done
is provide a method-chainable way of applying the function.
Let's see it in action with string-type data.
from random import choice
def make_strings(length: int):
return "".join(choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(length))
strings = (make_strings(30) for _ in range(1_000_000))
stringdf = pd.DataFrame({"data": list(strings)})
Firstly, by raw function application:
def first_five(s):
return s.str[0:5]
%%timeit
stringdf.assign(data=first_five(stringdf["data"]))
408 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
first_five(stringdf["data"])
293 ms ± 4.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
stringdf["data"].str[0:5]
295 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
stringdf["data"].apply(lambda x: x[0:5])
301 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It appears assigning the result to a column comes with a bit of overhead.
Now, by using transform_column
with default settings:
%%timeit
stringdf.transform_column("data", lambda x: x[0:5])
409 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now by using transform_column
while also leveraging string methods:
%%timeit
stringdf.transform_column("data", first_five, elementwise=False)
403 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)