In [1]:

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Transforming columns¶

Introduction¶

There are two ways to use the transform_column function: by passing in a function that operates elementwise, or by passing in a function that operates columnwise.

We will show you both in this notebook.

In [3]:

import numpy as np
import pandas as pd

Numeric Data¶

In [4]:

data = np.random.normal(size=(1_000_000, 4))

In [5]:

df = pd.DataFrame(data).clean_names()

Using the elementwise application:

In [6]:

%%timeit
# We are using a lambda function that operates on each element,
# to highlight the point about elementwise operations.
df.transform_column("0", lambda x: np.abs(x), "abs_0")

1.86 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And now using columnwise application:

In [7]:

%%timeit
df.transform_column("0", lambda s: np.abs(s), elementwise=False)

15.7 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Because np.abs is vectorizable over the entire series, it runs about 50X faster. If you know your function is vectorizable, then take advantage of the fact, and use it inside transform_column. After all, all that transform_column has done is provide a method-chainable way of applying the function.

String Data¶

Let's see it in action with string-type data.

In [8]:

from random import choice

def make_strings(length: int):
    return "".join(choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(length))

strings = (make_strings(30) for _ in range(1_000_000))

stringdf = pd.DataFrame({"data": list(strings)})

Firstly, by raw function application:

In [9]:

def first_five(s):
    return s.str[0:5]

In [10]:

%%timeit
stringdf.assign(data=first_five(stringdf["data"]))

408 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]:

%%timeit
first_five(stringdf["data"])

293 ms ± 4.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:

%%timeit
stringdf["data"].str[0:5]

295 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]:

%%timeit
stringdf["data"].apply(lambda x: x[0:5])

301 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It appears assigning the result to a column comes with a bit of overhead.

Now, by using transform_column with default settings:

In [14]:

%%timeit
stringdf.transform_column("data", lambda x: x[0:5])

409 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now by using transform_column while also leveraging string methods:

In [15]:

%%timeit
stringdf.transform_column("data", first_five, elementwise=False)

403 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)