Validated DataFrames¶

Dislcaimer: I personally hate pandas because of it's groteske abuse of strings as variables. Strings are useless for static code inspection and yet pandas encourages users to plaster code with strings of the type df['some random name'] = np.float64(df['other name']) and hope that the rest of the workflow doesn't crash. It gets worse when operators are chained brainlessly:

peak_date = list(df.groupby('date').agg({ord_id_col_name: 'nunique'}).reset_index() \
                     .sort_values(ord_id_col_name, ascending=False)['date'])[0]

The conglomerate of list, an index, three chained operations and "\" as a desparate attempt to make it fit to the IDE's enforcement of 120 characters is the pinnacle of poor readability.

So let's do something about it. Let's make a verified dataframe that get's rid of the strings...

First the imports, and then make a dataframe:

In [1]:

import pandas as pd

In [2]:

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(data)
df

Out[2]:

	col_1	col_2
0	3	a
1	2	b
2	1	c
3	0	d

A pleasant endgame for a validated dataframe would along something like this:

import pandas as pd

class DF(object):
    def __init__(self, df):
        if not isinstance(df, pd.DataFrame):
            raise TypeError
        self.df = df
        self._columns = set(df.columns)

    def __getattr__(self, name):
        if name in self._columns:
            return getattr(self,name)
        elif hasattr(self.df, name):
            return getattr(self.df, name)
        else:
            # Default behaviour
            return object.__getattribute__(self, name)

    @property
    def col_1(self):
        return self.df["col_1"]

    @property
    def col_2(self):
        return self.df["col_2"]

Onto writing the code generator.

First I create two templates: One for the class with override of gettattr, and one for the properties, that are intended to inform the static code inspector.

In [3]:

_class_template = """import pandas as pd

class DF(object):
    def __init__(self, df):
        if not isinstance(df, pd.DataFrame):
            raise TypeError
        self.df = df
        self._columns = set(df.columns)
    
    def __getattr__(self, name):
        if name in self._columns:
            return getattr(self,name)
        elif hasattr(self.df, name):
            return getattr(self.df, name)
        else:
            # Default behaviour
            return object.__getattribute__(self, name)
    {properties}
    """


_property_template = """
    @property
    def {name}(self):
        return self.df["{name}"]
"""

Next I need a function to populate my templates

In [4]:

# This can now be used with hints:
def make_validated_df(df, path):
    if not isinstance(df, pd.DataFrame):
        raise TypeError

    mapping = {n: df[n].dtype for n in df.columns}
    for name, dtype in mapping.items():
        if name == "df":
            raise ValueError("df is a reserved keyword.")
        if not df[name].dtype == dtype:
            raise TypeError(f"{name} is {df[name].dtype}, not {dtype}")

    properties = []
    for name in mapping:
        prop = _property_template.format(name=name)
        properties.append(prop)

    columns = str(set(mapping.keys()))
    s = _class_template.format(properties="".join(properties), columns=columns)

    with path.open("w") as fo:
        fo.write(s)

Note that this file creates the importable module vdf.py, so now I can use the function like this:

In [5]:

from pathlib import Path
make_validated_df(df, path=Path("vdf.py"))
from vdf import DF as myDF

In [6]:

df2 = myDF(df)

df2.groupby(df2.col_1).sum()  # no strings attached!

Out[6]:

	col_2
col_1
0	d
1	c
2	b
3	a

And it produces the same output as the regular dataframe:

In [7]:

print(df.groupby("col_1").sum())

      col_2
col_1      
0         d
1         c
2         b
3         a

This best thing however, is that IDE support now is enabled: