Dislcaimer: I personally hate pandas because of it's groteske abuse of strings as variables. Strings are useless for static code inspection and yet pandas encourages users to plaster code with strings of the type df['some random name'] = np.float64(df['other name'])
and hope that the rest of the workflow doesn't crash. It gets worse when operators are chained brainlessly:
peak_date = list(df.groupby('date').agg({ord_id_col_name: 'nunique'}).reset_index() \
.sort_values(ord_id_col_name, ascending=False)['date'])[0]
The conglomerate of list
, an index, three chained operations and "\
" as a desparate attempt to make it fit to the IDE's enforcement of 120 characters is the pinnacle of poor readability.
So let's do something about it. Let's make a verified dataframe that get's rid of the strings...
First the imports, and then make a dataframe:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(data)
df
col_1 | col_2 | |
---|---|---|
0 | 3 | a |
1 | 2 | b |
2 | 1 | c |
3 | 0 | d |
A pleasant endgame for a validated dataframe would along something like this:
import pandas as pd
class DF(object):
def __init__(self, df):
if not isinstance(df, pd.DataFrame):
raise TypeError
self.df = df
self._columns = set(df.columns)
def __getattr__(self, name):
if name in self._columns:
return getattr(self,name)
elif hasattr(self.df, name):
return getattr(self.df, name)
else:
# Default behaviour
return object.__getattribute__(self, name)
@property
def col_1(self):
return self.df["col_1"]
@property
def col_2(self):
return self.df["col_2"]
Onto writing the code generator.
First I create two templates: One for the class with override of gettattr, and one for the properties, that are intended to inform the static code inspector.
_class_template = """import pandas as pd
class DF(object):
def __init__(self, df):
if not isinstance(df, pd.DataFrame):
raise TypeError
self.df = df
self._columns = set(df.columns)
def __getattr__(self, name):
if name in self._columns:
return getattr(self,name)
elif hasattr(self.df, name):
return getattr(self.df, name)
else:
# Default behaviour
return object.__getattribute__(self, name)
{properties}
"""
_property_template = """
@property
def {name}(self):
return self.df["{name}"]
"""
Next I need a function to populate my templates
# This can now be used with hints:
def make_validated_df(df, path):
if not isinstance(df, pd.DataFrame):
raise TypeError
mapping = {n: df[n].dtype for n in df.columns}
for name, dtype in mapping.items():
if name == "df":
raise ValueError("df is a reserved keyword.")
if not df[name].dtype == dtype:
raise TypeError(f"{name} is {df[name].dtype}, not {dtype}")
properties = []
for name in mapping:
prop = _property_template.format(name=name)
properties.append(prop)
columns = str(set(mapping.keys()))
s = _class_template.format(properties="".join(properties), columns=columns)
with path.open("w") as fo:
fo.write(s)
Note that this file creates the importable module vdf.py
, so now I can use the function like this:
from pathlib import Path
make_validated_df(df, path=Path("vdf.py"))
from vdf import DF as myDF
df2 = myDF(df)
df2.groupby(df2.col_1).sum() # no strings attached!
col_2 | |
---|---|
col_1 | |
0 | d |
1 | c |
2 | b |
3 | a |
And it produces the same output as the regular dataframe:
print(df.groupby("col_1").sum())
col_2 col_1 0 d 1 c 2 b 3 a
This best thing however, is that IDE support now is enabled: