from fastai.gen_doc.nbdoc import *
from fastai.tabular import *
This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic TabularProc
. Preprocessing includes things like
In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called TabularProc
.
The data used in this document page is a subset of the adult dataset. It gives a certain amount of data on individuals to train a model to predict whether their salary is greater than $50k or not.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df.iloc[:800].copy(), df.iloc[800:1000].copy()
train_df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
We see it contains numerical variables (like age
or education-num
) as well as categorical ones (like workclass
or relationship
). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables.
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
show_doc(TabularProc)
class
TabularProc
[source][test]
TabularProc
(cat_names
:StrList
,cont_names
:StrList
)
No tests found for TabularProc
. To contribute a test please refer to this guide and this discussion.
A processor for tabular dataframes.
Base class for creating transforms for dataframes with categorical variables cat_names
and continuous variables cont_names
. Note that any column not in one of those lists won't be touched.
show_doc(TabularProc.__call__)
__call__
[source][test]
call
(df
:DataFrame
,test
:bool
=*False
*)
No tests found for __call__
. To contribute a test please refer to this guide and this discussion.
Apply the correct function to df
depending on test
.
show_doc(TabularProc.apply_train)
apply_train
[source][test]
apply_train
(df
:DataFrame
)
Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Function applied to df
if it's the train set.
show_doc(TabularProc.apply_test)
apply_test
[source][test]
apply_test
(df
:DataFrame
)
Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Function applied to df
if it's the test set.
jekyll_important("Those two functions must be implemented in a subclass. `apply_test` defaults to `apply_train`.")
The following TabularProc
are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a TabularDataBunch
.
show_doc(Categorify)
class
Categorify
[source][test]
Categorify
(cat_names
:StrList
,cont_names
:StrList
) ::TabularProc
Transform the categorical variables to that type.
Variables in cont_names
aren't affected.
show_doc(Categorify.apply_train)
apply_train
[source][test]
apply_train
(df
:DataFrame
)
Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Transform self.cat_names
columns in categorical.
show_doc(Categorify.apply_test)
apply_test
[source][test]
apply_test
(df
:DataFrame
)
Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Transform self.cat_names
columns in categorical using the codes decided in apply_train
.
tfm = Categorify(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes.
train_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'], dtype='object')
The test set will be given the same category codes as the training set.
valid_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'], dtype='object')
show_doc(FillMissing)
class
FillMissing
[source][test]
FillMissing
(cat_names
:StrList
,cont_names
:StrList
,fill_strategy
:FillStrategy
=*<FillStrategy.MEDIAN: 1>
,add_col
:bool
=True
,fill_val
:float
=0.0
*) ::TabularProc
Tests found for FillMissing
:
pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median
[source]Some other tests where FillMissing
is used:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Fill the missing values in continuous columns.
cat_names
variables are left untouched (their missing value will be replaced by code 0 in the TabularDataBunch
). fill_strategy
is adopted to replace those nans and if add_col
is True, whenever a column c
has missing values, a column named c_nan
is added and flags the line where the value was missing.
show_doc(FillMissing.apply_train)
apply_train
[source][test]
apply_train
(df
:DataFrame
)
Tests found for apply_train
:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]To run tests please refer to this guide.
Fill missing values in self.cont_names
according to self.fill_strategy
.
show_doc(FillMissing.apply_test)
apply_test
[source][test]
apply_test
(df
:DataFrame
)
Tests found for apply_test
:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]To run tests please refer to this guide.
Fill missing values in self.cont_names
like in apply_train
.
Fills the missing values in the cont_names
columns with the ones picked during train.
train_df[cont_names].head()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
0 | 49 | 101320 | 12.0 | 0 | 1902 | 40 |
1 | 44 | 236746 | 14.0 | 10520 | 0 | 45 |
2 | 38 | 96185 | NaN | 0 | 0 | 32 |
3 | 38 | 112847 | 15.0 | 0 | 0 | 40 |
4 | 42 | 82297 | NaN | 0 | 0 | 50 |
tfm = FillMissing(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
train_df[cont_names].head()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
0 | 49 | 101320 | 12.0 | 0 | 1902 | 40 |
1 | 44 | 236746 | 14.0 | 10520 | 0 | 45 |
2 | 38 | 96185 | 10.0 | 0 | 0 | 32 |
3 | 38 | 112847 | 15.0 | 0 | 0 | 40 |
4 | 42 | 82297 | 10.0 | 0 | 0 | 50 |
Values missing in the education-num
column are replaced by 10, which is the median of the column in train_df
. Categorical variables are not changed, since nan
is simply used as another category.
valid_df[cont_names].head()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
800 | 45 | 96975 | 10.0 | 0 | 0 | 40 |
801 | 46 | 192779 | 10.0 | 15024 | 0 | 60 |
802 | 36 | 376455 | 10.0 | 0 | 0 | 38 |
803 | 25 | 50053 | 10.0 | 0 | 0 | 45 |
804 | 37 | 164526 | 10.0 | 0 | 0 | 40 |
show_doc(FillStrategy, alt_doc_string='Enum flag represents determines how `FillMissing` should handle missing/nan values', arg_comments={
'MEDIAN':'nans are replaced by the median value of the column',
'COMMON': 'nans are replaced by the most common value of the column',
'CONSTANT': 'nans are replaced by `fill_val`'
})
Enum
= [MEDIAN, COMMON, CONSTANT]
Enum flag represents determines how FillMissing
should handle missing/nan values
fill_val
show_doc(Normalize)
class
Normalize
[source][test]
Normalize
(cat_names
:StrList
,cont_names
:StrList
) ::TabularProc
Normalize the continuous variables.
norm = Normalize(cat_names, cont_names)
show_doc(Normalize.apply_train)
apply_train
[source][test]
apply_train
(df
:DataFrame
)
Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Compute the means and stds of self.cont_names
columns to normalize them.
norm.apply_train(train_df)
train_df[cont_names].head()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
0 | 0.829039 | -0.812589 | 0.981643 | -0.136271 | 4.416656 | -0.050230 |
1 | 0.443977 | 0.355532 | 2.078450 | 1.153121 | -0.228760 | 0.361492 |
2 | -0.018098 | -0.856881 | -0.115165 | -0.136271 | -0.228760 | -0.708985 |
3 | -0.018098 | -0.713162 | 2.626854 | -0.136271 | -0.228760 | -0.050230 |
4 | 0.289952 | -0.976672 | -0.115165 | -0.136271 | -0.228760 | 0.773213 |
show_doc(Normalize.apply_test)
apply_test
[source][test]
apply_test
(df
:DataFrame
)
Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
Normalize self.cont_names
with the same statistics as in apply_train
.
norm.apply_test(valid_df)
valid_df[cont_names].head()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
800 | 0.520989 | -0.850066 | -0.115165 | -0.136271 | -0.22876 | -0.050230 |
801 | 0.598002 | -0.023706 | -0.115165 | 1.705157 | -0.22876 | 1.596657 |
802 | -0.172123 | 1.560596 | -0.115165 | -0.136271 | -0.22876 | -0.214919 |
803 | -1.019260 | -1.254793 | -0.115165 | -0.136271 | -0.22876 | 0.361492 |
804 | -0.095110 | -0.267403 | -0.115165 | -0.136271 | -0.22876 | -0.050230 |
show_doc(add_datepart)
Will drop
the column in df
if the flag is True
. The time
flag decides if we go down to the time parts or stick to the date parts.
df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})
add_datepart(df, 'col1') # inplace
df.head()
col2 | col1Year | col1Month | col1Week | col1Day | col1Dayofweek | col1Dayofyear | col1Is_month_end | col1Is_month_start | col1Is_quarter_end | col1Is_quarter_start | col1Is_year_end | col1Is_year_start | col1Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | a | 2017 | 2 | 5 | 3 | 4 | 34 | False | False | False | False | False | False | 1486080000 |
1 | b | 2017 | 2 | 5 | 4 | 5 | 35 | False | False | False | False | False | False | 1486166400 |
2 | a | 2017 | 2 | 5 | 5 | 6 | 36 | False | False | False | False | False | False | 1486252800 |
show_doc(add_cyclic_datepart)
add_cyclic_datepart
[source][test]
add_cyclic_datepart
(df
:DataFrame
,field_name
:str
,prefix
:str
=*None
,drop
:bool
=True
,time
:bool
=False
,add_linear
:bool
=False
*)
No tests found for add_cyclic_datepart
. To contribute a test please refer to this guide and this discussion.
Helper function that adds trigonometric date/time features to a date in the column field_name
of df
.
df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})
df = add_cyclic_datepart(df, 'col1') # returns a dataframe
df.head()
col2 | col1weekday_cos | col1weekday_sin | col1day_month_cos | col1day_month_sin | col1month_year_cos | col1month_year_sin | col1day_year_cos | col1day_year_sin | |
---|---|---|---|---|---|---|---|---|---|
0 | a | -0.900969 | -0.433884 | 0.900969 | 0.433884 | 0.866025 | 0.5 | 0.842942 | 0.538005 |
1 | b | -0.222521 | -0.974928 | 0.781831 | 0.623490 | 0.866025 | 0.5 | 0.833556 | 0.552435 |
2 | a | 0.623490 | -0.781831 | 0.623490 | 0.781831 | 0.866025 | 0.5 | 0.823923 | 0.566702 |
show_doc(cont_cat_split)
Parameters:
Return:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'a'], 'col3': [0.5, 1.2, 7.5], 'col4': ['ab', 'o', 'o']})
df
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 1 | a | 0.5 | ab |
1 | 2 | b | 1.2 | o |
2 | 3 | a | 7.5 | o |
cont_list, cat_list = cont_cat_split(df=df, max_card=20, dep_var='col4')
cont_list, cat_list
(['col3'], ['col1', 'col2'])