Tabular data handling¶

This module defines the main class to handle tabular data in the fastai library: TabularDataset. As always, there is also a helper function to quickly get your data.

To allow you to easily create a Learner for your data, it provides get_tabular_learner.

In [ ]:

from fastai.gen_doc.nbdoc import *
from fastai.tabular import * 
from fastai import *

In [ ]:

show_doc(TabularDataBunch, doc_string=False)

`class` `TabularDataBunch`[source]

TabularDataBunch(train_dl:DataLoader, valid_dl:DataLoader, test_dl:Optional[DataLoader]=None, device:device=None, tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate') :: DataBunch

The best way to quickly get your data in a DataBunch suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the adult dataset.

In [ ]:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df[:800].copy(),df[800:].copy()
train_df.head()

Out[ ]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	>=50k
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	1
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	1
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	0
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	1
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	0

In [ ]:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = '>=50k'

In [ ]:

show_doc(TabularDataBunch.from_df, doc_string=False)

`from_df`[source]

from_df(path, train_df:DataFrame, valid_df:DataFrame, dep_var:str, test_df:OptDataFrame=None, tfms:Optional[Collection[TabularTransform]]=None, cat_names:OptStrList=None, cont_names:OptStrList=None, stats:OptStats=None, log_output:bool=False, kwargs) → DataBunch

Creates a DataBunch in path from train_df, valid_df and optionally test_df. The dependent variable is dep_var, while the categorical and continuous variables are in the cat_names columns and cont_names columns respectively. If cont_names is None then we assume all variables that aren't dependent or categorical are continuous. The TabularTransform in tfms are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for nan) and the continuous variables are normalized. You can pass the stats to use for that step. If log_output is True, the dependant variable is replaced by its log.

Note that the transforms should be passed as Callable: the actual initialization with cat_names and cont_names is done inside.

In [ ]:

tfms = [FillMissing, Categorify]
data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var=dep_var, tfms=tfms, cat_names=cat_names)

You can then easily create a Learner for this data with get_tabular_learner.

In [ ]:

show_doc(get_tabular_learner)

`get_tabular_learner`[source]

get_tabular_learner(data:DataBunch, layers:Collection[int], emb_szs:Dict[str, int]=None, metrics=None, ps:Collection[float]=None, emb_drop:float=0.0, y_range:OptRange=None, use_bn:bool=True, kwargs)

Get a Learner using data, with metrics, including a TabularModel created using the remaining params.

emb_szs is a dict mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.

In [ ]:

show_doc(TabularDataset, doc_string=False)

`class` `TabularDataset`[source]

TabularDataset(df:DataFrame, dep_var:str, cat_names:OptStrList=None, cont_names:OptStrList=None, stats:OptStats=None, log_output:bool=False) :: DatasetBase

A dataset from DataFrame df with the dependent being the dep_var column, while the categorical and continuous variables are in the cat_names columns and cont_names columns respectively. Categories are replaced by their codes+1 (leaving 0 for nan) and the continuous variables are normalized. You can pass the stats to use for normalization; if none, then will be calculated from your data. If the flag log_output is True, the dependant variable is replaced by its log.

In [ ]:

show_doc(TabularDataset.from_dataframe, doc_string=False)

`from_dataframe`[source]

from_dataframe(df:DataFrame, dep_var:str, tfms:Optional[Collection[TabularTransform]]=None, cat_names:OptStrList=None, cont_names:OptStrList=None, stats:OptStats=None, log_output:bool=False) → TabularDataset

Factory method to create a TabularDataset from df. The only difference from the constructor is that it gets a list tfms of TabularTfm that it applied before passing the dataframe to the class initialization.

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

New Methods - Please document or move to the undocumented section¶

In [ ]:

show_doc(TabularDataset.get_emb_szs)

`get_emb_szs`[source]

get_emb_szs(sz_dict)

Tabular data handling¶

class TabularDataBunch[source]

from_df[source]

get_tabular_learner[source]

class TabularDataset[source]

from_dataframe[source]

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

New Methods - Please document or move to the undocumented section¶

get_emb_szs[source]

`class` `TabularDataBunch`[source]

`from_df`[source]

`get_tabular_learner`[source]

`class` `TabularDataset`[source]

`from_dataframe`[source]

`get_emb_szs`[source]