This module defines the main class to handle tabular data in the fastai library: TabularDataset
. As always, there is also a helper function to quickly get your data.
To allow you to easily create a Learner
for your data, it provides get_tabular_learner
.
from fastai.gen_doc.nbdoc import *
from fastai.tabular import *
from fastai import *
show_doc(TabularDataBunch, doc_string=False)
class
TabularDataBunch
[source]
TabularDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
) ::DataBunch
The best way to quickly get your data in a DataBunch
suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the adult dataset.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df[:800].copy(),df[800:].copy()
train_df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | >=50k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | 1 |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | 1 |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | 0 |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | 1 |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | 0 |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = '>=50k'
show_doc(TabularDataBunch.from_df, doc_string=False)
from_df
[source]
from_df
(path
,train_df
:DataFrame
,valid_df
:DataFrame
,dep_var
:str
,test_df
:OptDataFrame
=None
,tfms
:Optional
[Collection
[TabularTransform
]]=None
,cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,stats
:OptStats
=None
,log_output
:bool
=False
,kwargs
) →DataBunch
Creates a DataBunch
in path
from train_df
, valid_df
and optionally test_df
. The dependent variable is dep_var
, while the categorical and continuous variables are in the cat_names
columns and cont_names
columns respectively. If cont_names
is None then we assume all variables that aren't dependent or categorical are continuous. The TabularTransform
in tfms
are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for nan
) and the continuous variables are normalized. You can pass the stats
to use for that step. If log_output
is True, the dependant variable is replaced by its log.
Note that the transforms should be passed as Callable
: the actual initialization with cat_names
and cont_names
is done inside.
tfms = [FillMissing, Categorify]
data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var=dep_var, tfms=tfms, cat_names=cat_names)
You can then easily create a Learner
for this data with get_tabular_learner
.
show_doc(get_tabular_learner)
get_tabular_learner
[source]
get_tabular_learner
(data
:DataBunch
,layers
:Collection
[int
],emb_szs
:Dict
[str
,int
]=None
,metrics
=None
,ps
:Collection
[float
]=None
,emb_drop
:float
=0.0
,y_range
:OptRange
=None
,use_bn
:bool
=True
,kwargs
)
Get a Learner
using data
, with metrics
, including a TabularModel
created using the remaining params.
emb_szs
is a dict
mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.
show_doc(TabularDataset, doc_string=False)
class
TabularDataset
[source]
TabularDataset
(df
:DataFrame
,dep_var
:str
,cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,stats
:OptStats
=None
,log_output
:bool
=False
) ::DatasetBase
A dataset from DataFrame
df
with the dependent being the dep_var
column, while the categorical and continuous variables are in the cat_names
columns and cont_names
columns respectively. Categories are replaced by their codes+1 (leaving 0 for nan
) and the continuous variables are normalized. You can pass the stats
to use for normalization; if none, then will be calculated from your data. If the flag log_output
is True, the dependant variable is replaced by its log.
show_doc(TabularDataset.from_dataframe, doc_string=False)
from_dataframe
[source]
from_dataframe
(df
:DataFrame
,dep_var
:str
,tfms
:Optional
[Collection
[TabularTransform
]]=None
,cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,stats
:OptStats
=None
,log_output
:bool
=False
) →TabularDataset
Factory method to create a TabularDataset
from df
. The only difference from the constructor is that it gets a list tfms
of TabularTfm
that it applied before passing the dataframe to the class initialization.
show_doc(TabularDataset.get_emb_szs)
get_emb_szs
[source]
get_emb_szs
(sz_dict
)