Tabular data¶

In [ ]:

from fastai.gen_doc.nbdoc import *
from fastai.tabular.models import *

tabular contains all the necessary classes to deal with tabular data, across two modules:

tabular.transform: defines the TabularTransform class to help with preprocessing;
tabular.data: defines the TabularDataset that handles that data, as well as the methods to quickly get a TabularDataBunch.

To create a model, you'll need to use models.tabular. See below for an end-to-end example using all these modules.

Preprocessing tabular data¶

First, let's import everything we need for the tabular application.

In [ ]:

from fastai.tabular import * 

Tabular data usually comes in the form of a delimited file (such as .csv) containing variables of different kinds: text/category, numbers, and perhaps some missing values. The example we'll work with in this section is a sample of the adult dataset which has some census information on individuals. We'll use it to train a model to predict whether salary is greater than $50k or not.

In [ ]:

path = untar_data(URLs.ADULT_SAMPLE)
path

Out[ ]:

PosixPath('/home/ubuntu/.fastai/data/adult_sample')

In [ ]:

df = pd.read_csv(path/'adult.csv')
df.head()

Out[ ]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Here all the information that will form our input is in the 14 first columns, and the dependent variable is the last column. We will split our input between two types of variables: categorical and continuous.

Categorical variables will be replaced by a category - a unique id that identifies them - before they are passed through an embedding layer.
Continuous variables will be normalized and then directly fed to the model.

Another thing we need to handle are the missing values: our model isn't going to like receiving NaNs so we should remove them in a smart way. All of this preprocessing is done by TabularTransform objects and TabularDataset.

We can define a bunch of Transforms that will be applied to our variables. Here we transform all categorical variables into categories. We also replace missing values for continuous variables by the median column value and normalize those.

In [ ]:

procs = [FillMissing, Categorify, Normalize]

To split our data into training and validation sets, we use valid indexes

In [ ]:

valid_idx = range(len(df)-2000, len(df))

Then let's manually split our variables into categorical and continuous variables (we can ignore the dependent variable at this stage). fastai will assume all variables that aren't dependent or categorical are continuous, unless we explicitly pass a list to the cont_names parameter when constructing our DataBunch.

In [ ]:

dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

Now we're ready to pass this information to TabularDataBunch.from_df to create the DataBunch that we'll use for training.

In [ ]:

data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)
print(data.train_ds.cont_names)  # `cont_names` defaults to: set(df)-set(cat_names)-{dep_var}

['capital-gain', 'fnlwgt', 'hours-per-week', 'capital-loss', 'education-num', 'age']

We can grab a mini-batch of data and take a look (note that to_np here converts from pytorch tensor to numpy):

In [ ]:

(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

[[ 5  8  3 13  1  5  2 40  1]
 [ 5 12  3  5  1  3  2 40  1]
 [ 1  2  3  1  1  5  2 40  1]
 [ 5 12  3  4  1  5  2 40  1]
 [ 7 12  3  5  1  5  2 40  1]]
[[ 0.886658 -1.382477 -0.035789 -0.216787  0.753904  0.393667]
 [-0.145922 -0.367974  1.176085  4.469497 -0.421569  0.833029]
 [-0.145922 -1.454197 -2.621122 -0.216787 -1.205218  2.07789 ]
 [-0.145922  0.135373 -0.035789 -0.216787 -0.421569  0.173985]
 [-0.145922 -0.224013 -0.035789 -0.216787 -0.421569  2.151118]]
[1 1 0 0 1]

After being processed in TabularDataset, the categorical variables are replaced by ids and the continuous variables are normalized. The codes corresponding to categorical variables are all put together, as are all the continuous variables.

Defining a model¶

Once we have our data ready in a DataBunch, we just need to create a model to then define a Learner and start training. The fastai library has a flexible and powerful TabularModel in models.tabular. To use that function, we just need to specify the embedding sizes for each of our categorical variables.

In [ ]:

learn = tabular_learner(data, layers=[200,100], emb_szs={'native-country': 10}, metrics=accuracy)
learn.fit_one_cycle(1, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	0.321540	0.319863	0.844000	00:04

As usual, we can use the Learner.predict method to get predictions. In this case, we need to pass the row of a dataframe that has the same names of categorical and continuous variables as our training or validation dataframe.

In [ ]:

learn.predict(df.iloc[0])

Out[ ]:

(Category >=50k, tensor(1), tensor([0.1864, 0.8136]))