Tabular training¶

How to use the tabular application in fastai

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

In [ ]:

from fastai2.tabular.all import *

We can download a sample of this dataset with the usual command:

In [ ]:

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

Out[ ]:

(#3) [Path('/home/sgugger/.fastai/data/adult_sample/models'),Path('/home/sgugger/.fastai/data/adult_sample/adult.csv'),Path('/home/sgugger/.fastai/data/adult_sample/export.pkl')]

Then we can have a look at how the data is structured:

In [ ]:

df = pd.read_csv(path/'adult.csv')
df.head()

Out[ ]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:

In [ ]:

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

The last part is the list of pre-processors we apply to our data:

Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
FillMissing will fille the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
Normalize will normalize the continuous variables (substract the mean and divide by the std)

The show_batch method works like for every other application:

In [ ]:

dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	?	Some-college	Never-married	?	Own-child	White	False	22.000000	32731.996436	10.0	<50k
1	Private	7th-8th	Married-civ-spouse	Machine-op-inspct	Husband	White	False	44.000000	99202.998578	4.0	<50k
2	Private	HS-grad	Divorced	Farming-fishing	Not-in-family	White	False	63.000001	117680.996997	9.0	<50k
3	Private	HS-grad	Married-civ-spouse	Machine-op-inspct	Husband	White	False	33.000000	194141.000170	9.0	<50k
4	Private	Assoc-voc	Divorced	Transport-moving	Not-in-family	White	False	35.000000	172570.999732	11.0	<50k
5	Local-gov	HS-grad	Divorced	Exec-managerial	Unmarried	Amer-Indian-Eskimo	False	43.000000	196308.000036	9.0	<50k
6	Private	HS-grad	Never-married	Exec-managerial	Not-in-family	White	False	43.000000	336642.996235	9.0	<50k
7	Private	HS-grad	Never-married	Other-service	Not-in-family	White	False	27.000000	158156.001081	9.0	<50k
8	?	Bachelors	Never-married	?	Unmarried	White	False	26.000000	130832.001756	13.0	<50k
9	Private	Assoc-voc	Married-civ-spouse	Tech-support	Husband	White	False	27.000000	62737.003461	11.0	<50k

We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y's may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won't presume you are doing regression.

In [ ]:

learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won't be useful here since we don't have a pretrained model).

In [ ]:

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.366727	0.351524	0.835842	00:05

We can then have a look at some predictions:

In [ ]:

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5.0	12.0	3.0	15.0	1.0	5.0	1.0	-0.333356	-0.900977	-0.419934	1.0	0.0
1	7.0	12.0	5.0	6.0	5.0	5.0	1.0	0.916167	-1.457755	-0.419934	0.0	0.0
2	5.0	10.0	3.0	2.0	1.0	5.0	1.0	-0.774364	-0.030944	1.150726	0.0	0.0
3	5.0	13.0	3.0	5.0	1.0	5.0	1.0	-0.259855	-0.668491	1.543390	0.0	1.0
4	5.0	13.0	1.0	13.0	2.0	5.0	1.0	0.622161	0.409060	1.543390	1.0	0.0
5	3.0	16.0	3.0	4.0	1.0	5.0	1.0	0.254654	-0.870132	-0.027269	1.0	0.0
6	5.0	12.0	5.0	13.0	2.0	5.0	1.0	-0.259855	-0.464552	-0.419934	0.0	0.0
7	5.0	9.0	3.0	4.0	1.0	5.0	1.0	0.989668	-0.430562	0.365396	1.0	1.0
8	6.0	16.0	3.0	4.0	1.0	3.0	1.0	-0.627362	-0.110140	-0.027269	0.0	0.0

Or use the predict method on a row:

In [ ]:

learn.predict(df.iloc[0])

Out[ ]:

(   workclass  education  marital-status  occupation  relationship  race  \
 0        5.0        8.0             3.0         0.0           6.0   5.0   
 
    education-num_na       age    fnlwgt  education-num  salary  
 0               1.0  0.769164 -0.835926       0.758061     0.0  ,
 tensor(0),
 tensor([0.5200, 0.4800]))

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

In [ ]:

test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

Then Learner.get_preds will give you the predictions:

In [ ]:

learn.get_preds(dl=dl)

Out[ ]:

(tensor([[0.5200, 0.4800],
         [0.5536, 0.4464],
         [0.9767, 0.0233],
         ...,
         [0.6025, 0.3975],
         [0.7228, 0.2772],
         [0.5157, 0.4843]]), None)

In [ ]: