How to use the tabular application in fastai
To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.
from fastai2.tabular.all import *
We can download a sample of this dataset with the usual command:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
(#3) [Path('/home/sgugger/.fastai/data/adult_sample/models'),Path('/home/sgugger/.fastai/data/adult_sample/adult.csv'),Path('/home/sgugger/.fastai/data/adult_sample/export.pkl')]
Then we can have a look at how the data is structured:
df = pd.read_csv(path/'adult.csv')
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders
factory methods:
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
The last part is the list of pre-processors we apply to our data:
Categorify
is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.FillMissing
will fille the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)Normalize
will normalize the continuous variables (substract the mean and divide by the std)The show_batch
method works like for every other application:
dls.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ? | Some-college | Never-married | ? | Own-child | White | False | 22.000000 | 32731.996436 | 10.0 | <50k |
1 | Private | 7th-8th | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 44.000000 | 99202.998578 | 4.0 | <50k |
2 | Private | HS-grad | Divorced | Farming-fishing | Not-in-family | White | False | 63.000001 | 117680.996997 | 9.0 | <50k |
3 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 33.000000 | 194141.000170 | 9.0 | <50k |
4 | Private | Assoc-voc | Divorced | Transport-moving | Not-in-family | White | False | 35.000000 | 172570.999732 | 11.0 | <50k |
5 | Local-gov | HS-grad | Divorced | Exec-managerial | Unmarried | Amer-Indian-Eskimo | False | 43.000000 | 196308.000036 | 9.0 | <50k |
6 | Private | HS-grad | Never-married | Exec-managerial | Not-in-family | White | False | 43.000000 | 336642.996235 | 9.0 | <50k |
7 | Private | HS-grad | Never-married | Other-service | Not-in-family | White | False | 27.000000 | 158156.001081 | 9.0 | <50k |
8 | ? | Bachelors | Never-married | ? | Unmarried | White | False | 26.000000 | 130832.001756 | 13.0 | <50k |
9 | Private | Assoc-voc | Married-civ-spouse | Tech-support | Husband | White | False | 27.000000 | 62737.003461 | 11.0 | <50k |
We can define a model using the tabular_learner
method. When we define our model, fastai
will try to infer the loss function based on our y_names
earlier.
Note: Sometimes with tabular data, your y
's may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock
in your constructor so fastai
won't presume you are doing regression.
learn = tabular_learner(dls, metrics=accuracy)
And we can train that model with the fit_one_cycle
method (the fine_tune
method won't be useful here since we don't have a pretrained model).
learn.fit_one_cycle(1)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.366727 | 0.351524 | 0.835842 | 00:05 |
We can then have a look at some predictions:
learn.show_results()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | salary_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.0 | 12.0 | 3.0 | 15.0 | 1.0 | 5.0 | 1.0 | -0.333356 | -0.900977 | -0.419934 | 1.0 | 0.0 |
1 | 7.0 | 12.0 | 5.0 | 6.0 | 5.0 | 5.0 | 1.0 | 0.916167 | -1.457755 | -0.419934 | 0.0 | 0.0 |
2 | 5.0 | 10.0 | 3.0 | 2.0 | 1.0 | 5.0 | 1.0 | -0.774364 | -0.030944 | 1.150726 | 0.0 | 0.0 |
3 | 5.0 | 13.0 | 3.0 | 5.0 | 1.0 | 5.0 | 1.0 | -0.259855 | -0.668491 | 1.543390 | 0.0 | 1.0 |
4 | 5.0 | 13.0 | 1.0 | 13.0 | 2.0 | 5.0 | 1.0 | 0.622161 | 0.409060 | 1.543390 | 1.0 | 0.0 |
5 | 3.0 | 16.0 | 3.0 | 4.0 | 1.0 | 5.0 | 1.0 | 0.254654 | -0.870132 | -0.027269 | 1.0 | 0.0 |
6 | 5.0 | 12.0 | 5.0 | 13.0 | 2.0 | 5.0 | 1.0 | -0.259855 | -0.464552 | -0.419934 | 0.0 | 0.0 |
7 | 5.0 | 9.0 | 3.0 | 4.0 | 1.0 | 5.0 | 1.0 | 0.989668 | -0.430562 | 0.365396 | 1.0 | 1.0 |
8 | 6.0 | 16.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 | -0.627362 | -0.110140 | -0.027269 | 0.0 | 0.0 |
Or use the predict method on a row:
learn.predict(df.iloc[0])
( workclass education marital-status occupation relationship race \ 0 5.0 8.0 3.0 0.0 6.0 5.0 education-num_na age fnlwgt education-num salary 0 1.0 0.769164 -0.835926 0.758061 0.0 , tensor(0), tensor([0.5200, 0.4800]))
To get prediction on a new dataframe, you can use the test_dl
method of the DataLoaders
. That dataframe does not need to have the dependent variable in its column.
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)
Then Learner.get_preds
will give you the predictions:
learn.get_preds(dl=dl)
(tensor([[0.5200, 0.4800], [0.5536, 0.4464], [0.9767, 0.0233], ..., [0.6025, 0.3975], [0.7228, 0.2772], [0.5157, 0.4843]]), None)