In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

import urllib.request
import pandas as pd
import numpy as np

In [2]:

import ktrain
from ktrain import tabular

Income Prediction from Census Dataset¶

In this notebook, we will predict which individuals make more than $50K from Census data. This is the same dataset used in the AutoGluon tabular prediction example.

Collect Training and Test Sets¶

The original dataset is available from the UCI Machine Learning Repository, but we will download it from the AutoGluon website.

In [3]:

# training set
urllib.request.urlretrieve('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv', 
                           '/tmp/train.csv')

Out[3]:

('/tmp/train.csv', <http.client.HTTPMessage at 0x7fa0a8a16eb8>)

STEP 1: Load and Preprocess Data¶

In [4]:

trn, val, preproc = tabular.tabular_from_csv('/tmp/train.csv', label_columns='class', random_state=42)

processing train: 35179 rows x 15 columns

The following integer column(s) are being treated as categorical variables:
['education-num']
To treat any of these column(s) as numerical, cast the column to float in DataFrame or CSV
 and re-run tabular_from* function.

processing test: 3894 rows x 15 columns

STEP 2: Create Model and Wrap in `Learner`¶

In [5]:

model = tabular.tabular_classifier('mlp', trn)
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)

Is Multi-Label? False
done.

STEP 3: Estimate LR¶

In [6]:

learner.lr_find(show_plot=True)

simulating training for different learning rates... this may take a few moments...
Train for 274 steps
Epoch 1/1024
274/274 [==============================] - 8s 28ms/step - loss: 0.7151 - accuracy: 0.3431
Epoch 2/1024
274/274 [==============================] - 7s 25ms/step - loss: 0.6359 - accuracy: 0.6889
Epoch 3/1024
274/274 [==============================] - 7s 25ms/step - loss: 0.4145 - accuracy: 0.8113
Epoch 4/1024
274/274 [==============================] - 7s 25ms/step - loss: 0.3268 - accuracy: 0.8486
Epoch 5/1024
274/274 [==============================] - 7s 25ms/step - loss: 0.6269 - accuracy: 0.7968
Epoch 6/1024
274/274 [==============================] - 7s 25ms/step - loss: 0.5543 - accuracy: 0.7589
Epoch 7/1024
 50/274 [====>.........................] - ETA: 7s - loss: 47426.1304 - accuracy: 0.7517

done.
Visually inspect loss plot and select learning rate associated with falling loss

STEP 4: Train¶

In [7]:

learner.autofit(1e-3)

early_stopping automatically enabled at patience=5
reduce_on_plateau automatically enabled at patience=2


begin training using triangular learning rate policy with max lr of 0.001...
Train for 275 steps, validate for 122 steps
Epoch 1/1024
275/275 [==============================] - 10s 38ms/step - loss: 0.3674 - accuracy: 0.8285 - val_loss: 0.2957 - val_accuracy: 0.8624
Epoch 2/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.3157 - accuracy: 0.8549 - val_loss: 0.2962 - val_accuracy: 0.8652
Epoch 3/1024
269/275 [============================>.] - ETA: 0s - loss: 0.3128 - accuracy: 0.8558
Epoch 00003: Reducing Max LR on Plateau: new max lr will be 0.0005 (if not early_stopping).
275/275 [==============================] - 10s 35ms/step - loss: 0.3127 - accuracy: 0.8559 - val_loss: 0.2994 - val_accuracy: 0.8621
Epoch 4/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.3086 - accuracy: 0.8574 - val_loss: 0.2951 - val_accuracy: 0.8652
Epoch 5/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.3078 - accuracy: 0.8586 - val_loss: 0.2953 - val_accuracy: 0.8654
Epoch 6/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.3057 - accuracy: 0.8595 - val_loss: 0.2933 - val_accuracy: 0.8659
Epoch 7/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.3045 - accuracy: 0.8595 - val_loss: 0.2928 - val_accuracy: 0.8634
Epoch 8/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.3033 - accuracy: 0.8605 - val_loss: 0.2927 - val_accuracy: 0.8649
Epoch 9/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.3037 - accuracy: 0.8605 - val_loss: 0.2931 - val_accuracy: 0.8624
Epoch 10/1024
269/275 [============================>.] - ETA: 0s - loss: 0.3016 - accuracy: 0.8612
Epoch 00010: Reducing Max LR on Plateau: new max lr will be 0.00025 (if not early_stopping).
275/275 [==============================] - 10s 35ms/step - loss: 0.3015 - accuracy: 0.8611 - val_loss: 0.2931 - val_accuracy: 0.8659
Epoch 11/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2993 - accuracy: 0.8624 - val_loss: 0.2924 - val_accuracy: 0.8641
Epoch 12/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2986 - accuracy: 0.8625 - val_loss: 0.2925 - val_accuracy: 0.8636
Epoch 13/1024
269/275 [============================>.] - ETA: 0s - loss: 0.2982 - accuracy: 0.8636
Epoch 00013: Reducing Max LR on Plateau: new max lr will be 0.000125 (if not early_stopping).
275/275 [==============================] - 10s 35ms/step - loss: 0.2982 - accuracy: 0.8636 - val_loss: 0.2926 - val_accuracy: 0.8634
Epoch 14/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.2958 - accuracy: 0.8641 - val_loss: 0.2923 - val_accuracy: 0.8636
Epoch 15/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.2950 - accuracy: 0.8642 - val_loss: 0.2920 - val_accuracy: 0.8654
Epoch 16/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2944 - accuracy: 0.8645 - val_loss: 0.2938 - val_accuracy: 0.8608
Epoch 17/1024
272/275 [============================>.] - ETA: 0s - loss: 0.2940 - accuracy: 0.8640
Epoch 00017: Reducing Max LR on Plateau: new max lr will be 6.25e-05 (if not early_stopping).
275/275 [==============================] - 9s 34ms/step - loss: 0.2943 - accuracy: 0.8638 - val_loss: 0.2924 - val_accuracy: 0.8641
Epoch 18/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.2929 - accuracy: 0.8651 - val_loss: 0.2919 - val_accuracy: 0.8647
Epoch 19/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2929 - accuracy: 0.8644 - val_loss: 0.2926 - val_accuracy: 0.8649
Epoch 20/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2924 - accuracy: 0.8651 - val_loss: 0.2917 - val_accuracy: 0.8647
Epoch 21/1024
275/275 [==============================] - 9s 34ms/step - loss: 0.2921 - accuracy: 0.8658 - val_loss: 0.2918 - val_accuracy: 0.8652
Epoch 22/1024
274/275 [============================>.] - ETA: 0s - loss: 0.2908 - accuracy: 0.8667
Epoch 00022: Reducing Max LR on Plateau: new max lr will be 3.125e-05 (if not early_stopping).
275/275 [==============================] - 10s 35ms/step - loss: 0.2912 - accuracy: 0.8665 - val_loss: 0.2919 - val_accuracy: 0.8652
Epoch 23/1024
275/275 [==============================] - 10s 35ms/step - loss: 0.2909 - accuracy: 0.8663 - val_loss: 0.2920 - val_accuracy: 0.8652
Epoch 24/1024
272/275 [============================>.] - ETA: 0s - loss: 0.2905 - accuracy: 0.8667
Epoch 00024: Reducing Max LR on Plateau: new max lr will be 1.5625e-05 (if not early_stopping).
275/275 [==============================] - 9s 34ms/step - loss: 0.2904 - accuracy: 0.8670 - val_loss: 0.2921 - val_accuracy: 0.8649
Epoch 25/1024
269/275 [============================>.] - ETA: 0s - loss: 0.2899 - accuracy: 0.8668Restoring model weights from the end of the best epoch.
275/275 [==============================] - 9s 34ms/step - loss: 0.2900 - accuracy: 0.8666 - val_loss: 0.2921 - val_accuracy: 0.8649
Epoch 00025: early stopping
Weights from best epoch have been loaded into model.

Out[7]:

<tensorflow.python.keras.callbacks.History at 0x7fa0a7c361d0>

In [8]:

learner.validate(class_names=preproc.get_classes())

              precision    recall  f1-score   support

       <=50K       0.89      0.94      0.91      3013
        >50K       0.74      0.62      0.67       881

    accuracy                           0.86      3894
   macro avg       0.82      0.78      0.79      3894
weighted avg       0.86      0.86      0.86      3894

Out[8]:

array([[2824,  189],
       [ 338,  543]])

Evaluate Model on Unseen Test Data¶

In [9]:

# download test dataset
urllib.request.urlretrieve('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv', 
                           '/tmp/test.csv')


test_df = pd.read_csv('/tmp/test.csv')

In [10]:

test_df.head()

Out[10]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-loss	hours-per-week	native-country	class
0	31	Private	169085	11th	7	Married-civ-spouse	Sales	Wife	White	Female	0	20	United-States	<=50K
1	17	Self-emp-not-inc	226203	12th	8	Never-married	Sales	Own-child	White	Male	0	45	United-States	<=50K
2	47	Private	54260	Assoc-voc	11	Married-civ-spouse	Exec-managerial	Husband	White	Male	1887	60	United-States	>50K
3	21	Private	176262	Some-college	10	Never-married	Exec-managerial	Own-child	White	Female	0	30	United-States	<=50K
4	17	Private	241185	12th	8	Never-married	Prof-specialty	Own-child	White	Male	0	20	United-States	<=50K

The learner.evaluate method is just an alias to learner.validate. By default, it was validate/evaluate learner.val_data, but both can accept a test set as an argument in the form of a TabularDataset.

We use learner.evaluate here to compute test set metrics.

In [11]:

learner.evaluate(preproc.preprocess_test(test_df), class_names=preproc.get_classes())

processing test: 9769 rows x 15 columns
              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91      7451
        >50K       0.76      0.61      0.67      2318

    accuracy                           0.86      9769
   macro avg       0.82      0.77      0.79      9769
weighted avg       0.85      0.86      0.85      9769

Out[11]:

array([[6996,  455],
       [ 914, 1404]])

Generating Test Results¶

Let's generate a DataFrame showing the test set predictions for each instance:

In [12]:

preproc.get_classes()

Out[12]:

['<=50K', '>50K']

In [13]:

predictor = ktrain.get_predictor(learner.model, preproc)
preds = predictor.predict(test_df)

In [14]:

df = test_df.copy()

In [15]:

df['predicted_class'] = preds

In [16]:

df.head()

Out[16]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-loss	hours-per-week	native-country	class	predicted_class
0	31	Private	169085	11th	7	Married-civ-spouse	Sales	Wife	White	Female	0	20	United-States	<=50K	<=50K
1	17	Self-emp-not-inc	226203	12th	8	Never-married	Sales	Own-child	White	Male	0	45	United-States	<=50K	<=50K
2	47	Private	54260	Assoc-voc	11	Married-civ-spouse	Exec-managerial	Husband	White	Male	1887	60	United-States	>50K	>50K
3	21	Private	176262	Some-college	10	Never-married	Exec-managerial	Own-child	White	Female	0	30	United-States	<=50K	<=50K
4	17	Private	241185	12th	8	Never-married	Prof-specialty	Own-child	White	Male	0	20	United-States	<=50K	<=50K

In [ ]: