%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
import urllib.request
import pandas as pd
import numpy as np
import ktrain
from ktrain import tabular
In this notebook, we will predict which individuals make more than $50K from Census data. This is the same dataset used in the AutoGluon tabular prediction example.
The original dataset is available from the UCI Machine Learning Repository, but we will download it from the AutoGluon website.
# training set
urllib.request.urlretrieve('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv',
'/tmp/train.csv')
('/tmp/train.csv', <http.client.HTTPMessage at 0x7fa0a8a16eb8>)
trn, val, preproc = tabular.tabular_from_csv('/tmp/train.csv', label_columns='class', random_state=42)
processing train: 35179 rows x 15 columns The following integer column(s) are being treated as categorical variables: ['education-num'] To treat any of these column(s) as numerical, cast the column to float in DataFrame or CSV and re-run tabular_from* function. processing test: 3894 rows x 15 columns
Learner
¶model = tabular.tabular_classifier('mlp', trn)
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)
Is Multi-Label? False done.
learner.lr_find(show_plot=True)
simulating training for different learning rates... this may take a few moments... Train for 274 steps Epoch 1/1024 274/274 [==============================] - 8s 28ms/step - loss: 0.7151 - accuracy: 0.3431 Epoch 2/1024 274/274 [==============================] - 7s 25ms/step - loss: 0.6359 - accuracy: 0.6889 Epoch 3/1024 274/274 [==============================] - 7s 25ms/step - loss: 0.4145 - accuracy: 0.8113 Epoch 4/1024 274/274 [==============================] - 7s 25ms/step - loss: 0.3268 - accuracy: 0.8486 Epoch 5/1024 274/274 [==============================] - 7s 25ms/step - loss: 0.6269 - accuracy: 0.7968 Epoch 6/1024 274/274 [==============================] - 7s 25ms/step - loss: 0.5543 - accuracy: 0.7589 Epoch 7/1024 50/274 [====>.........................] - ETA: 7s - loss: 47426.1304 - accuracy: 0.7517 done. Visually inspect loss plot and select learning rate associated with falling loss
learner.autofit(1e-3)
early_stopping automatically enabled at patience=5 reduce_on_plateau automatically enabled at patience=2 begin training using triangular learning rate policy with max lr of 0.001... Train for 275 steps, validate for 122 steps Epoch 1/1024 275/275 [==============================] - 10s 38ms/step - loss: 0.3674 - accuracy: 0.8285 - val_loss: 0.2957 - val_accuracy: 0.8624 Epoch 2/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.3157 - accuracy: 0.8549 - val_loss: 0.2962 - val_accuracy: 0.8652 Epoch 3/1024 269/275 [============================>.] - ETA: 0s - loss: 0.3128 - accuracy: 0.8558 Epoch 00003: Reducing Max LR on Plateau: new max lr will be 0.0005 (if not early_stopping). 275/275 [==============================] - 10s 35ms/step - loss: 0.3127 - accuracy: 0.8559 - val_loss: 0.2994 - val_accuracy: 0.8621 Epoch 4/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.3086 - accuracy: 0.8574 - val_loss: 0.2951 - val_accuracy: 0.8652 Epoch 5/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.3078 - accuracy: 0.8586 - val_loss: 0.2953 - val_accuracy: 0.8654 Epoch 6/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.3057 - accuracy: 0.8595 - val_loss: 0.2933 - val_accuracy: 0.8659 Epoch 7/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.3045 - accuracy: 0.8595 - val_loss: 0.2928 - val_accuracy: 0.8634 Epoch 8/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.3033 - accuracy: 0.8605 - val_loss: 0.2927 - val_accuracy: 0.8649 Epoch 9/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.3037 - accuracy: 0.8605 - val_loss: 0.2931 - val_accuracy: 0.8624 Epoch 10/1024 269/275 [============================>.] - ETA: 0s - loss: 0.3016 - accuracy: 0.8612 Epoch 00010: Reducing Max LR on Plateau: new max lr will be 0.00025 (if not early_stopping). 275/275 [==============================] - 10s 35ms/step - loss: 0.3015 - accuracy: 0.8611 - val_loss: 0.2931 - val_accuracy: 0.8659 Epoch 11/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2993 - accuracy: 0.8624 - val_loss: 0.2924 - val_accuracy: 0.8641 Epoch 12/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2986 - accuracy: 0.8625 - val_loss: 0.2925 - val_accuracy: 0.8636 Epoch 13/1024 269/275 [============================>.] - ETA: 0s - loss: 0.2982 - accuracy: 0.8636 Epoch 00013: Reducing Max LR on Plateau: new max lr will be 0.000125 (if not early_stopping). 275/275 [==============================] - 10s 35ms/step - loss: 0.2982 - accuracy: 0.8636 - val_loss: 0.2926 - val_accuracy: 0.8634 Epoch 14/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.2958 - accuracy: 0.8641 - val_loss: 0.2923 - val_accuracy: 0.8636 Epoch 15/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.2950 - accuracy: 0.8642 - val_loss: 0.2920 - val_accuracy: 0.8654 Epoch 16/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2944 - accuracy: 0.8645 - val_loss: 0.2938 - val_accuracy: 0.8608 Epoch 17/1024 272/275 [============================>.] - ETA: 0s - loss: 0.2940 - accuracy: 0.8640 Epoch 00017: Reducing Max LR on Plateau: new max lr will be 6.25e-05 (if not early_stopping). 275/275 [==============================] - 9s 34ms/step - loss: 0.2943 - accuracy: 0.8638 - val_loss: 0.2924 - val_accuracy: 0.8641 Epoch 18/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.2929 - accuracy: 0.8651 - val_loss: 0.2919 - val_accuracy: 0.8647 Epoch 19/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2929 - accuracy: 0.8644 - val_loss: 0.2926 - val_accuracy: 0.8649 Epoch 20/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2924 - accuracy: 0.8651 - val_loss: 0.2917 - val_accuracy: 0.8647 Epoch 21/1024 275/275 [==============================] - 9s 34ms/step - loss: 0.2921 - accuracy: 0.8658 - val_loss: 0.2918 - val_accuracy: 0.8652 Epoch 22/1024 274/275 [============================>.] - ETA: 0s - loss: 0.2908 - accuracy: 0.8667 Epoch 00022: Reducing Max LR on Plateau: new max lr will be 3.125e-05 (if not early_stopping). 275/275 [==============================] - 10s 35ms/step - loss: 0.2912 - accuracy: 0.8665 - val_loss: 0.2919 - val_accuracy: 0.8652 Epoch 23/1024 275/275 [==============================] - 10s 35ms/step - loss: 0.2909 - accuracy: 0.8663 - val_loss: 0.2920 - val_accuracy: 0.8652 Epoch 24/1024 272/275 [============================>.] - ETA: 0s - loss: 0.2905 - accuracy: 0.8667 Epoch 00024: Reducing Max LR on Plateau: new max lr will be 1.5625e-05 (if not early_stopping). 275/275 [==============================] - 9s 34ms/step - loss: 0.2904 - accuracy: 0.8670 - val_loss: 0.2921 - val_accuracy: 0.8649 Epoch 25/1024 269/275 [============================>.] - ETA: 0s - loss: 0.2899 - accuracy: 0.8668Restoring model weights from the end of the best epoch. 275/275 [==============================] - 9s 34ms/step - loss: 0.2900 - accuracy: 0.8666 - val_loss: 0.2921 - val_accuracy: 0.8649 Epoch 00025: early stopping Weights from best epoch have been loaded into model.
<tensorflow.python.keras.callbacks.History at 0x7fa0a7c361d0>
learner.validate(class_names=preproc.get_classes())
precision recall f1-score support <=50K 0.89 0.94 0.91 3013 >50K 0.74 0.62 0.67 881 accuracy 0.86 3894 macro avg 0.82 0.78 0.79 3894 weighted avg 0.86 0.86 0.86 3894
array([[2824, 189], [ 338, 543]])
# download test dataset
urllib.request.urlretrieve('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv',
'/tmp/test.csv')
test_df = pd.read_csv('/tmp/test.csv')
test_df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31 | Private | 169085 | 11th | 7 | Married-civ-spouse | Sales | Wife | White | Female | 0 | 0 | 20 | United-States | <=50K |
1 | 17 | Self-emp-not-inc | 226203 | 12th | 8 | Never-married | Sales | Own-child | White | Male | 0 | 0 | 45 | United-States | <=50K |
2 | 47 | Private | 54260 | Assoc-voc | 11 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1887 | 60 | United-States | >50K |
3 | 21 | Private | 176262 | Some-college | 10 | Never-married | Exec-managerial | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
4 | 17 | Private | 241185 | 12th | 8 | Never-married | Prof-specialty | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
The learner.evaluate
method is just an alias to learner.validate
. By default, it was validate/evaluate
learner.val_data
, but both can accept a test set as an argument in the form of a TabularDataset
.
We use learner.evaluate
here to compute test set metrics.
learner.evaluate(preproc.preprocess_test(test_df), class_names=preproc.get_classes())
processing test: 9769 rows x 15 columns precision recall f1-score support <=50K 0.88 0.94 0.91 7451 >50K 0.76 0.61 0.67 2318 accuracy 0.86 9769 macro avg 0.82 0.77 0.79 9769 weighted avg 0.85 0.86 0.85 9769
array([[6996, 455], [ 914, 1404]])
Let's generate a DataFrame showing the test set predictions for each instance:
preproc.get_classes()
['<=50K', '>50K']
predictor = ktrain.get_predictor(learner.model, preproc)
preds = predictor.predict(test_df)
df = test_df.copy()
df['predicted_class'] = preds
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | predicted_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31 | Private | 169085 | 11th | 7 | Married-civ-spouse | Sales | Wife | White | Female | 0 | 0 | 20 | United-States | <=50K | <=50K |
1 | 17 | Self-emp-not-inc | 226203 | 12th | 8 | Never-married | Sales | Own-child | White | Male | 0 | 0 | 45 | United-States | <=50K | <=50K |
2 | 47 | Private | 54260 | Assoc-voc | 11 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1887 | 60 | United-States | >50K | >50K |
3 | 21 | Private | 176262 | Some-college | 10 | Never-married | Exec-managerial | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K | <=50K |
4 | 17 | Private | 241185 | 12th | 8 | Never-married | Prof-specialty | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K | <=50K |