%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
import ktrain
from ktrain import text
Using TensorFlow backend.
Here, we will classify Wikipedia comments into one or more categories of so-called toxic comments. Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate. The dataset can be downloaded from the Kaggle Toxic Comment Classification Challenge as a CSV file (i.e., download the file train.csv
). We will load the data using the texts_from_csv
method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since val_filepath is None, 10% of the data will automatically be used as a validation set.
DATA_PATH = 'data/toxic-comments/train.csv'
NUM_WORDS = 50000
MAXLEN = 150
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,
'comment_text',
label_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"],
val_filepath=None, # if None, 10% of data will be used for validation
max_features=NUM_WORDS, maxlen=MAXLEN,
ngram_range=1)
Word Counts: 196995 Nrows: 143613 143613 train sequences Average train sequence length: 66 15958 test sequences Average test sequence length: 66 Pad sequences (samples x time) x_train shape: (143613,150) x_test shape: (15958,150) y_train shape: (143613,6) y_test shape: (15958,6)
model = text.text_classifier('fasttext', (x_train, y_train),
preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))
Is Multi-Label? True compiling word ID features... max_features is 49350 done.
learner.lr_find()
learner.lr_plot()
simulating training for different learning rates... this may take a few moments... Epoch 1/5 47840/143613 [========>.....................] - ETA: 31s - loss: 0.4965 - acc: 0.7510 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.autofit(0.001)
early_stopping automatically enabled at patience=5 reduce_on_plateau automatically enabled at patience=2 begin training using triangular learning rate policy with max lr of 0.001... Train on 143613 samples, validate on 15958 samples Epoch 1/1024 143613/143613 [==============================] - 51s 356us/step - loss: 0.1358 - acc: 0.9530 - val_loss: 0.0536 - val_acc: 0.9817 Epoch 2/1024 143613/143613 [==============================] - 51s 355us/step - loss: 0.0643 - acc: 0.9784 - val_loss: 0.0504 - val_acc: 0.9826 Epoch 3/1024 143613/143613 [==============================] - 51s 356us/step - loss: 0.0577 - acc: 0.9797 - val_loss: 0.0483 - val_acc: 0.9831 Epoch 4/1024 143613/143613 [==============================] - 51s 352us/step - loss: 0.0540 - acc: 0.9806 - val_loss: 0.0475 - val_acc: 0.9830 Epoch 5/1024 143613/143613 [==============================] - 51s 355us/step - loss: 0.0520 - acc: 0.9811 - val_loss: 0.0471 - val_acc: 0.9832 Epoch 6/1024 143613/143613 [==============================] - 51s 355us/step - loss: 0.0500 - acc: 0.9818 - val_loss: 0.0469 - val_acc: 0.9833 Epoch 7/1024 143613/143613 [==============================] - 51s 353us/step - loss: 0.0484 - acc: 0.9820 - val_loss: 0.0466 - val_acc: 0.9832 Epoch 8/1024 143613/143613 [==============================] - 51s 358us/step - loss: 0.0475 - acc: 0.9823 - val_loss: 0.0470 - val_acc: 0.9830 Epoch 9/1024 143613/143613 [==============================] - 52s 360us/step - loss: 0.0465 - acc: 0.9826 - val_loss: 0.0470 - val_acc: 0.9831 Epoch 00009: Reducing Max LR on Plateau: new max lr will be 0.0005 (if not early_stopping). Epoch 10/1024 143613/143613 [==============================] - 52s 359us/step - loss: 0.0441 - acc: 0.9832 - val_loss: 0.0473 - val_acc: 0.9830 Epoch 11/1024 143613/143613 [==============================] - 52s 359us/step - loss: 0.0432 - acc: 0.9835 - val_loss: 0.0474 - val_acc: 0.9831 Epoch 00011: Reducing Max LR on Plateau: new max lr will be 0.00025 (if not early_stopping). Epoch 12/1024 143613/143613 [==============================] - 51s 357us/step - loss: 0.0420 - acc: 0.9838 - val_loss: 0.0477 - val_acc: 0.9830 Restoring model weights from the end of the best epoch Epoch 00012: early stopping Weights from best epoch have been loaded into model.
<keras.callbacks.History at 0x7fb5e7094978>