%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
We will begin by importing some required modules for performing text classification in ktrain.
import ktrain
from ktrain import text
Next, we will load and preprocess the text data for training and validation. ktrain can load texts and associated labels from a variety of source:
texts_from_folder
: labels are represented as subfolders containing text files [example notebook] texts_from_csv
: texts and associated labels are stored in columns in a CSV file [ example notebook ]texts_from_df
: texts and associated labels are stored in columns in a pandas DataFrame [ example notebook ]texts_from_array
: texts and labels are loaded and preprocessed from an array [ example notebook ]For texts_from_csv
and texts_from_df
, labels can either be multi or one-hot-encoded with one column per class or can be a single column storing integers or strings like this:
# my_training_data.csv
TEXT,LABEL
I like this movie,positive
I hate this movie,negative
For texts_from_array
, the labels are arrays in one of the following forms:
# string labels
y_train = ['negative', 'positive']
# integer labels
y_train = [0, 1] # indices must start from 0
# multi or one-hot encoded labels (used for multi-label problems)
y_train = [[1,0], [0,1]]
In the latter two cases, you must supply a class_names
argument to the texts_from_array
, which tells ktrain how indices map to class names. In this case, class_names=['negative', 'positive']
because 0=negative and 1=positive.
Sample arrays for texts_from_array
might look like this:
x_train = ['I hate this movie.', 'I like this movie.']
y_train = ['negative', 'positive']
x_test = ['I despise this movie.', 'I love this movie.']
y_test = ['negative', 'positive']
All of the above methods transform the texts into a sequence of word IDs in one way or another, as expected by neural network models.
In this first example problem, we use the texts_from_folder
function to load documents as fixed-length sequences of word IDs from a folder of raw documents. This function assumes a directory structure like the following:
├── datadir
│ ├── train
│ │ ├── class0 # folder containing documents of class 0
│ │ ├── class1 # folder containing documents of class 1
│ │ ├── class2 # folder containing documents of class 2
│ │ └── classN # folder containing documents of class N
│ └── test
│ ├── class0 # folder containing documents of class 0
│ ├── class1 # folder containing documents of class 1
│ ├── class2 # folder containing documents of class 2
│ └── classN # folder containing documents of class N
Each subfolder will contain documents in plain text format (e.g., .txt
files) pertaining to the class represented by the subfolder.
For our text classification example, we will again classifiy IMDb movie reviews as either positive or negative. However, instead of using the pre-processed version of the dataset pre-packaged with Keras, we will use the original (or raw) aclImdb dataset. The dataset can be downloaded from here. Set the DATADIR
variable to the location of the extracted aclImdb folder.
In the cell below, note that we supplied preprocess_mode='standard'
to the data-loading function (which is the default). For pretrained models like BERT and DistilBERT, the dataset must be preprocessed in a specific way. If you are planning to use BERT for text classification, you should replace this argument with preprocess_mode='bert'
. Since we will not be using BERT in this example, we leave it as preprocess_mode='standard'
. See this notebook for an example of how to use BERT for text classification in ktrain. There is also a DistilBERT example notebook.
NOTE: If using preprocess_mode='bert'
or preprocess_mode='distilbert'
, an English pretrained model is used for English, a Chinese pretrained model is used for Chinese, and a multilingual pretrained model is used for all other languages. For more flexibility in choosing the model used, you can use the alternative Transformer API for text classification in ktrain.
# load training and validation data from a folder
DATADIR = 'data/aclImdb'
trn, val, preproc = text.texts_from_folder(DATADIR,
max_features=80000, maxlen=2000,
ngram_range=3,
preprocess_mode='standard',
classes=['pos', 'neg'])
detected encoding: utf-8 language: en Word Counts: 88582 Nrows: 25000 25000 train sequences train sequence lengths: mean : 237 95percentile : 608 99percentile : 923 Adding 3-gram features max_features changed to 5151281 with addition of ngrams Average train sequence length with ngrams: 709 train (w/ngrams) sequence lengths: mean : 709 95percentile : 1821 99percentile : 2766 x_train shape: (25000,2000) y_train shape: (25000, 2) Is Multi-Label? False 25000 test sequences test sequence lengths: mean : 230 95percentile : 584 99percentile : 900 Average test sequence length with ngrams: 523 test (w/ngrams) sequence lengths: mean : 524 95percentile : 1295 99percentile : 1971 x_test shape: (25000,2000) y_test shape: (25000, 2)
Having loaded the data, we will now create a text classification model. The print_text_classifier
function prints some available models. The model selected should be consistent with the preprocess_mode
selected above.
(As mentioned above, one can also use the alternative Transformer
API for text classification in ktrain to access an even larger library of Hugging Face Transformer models like RoBERTa and XLNet. See this tutorial for more information on this.)
In this example, the text_classifier
function will return a neural implementation of NBSVM, which is a strong baseline that can outperform more complex neural architectures. It may take a few moments to return as it builds a document-term matrix from the input data we provide it. The text_classifier
function expects trn
to be a preprocessed training set returned from the texts_from*
function above. In this case where we have used preprocess_mode='standard'
, trn
is a numpy array with each document represented as fixed-size sequence of word IDs.
text.print_text_classifiers()
fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf] logreg: logistic regression using a trainable Embedding layer nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018] bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html] standard_gru: simple 2-layer GRU with randomly initialized embeddings bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805] distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]
# load an NBSVM model
model = text.text_classifier('nbsvm', trn, preproc=preproc)
Is Multi-Label? False compiling word ID features... building document-term matrix... this may take a few moments... rows: 1-10000 rows: 10001-20000 rows: 20001-25000 computing log-count ratios... done.
Next, we instantiate a Learner object and call the lr_find
and lr_plot
methods to help identify a good learning rate.
learner = ktrain.get_learner(model, train_data=trn, val_data=val)
learner.lr_find()
simulating training for different learning rates... this may take a few moments... Epoch 1/5 25000/25000 [==============================] - 6s 226us/step - loss: 0.6906 - acc: 0.5797 Epoch 2/5 25000/25000 [==============================] - 5s 206us/step - loss: 0.6071 - acc: 0.9114 Epoch 3/5 25000/25000 [==============================] - 5s 205us/step - loss: 0.2151 - acc: 0.9711 Epoch 4/5 16032/25000 [==================>...........] - ETA: 1s - loss: 0.0252 - acc: 0.9943 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.lr_plot()
Finally, we will fit our model using and SGDR learning rate schedule by invoking the fit
method with the cycle_len parameter (along with the cycle_mult parameter).
learner.fit(0.001, 3, cycle_len=1, cycle_mult=2)
Train on 25000 samples, validate on 25000 samples Epoch 1/7 25000/25000 [==============================] - 7s 263us/step - loss: 0.2105 - acc: 0.9461 - val_loss: 0.2481 - val_acc: 0.9187 Epoch 2/7 25000/25000 [==============================] - 7s 261us/step - loss: 0.0458 - acc: 0.9936 - val_loss: 0.2266 - val_acc: 0.9218 Epoch 3/7 25000/25000 [==============================] - 6s 257us/step - loss: 0.0082 - acc: 0.9999 - val_loss: 0.2236 - val_acc: 0.9228 Epoch 4/7 25000/25000 [==============================] - 6s 256us/step - loss: 0.0069 - acc: 0.9999 - val_loss: 0.2169 - val_acc: 0.9227 Epoch 5/7 25000/25000 [==============================] - 6s 259us/step - loss: 0.0029 - acc: 1.0000 - val_loss: 0.2148 - val_acc: 0.9227 Epoch 6/7 25000/25000 [==============================] - 7s 261us/step - loss: 0.0020 - acc: 1.0000 - val_loss: 0.2142 - val_acc: 0.9228 Epoch 7/7 25000/25000 [==============================] - 6s 255us/step - loss: 0.0017 - acc: 1.0000 - val_loss: 0.2141 - val_acc: 0.9227
<keras.callbacks.History at 0x7f382b6819e8>
Let's predict the sentiment of new movie reviews (or comments in this case) using our trained model.
The preproc
object (returned by texts_from_folder
) is important here, as it is used to preprocess data in a way our model expects.
predictor = ktrain.get_predictor(learner.model, preproc)
data = [ 'This movie was horrible! The plot was boring. Acting was okay, though.',
'The film really sucked. I want my money back.',
'What a beautiful romantic comedy. 10/10 would see again!']
predictor.predict(data)
['neg', 'neg', 'pos']
As can be seen, our model returns predictions that appear to be correct. The predictor instance can also be used to return "probabilities" of our predictions with respect to each class. Let us first print the classes and their order. The class pos stands for positive sentiment and neg stands for negative sentiment. Then, we will re-run predictor.predict
with return_proba=True to see the probabilities.
predictor.get_classes()
['neg', 'pos']
predictor.predict(data, return_proba=True)
array([[0.81179327, 0.18820675], [0.7463994 , 0.25360066], [0.26558533, 0.7344147 ]], dtype=float32)
For text classifiers, there is also predictor.predict_proba
, which is simply calls predict
with return_proba=True
.
Our movie review sentiment predictor can be saved to disk and reloaded/re-used later as part of an application. This is illustrated below:
predictor.save('/tmp/my_moviereview_predictor')
predictor = ktrain.load_predictor('/tmp/my_moviereview_predictor')
predictor.predict(['Groundhog Day is my favorite movie of all time!'])
['pos']
In the previous example, the classes (or categories) were mutually exclusive. By contrast, in multi-label text classification, a document or text snippet can belong to multiple classes. Here, we will classify Wikipedia comments into one or more categories of so-called toxic comments. Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate. The dataset can be downloaded from the Kaggle Toxic Comment Classification Challenge as a CSV file (i.e., download the file train.csv
). We will load the data using the texts_from_csv
function. This function expects one column to contain the texts of documents and one or more other columns to store the labels. Labels can be in any of the following formats:
1. one-hot-encoded or arrays representing classes will have a single one in each row:
Binary Classification (two classes):
text|positive|negative
I like this movie.|1|0
I hated this movie.|0|1
Multiclass Classification (more than two classes):
text|negative|neutral|positive
I hated this movie.|1|0|0 # negative
I loved this movie.|0|0|1 # positive
I saw the movie.|0|1|0 # neutral
2. multi-hot-encoded arrays representing classes:
Multi-label classification will have one or more ones in each row:
text|politics|television|sports
I will vote in 2020.|1|0|0 # politics
I watched the debate on CNN.|1|1|0 # politics and television
Did you watch the game on ESPN?|0|1|1 # sports and television
I play basketball.|0|0|1 # sports
3. labels are in a single column of string or integer values representing classs labels
Example with label_columns=['label'] and text_column='text':
text|label
I like this movie.|positive
I hated this movie.|negative
Since the Toxic Comment Classification Challenge is a multi-label problem, we must use the second format, where labels are already multi-hot-encoded. Luckily, the train.csv
file for this problem is already multi-hot-encoded, so no extra processing is required.
Since val_filepath is None
, 10% of the data will automatically be used as a validation set.
DATA_PATH = 'data/toxic-comments/train.csv'
NUM_WORDS = 50000
MAXLEN = 150
trn, val, preproc = text.texts_from_csv(DATA_PATH,
'comment_text',
label_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"],
val_filepath=None, # if None, 10% of data will be used for validation
max_features=NUM_WORDS, maxlen=MAXLEN,
ngram_range=1)
Word Counts: 197340 Nrows: 143613 143613 train sequences Average train sequence length: 66 15958 test sequences Average test sequence length: 66 Pad sequences (samples x time) x_train shape: (143613,150) x_test shape: (15958,150) y_train shape: (143613,6) y_test shape: (15958,6)
Next, as before, we load a text classification model and wrap the model and data in Learner object. Instead of using the NBSVM model, we will explicitly request a different model called fasttext using the name
parameter of text_classifier
. The fastText architecture was created by Facebook in 2016. (You can call the print_textmodels
to show the available text classification models.)
text.print_text_classifiers()
nbsvm: NBSVM model (http://www.aclweb.org/anthology/P12-2018) fasttext: a fastText-like model (http://arxiv.org/pdf/1607.01759.pdf) logreg: logistic regression
model = text.text_classifier('fasttext', trn, preproc=preproc)
learner = ktrain.get_learner(model, train_data=trn, val_data=val)
Is Multi-Label? True compiling word ID features... done.
As before, we use our learning rate finder to find a good learning rate. In this case, a learning rate of 0.0007 appears to be good.
learner.lr_find()
simulating training for different learning rates... this may take a few moments... Epoch 1/5 143613/143613 [==============================] - 47s 325us/step - loss: 0.7361 - acc: 0.5322 Epoch 2/5 143613/143613 [==============================] - 46s 323us/step - loss: 0.4683 - acc: 0.7714 Epoch 3/5 143613/143613 [==============================] - 46s 323us/step - loss: 0.0879 - acc: 0.9729 Epoch 4/5 143613/143613 [==============================] - 46s 323us/step - loss: 0.1106 - acc: 0.9686 Epoch 5/5 143613/143613 [==============================] - 46s 323us/step - loss: 0.1636 - acc: 0.9629 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.lr_plot()
Finally, we will train our model for 8 epochs using autofit
with a learning rate of 0.0007. Having explicitly specified the number of epochs, autofit
will automatically employ a triangular learning rate policy. Our final ROC-AUC score is 0.98.
As shown in this example notebook on our GitHub project, even better results can be obtained using a Bidirectional GRU with pretrained word vectors (called ‘bigru’ in ktrain)
learner.autofit(0.0007, 8)
begin training using triangular learning rate policy with max lr of 0.0007... Train on 143613 samples, validate on 15958 samples Epoch 1/8 143613/143613 [==============================] - 48s 333us/step - loss: 0.1140 - acc: 0.9630 - val_loss: 0.0530 - val_acc: 0.9812 Epoch 2/8 143613/143613 [==============================] - 47s 330us/step - loss: 0.0625 - acc: 0.9790 - val_loss: 0.0501 - val_acc: 0.9819 Epoch 3/8 143613/143613 [==============================] - 48s 331us/step - loss: 0.0572 - acc: 0.9801 - val_loss: 0.0491 - val_acc: 0.9821 Epoch 4/8 143613/143613 [==============================] - 47s 331us/step - loss: 0.0538 - acc: 0.9806 - val_loss: 0.0481 - val_acc: 0.9823 Epoch 5/8 143613/143613 [==============================] - 47s 329us/step - loss: 0.0517 - acc: 0.9813 - val_loss: 0.0476 - val_acc: 0.9823 Epoch 6/8 143613/143613 [==============================] - 47s 329us/step - loss: 0.0501 - acc: 0.9815 - val_loss: 0.0470 - val_acc: 0.9825 Epoch 7/8 143613/143613 [==============================] - 47s 331us/step - loss: 0.0486 - acc: 0.9820 - val_loss: 0.0468 - val_acc: 0.9824 Epoch 8/8 143613/143613 [==============================] - 47s 330us/step - loss: 0.0471 - acc: 0.9824 - val_loss: 0.0470 - val_acc: 0.9826
<keras.callbacks.History at 0x7f8b80067630>
from sklearn.metrics import roc_auc_score
y_pred = learner.model.predict(x_test, verbose=0)
score = roc_auc_score(y_test, y_pred)
print("\n ROC-AUC score: %.6f \n" % (score))
ROC-AUC score: 0.980092
As before, let's make some predictions about toxic comments using our model by wrapping it in a Predictor instance.
predictor = ktrain.get_predictor(learner.model, preproc)
# correctly predict a toxic comment that includes a threat
predictor.predict(["If you don't stop immediately, I will kill you."])
[[('toxic', 0.5491581), ('severe_toxic', 0.02454061), ('obscene', 0.084347874), ('threat', 0.4110818), ('insult', 0.17229997), ('identity_hate', 0.08519211)]]
# non-toxic comment
predictor.predict(["Okay - I'm calling it a night. See you tomorrow."])
[[('toxic', 0.021799222), ('severe_toxic', 7.991817e-07), ('obscene', 0.000504758), ('threat', 5.477591e-05), ('insult', 0.001496369), ('identity_hate', 9.472556e-05)]]
predictor.save('/tmp/toxic_detector')
predictor = ktrain.load_predictor('/tmp/toxic_detector')
# model works correctly and as expected after reloading from disk
predictor.predict(["You have a really ugly face."])
[[('toxic', 0.86799675), ('severe_toxic', 0.008107864), ('obscene', 0.26740596), ('threat', 0.006626291), ('insult', 0.39607796), ('identity_hate', 0.023489485)]]
Transformers
API in ktrain¶If using transformer models like BERT or DistilBert or RoBERTa, ktrain includes an alternative API for text classification, which allows the use of any Hugging Face transformers
model. This API can be used as follows:
import ktrain
from ktrain import text
MODEL_NAME = 'bert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500,
classes=label_list)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(3e-5, 1)
Note that x_train
and x_test
are the raw texts here:
x_train = ['I hate this movie.', 'I like this movie.']
Similar to texts_from_array
, the labels are arrays in one of the following forms:
# string labels
y_train = ['negative', 'positive']
# integer labels
y_train = [0, 1]
# multi or one-hot encoded labels
y_train = [[1,0], [0,1]]
In the latter two cases, you must supply a class_names
argument to the Transformer
constructor, which tells ktrain how indices map to class names. In this case, class_names=['negative', 'positive']
because 0=negative and 1=positive.
For an example, see this notebook, which builds and Arabic sentiment analysis model using AraBERT.
For more information, see our tutorial on text classification with Hugging Face Transformers.
You may be also interested in some of our blog posts on text classification: