%reload_ext autoreload %autoreload 2 %matplotlib inline import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; os.environ["CUDA_VISIBLE_DEVICES"]="0";
ktrain uses TensorFlow 2. To support sequence-tagging, ktrain also currently uses the CRF module from
keras_contrib, which is not yet fully compatible with TensorFlow 2.
To use the BiLSTM-CRF model (which currently requires
keras_contrib) for sequence-tagging in ktrain, you must disable V2 behavior in TensorFlow 2
by adding the following line to the top of your notebook or script before importing ktrain:
import os os.environ['DISABLE_V2_BEHAVIOR'] = '1'
Since we are employing a CRF layer in this notebook, we will set this value here:
os.environ['DISABLE_V2_BEHAVIOR'] = '1'
import ktrain from ktrain import text
Using DISABLE_V2_BEHAVIOR with TensorFlow
Sequence tagging (or sequence labeling) involves classifying words or sequences of words as representing some category or concept of interest. One example of sequence tagging is Named Entity Recognition (NER), where we classify words or sequences of words that identify some entity such as a person, organization, or location. In this tutorial, we will show how to use ktrain to perform sequence tagging in three simple steps.
entities_from_txt function can be used to load tagged sentences from a text file. The text file can be in one of two different formats: 1) the CoNLL2003 format or 2) the Groningen Meaning Bank (GMB) format. In both formats, there is one word and its associated tag on each line (where the word and tag are delimited by a space, tab or comma). Words are ordered as they appear in the sentence. In the CoNLL2003 format, there is a blank line that delineates sentences. In the GMB format, there is a third column for Sentence ID that assignes a number to each row indicating the sentence to which the word belongs. If you are building a sequence tagger for your own use case with the
entities_from_txt function, the training data should be formatted into one of these two formats. Alternatively, one can use the
entities_from_aray function which simply expects arrays of the following form:
x_train = [['Hello', 'world', '!'], ['Hello', 'Barack', 'Obama'], ['I', 'love', 'Chicago']] y_train = [['O', 'O', 'O'], ['O', 'B-PER', 'I-PER'], ['O', 'O', 'B-LOC']]
Note that the tags in this example follow the IOB2 format).
In this notebook, we will be using
entities_from_txt and build a sequence tagger using the Groningen Meaning Bank NER dataset available on Kaggle here. The format essentially looks like this (with fields being delimited by comma):
SentenceID Word Tag 1 Paul B-PER 1 Newman I-PER 1 is O 1 a O 1 great O 1 actor O 1 . O
We will be using the file
ner_dataset.csv (which conforms to the format above) and will load and preprocess it using the
entities_from_txt function. The output is simlar to data-loading functions used in previous tutorials and includes the processed training set, processed validaton set, and an instance of
The Kaggle dataset
ner_dataset.csv the three columns of interest (mentioned above) are labeled 'Sentence #', 'Word', and 'Tag'. Thus, we specify these in the call to the function.
DATAFILE = '/home/amaiya/data/groningen_meaning_bank/ner_dataset.csv' (trn, val, preproc) = text.entities_from_txt(DATAFILE, sentence_column='Sentence #', word_column='Word', tag_column='Tag', data_format='gmb', use_char=True)
detected encoding: WINDOWS-1250 (if wrong, set manually) Number of sentences: 47959 Number of words in the dataset: 35178 Tags: ['B-art', 'I-art', 'I-eve', 'B-geo', 'B-gpe', 'I-per', 'O', 'B-tim', 'I-gpe', 'B-nat', 'B-eve', 'B-org', 'I-nat', 'B-per', 'I-org', 'I-tim', 'I-geo'] Number of Labels: 17 Longest sentence: 104 words
When loading the dataset above, we specify
use_char=True to instruct ktrain to extract the character vocabulary to be used in a character embedding layer of a model.
print_sequence_taggers function shows that, as of this writing, ktrain currently supports both Bidirectional LSTM-CRM and Bidirectional LSTM as base models for sequence tagging. Theses base models can be used with different embedding schemes.
For instance, the
bilstm-bert model employs BERT word embeddings as features for a Bidirectional LSTM. See this notebook for an example of
bilstm-bert. In this tutorial, we will use a Bidirectional LSTM model with a CRF layer.
bilstm: Bidirectional LSTM (https://arxiv.org/abs/1603.01360) bilstm-bert: Bidirectional LSTM w/ BERT embeddings bilstm-crf: Bidirectional LSTM-CRF (https://arxiv.org/abs/1603.01360) bilstm-elmo: Bidirectional LSTM w/ Elmo embeddings [English only] bilstm-crf-elmo: Bidirectional LSTM-CRF w/ Elmo embeddings [English only]
WV_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz' model = text.sequence_tagger('bilstm-crf', preproc, wv_path_or_url=WV_URL)
Embedding schemes employed (combined with concatenation): word embeddings initialized with fasttext word vectors (cc.en.300.vec.gz) character embeddings pretrained word embeddings will be loaded from: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz loading pretrained word vectors...this may take a few moments...
In the cell above, notice that we suppied the
wv_path_or_url argument. This directs ktrain to initialized word embeddings with one of the pretrained fasttext (word2vec) word vector sets from Facebook's fastttext site. When supplied with a valid URL to a
.vec.gz, the word vectors will be automatically downloaded, extracted, and loaded in STEP 2 (download location is
<home_directory>/ktrain_data). To disable pretrained word embeddings, set
wv_path_or_url=None and randomly initialized word embeddings will be employed. Use of pretrained embeddings will typically boost final accuracy. When used in combination with a model that uses an embedding scheme like BERT (e.g.,
bilstm-bert), the different word embeddings are stacked together using concatenation.
Finally, we will wrap our selected model and datasets in a
Learner object to facilitate training.
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)
Here, we will train for a single epoch using an initial learning rate of 0.01 with gradual decay using cosine annealing (via the
cycle_len=1) parameter and see how well we do. The learning rate of
0.01 is determined with the learning-rate-finder (i.e.,
simulating training for different learning rates... this may take a few moments... Train for 337 steps Epoch 1/1024 337/337 [==============================] - 144s 426ms/step - loss: 1.2752 Epoch 2/1024 337/337 [==============================] - 138s 408ms/step - loss: 0.6956 Epoch 3/1024 337/337 [==============================] - 137s 407ms/step - loss: 0.2069 Epoch 4/1024 337/337 [==============================] - 136s 405ms/step - loss: 0.0684 Epoch 5/1024 160/337 [=============>................] - ETA: 1:12 - loss: 0.1804 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.fit(1e-2, 1, cycle_len=1)
preparing train data ...done. preparing valid data ...done. 338/338 [==============================] - 123s 365ms/step - loss: 4.6233 - val_loss: 4.5265
<tensorflow.python.keras.callbacks.History at 0x7f09b55ace48>
F1: 84.19 precision recall f1-score support tim 0.90 0.86 0.88 2078 geo 0.84 0.90 0.87 3728 org 0.75 0.69 0.72 1981 per 0.81 0.78 0.79 1717 gpe 0.97 0.93 0.95 1540 eve 0.60 0.21 0.31 29 art 0.00 0.00 0.00 47 nat 0.57 0.19 0.29 21 micro avg 0.85 0.84 0.84 11141 macro avg 0.84 0.84 0.84 11141
Our F1-score is 84.19 after a single pass through the dataset. Not bad for a single epoch of training.
view_top_losses to see the sentence we got the most wrong. This single sentence about James Brown contains 10 words that are misclassified. We can see here that our model has trouble with titles of songs. In addition, some of the ground truth labels for this example are sketchy and incomplete, which also makes things difficult.
total incorrect: 10 Word True : (Pred) ============================== Mr. :B-per (B-per) Brown :I-per (I-per) is :O (O) known :O (O) by :O (O) millions :O (O) of :O (O) fans :O (O) as :O (O) " :O (O) The :O (O) Godfather :B-per (B-org) of :O (O) Soul :B-per (B-per) " :O (O) thanks :O (O) to :O (O) such :O (O) classic :O (O) songs :O (O) as :O (O) " :O (O) Please :B-art (O) , :O (O) Please :O (B-geo) , :O (O) Please :O (O) , :O (O) " :O (O) " :O (O) It :O (O) 's :O (O) a :O (O) Man :O (O) 's :O (O) World :O (O) , :O (O) " :O (O) and :O (O) " :O (O) Papa :B-art (B-org) 's :I-art (O) Got :I-art (O) a :I-art (O) Brand :I-art (B-org) New :I-art (I-org) Bag :I-art (I-org) . :O (O) " :O (O)
Let's use our model to extract entities from new sentences. We begin by instantating a
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')
[('As', 'O'), ('of', 'O'), ('2019', 'B-tim'), (',', 'O'), ('Donald', 'B-per'), ('Trump', 'I-per'), ('is', 'O'), ('still', 'O'), ('the', 'O'), ('President', 'B-per'), ('of', 'O'), ('the', 'O'), ('United', 'B-geo'), ('States', 'I-geo'), ('.', 'O')]
We can save the predictor for later deployment.
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')
reloaded_predictor.predict('Paul Newman is my favorite American actor.')
[('Paul', 'B-per'), ('Newman', 'I-per'), ('is', 'O'), ('my', 'O'), ('favorite', 'O'), ('American', 'B-gpe'), ('actor', 'O'), ('.', 'O')]
predict method also can accept a list of sentences. And, larger batch sizes can potentially speed predictions when
predictor.predict is supplied with a list of examples.
get_predictor functions accept an optional
batch_size argument used for predictions, which is set to 32 by default. The
batch_size can also be set manually on the
Predictor instance. That is, the
batch_size used for inference and predictions can be increased with either of the following:
# you can set the batch_size as an argument to load_predictor (or get_predictor) predictor = ktrain.load_predictor('/tmp/mypred', batch_size=128) # you can also set the batch_size used for predictions this way predictor.batch_size = 128
predict method typically operates on individual sentences instead of entire paragraphs or documents. The model after all was trained on individual sentences. In production, you can use the
sent_tokenize function to tokenize text into individual sentences.
from ktrain import text text.textutils.sent_tokenize('This is the first sentence about Dr. Smith. This is the second sentence.')
The above will output:
['This is the first sentence about Dr . Smith .', 'This is the second sentence .']