%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
ktrain uses TensorFlow 2. To support sequence-tagging, ktrain also currently uses the CRF module from keras_contrib
, which is not yet fully compatible with TensorFlow 2.
To use the BiLSTM-CRF model (which currently requires keras_contrib
) for sequence-tagging in ktrain, you must disable V2 behavior in TensorFlow 2
by adding the following line to the top of your notebook or script before importing ktrain:
import os
os.environ['DISABLE_V2_BEHAVIOR'] = '1'
Since we are employing a CRF layer in this notebook, we will set this value here:
os.environ['DISABLE_V2_BEHAVIOR'] = '1'
import ktrain
from ktrain import text
Using DISABLE_V2_BEHAVIOR with TensorFlow
Sequence tagging (or sequence labeling) involves classifying words or sequences of words as representing some category or concept of interest. One example of sequence tagging is Named Entity Recognition (NER), where we classify words or sequences of words that identify some entity such as a person, organization, or location. In this tutorial, we will show how to use ktrain to perform sequence tagging in three simple steps.
The entities_from_txt
function can be used to load tagged sentences from a text file. The text file can be in one of two different formats: 1) the CoNLL2003 format or 2) the Groningen Meaning Bank (GMB) format. In both formats, there is one word and its associated tag on each line (where the word and tag are delimited by a space, tab or comma). Words are ordered as they appear in the sentence. In the CoNLL2003 format, there is a blank line that delineates sentences. In the GMB format, there is a third column for Sentence ID that assignes a number to each row indicating the sentence to which the word belongs. If you are building a sequence tagger for your own use case with the entities_from_txt
function, the training data should be formatted into one of these two formats. Alternatively, one can use the entities_from_aray
function which simply expects arrays of the following form:
x_train = [['Hello', 'world', '!'], ['Hello', 'Barack', 'Obama'], ['I', 'love', 'Chicago']]
y_train = [['O', 'O', 'O'], ['O', 'B-PER', 'I-PER'], ['O', 'O', 'B-LOC']]
Note that the tags in this example follow the IOB2 format.
In this notebook, we will be using entities_from_txt
and build a sequence tagger using the Groningen Meaning Bank NER dataset available on Kaggle here. The format essentially looks like this (with fields being delimited by comma):
SentenceID Word Tag
1 Paul B-PER
1 Newman I-PER
1 is O
1 a O
1 great O
1 actor O
1 . O
We will be using the file ner_dataset.csv
(which conforms to the format above) and will load and preprocess it using the entities_from_txt
function. The output is simlar to data-loading functions used in previous tutorials and includes the processed training set, processed validaton set, and an instance of NERPreprocessor
.
The Kaggle dataset ner_dataset.csv
the three columns of interest (mentioned above) are labeled 'Sentence #', 'Word', and 'Tag'. Thus, we specify these in the call to the function.
DATAFILE = '/home/amaiya/data/groningen_meaning_bank/ner_dataset.csv'
(trn, val, preproc) = text.entities_from_txt(DATAFILE,
sentence_column='Sentence #',
word_column='Word',
tag_column='Tag',
data_format='gmb',
use_char=True)
detected encoding: WINDOWS-1250 (if wrong, set manually) Number of sentences: 47959 Number of words in the dataset: 35178 Tags: ['B-art', 'I-art', 'I-eve', 'B-geo', 'B-gpe', 'I-per', 'O', 'B-tim', 'I-gpe', 'B-nat', 'B-eve', 'B-org', 'I-nat', 'B-per', 'I-org', 'I-tim', 'I-geo'] Number of Labels: 17 Longest sentence: 104 words
When loading the dataset above, we specify use_char=True
to instruct ktrain to extract the character vocabulary to be used in a character embedding layer of a model.
The print_sequence_taggers
function shows that, as of this writing, ktrain currently supports both Bidirectional LSTM-CRM and Bidirectional LSTM as base models for sequence tagging. Theses base models can be used with different embedding schemes.
For instance, the bilstm-bert
model employs BERT word embeddings as features for a Bidirectional LSTM. See this notebook for an example of bilstm-bert
. In this tutorial, we will use a Bidirectional LSTM model with a CRF layer.
text.print_sequence_taggers()
bilstm: Bidirectional LSTM (https://arxiv.org/abs/1603.01360) bilstm-bert: Bidirectional LSTM w/ BERT embeddings bilstm-crf: Bidirectional LSTM-CRF (https://arxiv.org/abs/1603.01360) bilstm-elmo: Bidirectional LSTM w/ Elmo embeddings [English only] bilstm-crf-elmo: Bidirectional LSTM-CRF w/ Elmo embeddings [English only]
WV_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz'
model = text.sequence_tagger('bilstm-crf', preproc, wv_path_or_url=WV_URL)
Embedding schemes employed (combined with concatenation): word embeddings initialized with fasttext word vectors (cc.en.300.vec.gz) character embeddings pretrained word embeddings will be loaded from: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz loading pretrained word vectors...this may take a few moments...
In the cell above, notice that we suppied the wv_path_or_url
argument. This directs ktrain to initialized word embeddings with one of the pretrained fasttext (word2vec) word vector sets from Facebook's fastttext site. When supplied with a valid URL to a .vec.gz
, the word vectors will be automatically downloaded, extracted, and loaded in STEP 2 (download location is <home_directory>/ktrain_data
). To disable pretrained word embeddings, set wv_path_or_url=None
and randomly initialized word embeddings will be employed. Use of pretrained embeddings will typically boost final accuracy. When used in combination with a model that uses an embedding scheme like BERT (e.g., bilstm-bert
), the different word embeddings are stacked together using concatenation.
Finally, we will wrap our selected model and datasets in a Learner
object to facilitate training.
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)
Here, we will train for a single epoch using an initial learning rate of 0.01 with gradual decay using cosine annealing (via the cycle_len=1
) parameter and see how well we do. The learning rate of 0.01
is determined with the learning-rate-finder (i.e., lr_find
).
learner.lr_find()
simulating training for different learning rates... this may take a few moments... Train for 337 steps Epoch 1/1024 337/337 [==============================] - 144s 426ms/step - loss: 1.2752 Epoch 2/1024 337/337 [==============================] - 138s 408ms/step - loss: 0.6956 Epoch 3/1024 337/337 [==============================] - 137s 407ms/step - loss: 0.2069 Epoch 4/1024 337/337 [==============================] - 136s 405ms/step - loss: 0.0684 Epoch 5/1024 160/337 [=============>................] - ETA: 1:12 - loss: 0.1804 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.lr_plot()
learner.fit(1e-2, 1, cycle_len=1)
preparing train data ...done. preparing valid data ...done. 338/338 [==============================] - 123s 365ms/step - loss: 4.6233 - val_loss: 4.5265
<tensorflow.python.keras.callbacks.History at 0x7f09b55ace48>
learner.validate()
F1: 84.19 precision recall f1-score support tim 0.90 0.86 0.88 2078 geo 0.84 0.90 0.87 3728 org 0.75 0.69 0.72 1981 per 0.81 0.78 0.79 1717 gpe 0.97 0.93 0.95 1540 eve 0.60 0.21 0.31 29 art 0.00 0.00 0.00 47 nat 0.57 0.19 0.29 21 micro avg 0.85 0.84 0.84 11141 macro avg 0.84 0.84 0.84 11141
0.8418623591692684
Our F1-score is 84.19 after a single pass through the dataset. Not bad for a single epoch of training.
Let's invoke view_top_losses
to see the sentence we got the most wrong. This single sentence about James Brown contains 10 words that are misclassified. We can see here that our model has trouble with titles of songs. In addition, some of the ground truth labels for this example are sketchy and incomplete, which also makes things difficult.
learner.view_top_losses(n=1)
total incorrect: 10 Word True : (Pred) ============================== Mr. :B-per (B-per) Brown :I-per (I-per) is :O (O) known :O (O) by :O (O) millions :O (O) of :O (O) fans :O (O) as :O (O) " :O (O) The :O (O) Godfather :B-per (B-org) of :O (O) Soul :B-per (B-per) " :O (O) thanks :O (O) to :O (O) such :O (O) classic :O (O) songs :O (O) as :O (O) " :O (O) Please :B-art (O) , :O (O) Please :O (B-geo) , :O (O) Please :O (O) , :O (O) " :O (O) " :O (O) It :O (O) 's :O (O) a :O (O) Man :O (O) 's :O (O) World :O (O) , :O (O) " :O (O) and :O (O) " :O (O) Papa :B-art (B-org) 's :I-art (O) Got :I-art (O) a :I-art (O) Brand :I-art (B-org) New :I-art (I-org) Bag :I-art (I-org) . :O (O) " :O (O)
Let's use our model to extract entities from new sentences. We begin by instantating a Predictor
object.
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')
[('As', 'O'), ('of', 'O'), ('2019', 'B-tim'), (',', 'O'), ('Donald', 'B-per'), ('Trump', 'I-per'), ('is', 'O'), ('still', 'O'), ('the', 'O'), ('President', 'B-per'), ('of', 'O'), ('the', 'O'), ('United', 'B-geo'), ('States', 'I-geo'), ('.', 'O')]
We can save the predictor for later deployment.
predictor.save('/tmp/mypred')
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')
reloaded_predictor.predict('Paul Newman is my favorite American actor.')
[('Paul', 'B-per'), ('Newman', 'I-per'), ('is', 'O'), ('my', 'O'), ('favorite', 'O'), ('American', 'B-gpe'), ('actor', 'O'), ('.', 'O')]
The predict
method typically operates on individual sentences instead of entire paragraphs or documents. The model after all was trained on individual sentences. In production, you can use the sent_tokenize
function to tokenize text into individual sentences.
from ktrain import text
text.textutils.sent_tokenize('This is the first sentence about Dr. Smith. This is the second sentence.')
The above will output:
['This is the first sentence about Dr . Smith .',
'This is the second sentence .']