%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
import ktrain
from ktrain import text
TDATA = 'data/conll2003/train.txt'
VDATA = 'data/conll2003/valid.txt'
(trn, val, preproc) = text.entities_from_conll2003(TDATA, val_filepath=VDATA)
detected encoding: utf-8 (if wrong, set manually) Number of sentences: 14041 Number of words in the dataset: 23623 Tags: ['B-PER', 'O', 'I-MISC', 'B-ORG', 'I-LOC', 'I-ORG', 'I-PER', 'B-MISC', 'B-LOC'] Number of Labels: 9 Longest sentence: 113 words
In this example notebook, we will build a Bidirectional LSTM model that employs the use of pretrained BERT word embeddings. By default, the sequence_tagger
will use a pretrained multilingual model (i.e., bert-base-multilingual-cased
) that supports 157 different languages. However, since we are training a English-language model on an English-only dataset, it is better to select the English pretrained BERT model: bert-base-cased
. Notice that we selected the cased model, as case is important for English NER, as entities are often capitalized. A full list of available pretrained models is listed here. ktrain currently supports any bert-*
model in addition to any distilbert-*
model. One can also employ the use of BERT-based community-uploaded moels that focus on specific domains such as the biomedical or scientific domains (e.g, BioBERT, SciBERT). To use SciBERT, for example, set bert_model
to allenai/scibert_scivocab_uncased
.
text.print_sequence_taggers()
bilstm: Bidirectional LSTM (https://arxiv.org/abs/1603.01360) bilstm-bert: Bidirectional LSTM w/ BERT embeddings bilstm-crf: Bidirectional LSTM-CRF (https://arxiv.org/abs/1603.01360) bilstm-elmo: Bidirectional LSTM w/ Elmo embeddings [English only] bilstm-crf-elmo: Bidirectional LSTM-CRF w/ Elmo embeddings [English only]
model = text.sequence_tagger('bilstm-bert', preproc, bert_model='bert-base-cased')
Embedding schemes employed (combined with concatenation): word embeddings initialized randomly BERT embeddings with bert-base-cased
From the output above, we see that the model is configured to use both BERT pretrained word embeddings and randomly-initialized word emeddings. Instead of randomly-initialized word vectors, one can also select pretrained fasttext word vectors from Facebook's fasttext site and supply the URL via the wv_path_or_url
parameter:
wv_path_or_url='https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz')
We have not used fasttext word embeddings in this example - only BERT word embeddings.
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)
learner.fit(0.01, 2, cycle_len=5)
preparing train data ...done. preparing valid data ...done. Train for 110 steps, validate for 26 steps Epoch 1/10 110/110 [==============================] - 61s 551ms/step - loss: 0.1072 - val_loss: 0.0301 Epoch 2/10 110/110 [==============================] - 54s 489ms/step - loss: 0.0280 - val_loss: 0.0214 Epoch 3/10 110/110 [==============================] - 53s 486ms/step - loss: 0.0159 - val_loss: 0.0179 Epoch 4/10 110/110 [==============================] - 54s 487ms/step - loss: 0.0104 - val_loss: 0.0165 Epoch 5/10 110/110 [==============================] - 53s 484ms/step - loss: 0.0087 - val_loss: 0.0161 Epoch 6/10 110/110 [==============================] - 53s 481ms/step - loss: 0.0129 - val_loss: 0.0176 Epoch 7/10 110/110 [==============================] - 53s 480ms/step - loss: 0.0094 - val_loss: 0.0167 Epoch 8/10 110/110 [==============================] - 53s 481ms/step - loss: 0.0060 - val_loss: 0.0164 Epoch 9/10 110/110 [==============================] - 53s 485ms/step - loss: 0.0041 - val_loss: 0.0155 Epoch 10/10 110/110 [==============================] - 53s 486ms/step - loss: 0.0033 - val_loss: 0.0157
<tensorflow.python.keras.callbacks.History at 0x7fe9c18f8470>
learner.validate()
F1: 92.62 precision recall f1-score support MISC 0.84 0.84 0.84 922 PER 0.96 0.96 0.96 1842 LOC 0.95 0.96 0.96 1837 ORG 0.88 0.92 0.90 1341 micro avg 0.92 0.93 0.93 5942 macro avg 0.92 0.93 0.93 5942
0.9262014208106979
We can use the view_top_losses
method to inspect the sentences we're getting the most wrong. Here, we can see our model has trouble with titles, which is understandable since it is mixed into a catch-all miscellaneous category.
learner.view_top_losses(n=1)
total incorrect: 11 Word True : (Pred) ============================== The :O (O) titles :O (O) of :O (O) his :O (O) other :O (O) novels :O (O) translate :O (O) as :O (O) " :O (O) In :B-MISC (O) the :I-MISC (O) Year :I-MISC (O) of :I-MISC (I-MISC) January :I-MISC (O) " :O (O) ( :O (O) 1963 :O (O) ) :O (O) , :O (O) " :O (O) The :B-MISC (O) Collapse :I-MISC (O) " :O (O) ( :O (O) 1964 :O (O) ) :O (O) , :O (O) " :O (O) Sleeping :B-MISC (B-MISC) Bread :I-MISC (I-MISC) " :O (O) ( :O (O) 1975 :O (O) ) :O (O) , :O (O) " :O (O) The :B-MISC (O) Decaying :I-MISC (B-MISC) Mansion :I-MISC (I-MISC) " :O (O) ( :O (O) 1977 :O (O) ) :O (O) and :O (O) " :O (O) A :B-MISC (B-MISC) World :I-MISC (I-MISC) of :I-MISC (I-MISC) Things :I-MISC (O) " :O (O) ( :O (O) 1982 :O (O) ) :O (O) , :O (O) followed :O (O) by :O (O) " :O (O) The :B-MISC (O) Knot :I-MISC (O) , :O (O) " :O (O) " :O (O) Soul :B-MISC (B-MISC) Alone :I-MISC (I-MISC) " :O (O) and :O (O) , :O (O) most :O (O) recently :O (O) , :O (O) " :O (O) A :B-MISC (B-MISC) Woman :I-MISC (I-MISC) . :O (O) " :O (O)
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')
[('As', 'O'), ('of', 'O'), ('2019', 'O'), (',', 'O'), ('Donald', 'B-PER'), ('Trump', 'I-PER'), ('is', 'O'), ('still', 'O'), ('the', 'O'), ('President', 'O'), ('of', 'O'), ('the', 'O'), ('United', 'B-LOC'), ('States', 'I-LOC'), ('.', 'O')]
predictor.save('/tmp/mypred')
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')
reloaded_predictor.predict('Paul Newman is my favorite actor.')
[('Paul', 'B-PER'), ('Newman', 'I-PER'), ('is', 'O'), ('my', 'O'), ('favorite', 'O'), ('actor', 'O'), ('.', 'O')]