In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 
In [2]:
import ktrain
from ktrain import text

STEP 1: Load and Preprocess Data

The CoNLL2003 NER dataset can be downloaded from here.

In [3]:
TDATA = 'data/conll2003/train.txt'
VDATA = 'data/conll2003/valid.txt'
(trn, val, preproc) = text.entities_from_conll2003(TDATA, val_filepath=VDATA)
detected encoding: utf-8 (if wrong, set manually)
Number of sentences:  14041
Number of words in the dataset:  23623
Tags: ['B-PER', 'O', 'I-MISC', 'B-ORG', 'I-LOC', 'I-ORG', 'I-PER', 'B-MISC', 'B-LOC']
Number of Labels:  9
Longest sentence: 113 words

STEP 2: Define a Model

In this example notebook, we will build a Bidirectional LSTM model that employs the use of pretrained BERT word embeddings. By default, the sequence_tagger will use a pretrained multilingual model (i.e., bert-base-multilingual-cased) that supports 157 different languages. However, since we are training a English-language model on an English-only dataset, it is better to select the English pretrained BERT model: bert-base-cased. Notice that we selected the cased model, as case is important for English NER, as entities are often capitalized. A full list of available pretrained models is listed here. ktrain currently supports any bert-* model in addition to any distilbert-* model. One can also employ the use of BERT-based community-uploaded moels that focus on specific domains such as the biomedical or scientific domains (e.g, BioBERT, SciBERT). To use SciBERT, for example, set bert_model to allenai/scibert_scivocab_uncased.

In [4]:
text.print_sequence_taggers()
bilstm: Bidirectional LSTM (https://arxiv.org/abs/1603.01360)
bilstm-bert: Bidirectional LSTM w/ BERT embeddings
bilstm-crf: Bidirectional LSTM-CRF  (https://arxiv.org/abs/1603.01360)
bilstm-elmo: Bidirectional LSTM w/ Elmo embeddings [English only]
bilstm-crf-elmo: Bidirectional LSTM-CRF w/ Elmo embeddings [English only]
In [5]:
model = text.sequence_tagger('bilstm-bert', preproc, bert_model='bert-base-cased')
Embedding schemes employed (combined with concatenation):
	word embeddings initialized randomly
	BERT embeddings with bert-base-cased

From the output above, we see that the model is configured to use both BERT pretrained word embeddings and randomly-initialized word emeddings. Instead of randomly-initialized word vectors, one can also select pretrained fasttext word vectors from Facebook's fasttext site and supply the URL via the wv_path_or_url parameter:

wv_path_or_url='https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz')

We have not used fasttext word embeddings in this example - only BERT word embeddings.

In [6]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)

STEP 3: Train and Evaluate Model

In [7]:
learner.fit(0.01, 2, cycle_len=5)
preparing train data ...done.
preparing valid data ...done.
Train for 110 steps, validate for 26 steps
Epoch 1/10
110/110 [==============================] - 61s 551ms/step - loss: 0.1072 - val_loss: 0.0301
Epoch 2/10
110/110 [==============================] - 54s 489ms/step - loss: 0.0280 - val_loss: 0.0214
Epoch 3/10
110/110 [==============================] - 53s 486ms/step - loss: 0.0159 - val_loss: 0.0179
Epoch 4/10
110/110 [==============================] - 54s 487ms/step - loss: 0.0104 - val_loss: 0.0165
Epoch 5/10
110/110 [==============================] - 53s 484ms/step - loss: 0.0087 - val_loss: 0.0161
Epoch 6/10
110/110 [==============================] - 53s 481ms/step - loss: 0.0129 - val_loss: 0.0176
Epoch 7/10
110/110 [==============================] - 53s 480ms/step - loss: 0.0094 - val_loss: 0.0167
Epoch 8/10
110/110 [==============================] - 53s 481ms/step - loss: 0.0060 - val_loss: 0.0164
Epoch 9/10
110/110 [==============================] - 53s 485ms/step - loss: 0.0041 - val_loss: 0.0155
Epoch 10/10
110/110 [==============================] - 53s 486ms/step - loss: 0.0033 - val_loss: 0.0157
Out[7]:
<tensorflow.python.keras.callbacks.History at 0x7fe9c18f8470>
In [8]:
learner.validate()
   F1: 92.62
           precision    recall  f1-score   support

     MISC       0.84      0.84      0.84       922
      PER       0.96      0.96      0.96      1842
      LOC       0.95      0.96      0.96      1837
      ORG       0.88      0.92      0.90      1341

micro avg       0.92      0.93      0.93      5942
macro avg       0.92      0.93      0.93      5942

Out[8]:
0.9262014208106979

We can use the view_top_losses method to inspect the sentences we're getting the most wrong. Here, we can see our model has trouble with titles, which is understandable since it is mixed into a catch-all miscellaneous category.

In [9]:
learner.view_top_losses(n=1)
total incorrect: 11
Word            True : (Pred)
==============================
The            :O     (O)
titles         :O     (O)
of             :O     (O)
his            :O     (O)
other          :O     (O)
novels         :O     (O)
translate      :O     (O)
as             :O     (O)
"              :O     (O)
In             :B-MISC (O)
the            :I-MISC (O)
Year           :I-MISC (O)
of             :I-MISC (I-MISC)
January        :I-MISC (O)
"              :O     (O)
(              :O     (O)
1963           :O     (O)
)              :O     (O)
,              :O     (O)
"              :O     (O)
The            :B-MISC (O)
Collapse       :I-MISC (O)
"              :O     (O)
(              :O     (O)
1964           :O     (O)
)              :O     (O)
,              :O     (O)
"              :O     (O)
Sleeping       :B-MISC (B-MISC)
Bread          :I-MISC (I-MISC)
"              :O     (O)
(              :O     (O)
1975           :O     (O)
)              :O     (O)
,              :O     (O)
"              :O     (O)
The            :B-MISC (O)
Decaying       :I-MISC (B-MISC)
Mansion        :I-MISC (I-MISC)
"              :O     (O)
(              :O     (O)
1977           :O     (O)
)              :O     (O)
and            :O     (O)
"              :O     (O)
A              :B-MISC (B-MISC)
World          :I-MISC (I-MISC)
of             :I-MISC (I-MISC)
Things         :I-MISC (O)
"              :O     (O)
(              :O     (O)
1982           :O     (O)
)              :O     (O)
,              :O     (O)
followed       :O     (O)
by             :O     (O)
"              :O     (O)
The            :B-MISC (O)
Knot           :I-MISC (O)
,              :O     (O)
"              :O     (O)
"              :O     (O)
Soul           :B-MISC (B-MISC)
Alone          :I-MISC (I-MISC)
"              :O     (O)
and            :O     (O)
,              :O     (O)
most           :O     (O)
recently       :O     (O)
,              :O     (O)
"              :O     (O)
A              :B-MISC (B-MISC)
Woman          :I-MISC (I-MISC)
.              :O     (O)
"              :O     (O)


Make Predictions on New Data

In [10]:
predictor = ktrain.get_predictor(learner.model, preproc)
In [11]:
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')
Out[11]:
[('As', 'O'),
 ('of', 'O'),
 ('2019', 'O'),
 (',', 'O'),
 ('Donald', 'B-PER'),
 ('Trump', 'I-PER'),
 ('is', 'O'),
 ('still', 'O'),
 ('the', 'O'),
 ('President', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('United', 'B-LOC'),
 ('States', 'I-LOC'),
 ('.', 'O')]
In [12]:
predictor.save('/tmp/mypred')
In [13]:
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')
In [14]:
reloaded_predictor.predict('Paul Newman is my favorite actor.')
Out[14]:
[('Paul', 'B-PER'),
 ('Newman', 'I-PER'),
 ('is', 'O'),
 ('my', 'O'),
 ('favorite', 'O'),
 ('actor', 'O'),
 ('.', 'O')]
In [ ]: