Use Text Extensions for Pandas to integrate BERT tokenization with model training for named entity recognition on Pandas.
This notebook shows how to use the open source library Text Extensions for Pandas to seamlessly integrate BERT tokenization and embeddings with model training for named entity recognition using Pandas DataFrames.
This example will build on the analysis of the CoNLL-2003 corpus done in Analyze_Model_Outputs to train a new model for named entity recognition (NER) using state-of-the-art natural language understanding with BERT tokenization and embeddings. While the model used is rather simple and will only get modest scoring results, the purpose is to demonstrate how Text Extensions for Pandas integrates BERT from Huggingface Transformers with the TensorArray
extension for model training and scoring, all within Pandas DataFrames. See Text_Extension_for_Pandas_Overview for TensorArray
specification and more example usage.
The notebook is divided into the following steps:
This notebook requires a Python 3.7 or later environment with NumPy, Pandas, scikit-learn, PyTorch and Huggingface transformers
.
The notebook also requires the text_extensions_for_pandas
library. You can satisfy this dependency in two ways:
pip install text_extensions_for_pandas
before running this notebook. This command adds the library to your Python environment.import gc
import os
import sys
from typing import *
import numpy as np
import pandas as pd
import sklearn.pipeline
import sklearn.linear_model
import torch
import transformers
# And of course we need the text_extensions_for_pandas library itself.
try:
import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
# If we're running from within the project source tree and the parent Python
# environment doesn't have the text_extensions_for_pandas package, use the
# version in the local source tree.
if not os.getcwd().endswith("notebooks"):
raise e
if ".." not in sys.path:
sys.path.insert(0, "..")
import text_extensions_for_pandas as tp
CoNLL, the SIGNLL Conference on Computational Natural Language Learning, is an annual academic conference for natural language processing researchers. Each year's conference features a competition involving a challenging NLP task. The task for the 2003 competition involved identifying mentions of named entities in English and German news articles from the late 1990's. The corpus for this 2003 competition is one of the most widely-used benchmarks for the performance of named entity recognition models. Current state-of-the-art results on this corpus produce an F1 score (harmonic mean of precision and recall) of 0.93. The best F1 score in the original competition was 0.89.
For more information about this data set, we recommend reading the conference paper about the competition results, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,".
Note that the data set is licensed for research use only. Be sure to adhere to the terms of the license when using this data set!
The developers of the CoNLL-2003 corpus defined a file format for the corpus, based on the file format used in the earlier Message Understanding Conference competition. This format is generally known as "CoNLL format" or "CoNLL-2003 format".
In the following cell, we use the facilities of Text Extensions for Pandas to download a copy of the CoNLL-2003 data set. Then we read the CoNLL-2003-format file containing the test
fold of the corpus and translate the data into a collection of Pandas DataFrame objects, one Dataframe per document. Finally, we display the Dataframe for the first document of the test
fold of the corpus.
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
# to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info
{'train': 'outputs/eng.train', 'dev': 'outputs/eng.testa', 'test': 'outputs/eng.testb'}
The BERT model is originally from the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. The model is pre-trained with masked language modeling and next sentence prediction objectives, which make it effective for masked token prediction and NLU.
With the CoNLL-2003 corpus loaded, it will need to be retokenized using a "BERT-compatible" tokenizer. Then we can map the token/entity labels from the original corpus on to the new tokenization.
We will start by showing the retokenizing process for a single document before doing the same on the entire corpus.
# Read in the corpus in its original tokenization.
corpus_raw = {}
for fold_name, file_name in data_set_info.items():
df_list = tp.io.conll.conll_2003_to_dataframes(file_name,
["pos", "phrase", "ent"],
[False, True, True])
corpus_raw[fold_name] = [
df.drop(columns=["pos", "phrase_iob", "phrase_type"])
for df in df_list
]
test_raw = corpus_raw["test"]
# Pick out the dataframe for a single example document.
example_df = test_raw[5]
example_df
span | ent_iob | ent_type | sentence | line_num | |
---|---|---|---|---|---|
0 | [0, 10): '-DOCSTART-' | O | None | [0, 10): '-DOCSTART-' | 1469 |
1 | [11, 18): 'CRICKET' | O | None | [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE... | 1471 |
2 | [18, 19): '-' | O | None | [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE... | 1472 |
3 | [20, 28): 'PAKISTAN' | B | LOC | [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE... | 1473 |
4 | [29, 30): 'V' | O | None | [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE... | 1474 |
... | ... | ... | ... | ... | ... |
350 | [1620, 1621): '8' | O | None | [1590, 1634): 'Third one-day match: December 8... | 1865 |
351 | [1621, 1622): ',' | O | None | [1590, 1634): 'Third one-day match: December 8... | 1866 |
352 | [1623, 1625): 'in' | O | None | [1590, 1634): 'Third one-day match: December 8... | 1867 |
353 | [1626, 1633): 'Karachi' | B | LOC | [1590, 1634): 'Third one-day match: December 8... | 1868 |
354 | [1633, 1634): '.' | O | None | [1590, 1634): 'Third one-day match: December 8... | 1869 |
355 rows × 5 columns
The example_df
contains columns span
and sentence
of dtypes SpanDtype
and TokenSpanDtype
. These represent spans from the target text, and here they contain tokens of the text and the sentence containing that token. See the notebook Text_Extension_for_Pandas_Overview for more on SpanArray
and TokenSpanArray
.
example_df.dtypes
span SpanDtype ent_iob object ent_type object sentence TokenSpanDtype line_num int64 dtype: object
The data we've looked at so far has been in IOB2 format.
Each row of our DataFrame represents a token, and each token is tagged with an entity type (ent_type
) and an IOB tag (ent_iob
). The first token of each named entity mention is tagged B
, while subsequent tokens are tagged I
. Tokens that aren't part of any named entity are tagged O
.
IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged O
, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is B, B, I
but the model has output B, I, I
; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.
The CoNLL 2003 competition used the number of errors in extracting entire entity mentions to measure the result quality of the entries. We will use the same metric in this notebook. To compute entity-level errors, we convert the IOB-tagged tokens into pairs of <entity span, entity type>
.
Text Extensions for Pandas includes a function iob_to_spans()
that will handle this conversion for you.
# Convert the corpus IOB2 tagged DataFrame to one with entity span and type columns.
spans_df = tp.io.conll.iob_to_spans(example_df)
spans_df
span | ent_type | |
---|---|---|
0 | [20, 28): 'PAKISTAN' | LOC |
1 | [31, 42): 'NEW ZEALAND' | LOC |
2 | [80, 83): 'GMT' | MISC |
3 | [85, 92): 'SIALKOT' | LOC |
4 | [94, 102): 'Pakistan' | LOC |
... | ... | ... |
69 | [1488, 1501): 'Shahid Afridi' | PER |
70 | [1512, 1523): 'Salim Malik' | PER |
71 | [1535, 1545): 'Ijaz Ahmad' | PER |
72 | [1565, 1573): 'Pakistan' | LOC |
73 | [1626, 1633): 'Karachi' | LOC |
74 rows × 2 columns
Here we configure and initialize the Huggingface transformers BERT tokenizer and model. Text Extensions for Pandas provides a make_bert_tokens()
function that will use the tokenizer to create BERT tokens as a span column in a DataFrame, suitable to compute BERT embeddings with.
# Huggingface transformers BERT Configuration.
bert_model_name = "dslim/bert-base-NER"
tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name,
add_special_tokens=True)
# Disable the warning about long sequences. We know what we're doing.
# Different versions of transformers disable this warning differently,
# so we need to do this twice.
tokenizer.deprecation_warnings[
"sequence-length-is-longer-than-the-specified-maximum"] = True
tokenizer.model_max_length = 16384
# Retokenize the document's text with the BERT tokenizer as a DataFrame
# with a span column.
bert_toks_df = tp.io.bert.make_bert_tokens(example_df["span"].values[0].target_text, tokenizer)
bert_toks_df
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | |
---|---|---|---|---|---|---|
0 | 0 | [0, 0): '' | 101 | 0 | 1 | True |
1 | 1 | [0, 1): '-' | 118 | 0 | 1 | False |
2 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False |
3 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False |
4 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False |
... | ... | ... | ... | ... | ... | ... |
684 | 684 | [1621, 1622): ',' | 117 | 0 | 1 | False |
685 | 685 | [1623, 1625): 'in' | 1107 | 0 | 1 | False |
686 | 686 | [1626, 1633): 'Karachi' | 16237 | 0 | 1 | False |
687 | 687 | [1633, 1634): '.' | 119 | 0 | 1 | False |
688 | 688 | [0, 0): '' | 102 | 0 | 1 | True |
689 rows × 6 columns
# BERT tokenization includes special zero-length tokens.
bert_toks_df[bert_toks_df["special_tokens_mask"]]
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | |
---|---|---|---|---|---|---|
0 | 0 | [0, 0): '' | 101 | 0 | 1 | True |
688 | 688 | [0, 0): '' | 102 | 0 | 1 | True |
# Align the BERT tokens with the original tokenization.
bert_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df["span"],
spans_df["span"])
pd.DataFrame({
"original_span": spans_df["span"],
"bert_spans": bert_spans,
"ent_type": spans_df["ent_type"]
})
original_span | bert_spans | ent_type | |
---|---|---|---|
0 | [20, 28): 'PAKISTAN' | [20, 28): 'PAKISTAN' | LOC |
1 | [31, 42): 'NEW ZEALAND' | [31, 42): 'NEW ZEALAND' | LOC |
2 | [80, 83): 'GMT' | [80, 83): 'GMT' | MISC |
3 | [85, 92): 'SIALKOT' | [85, 92): 'SIALKOT' | LOC |
4 | [94, 102): 'Pakistan' | [94, 102): 'Pakistan' | LOC |
... | ... | ... | ... |
69 | [1488, 1501): 'Shahid Afridi' | [1488, 1501): 'Shahid Afridi' | PER |
70 | [1512, 1523): 'Salim Malik' | [1512, 1523): 'Salim Malik' | PER |
71 | [1535, 1545): 'Ijaz Ahmad' | [1535, 1545): 'Ijaz Ahmad' | PER |
72 | [1565, 1573): 'Pakistan' | [1565, 1573): 'Pakistan' | LOC |
73 | [1626, 1633): 'Karachi' | [1626, 1633): 'Karachi' | LOC |
74 rows × 3 columns
# Generate IOB2 tags and entity labels that align with the BERT tokens.
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
bert_toks_df[["ent_iob", "ent_type"]] = tp.io.conll.spans_to_iob(bert_spans,
spans_df["ent_type"])
bert_toks_df[10:20]
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | |
---|---|---|---|---|---|---|---|---|
10 | 10 | [15, 17): 'KE' | 22441 | 0 | 1 | False | O | <NA> |
11 | 11 | [17, 18): 'T' | 1942 | 0 | 1 | False | O | <NA> |
12 | 12 | [18, 19): '-' | 118 | 0 | 1 | False | O | <NA> |
13 | 13 | [20, 22): 'PA' | 8544 | 0 | 1 | False | B | LOC |
14 | 14 | [22, 23): 'K' | 2428 | 0 | 1 | False | I | LOC |
15 | 15 | [23, 25): 'IS' | 6258 | 0 | 1 | False | I | LOC |
16 | 16 | [25, 27): 'TA' | 9159 | 0 | 1 | False | I | LOC |
17 | 17 | [27, 28): 'N' | 2249 | 0 | 1 | False | I | LOC |
18 | 18 | [29, 30): 'V' | 159 | 0 | 1 | False | O | <NA> |
19 | 19 | [31, 33): 'NE' | 26546 | 0 | 1 | False | B | LOC |
# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
token_class_dtype, int_to_label, label_to_int = tp.io.conll.make_iob_tag_categories(ENTITY_TYPES)
token_class_dtype
CategoricalDtype(categories=['O', 'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER'], , ordered=False)
# The traditional way to transform NER to token classification is to
# treat each combination of {I,O,B} X {entity type} as a different
# class. Generate class labels in that format.
classes_df = tp.io.conll.add_token_classes(bert_toks_df, token_class_dtype)
classes_df
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | [0, 0): '' | 101 | 0 | 1 | True | O | <NA> | O | 0 |
1 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | O | <NA> | O | 0 |
2 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | O | <NA> | O | 0 |
3 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | O | <NA> | O | 0 |
4 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | O | <NA> | O | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
684 | 684 | [1621, 1622): ',' | 117 | 0 | 1 | False | O | <NA> | O | 0 |
685 | 685 | [1623, 1625): 'in' | 1107 | 0 | 1 | False | O | <NA> | O | 0 |
686 | 686 | [1626, 1633): 'Karachi' | 16237 | 0 | 1 | False | B | LOC | B-LOC | 1 |
687 | 687 | [1633, 1634): '.' | 119 | 0 | 1 | False | O | <NA> | O | 0 |
688 | 688 | [0, 0): '' | 102 | 0 | 1 | True | O | <NA> | O | 0 |
689 rows × 10 columns
We are going to use the BERT embeddings as the feature vector to train our model. First, we will show how they are computed
# Initialize the BERT model that will be used to generate embeddings.
bert = transformers.BertModel.from_pretrained(bert_model_name)
# Force garbage collection in case this notebook is running on a low-RAM environment.
gc.collect()
# Compute BERT embeddings with the BERT model and add result to our example DataFrame.
embeddings_df = tp.io.bert.add_embeddings(classes_df, bert)
embeddings_df[["token_id", "span", "input_id", "ent_iob", "ent_type", "token_class", "embedding"]].iloc[10:20]
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertModel: ['classifier.weight', 'classifier.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
token_id | span | input_id | ent_iob | ent_type | token_class | embedding | |
---|---|---|---|---|---|---|---|
10 | 10 | [15, 17): 'KE' | 22441 | O | <NA> | O | [ -0.19854169, -0.46898514, 0.7755601... |
11 | 11 | [17, 18): 'T' | 1942 | O | <NA> | O | [ -0.24190396, -0.42399377, 0.9554063... |
12 | 12 | [18, 19): '-' | 118 | O | <NA> | O | [ -0.20076752, -0.7481933, 1.302213... |
13 | 13 | [20, 22): 'PA' | 8544 | B | LOC | B-LOC | [ 0.20202553, -0.26199815, 0.3297633... |
14 | 14 | [22, 23): 'K' | 2428 | I | LOC | I-LOC | [ -0.5462168, -0.90924424, -0.0583674... |
15 | 15 | [23, 25): 'IS' | 6258 | I | LOC | I-LOC | [ -0.37400252, -0.6890734, -0.1446257... |
16 | 16 | [25, 27): 'TA' | 9159 | I | LOC | I-LOC | [ -0.46548516, -0.8717417, 0.3557479... |
17 | 17 | [27, 28): 'N' | 2249 | I | LOC | I-LOC | [ -0.18682763, -0.90081865, 0.3601499... |
18 | 18 | [29, 30): 'V' | 159 | O | <NA> | O | [ -0.16640103, -0.8363804, 0.8740610... |
19 | 19 | [31, 33): 'NE' | 26546 | B | LOC | B-LOC | [ -0.30241105, -0.83826715, 1.105809... |
embeddings_df[["span", "ent_iob", "ent_type", "embedding"]].iloc[70:75]
span | ent_iob | ent_type | embedding | |
---|---|---|---|---|
70 | [155, 168): 'international' | O | <NA> | [ 0.23404993, -0.5534872, 0.9083986, ... |
71 | [169, 176): 'between' | O | <NA> | [ 0.27793035, -0.68538034, 1.1050361, ... |
72 | [177, 185): 'Pakistan' | B | LOC | [ 0.1971882, -0.4634109, 0.5182331, ... |
73 | [186, 189): 'and' | O | <NA> | [ 0.20423535, -0.63758826, 0.82874435, ... |
74 | [190, 193): 'New' | B | LOC | [ 0.2874066, -0.47174183, 0.7771955, ... |
# The `embedding` column is an extension type `TensorDtype` that holds a
#`TensorArray` provided by Text Extensions for Pandas.
embeddings_df["embedding"].dtype
<text_extensions_for_pandas.array.tensor.TensorDtype at 0x7fea4a097040>
A TensorArray
can be constructed with a NumPy array of arbitrary dimensions, added to a DataFrame, then used with standard Pandas functionality. See the notebook Text_Extension_for_Pandas_Overview for more on TensorArray
.
# Zero-copy conversion to NumPy can be done by first unwrapping the
# `TensorArray` with `.array` and calling `to_numpy()`.
embeddings_arr = embeddings_df["embedding"].array.to_numpy()
embeddings_arr.dtype, embeddings_arr.shape
(dtype('float32'), (689, 768))
Text Extensions for Pandas has a convenience function that will combine the above cells to create BERT tokens and embeddings. We will use this to add embeddings to the entire corpus.
# Example usage of the convenience function to create BERT tokens and embeddings.
tp.io.bert.conll_to_bert(example_df, tokenizer, bert, token_class_dtype)
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | [0, 0): '' | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.08307081, -0.35959032, 1.015068... |
1 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.22862603, -0.49313632, 1.28423... |
2 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.028480662, -0.17874284, 1.54320... |
3 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.4651753, -0.29836023, 1.073767... |
4 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.10730811, -0.33720982, 1.226979... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
684 | 684 | [1621, 1622): ',' | 117 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.1280663, -0.0023243837, 0.678132... |
685 | 685 | [1623, 1625): 'in' | 1107 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.3053407, -0.52625775, 0.8281702... |
686 | 686 | [1626, 1633): 'Karachi' | 16237 | 0 | 1 | False | B | LOC | B-LOC | 1 | [ -0.048738778, -0.33797324, -0.0583509... |
687 | 687 | [1633, 1634): '.' | 119 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.005289644, -0.29743072, 0.716173... |
688 | 688 | [0, 0): '' | 102 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.50302404, 0.36253828, 0.7314933... |
689 rows × 11 columns
When this notebook is running on a resource-constrained environment like Binder, there may not be enough RAM available to hold all the embeddings in memory. So we use Gaussian random projection to reduce the size of the embeddings. The projection shrinks the embeddings by a factor of 3 at the expense of a small decrease in model accuracy.
Change the constant SHRINK_EMBEDDINGS
in the following cell to False
if you want to disable this behavior.
SHRINK_EMBEDDINGS = False
PROJECTION_DIMS = 256
RANDOM_SEED=42
import sklearn.random_projection
projection = sklearn.random_projection.GaussianRandomProjection(
n_components=PROJECTION_DIMS, random_state=RANDOM_SEED)
def maybe_shrink_embeddings(df):
if SHRINK_EMBEDDINGS:
df["embedding"] = tp.TensorArray(projection.fit_transform(df["embedding"]))
return df
# Run the entire corpus through our processing pipeline.
bert_toks_by_fold = {}
for fold_name in corpus_raw.keys():
print(f"Processing fold '{fold_name}'...")
raw = corpus_raw[fold_name]
with torch.inference_mode(): # This line cuts CPU usage by ~50%
bert_toks_by_fold[fold_name] = tp.jupyter.run_with_progress_bar(
len(raw), lambda i: maybe_shrink_embeddings(tp.io.bert.conll_to_bert(
raw[i], tokenizer, bert, token_class_dtype)))
bert_toks_by_fold["dev"][20]
Processing fold 'train'...
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=946, style=ProgressStyle(desc…
Processing fold 'dev'...
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=216, style=ProgressStyle(desc…
Processing fold 'test'...
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…
token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | [0, 0): '' | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.17669655, -0.3989963, 0.908887... |
1 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.3855382, -0.50232756, 1.173232... |
2 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.11718995, -0.12701154, 1.38969... |
3 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.39025685, -0.25043246, 1.074507... |
4 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.27732754, -0.26160136, 1.078761... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2154 | 2154 | [5704, 5705): ')' | 114 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.015393024, -0.040650737, 1.001185... |
2155 | 2155 | [5706, 5708): '39' | 3614 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.075038865, 0.014400693, 1.043231... |
2156 | 2156 | [5708, 5709): '.' | 119 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.085796565, 0.05905571, 1.114640... |
2157 | 2157 | [5709, 5711): '93' | 5429 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.0113782445, -0.26387203, 0.881803... |
2158 | 2158 | [0, 0): '' | 102 | 0 | 1 | True | O | <NA> | O | 0 | [ 0.48513305, 1.5709875, 0.592935... |
2159 rows × 11 columns
# Create a single DataFrame with the entire corpus's embeddings.
corpus_df = tp.io.conll.combine_folds(bert_toks_by_fold)
corpus_df
fold | doc_num | token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 0 | 0 | [0, 0): '' | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.098505504, -0.4050192, 0.742888... |
1 | train | 0 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.057021566, -0.48112106, 0.989868... |
2 | train | 0 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.04824192, -0.2532998, 1.16719... |
3 | train | 0 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.26682985, -0.31008705, 1.00747... |
4 | train | 0 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.22296886, -0.21308525, 0.933102... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
416536 | test | 230 | 314 | [1386, 1393): 'brother' | 1711 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.02817309, -0.08062352, 0.9804888... |
416537 | test | 230 | 315 | [1393, 1394): ',' | 117 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.118173525, -0.07008511, 0.865484... |
416538 | test | 230 | 316 | [1395, 1400): 'Bobby' | 5545 | 0 | 1 | False | B | PER | B-PER | 4 | [ -0.35689434, 0.31400475, 1.573854... |
416539 | test | 230 | 317 | [1400, 1401): '.' | 119 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.18957116, -0.2458116, 0.66257... |
416540 | test | 230 | 318 | [0, 0): '' | 102 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.4468915, -0.31665248, 0.779688... |
416541 rows × 13 columns
With the TensorArray
from Text Extensions for Pandas, the computed embeddings can be persisted as a tensor along with the rest of the DataFrame using standard Pandas input/output methods. Since this is a costly operation and the embeddings are deterministic, it can save lots of time to checkpoint the data here and save the results to disk. This will allow us to continue working with model training without needing to re-compute the BERT embeddings again.
# Write the tokenized corpus with embeddings to a Feather file.
# We can't currently serialize span columns that cover multiple documents (see issue #73 https://github.com/CODAIT/text-extensions-for-pandas/issues/73),
# so drop span columns from the contents we write to the Feather file.
cols_to_drop = [c for c in corpus_df.columns if "span" in c]
corpus_df.drop(columns=cols_to_drop).to_feather("outputs/corpus.feather")
# Read the serialized embeddings back in so that you can rerun the model
# training parts of this notebook (the cells from here onward) without
# regenerating the embeddings.
corpus_df = pd.read_feather("outputs/corpus.feather")
corpus_df
fold | doc_num | token_id | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 0 | 0 | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.098505504, -0.4050192, 0.742888... |
1 | train | 0 | 1 | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.057021566, -0.48112106, 0.989868... |
2 | train | 0 | 2 | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.04824192, -0.2532998, 1.16719... |
3 | train | 0 | 3 | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.26682985, -0.31008705, 1.00747... |
4 | train | 0 | 4 | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.22296886, -0.21308525, 0.933102... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
416536 | test | 230 | 314 | 1711 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.02817309, -0.08062352, 0.9804888... |
416537 | test | 230 | 315 | 117 | 0 | 1 | False | O | <NA> | O | 0 | [ 0.118173525, -0.07008511, 0.865484... |
416538 | test | 230 | 316 | 5545 | 0 | 1 | False | B | PER | B-PER | 4 | [ -0.35689434, 0.31400475, 1.573854... |
416539 | test | 230 | 317 | 119 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.18957116, -0.2458116, 0.66257... |
416540 | test | 230 | 318 | 102 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.4468915, -0.31665248, 0.779688... |
416541 rows × 12 columns
Now we will use the loaded BERT embeddings to train a multinomial model to predict the token class from the embeddings tensor.
# Extract the training set DataFrame.
train_df = corpus_df[corpus_df["fold"] == "train"]
train_df
fold | doc_num | token_id | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 0 | 0 | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.098505504, -0.4050192, 0.742888... |
1 | train | 0 | 1 | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.057021566, -0.48112106, 0.989868... |
2 | train | 0 | 2 | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.04824192, -0.2532998, 1.16719... |
3 | train | 0 | 3 | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.26682985, -0.31008705, 1.00747... |
4 | train | 0 | 4 | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.22296886, -0.21308525, 0.933102... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
281104 | train | 945 | 53 | 17057 | 0 | 1 | False | B | ORG | B-ORG | 3 | [ 0.7556371, -0.91891253, -0.1403036... |
281105 | train | 945 | 54 | 122 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.11528473, -0.44492027, 0.4715562... |
281106 | train | 945 | 55 | 4617 | 0 | 1 | False | B | ORG | B-ORG | 3 | [ 0.45602208, -0.8970848, 0.0678616... |
281107 | train | 945 | 56 | 123 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.19713743, -0.5427194, 0.294020... |
281108 | train | 945 | 57 | 102 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.57650733, -0.42160645, 0.994703... |
281109 rows × 12 columns
%%time
# Train a multinomial logistic regression model on the training set.
MULTI_CLASS = "multinomial"
# How many iterations to run the BGFS optimizer when fitting logistic
# regression models. 100 ==> Fast; 10000 ==> Full convergence
LBGFS_ITERATIONS = 10000
_REGULARIZATION_COEFF = 1e-1 # Smaller values ==> more regularization
base_pipeline = sklearn.pipeline.Pipeline([
# Standard scaler. This only makes a difference for certain classes
# of embeddings.
#("scaler", sklearn.preprocessing.StandardScaler()),
("mlogreg", sklearn.linear_model.LogisticRegression(
multi_class=MULTI_CLASS,
verbose=1,
max_iter=LBGFS_ITERATIONS,
C=_REGULARIZATION_COEFF
))
])
X_train = train_df["embedding"].values
Y_train = train_df["token_class_id"]
base_model = base_pipeline.fit(X_train, Y_train)
base_model
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 6921 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 6.17660D+05 |proj g|= 4.23293D+05
This problem is unconstrained.
At iterate 50 f= 1.22005D+04 |proj g|= 2.48275D+02 At iterate 100 f= 8.87639D+03 |proj g|= 1.72205D+02 At iterate 150 f= 8.07946D+03 |proj g|= 1.28633D+02 At iterate 200 f= 7.87840D+03 |proj g|= 6.20068D+01 At iterate 250 f= 7.81730D+03 |proj g|= 9.11741D+00 At iterate 300 f= 7.80144D+03 |proj g|= 6.86435D+00 At iterate 350 f= 7.79623D+03 |proj g|= 7.21843D+00 At iterate 400 f= 7.79451D+03 |proj g|= 5.64213D+00 At iterate 450 f= 7.79356D+03 |proj g|= 2.47884D+00 At iterate 500 f= 7.79273D+03 |proj g|= 2.32130D+00 At iterate 550 f= 7.79141D+03 |proj g|= 1.03513D+01 At iterate 600 f= 7.78944D+03 |proj g|= 4.39763D+00 At iterate 650 f= 7.78798D+03 |proj g|= 2.72198D+00 At iterate 700 f= 7.78721D+03 |proj g|= 2.49312D+00 At iterate 750 f= 7.78691D+03 |proj g|= 2.09049D+00 At iterate 800 f= 7.78678D+03 |proj g|= 1.56225D+00 At iterate 850 f= 7.78669D+03 |proj g|= 9.61272D-01 At iterate 900 f= 7.78660D+03 |proj g|= 1.88970D+00 At iterate 950 f= 7.78644D+03 |proj g|= 1.39468D+00 At iterate 1000 f= 7.78615D+03 |proj g|= 1.56165D+00 At iterate 1050 f= 7.78593D+03 |proj g|= 1.81700D+00 At iterate 1100 f= 7.78581D+03 |proj g|= 1.11273D+00 At iterate 1150 f= 7.78577D+03 |proj g|= 4.10524D-01 At iterate 1200 f= 7.78575D+03 |proj g|= 3.49336D-01 At iterate 1250 f= 7.78574D+03 |proj g|= 8.20185D-01 At iterate 1300 f= 7.78571D+03 |proj g|= 9.94495D-01 At iterate 1350 f= 7.78567D+03 |proj g|= 7.14421D-01 At iterate 1400 f= 7.78563D+03 |proj g|= 3.46513D-01 At iterate 1450 f= 7.78561D+03 |proj g|= 1.15784D+00 At iterate 1500 f= 7.78559D+03 |proj g|= 5.66811D-01 At iterate 1550 f= 7.78559D+03 |proj g|= 1.43156D-01 At iterate 1600 f= 7.78558D+03 |proj g|= 1.60595D-01 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 6921 1604 1694 1 0 0 4.829D-01 7.786D+03 F = 7785.5829997825367 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH CPU times: user 1h 34min 15s, sys: 6min 41s, total: 1h 40min 56s Wall time: 12min 44s
Pipeline(steps=[('mlogreg', LogisticRegression(C=0.1, max_iter=10000, multi_class='multinomial', verbose=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('mlogreg', LogisticRegression(C=0.1, max_iter=10000, multi_class='multinomial', verbose=1))])
LogisticRegression(C=0.1, max_iter=10000, multi_class='multinomial', verbose=1)
Using our model, we can now predict the token class from the test set using the computed embeddings.
# Define a function that will let us make predictions on a fold of the corpus.
def predict_on_df(df: pd.DataFrame, id_to_class: Dict[int, str], predictor):
"""
Run a trained model on a DataFrame of tokens with embeddings.
:param df: DataFrame of tokens for a document, containing a TokenSpan column
called "embedding" for each token.
:param id_to_class: Mapping from class ID to class name, as returned by
:func:`text_extensions_for_pandas.make_iob_tag_categories`
:param predictor: Python object with a `predict_proba` method that accepts
a numpy array of embeddings.
:returns: A copy of `df`, with the following additional columns:
`predicted_id`, `predicted_class`, `predicted_iob`, `predicted_type`
and `predicted_class_pr`.
"""
result_df = df.copy()
embeddings = result_df["embedding"].to_numpy()
class_pr = tp.TensorArray(predictor.predict_proba(embeddings))
result_df["predicted_id"] = np.argmax(class_pr, axis=1)
result_df["predicted_class"] = [id_to_class[i]
for i in result_df["predicted_id"].values]
iobs, types = tp.io.conll.decode_class_labels(result_df["predicted_class"].values)
result_df["predicted_iob"] = iobs
result_df["predicted_type"] = types
result_df["predicted_class_pr"] = class_pr
return result_df
# Make predictions on the test set.
test_results_df = predict_on_df(corpus_df[corpus_df["fold"] == "test"], int_to_label, base_model)
test_results_df.head()
fold | doc_num | token_id | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | predicted_id | predicted_class | predicted_iob | predicted_type | predicted_class_pr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
351001 | test | 0 | 0 | 101 | 0 | 1 | True | O | <NA> | O | 0 | [ -0.19626583, -0.450937, 0.6775361... | 0 | O | O | None | [ 0.9994774788863705, 1.9985127298723906e-0... |
351002 | test | 0 | 1 | 118 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.3187211, -0.5074784, 1.046454... | 0 | O | O | None | [ 0.9992964240340214, 3.7581023374440964e-0... |
351003 | test | 0 | 2 | 141 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.080538824, -0.2477481, 1.356255... | 0 | O | O | None | [ 0.998973288221842, 0.0004299715907382311... |
351004 | test | 0 | 3 | 9244 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.6878579, -0.30290246, 0.8842714... | 0 | O | O | None | [ 0.9983217119367633, 4.888114850946988e-0... |
351005 | test | 0 | 4 | 9272 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.2963228, -0.23313177, 0.93988... | 0 | O | O | None | [ 0.9999185106741023, 8.938753477308423e-0... |
# Take a slice to show a region with more entities.
test_results_df.iloc[40:50]
fold | doc_num | token_id | input_id | token_type_id | attention_mask | special_tokens_mask | ent_iob | ent_type | token_class | token_class_id | embedding | predicted_id | predicted_class | predicted_iob | predicted_type | predicted_class_pr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
351041 | test | 0 | 40 | 3309 | 0 | 1 | False | I | PER | I-PER | 8 | [ -0.21029201, -0.8535674, 0.0002756594... | 6 | I-MISC | I | MISC | [ 0.0010111308810159478, 1.6209660863726316e-0... |
351042 | test | 0 | 41 | 1306 | 0 | 1 | False | I | PER | I-PER | 8 | [ -0.23205486, -0.9290767, 0.3889118... | 6 | I-MISC | I | MISC | [ 0.012755027203264928, 0.00554094580945546... |
351043 | test | 0 | 42 | 2001 | 0 | 1 | False | I | PER | I-PER | 8 | [ 0.36844134, -0.68091154, -0.1059106... | 5 | I-LOC | I | LOC | [ 0.008349822538261149, 0.180904633782168... |
351044 | test | 0 | 43 | 1181 | 0 | 1 | False | I | PER | I-PER | 8 | [ -0.30131084, -0.6546019, -0.1726912... | 8 | I-PER | I | PER | [ 0.013398092974719904, 0.000889872066127380... |
351045 | test | 0 | 44 | 2293 | 0 | 1 | False | I | PER | I-PER | 8 | [ -0.1611614, -0.69891113, 0.2342468... | 5 | I-LOC | I | LOC | [ 0.014927046511081343, 0.0209250472885050... |
351046 | test | 0 | 45 | 18589 | 0 | 1 | False | B | LOC | B-LOC | 1 | [ -0.058567554, -0.79558676, 0.3360603... | 1 | B-LOC | B | LOC | [ 0.027281135850703336, 0.532249166723370... |
351047 | test | 0 | 46 | 118 | 0 | 1 | False | I | LOC | I-LOC | 5 | [ 0.2037595, -0.73730904, -0.0888521... | 5 | I-LOC | I | LOC | [ 0.22512840995098554, 0.00379439656874946... |
351048 | test | 0 | 47 | 19016 | 0 | 1 | False | I | LOC | I-LOC | 5 | [ -0.10341229, -0.33681834, 0.1738456... | 5 | I-LOC | I | LOC | [ 0.04472568023866835, 0.436126151622446... |
351049 | test | 0 | 48 | 2249 | 0 | 1 | False | I | LOC | I-LOC | 5 | [ -0.4054268, -0.6516522, 0.2469... | 5 | I-LOC | I | LOC | [ 0.0009405393288526446, 0.00244544190700176... |
351050 | test | 0 | 49 | 117 | 0 | 1 | False | O | <NA> | O | 0 | [ -0.16829254, -0.6475861, 0.8149025... | 0 | O | O | None | [ 0.9999736550716568, 5.7005018158771435e-0... |
With our model predictions on the test set, we can now compute precision and recall. To do this, we will use the following steps:
# Split model outputs for an entire fold back into documents and add
# token information.
# Get unique documents per fold.
fold_and_doc = test_results_df[["fold", "doc_num"]] \
.drop_duplicates() \
.to_records(index=False)
# Index by fold, doc and token id, then make sure sorted.
indexed_df = test_results_df \
.set_index(["fold", "doc_num", "token_id"], verify_integrity=True) \
.sort_index()
# Join predictions with token information, for each document.
test_results_by_doc = {}
for collection, doc_num in fold_and_doc:
doc_slice = indexed_df.loc[collection, doc_num].reset_index()
doc_toks = bert_toks_by_fold[collection][doc_num][
["token_id", "span", "ent_iob", "ent_type"]
].rename(columns={"id": "token_id"})
joined_df = doc_toks.copy().merge(
doc_slice[["token_id", "predicted_iob", "predicted_type"]])
test_results_by_doc[(collection, doc_num)] = joined_df
# Test results are now in one DataFrame per document.
test_results_by_doc[("test", 0)].iloc[40:60]
token_id | span | ent_iob | ent_type | predicted_iob | predicted_type | |
---|---|---|---|---|---|---|
40 | 40 | [68, 70): 'di' | I | PER | I | MISC |
41 | 41 | [70, 71): 'm' | I | PER | I | MISC |
42 | 42 | [72, 74): 'La' | I | PER | I | LOC |
43 | 43 | [74, 75): 'd' | I | PER | I | PER |
44 | 44 | [75, 77): 'ki' | I | PER | I | LOC |
45 | 45 | [78, 80): 'AL' | B | LOC | B | LOC |
46 | 46 | [80, 81): '-' | I | LOC | I | LOC |
47 | 47 | [81, 83): 'AI' | I | LOC | I | LOC |
48 | 48 | [83, 84): 'N' | I | LOC | I | LOC |
49 | 49 | [84, 85): ',' | O | <NA> | O | None |
50 | 50 | [86, 92): 'United' | B | LOC | B | LOC |
51 | 51 | [93, 97): 'Arab' | I | LOC | I | LOC |
52 | 52 | [98, 106): 'Emirates' | I | LOC | I | LOC |
53 | 53 | [107, 111): '1996' | O | <NA> | O | None |
54 | 54 | [111, 112): '-' | O | <NA> | O | None |
55 | 55 | [112, 114): '12' | O | <NA> | O | None |
56 | 56 | [114, 115): '-' | O | <NA> | O | None |
57 | 57 | [115, 117): '06' | O | <NA> | O | None |
58 | 58 | [118, 123): 'Japan' | B | LOC | B | LOC |
59 | 59 | [124, 129): 'began' | O | <NA> | O | None |
# Convert IOB2 format to spans, entity type with `tp.io.conll.iob_to_spans()`.
test_actual_spans = {k: tp.io.conll.iob_to_spans(v) for k, v in test_results_by_doc.items()}
test_model_spans = {k:
tp.io.conll.iob_to_spans(v, iob_col_name = "predicted_iob",
entity_type_col_name = "predicted_type")
.rename(columns={"predicted_type": "ent_type"})
for k, v in test_results_by_doc.items()}
test_model_spans[("test", 0)].head()
span | ent_type | |
---|---|---|
0 | [19, 24): 'JAPAN' | PER |
1 | [29, 34): 'LUCKY' | LOC |
2 | [40, 45): 'CHINA' | ORG |
3 | [66, 77): 'Nadim Ladki' | LOC |
4 | [78, 84): 'AL-AIN' | LOC |
# Compute per-document statistics into a single DataFrame.
test_stats_by_doc = tp.io.conll.compute_accuracy_by_document(test_actual_spans, test_model_spans)
test_stats_by_doc
fold | doc_num | num_true_positives | num_extracted | num_entities | precision | recall | F1 | |
---|---|---|---|---|---|---|---|---|
0 | test | 0 | 41 | 47 | 45 | 0.872340 | 0.911111 | 0.891304 |
1 | test | 1 | 41 | 42 | 44 | 0.976190 | 0.931818 | 0.953488 |
2 | test | 2 | 52 | 54 | 54 | 0.962963 | 0.962963 | 0.962963 |
3 | test | 3 | 42 | 44 | 44 | 0.954545 | 0.954545 | 0.954545 |
4 | test | 4 | 18 | 19 | 19 | 0.947368 | 0.947368 | 0.947368 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
226 | test | 226 | 6 | 7 | 7 | 0.857143 | 0.857143 | 0.857143 |
227 | test | 227 | 18 | 19 | 21 | 0.947368 | 0.857143 | 0.900000 |
228 | test | 228 | 24 | 28 | 27 | 0.857143 | 0.888889 | 0.872727 |
229 | test | 229 | 25 | 27 | 27 | 0.925926 | 0.925926 | 0.925926 |
230 | test | 230 | 25 | 27 | 28 | 0.925926 | 0.892857 | 0.909091 |
231 rows × 8 columns
# Collection-wide precision and recall can be computed by aggregating
# our DataFrame.
tp.io.conll.compute_global_accuracy(test_stats_by_doc)
{'num_true_positives': 4881, 'num_entities': 5648, 'num_extracted': 5620, 'precision': 0.8685053380782918, 'recall': 0.8641997167138811, 'F1': 0.8663471778487754}
The above results aren't bad for a first shot, but taking a look a some of the predictions will show that sometimes the tokens have been split up into multiple entities. This is because the BERT tokenizer uses WordPiece to make subword tokens, see https://huggingface.co/transformers/tokenizer_summary.html and https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf for more information.
This is going to cause a problem when computing precision/recall because we are comparing exact spans, and if the entity is split, it will be counted as a false negative and possibly one or more false positives. Luckily we can fix up with Text Extension for Pandas.
Let's drill down to see an example of the issue and how to correct it.
# Every once in a while, the BERT model will split a token in the original data
# set into multiple entities. For example, look at document 202 of the test set:
test_model_spans[("test", 202)].head(10)
span | ent_type | |
---|---|---|
0 | [11, 22): 'RUGBY UNION' | ORG |
1 | [24, 31): 'BRITISH' | MISC |
2 | [41, 47): 'LONDON' | LOC |
3 | [70, 77): 'British' | MISC |
4 | [111, 125): 'Pilkington Cup' | MISC |
5 | [139, 146): 'Reading' | ORG |
6 | [150, 151): 'W' | ORG |
7 | [151, 156): 'idnes' | ORG |
8 | [159, 166): 'English' | MISC |
9 | [180, 184): 'Bath' | ORG |
Notice [150, 151): 'W'
and [151, 156): 'idnes'
. These outputs are part
of the same original token, but have been split by the model.
# We can use spanner algebra in `tp.spanner.overlap_join()`
# to fix up these outputs.
spans_df = test_model_spans[("test", 202)]
toks_df = test_raw[202]
# First, find which tokens the spans overlap with:
overlaps_df = (
tp.spanner.overlap_join(spans_df["span"], toks_df["span"],
"span", "corpus_token")
.merge(spans_df)
)
overlaps_df.head(10)
span | corpus_token | ent_type | |
---|---|---|---|
0 | [11, 22): 'RUGBY UNION' | [11, 16): 'RUGBY' | ORG |
1 | [11, 22): 'RUGBY UNION' | [17, 22): 'UNION' | ORG |
2 | [24, 31): 'BRITISH' | [24, 31): 'BRITISH' | MISC |
3 | [41, 47): 'LONDON' | [41, 47): 'LONDON' | LOC |
4 | [70, 77): 'British' | [70, 77): 'British' | MISC |
5 | [111, 125): 'Pilkington Cup' | [111, 121): 'Pilkington' | MISC |
6 | [111, 125): 'Pilkington Cup' | [122, 125): 'Cup' | MISC |
7 | [139, 146): 'Reading' | [139, 146): 'Reading' | ORG |
8 | [150, 151): 'W' | [150, 156): 'Widnes' | ORG |
9 | [151, 156): 'idnes' | [150, 156): 'Widnes' | ORG |
# Next, compute the minimum span that covers all the corpus tokens
# that overlap with each entity span.
agg_df = (
overlaps_df
.groupby("span")
.aggregate({"corpus_token": "sum", "ent_type": "first"})
.reset_index()
)
agg_df.head(10)
span | corpus_token | ent_type | |
---|---|---|---|
0 | [11, 22): 'RUGBY UNION' | [11, 22): 'RUGBY UNION' | ORG |
1 | [24, 31): 'BRITISH' | [24, 31): 'BRITISH' | MISC |
2 | [41, 47): 'LONDON' | [41, 47): 'LONDON' | LOC |
3 | [70, 77): 'British' | [70, 77): 'British' | MISC |
4 | [111, 125): 'Pilkington Cup' | [111, 125): 'Pilkington Cup' | MISC |
5 | [139, 146): 'Reading' | [139, 146): 'Reading' | ORG |
6 | [150, 151): 'W' | [150, 156): 'Widnes' | ORG |
7 | [151, 156): 'idnes' | [150, 156): 'Widnes' | ORG |
8 | [159, 166): 'English' | [159, 166): 'English' | MISC |
9 | [180, 184): 'Bath' | [180, 184): 'Bath' | ORG |
# Finally, take unique values and covert character-based spans to token
# spans in the corpus tokenization (since the new offsets might not match a
# BERT tokenizer token boundary).
cons_df = (
tp.spanner.consolidate(agg_df, "corpus_token")[["corpus_token", "ent_type"]]
.rename(columns={"corpus_token": "span"})
)
cons_df["span"] = tp.TokenSpanArray.align_to_tokens(toks_df["span"],
cons_df["span"])
cons_df.head(10)
span | ent_type | |
---|---|---|
0 | [11, 22): 'RUGBY UNION' | ORG |
1 | [24, 31): 'BRITISH' | MISC |
2 | [41, 47): 'LONDON' | LOC |
3 | [70, 77): 'British' | MISC |
4 | [111, 125): 'Pilkington Cup' | MISC |
5 | [139, 146): 'Reading' | ORG |
6 | [150, 156): 'Widnes' | ORG |
8 | [159, 166): 'English' | MISC |
9 | [180, 184): 'Bath' | ORG |
10 | [188, 198): 'Harlequins' | ORG |
# Text Extensions for Pandas contains a single function that repeats the actions of the
# previous 3 cells.
tp.io.bert.align_bert_tokens_to_corpus_tokens(test_model_spans[("test", 202)], test_raw[202]).head(10)
span | ent_type | |
---|---|---|
0 | [11, 22): 'RUGBY UNION' | ORG |
1 | [24, 31): 'BRITISH' | MISC |
2 | [41, 47): 'LONDON' | LOC |
3 | [70, 77): 'British' | MISC |
4 | [111, 125): 'Pilkington Cup' | MISC |
5 | [139, 146): 'Reading' | ORG |
6 | [150, 156): 'Widnes' | ORG |
8 | [159, 166): 'English' | MISC |
9 | [180, 184): 'Bath' | ORG |
10 | [188, 198): 'Harlequins' | ORG |
# Run all of our DataFrames through `align_bert_tokens_to_corpus_tokens()`.
keys = list(test_model_spans.keys())
new_values = tp.jupyter.run_with_progress_bar(
len(keys),
lambda i: tp.io.bert.align_bert_tokens_to_corpus_tokens(test_model_spans[keys[i]], test_raw[keys[i][1]]))
test_model_spans = {k: v for k, v in zip(keys, new_values)}
test_model_spans[("test", 202)].head(10)
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…
span | ent_type | |
---|---|---|
0 | [11, 22): 'RUGBY UNION' | ORG |
1 | [24, 31): 'BRITISH' | MISC |
2 | [41, 47): 'LONDON' | LOC |
3 | [70, 77): 'British' | MISC |
4 | [111, 125): 'Pilkington Cup' | MISC |
5 | [139, 146): 'Reading' | ORG |
6 | [150, 156): 'Widnes' | ORG |
8 | [159, 166): 'English' | MISC |
9 | [180, 184): 'Bath' | ORG |
10 | [188, 198): 'Harlequins' | ORG |
# Compute per-document statistics into a single DataFrame.
test_stats_by_doc = tp.io.conll.compute_accuracy_by_document(test_actual_spans, test_model_spans)
test_stats_by_doc
fold | doc_num | num_true_positives | num_extracted | num_entities | precision | recall | F1 | |
---|---|---|---|---|---|---|---|---|
0 | test | 0 | 42 | 47 | 45 | 0.893617 | 0.933333 | 0.913043 |
1 | test | 1 | 41 | 42 | 44 | 0.976190 | 0.931818 | 0.953488 |
2 | test | 2 | 52 | 54 | 54 | 0.962963 | 0.962963 | 0.962963 |
3 | test | 3 | 42 | 44 | 44 | 0.954545 | 0.954545 | 0.954545 |
4 | test | 4 | 18 | 19 | 19 | 0.947368 | 0.947368 | 0.947368 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
226 | test | 226 | 7 | 7 | 7 | 1.000000 | 1.000000 | 1.000000 |
227 | test | 227 | 18 | 19 | 21 | 0.947368 | 0.857143 | 0.900000 |
228 | test | 228 | 24 | 27 | 27 | 0.888889 | 0.888889 | 0.888889 |
229 | test | 229 | 26 | 27 | 27 | 0.962963 | 0.962963 | 0.962963 |
230 | test | 230 | 26 | 27 | 28 | 0.962963 | 0.928571 | 0.945455 |
231 rows × 8 columns
# Collection-wide precision and recall can be computed by aggregating
# our DataFrame.
tp.io.conll.compute_global_accuracy(test_stats_by_doc)
{'num_true_positives': 4971, 'num_entities': 5648, 'num_extracted': 5587, 'precision': 0.889744048684446, 'recall': 0.8801345609065155, 'F1': 0.8849132176234981}
These results are a bit better than before, and while the F1 score is not high compared to todays standards, it is decent enough for a simplistic model. More importantly, we did show it was fairly easy to create a model for named entity recognition and analyze the output by leveraging the functionalitiy of Pandas DataFrames along with Text Extensions for Pandas SpanArray
, TensorArray
and integration with BERT from Huggingface Transformers.