This Jupyter notebook shows how to use the Text Extensions for Pandas library to analyze the outputs of a NLP model on a target corpus.
We use the CoNLL-2003 corpus as our target corpus, and we use the output of the bender
team in the original CoNLL 2003 competition as our example model output.
This notebook requires a Python 3.7 or later environment with numpy
and pandas
.
The notebook also requires the text_extensions_for_pandas
library. You can satisfy this dependency in two ways:
pip install text_extensions_for_pandas
before running this notebook. This command adds the library to your Python environment.import os
import sys
import numpy as np
import pandas as pd
# And of course we need the text_extensions_for_pandas library itself.
try:
import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
# If we're running from within the project source tree and the parent Python
# environment doesn't have the text_extensions_for_pandas package, use the
# version in the local source tree.
if not os.getcwd().endswith("notebooks"):
raise e
if ".." not in sys.path:
sys.path.insert(0, "..")
import text_extensions_for_pandas as tp
CoNLL, the SIGNLL Conference on Computational Natural Language Learning, is an annual academic conference for natural language processing researchers. Each year's conference features a competition involving a challenging NLP task. The task for the 2003 competition involved identifying mentions of named entities in English and German news articles from the late 1990's. The corpus for this 2003 competition is one of the most widely-used benchmarks for the performance of named entity recognition models. Current state-of-the-art results on this corpus produce an F1 score (harmonic mean of precision and recall) of 0.93. The best F1 score in the original competition was 0.89.
For more information about this data set, we recommend reading the conference paper about the competition results, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,".
Note that the data set is licensed for research use only. Be sure to adhere to the terms of the license when using this data set!
The developers of the CoNLL-2003 corpus defined a file format for the corpus, based on the file format used in the earlier Message Understanding Conference competition. This format is generally known as "CoNLL format" or "CoNLL-2003 format".
In the following cell, we use the facilities of Text Extensions for Pandas to download a copy of the CoNLL-2003 data set. Then we read the CoNLL-2003-format file containing the test
fold of the corpus and translate the data into a collection of Pandas DataFrame objects, one Dataframe per document. Finally, we display the Dataframe for the first document of the test
fold of the corpus.
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
# to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info
# Read gold standard data for the "test" fold of the corpus.
corpus_test_fold = tp.io.conll.conll_2003_to_dataframes(
data_set_info["test"], column_names=["pos", "phrase", "ent"],
iob_columns=[False, True, True])
# Pick some documents to use as examples
SHORT_DOC_NUM = 6
LONG_DOC_NUM = 0
# We use document 6 here because it's short.
corpus_test_fold[SHORT_DOC_NUM].head(11)
span | pos | phrase_iob | phrase_type | ent_iob | ent_type | sentence | line_num | |
---|---|---|---|---|---|---|---|---|
0 | [0, 10): '-DOCSTART-' | -X- | O | None | O | None | [0, 10): '-DOCSTART-' | 1871 |
1 | [11, 17): 'SOCCER' | NN | B | NP | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1873 |
2 | [17, 18): '-' | : | O | None | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1874 |
3 | [19, 26): 'ENGLISH' | NNP | B | NP | B | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1875 |
4 | [27, 31): 'F.A.' | NNP | I | NP | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1876 |
5 | [32, 35): 'CUP' | NNP | I | NP | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1877 |
6 | [36, 42): 'SECOND' | NNP | I | NP | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1878 |
7 | [43, 48): 'ROUND' | NNP | I | NP | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1879 |
8 | [49, 55): 'RESULT' | NNP | I | NP | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1880 |
9 | [55, 56): '.' | . | O | None | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1881 |
10 | [57, 63): 'LONDON' | NNP | B | NP | B | LOC | [57, 74): 'LONDON 1996-12-06' | 1883 |
The output of the previous cell corresponds to the first 10 lines of one document in the test
fold of the corpus in its original format. Here's what the original data looks like:
-DOCSTART- -X- -X- O
SOCCER NN I-NP O
- : O O
ENGLISH NNP I-NP I-MISC
F.A. NNP I-NP I-MISC
CUP NNP I-NP I-MISC
SECOND NNP I-NP O
ROUND NNP I-NP O
RESULT NNP I-NP O
. . O O
Each line represents a single token of the file. The first token of each document is a special token -DOCSTART-
. Each token is labeled with multiple attributes.
The function tp.io.conll.conll_2003_to_dataframes()
returns a list of DataFrames, one DataFrame per document. The DataFrame above contains the following columns:
span
: The span of the token within a reconstruction of the original document text, with begin and end offsets measured in characters. This column is stored using Text Extensions for Pandas' SpanArray
extension type.pos
: Part of speech information for the token, drawn from the second field of each line in the original file. Note that CoNLL-2003 format does not specify the names of metadata fields; this column has the name pos
because we specified that name in the column_names
argument to conll_2003_to_dataframes()
.phrase_iob
and phrase_type
: Noun/verb phrase information for the token, drawn from the third field of the original file, in Inside-Outside-Beginning-2 (IOB2) format. The names for these columns come from the column_names
argument we passed to conll_2003_to_dataframes()
.ent_iob
and ent_type
: Information about named entity mentions at this token offset, in IOB2 format. The names for these columns come from the column_names
argument we passed to conll_2003_to_dataframes()
.sentence
: The span of the sentence containing this token in the reconstructed document text.line_num
: Which line of the original input file contains this tokenNote that CoNLL-2003 format uses IOB1 tags for the metadata fields that we call "ent" and "phrase". The function conll_2003_to_dataframes()
converts these IOB1 tags to the IOB2 format for ease
of consumption. In IOB2 format, every entity starts with a "begin" tag, so your code can determine whether a token is the first token in an entity without needing to inspect the previous token. See the Wikipedia entry for IOB tagging
for more information.
We don't need the fields pos
, phrase_iob
, phrase_type
for the remainder of this notebook, so let's drop them:
# Drop unneeded metadata columns
if "pos" in corpus_test_fold[0].columns:
corpus_test_fold = [
df.drop(columns=["pos", "phrase_iob", "phrase_type"])
for df in corpus_test_fold
]
corpus_test_fold[SHORT_DOC_NUM].head(9)
span | ent_iob | ent_type | sentence | line_num | |
---|---|---|---|---|---|
0 | [0, 10): '-DOCSTART-' | O | None | [0, 10): '-DOCSTART-' | 1871 |
1 | [11, 17): 'SOCCER' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1873 |
2 | [17, 18): '-' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1874 |
3 | [19, 26): 'ENGLISH' | B | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1875 |
4 | [27, 31): 'F.A.' | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1876 |
5 | [32, 35): 'CUP' | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1877 |
6 | [36, 42): 'SECOND' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1878 |
7 | [43, 48): 'ROUND' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1879 |
8 | [49, 55): 'RESULT' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... | 1880 |
In this example, we will use the outputs of the "bender" team in the original competition as our model outputs.
A copy of these outputs is available in this repository under resources/conll_03/ner/results/bender
.
These outputs are also in CoNLL-2003 format, but they do not contain any information about the tokens. For example,
here's what the first 10 lines of the example document we've been using in the previous cells looks like
in the model outputs:
O
O
O
I-MISC
I-MISC
I-MISC
O
O
O
O
Text Extensions for Pandas includes a function conll_2003_output_to_dataframes()
that will read this format
of model output and merge the tags with the full token information in the original corpus, provided that you
have read the original corpus in with conll_2003_to_dataframes()
. The cell that follows uses this function
to read the output of the "bender" team, using the corpus_test_fold
list of DataFrames that we constructed
a few cells back.
# Read the outputs of the "bender" team in the original competition.
bender_output = tp.io.conll.conll_2003_output_to_dataframes(
corpus_test_fold, "../resources/conll_03/ner/results/bender/eng.testb")
bender_output[SHORT_DOC_NUM].head(10)
span | ent_iob | ent_type | sentence | |
---|---|---|---|---|
0 | [0, 10): '-DOCSTART-' | O | None | [0, 10): '-DOCSTART-' |
1 | [11, 17): 'SOCCER' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
2 | [17, 18): '-' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
3 | [19, 26): 'ENGLISH' | B | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
4 | [27, 31): 'F.A.' | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
5 | [32, 35): 'CUP' | I | MISC | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
6 | [36, 42): 'SECOND' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
7 | [43, 48): 'ROUND' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
8 | [49, 55): 'RESULT' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
9 | [55, 56): '.' | O | None | [11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU... |
The data we've looked at so far has been in IOB2 format.
Each row of our DataFrame represents a token, and each token is tagged with an entity type (ent_type
) and an IOB tag (ent_iob
). The first token of each named entity mention is tagged B
, while subsequent tokens are tagged I
. Tokens that aren't part of any named entity are tagged O
.
IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged O
, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is B, B, I
but the model has output B, I, I
; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.
The CoNLL 2003 competition used the number of errors in extracting entire entity mentions to measure the result quality of the entries. We will use the same metric in this notebook. To compute entity-level errors, we convert the IOB-tagged tokens into pairs of <entity span, entity type>
.
Text Extensions for Pandas includes a function iob_to_spans()
that will handle this conversion for you.
In the next cell, we use iob_to_spans()
to convert both the corpus and our example model's output to DataFrames of entity span and type information. Then we display the <entity span, entity type>
pairs for the "bender" team's output on our example document.
# Convert from IOB2-tagged tokens to <span, entity type> pairs.
# Again, one DataFrame per document.
corpus_spans = [tp.io.conll.iob_to_spans(df) for df in corpus_test_fold]
bender_spans = [tp.io.conll.iob_to_spans(df) for df in bender_output]
bender_spans[SHORT_DOC_NUM].head(10)
span | ent_type | |
---|---|---|
0 | [19, 35): 'ENGLISH F.A. CUP' | MISC |
1 | [57, 63): 'LONDON' | LOC |
2 | [88, 110): 'English F.A. Challenge' | MISC |
3 | [111, 114): 'Cup' | MISC |
4 | [145, 153): 'Plymouth' | ORG |
5 | [156, 162): 'Exeter' | ORG |
Each DataFrame in the list bender_spans
contains two columns, span
and ent_type
.
The span
column in the DataFrame has the data type (or "dtype", as Pandas
and Numpy call them) TokenSpanDtype
. TokenSpanDtype
is one of the
extension types from Text Extensions for Pandas.
The string representation of a Pandas Series shows the dtype of the series
("TokenSpanDtype" in this case) on its last line:
bender_spans[SHORT_DOC_NUM]["span"]
0 [19, 35): 'ENGLISH F.A. CUP' 1 [57, 63): 'LONDON' 2 [88, 110): 'English F.A. Challenge' 3 [111, 114): 'Cup' 4 [145, 153): 'Plymouth' 5 [156, 162): 'Exeter' Name: span, dtype: TokenSpanDtype
Columns with a dtype of TokenSpanDtype
are stored internally using the class
TokenSpanArray
, which is also part of Text Extensions for Pandas.
TokenSpanArray
is a subclass of ExtensionArray
, the base class for custom 1-D array types in Pandas.
Pandas stores extension arrays inside the associated Pandas Series
object
for the column. To obtain a reference to a the extension array that backs a
column, use the array
property of pandas.Series
:
print(bender_spans[SHORT_DOC_NUM]["span"].array)
<TokenSpanArray> [ [19, 35): 'ENGLISH F.A. CUP', [57, 63): 'LONDON', [88, 110): 'English F.A. Challenge', [111, 114): 'Cup', [145, 153): 'Plymouth', [156, 162): 'Exeter'] Length: 6, dtype: TokenSpanDtype
Note how the previous cell passed the TokenSpanArray
to print()
. The
TokenSpanArray
class can also render itself using Jupyter Notebook callbacks. To
see the HTML representation of the TokenSpanArray
, pass the array object
to Jupyter's display()
function; or make that object be the last line of
the cell, as in the following example:
bender_spans[SHORT_DOC_NUM]["span"].array
-DOCSTART-
SOCCER-
ENGLISH F.A. CUP
SECOND ROUND RESULT.
LONDON
1996-12-06
Result of an
English F.A. Challenge
Cup
second round match on Friday:
Plymouth
4
Exeter
1
The text on the right side of the HTML shows the spans in the context of the reconstructed document text.
The table on the left shows detailed information about the spans.
Internally, instances of TokenSpanArray
consist of arrays of begin and end token offsets, plus a
reference to the tokens. So only the begin_token
and end_token
values in the above table are actually
stored inside the array. The other attributes in the table, begin
, end
, and covered_text
, are computed
on demand.
You can obtain a reference to the underlying tokens for a TokenSpanArray
via the properties
document_tokens
and tokens
. document_tokens
returns the single underlying set of tokens
for arrays of spans that are all from the same document, while tokens
returns the (potentially
different) backing tokens for each span of an array that may cover spans from multiple documents.
In this case, each of our DataFrames spans a single document, so we can use the document_tokens
property to fetch the tokens of the corresponding document. For example, here are the spans of the
first 5 tokens:
bender_spans[SHORT_DOC_NUM]["span"].array.document_tokens[0:5]
-DOCSTART-
SOCCER
-
ENGLISH
F.A.
CUP SECOND ROUND RESULT.
LONDON 1996-12-06
Result of an English F.A. Challenge
Cup second round match on Friday:
Plymouth 4 Exeter 1
Note that the SpanArray
object we displayed in the previous cell is the same object
that backs the "span" column in the token information DataFrame bender_output[SHORT_DOC_NUM]
from a few cells back:
bender_output[SHORT_DOC_NUM]["span"].array[0:5]
-DOCSTART-
SOCCER
-
ENGLISH
F.A.
CUP SECOND ROUND RESULT.
LONDON 1996-12-06
Result of an English F.A. Challenge
Cup second round match on Friday:
Plymouth 4 Exeter 1
Now that we have converted our corpus and model outputs to DataFrames of <span, entity type>
pairs, we can use Pandas to compare the model's output with the labels.
There are several ways to compare a set of <span, entity type>
pairs. You may want to require an exact match of two spans, or you may want to consider partial matches. You might want to require exact matches of the entity types, or you may want to give partial credit to a model output that correctly identifies an entity's span but assigns the wrong entity type to that span.
Which of these types of comparisons is the "right" way to compare depends on your application. In the cells that follow, we'll show how to use Text Extensions for Pandas to do three different types of span comparison. To simplify the code that follows, we'll start by filtering down to just the Person (PER
) entities in both output sets. That way we can compare just spans, ignoring the ent_type
column for now.
# Let's look at just PER annotations
corpus_person = [df[df["ent_type"] == "PER"] for df in corpus_spans]
bender_person = [df[df["ent_type"] == "PER"] for df in bender_spans]
# Also, switch to a longer example document to make the comparisons that
# follow more interesting.
corpus_person[LONG_DOC_NUM]
span | ent_type | |
---|---|---|
1 | [40, 45): 'CHINA' | PER |
2 | [66, 77): 'Nadim Ladki' | PER |
12 | [482, 495): 'Igor Shkvyrin' | PER |
14 | [618, 632): 'Oleg Shatskiku' | PER |
21 | [1079, 1092): 'Takuya Takagi' | PER |
22 | [1148, 1168): 'Hiroshige Yanagimoto' | PER |
24 | [1216, 1227): 'Salem Bitar' | PER |
26 | [1360, 1372): 'Hassan Abbas' | PER |
27 | [1489, 1494): 'Bitar' | PER |
28 | [1503, 1517): 'Nader Jokhadar' | PER |
33 | [1702, 1707): 'Bitar' | PER |
35 | [1761, 1769): 'Shu Kamo' | PER |
Our extension types for spans consider two spans to be equal if they have the same target text and the same begin and end offsets.
So you can do exact match comparison between two DataFrames of span data using the standard Pandas
merge()
function.
corpus_person[LONG_DOC_NUM].merge(bender_person[LONG_DOC_NUM])
span | ent_type | |
---|---|---|
0 | [66, 77): 'Nadim Ladki' | PER |
1 | [618, 632): 'Oleg Shatskiku' | PER |
2 | [1079, 1092): 'Takuya Takagi' | PER |
3 | [1148, 1168): 'Hiroshige Yanagimoto' | PER |
4 | [1216, 1227): 'Salem Bitar' | PER |
5 | [1360, 1372): 'Hassan Abbas' | PER |
6 | [1503, 1517): 'Nader Jokhadar' | PER |
7 | [1761, 1769): 'Shu Kamo' | PER |
Text Extensions for Pandas also includes a function contain_join()
for finding pairs of spans where one span in the pair either equals or contains the other span. We can use contain_join()
to compare two DataFrame columns of spans and find all pairs that satisfy this looser notion of span equivalence:
# ...give credit for partial matches contained entirely within a true match:
tp.spanner.contain_join(corpus_person[LONG_DOC_NUM]["span"],
bender_person[LONG_DOC_NUM]["span"],
"corpus", "bender")
corpus | bender | |
---|---|---|
0 | [66, 77): 'Nadim Ladki' | [66, 77): 'Nadim Ladki' |
1 | [482, 495): 'Igor Shkvyrin' | [487, 495): 'Shkvyrin' |
2 | [618, 632): 'Oleg Shatskiku' | [618, 632): 'Oleg Shatskiku' |
3 | [1079, 1092): 'Takuya Takagi' | [1079, 1092): 'Takuya Takagi' |
4 | [1148, 1168): 'Hiroshige Yanagimoto' | [1148, 1168): 'Hiroshige Yanagimoto' |
5 | [1216, 1227): 'Salem Bitar' | [1216, 1227): 'Salem Bitar' |
6 | [1360, 1372): 'Hassan Abbas' | [1360, 1372): 'Hassan Abbas' |
7 | [1503, 1517): 'Nader Jokhadar' | [1503, 1517): 'Nader Jokhadar' |
8 | [1761, 1769): 'Shu Kamo' | [1761, 1769): 'Shu Kamo' |
Compared with the output of the previous cell, we now have 9 matches instead of 8. The "bender" team's model identified the [487, 495): 'Shkvyrin'
as a Person entity, while the corpus includes the longer span [482, 495): 'Igor Shkvyrin'
.
Text Extensions for Pandas also has second span comparison function overlap_join()
that finds pairs of spans that overlap. You can use this function to look for equivalent spans between two sets of spans, using this even-looser notion of span equivalence:
# ...give credit for matches that overlap at all with a true match:
tp.spanner.overlap_join(corpus_person[LONG_DOC_NUM]["span"],
bender_person[LONG_DOC_NUM]["span"],
"gold", "extracted")
gold | extracted | |
---|---|---|
0 | [66, 77): 'Nadim Ladki' | [66, 77): 'Nadim Ladki' |
1 | [482, 495): 'Igor Shkvyrin' | [487, 495): 'Shkvyrin' |
2 | [618, 632): 'Oleg Shatskiku' | [618, 632): 'Oleg Shatskiku' |
3 | [1079, 1092): 'Takuya Takagi' | [1079, 1092): 'Takuya Takagi' |
4 | [1148, 1168): 'Hiroshige Yanagimoto' | [1148, 1168): 'Hiroshige Yanagimoto' |
5 | [1216, 1227): 'Salem Bitar' | [1216, 1227): 'Salem Bitar' |
6 | [1360, 1372): 'Hassan Abbas' | [1360, 1372): 'Hassan Abbas' |
7 | [1503, 1517): 'Nader Jokhadar' | [1503, 1517): 'Nader Jokhadar' |
8 | [1761, 1769): 'Shu Kamo' | [1761, 1769): 'Shu Kamo' |
In this example document, the "overlap" type of span equivalence produces the same result as "containment" span equivalence.
Most benchmark results on the CoNLL-2003 dataset use the accuracy metric from the original competition: F1 score (the harmonic mean of precision and recall) over entity mentions, where an entity mention is considered "correct" if it corresponds exactly to an entity mention in the corpus labels. We can compute this metric using Text Extensions for Pandas' extension types.
We start by using Pandas' merge()
function to find the number of matches between pairs of DataFrames of <span, label>
pairs. The number of <span, label>
pairs that exactly match between each pair of DataFrames gives us the number of true positives in each document:
num_true_positives = [len(corpus_person[i].merge(bender_person[i]).index)
for i in range(len(corpus_person))]
num_true_positives[0:5]
[8, 31, 32, 20, 5]
The remaining inputs we need are the number of entity mentions the model extracted in each document and the number of mentions that the corpus contains in each document. We can obtain these figures directly from the lengths of our per-document DataFrames:
num_extracted = [len(df.index) for df in bender_person]
num_entities = [len(df.index) for df in corpus_person]
num_extracted[0:5], num_entities[0:5]
([9, 31, 33, 20, 5], [12, 31, 40, 20, 5])
Then we combine these three lists of counts into a single DataFrame:
stats_by_doc = pd.DataFrame({
"doc_num": np.arange(len(corpus_person)),
"num_true_positives": num_true_positives,
"num_extracted": num_extracted,
"num_entities": num_entities
})
stats_by_doc
doc_num | num_true_positives | num_extracted | num_entities | |
---|---|---|---|---|
0 | 0 | 8 | 9 | 12 |
1 | 1 | 31 | 31 | 31 |
2 | 2 | 32 | 33 | 40 |
3 | 3 | 20 | 20 | 20 |
4 | 4 | 5 | 5 | 5 |
... | ... | ... | ... | ... |
226 | 226 | 2 | 2 | 2 |
227 | 227 | 4 | 4 | 6 |
228 | 228 | 4 | 5 | 4 |
229 | 229 | 0 | 0 | 0 |
230 | 230 | 5 | 5 | 9 |
231 rows × 4 columns
The standard CoNLL-2003 accuracy metric is F1 score over the entire "test" fold. We can compute this statistic by aggregating the DataFrame:
total_true_positives = stats_by_doc["num_true_positives"].sum()
total_entities = stats_by_doc["num_entities"].sum()
total_extracted = stats_by_doc["num_extracted"].sum()
precision = total_true_positives / total_extracted
recall = total_true_positives / total_entities
F1 = 2.0 * (precision * recall) / (precision + recall)
print(
f"""Number of correct answers: {total_true_positives}
Number of entities identified: {total_extracted}
Actual number of entities: {total_entities}
Precision: {precision:1.4f}
Recall: {recall:1.4f}
F1: {F1:1.4f}""")
Number of correct answers: 1421 Number of entities identified: 1583 Actual number of entities: 1617 Precision: 0.8977 Recall: 0.8788 F1: 0.8881
The above numbers match up with the official results (last line below)
!head -14 ../resources/conll_03/ner/results/bender/conlleval.out
eng.testa processed 51578 tokens with 5942 phrases; found: 5846 phrases; correct: 5280. accuracy: 98.07%; precision: 90.32%; recall: 88.86%; FB1: 89.58 LOC: precision: 93.27%; recall: 93.58%; FB1: 93.42 MISC: precision: 88.51%; recall: 81.02%; FB1: 84.60 ORG: precision: 84.67%; recall: 83.59%; FB1: 84.13 PER: precision: 92.26%; recall: 91.91%; FB1: 92.09 eng.testb processed 46666 tokens with 5648 phrases; found: 5548 phrases; correct: 4698. accuracy: 96.80%; precision: 84.68%; recall: 83.18%; FB1: 83.92 LOC: precision: 86.44%; recall: 89.81%; FB1: 88.09 MISC: precision: 78.35%; recall: 73.22%; FB1: 75.70 ORG: precision: 80.27%; recall: 76.16%; FB1: 78.16 PER: precision: 89.77%; recall: 87.88%; FB1: 88.81
In addition to the standard corpus-level accuracy statistics, we can also compute precision, recall, and F1 score for each document by adding some additional columns to our DataFrame stats_by_doc
:
stats_by_doc["precision"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_extracted"]
stats_by_doc["recall"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_entities"]
stats_by_doc["F1"] = 2.0 * (stats_by_doc["precision"] * stats_by_doc["recall"]) / (stats_by_doc["precision"] + stats_by_doc["recall"])
stats_by_doc
doc_num | num_true_positives | num_extracted | num_entities | precision | recall | F1 | |
---|---|---|---|---|---|---|---|
0 | 0 | 8 | 9 | 12 | 0.888889 | 0.666667 | 0.761905 |
1 | 1 | 31 | 31 | 31 | 1.000000 | 1.000000 | 1.000000 |
2 | 2 | 32 | 33 | 40 | 0.969697 | 0.800000 | 0.876712 |
3 | 3 | 20 | 20 | 20 | 1.000000 | 1.000000 | 1.000000 |
4 | 4 | 5 | 5 | 5 | 1.000000 | 1.000000 | 1.000000 |
... | ... | ... | ... | ... | ... | ... | ... |
226 | 226 | 2 | 2 | 2 | 1.000000 | 1.000000 | 1.000000 |
227 | 227 | 4 | 4 | 6 | 1.000000 | 0.666667 | 0.800000 |
228 | 228 | 4 | 5 | 4 | 0.800000 | 1.000000 | 0.888889 |
229 | 229 | 0 | 0 | 0 | NaN | NaN | NaN |
230 | 230 | 5 | 5 | 9 | 1.000000 | 0.555556 | 0.714286 |
231 rows × 7 columns
We can use these per-document statistics to find documents where this model performed poorly.
Here, we use the Pandas sort_values()
function to identify the top ten most problematic documents by F1 score:
stats_by_doc.sort_values("F1").head(10)
doc_num | num_true_positives | num_extracted | num_entities | precision | recall | F1 | |
---|---|---|---|---|---|---|---|
75 | 75 | 2 | 21 | 2 | 0.095238 | 1.000000 | 0.173913 |
7 | 7 | 1 | 2 | 4 | 0.500000 | 0.250000 | 0.333333 |
8 | 8 | 1 | 1 | 5 | 1.000000 | 0.200000 | 0.333333 |
138 | 138 | 2 | 2 | 10 | 1.000000 | 0.200000 | 0.333333 |
161 | 161 | 1 | 3 | 2 | 0.333333 | 0.500000 | 0.400000 |
104 | 104 | 2 | 3 | 6 | 0.666667 | 0.333333 | 0.444444 |
185 | 185 | 1 | 2 | 2 | 0.500000 | 0.500000 | 0.500000 |
131 | 131 | 1 | 2 | 2 | 0.500000 | 0.500000 | 0.500000 |
85 | 85 | 1 | 1 | 3 | 1.000000 | 0.333333 | 0.500000 |
43 | 43 | 7 | 8 | 16 | 0.875000 | 0.437500 | 0.583333 |
What's going on with document 75?
from IPython import display
display.display(display.HTML("<h3>PER entities in corpus for document 75:</h3>"))
display.display(corpus_person[75])
display.display(display.HTML("<p><h3>PER entities in model outputs for document 75:</h3>"))
display.display(bender_person[75])
span | ent_type | |
---|---|---|
2 | [53, 70): 'Brendan Intindola' | PER |
54 | [2242, 2252): 'Marc Cohen' | PER |
span | ent_type | |
---|---|---|
2 | [53, 70): 'Brendan Intindola' | PER |
6 | [177, 185): 'Santa Fe' | PER |
8 | [207, 215): 'Santa Fe' | PER |
10 | [264, 272): 'Santa Fe' | PER |
14 | [455, 463): 'Santa Fe' | PER |
16 | [694, 702): 'Santa Fe' | PER |
18 | [828, 836): 'Santa Fe' | PER |
25 | [1348, 1356): 'Santa Fe' | PER |
28 | [1471, 1475): 'Dome' | PER |
30 | [1578, 1586): 'Santa Fe' | PER |
34 | [1750, 1759): 'Homestake' | PER |
40 | [1944, 1952): 'Santa Fe' | PER |
43 | [2080, 2088): 'Santa Fe' | PER |
54 | [2242, 2252): 'Marc Cohen' | PER |
57 | [2355, 2363): 'Santa Fe' | PER |
59 | [2654, 2662): 'Santa Fe' | PER |
61 | [2890, 2898): 'Santa Fe' | PER |
62 | [2962, 2970): 'Santa Fe' | PER |
64 | [3020, 3028): 'Newmonth' | PER |
65 | [3057, 3065): 'Santa Fe' | PER |
66 | [3143, 3151): 'Santa Fe' | PER |
It looks like this model had trouble with "Santa Fe". Let's look at all instances of that string in this document. We can use Text Extensions for Pandas' regular expression support to create spans for every mention of "Santa Fe" in document 75:
import regex
doc_75_tokens = corpus_test_fold[75]["span"]
# Find all matches of "Santa Fe" that start and end on a token boundary
santa_fe_mentions = tp.spanner.extract_regex_tok(
doc_75_tokens, regex.compile(r'[Ss]anta\s+[Ff]e'),
min_len=2, max_len=2)
santa_fe_mentions
match | |
---|---|
0 | [36, 44): 'Santa Fe' |
1 | [177, 185): 'Santa Fe' |
2 | [207, 215): 'Santa Fe' |
3 | [264, 272): 'Santa Fe' |
4 | [455, 463): 'Santa Fe' |
5 | [694, 702): 'Santa Fe' |
6 | [828, 836): 'Santa Fe' |
7 | [980, 988): 'Santa Fe' |
8 | [1348, 1356): 'Santa Fe' |
9 | [1578, 1586): 'Santa Fe' |
10 | [1944, 1952): 'Santa Fe' |
11 | [2080, 2088): 'Santa Fe' |
12 | [2355, 2363): 'Santa Fe' |
13 | [2654, 2662): 'Santa Fe' |
14 | [2890, 2898): 'Santa Fe' |
15 | [2962, 2970): 'Santa Fe' |
16 | [3057, 3065): 'Santa Fe' |
17 | [3143, 3151): 'Santa Fe' |
18 | [3434, 3442): 'Santa Fe' |
Now let's line up those regex matches with the model outputs and corpus labels:
santa_fe_bender = pd.merge(santa_fe_mentions, bender_person[75],
left_on="match", right_on="span")
santa_fe_bender
match | span | ent_type | |
---|---|---|---|
0 | [177, 185): 'Santa Fe' | [177, 185): 'Santa Fe' | PER |
1 | [207, 215): 'Santa Fe' | [207, 215): 'Santa Fe' | PER |
2 | [264, 272): 'Santa Fe' | [264, 272): 'Santa Fe' | PER |
3 | [455, 463): 'Santa Fe' | [455, 463): 'Santa Fe' | PER |
4 | [694, 702): 'Santa Fe' | [694, 702): 'Santa Fe' | PER |
5 | [828, 836): 'Santa Fe' | [828, 836): 'Santa Fe' | PER |
6 | [1348, 1356): 'Santa Fe' | [1348, 1356): 'Santa Fe' | PER |
7 | [1578, 1586): 'Santa Fe' | [1578, 1586): 'Santa Fe' | PER |
8 | [1944, 1952): 'Santa Fe' | [1944, 1952): 'Santa Fe' | PER |
9 | [2080, 2088): 'Santa Fe' | [2080, 2088): 'Santa Fe' | PER |
10 | [2355, 2363): 'Santa Fe' | [2355, 2363): 'Santa Fe' | PER |
11 | [2654, 2662): 'Santa Fe' | [2654, 2662): 'Santa Fe' | PER |
12 | [2890, 2898): 'Santa Fe' | [2890, 2898): 'Santa Fe' | PER |
13 | [2962, 2970): 'Santa Fe' | [2962, 2970): 'Santa Fe' | PER |
14 | [3057, 3065): 'Santa Fe' | [3057, 3065): 'Santa Fe' | PER |
15 | [3143, 3151): 'Santa Fe' | [3143, 3151): 'Santa Fe' | PER |
santa_fe_corpus = pd.merge(santa_fe_mentions, corpus_spans[75],
left_on="match", right_on="span")
santa_fe_corpus
match | span | ent_type | |
---|---|---|---|
0 | [36, 44): 'Santa Fe' | [36, 44): 'Santa Fe' | LOC |
1 | [207, 215): 'Santa Fe' | [207, 215): 'Santa Fe' | ORG |
2 | [264, 272): 'Santa Fe' | [264, 272): 'Santa Fe' | ORG |
3 | [455, 463): 'Santa Fe' | [455, 463): 'Santa Fe' | ORG |
4 | [694, 702): 'Santa Fe' | [694, 702): 'Santa Fe' | ORG |
5 | [828, 836): 'Santa Fe' | [828, 836): 'Santa Fe' | ORG |
6 | [980, 988): 'Santa Fe' | [980, 988): 'Santa Fe' | ORG |
7 | [1348, 1356): 'Santa Fe' | [1348, 1356): 'Santa Fe' | ORG |
8 | [1578, 1586): 'Santa Fe' | [1578, 1586): 'Santa Fe' | ORG |
9 | [1944, 1952): 'Santa Fe' | [1944, 1952): 'Santa Fe' | ORG |
10 | [2080, 2088): 'Santa Fe' | [2080, 2088): 'Santa Fe' | ORG |
11 | [2355, 2363): 'Santa Fe' | [2355, 2363): 'Santa Fe' | ORG |
12 | [2654, 2662): 'Santa Fe' | [2654, 2662): 'Santa Fe' | ORG |
13 | [2890, 2898): 'Santa Fe' | [2890, 2898): 'Santa Fe' | LOC |
14 | [2962, 2970): 'Santa Fe' | [2962, 2970): 'Santa Fe' | ORG |
15 | [3057, 3065): 'Santa Fe' | [3057, 3065): 'Santa Fe' | ORG |
16 | [3143, 3151): 'Santa Fe' | [3143, 3151): 'Santa Fe' | ORG |
17 | [3434, 3442): 'Santa Fe' | [3434, 3442): 'Santa Fe' | ORG |
Next, we'll compare the above two sets of spans to each other to create a picture of how the corpus and the "bender" team's output treated each of our regular expression matches.
cols = ["span", "ent_type"]
matching_spans = pd.merge(santa_fe_bender[cols], santa_fe_corpus[cols],
on="span", how="left", suffixes=["_bender", "_corpus"])
matching_spans
span | ent_type_bender | ent_type_corpus | |
---|---|---|---|
0 | [177, 185): 'Santa Fe' | PER | NaN |
1 | [207, 215): 'Santa Fe' | PER | ORG |
2 | [264, 272): 'Santa Fe' | PER | ORG |
3 | [455, 463): 'Santa Fe' | PER | ORG |
4 | [694, 702): 'Santa Fe' | PER | ORG |
5 | [828, 836): 'Santa Fe' | PER | ORG |
6 | [1348, 1356): 'Santa Fe' | PER | ORG |
7 | [1578, 1586): 'Santa Fe' | PER | ORG |
8 | [1944, 1952): 'Santa Fe' | PER | ORG |
9 | [2080, 2088): 'Santa Fe' | PER | ORG |
10 | [2355, 2363): 'Santa Fe' | PER | ORG |
11 | [2654, 2662): 'Santa Fe' | PER | ORG |
12 | [2890, 2898): 'Santa Fe' | PER | LOC |
13 | [2962, 2970): 'Santa Fe' | PER | ORG |
14 | [3057, 3065): 'Santa Fe' | PER | ORG |
15 | [3143, 3151): 'Santa Fe' | PER | ORG |
Out of 15 instances of "Santa Fe" that the "bender" model tagged as PER
, 14 are tagged in the corpus as ORG
or LOC
. What's going on with the 15th? Let's look at the context of that span in the reconstructed document:
matching_spans.iloc[0]["span"].context()
'... the most likely white knight buyer for [Santa Fe] Pacific Gold Corp if Santa Fe rejects u...'
It like that span is part of a larger entity, "Santa Fe Pacific Gold Corp". We can verify this fact by
using text_extensions_for_pandas.overlap_join()
to find the span in the corpus labels that overlaps
with the regular expression match:
tp.spanner.overlap_join(
corpus_spans[75]["span"], matching_spans.iloc[[0]]["span"],
"corpus", "regex_match")
corpus | regex_match | |
---|---|---|
0 | [177, 203): 'Santa Fe Pacific Gold Corp' | [177, 185): 'Santa Fe' |