Text_Extensions_for_Pandas_Overview.ipynb:

Overview of the basic functionality and usage of Text Extensions for Pandas.

Text Extensions for Pandas¶

Text Extensions for Pandas is a library that provides natural language processing support for Pandas DataFrames. It includes Pandas extension arrays that help with natural language processing, and integrates with other popular NLP libraries to provide a workflow centered around the easy to use and powerful Pandas DataFrame.

This notebook gives an overview of the basic functionality of Text Extensions for Pandas, and serves as a jumping off point to more in-depth examples of specific functionality. See the following notebooks that use Text Extensions for Pandas for data analysis, NLP, and model training:

Analyze_Model_Outputs - analyze the outputs of a NLP model on a target corpus
Analyze_Text - usage with the IBM Watson cloud API
Integrate_NLP_Libraries - integration with SpaCy and IBM Watson
Model_Training_with_BERT - model training for NER with BERT tokenization and embeddings
Understand_Tables - integration with IBM Watson Discovery for understanding of tables in PDFs and documents

API reference can be found at https://text-extensions-for-pandas.readthedocs.io/en/latest/

Environment Setup¶

This notebook requires a Python 3.6 or later environment with NumPy, and Pandas.

The notebook also requires the text_extensions_for_pandas library. You can satisfy this dependency in two ways:

Run pip install text_extensions_for_pandas before running this notebook. This command adds the library to your Python environment.
Run this notebook out of your local copy of the Text Extensions for Pandas project's source tree. In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree if the package is not installed in your Python environment.

In [1]:

import os
import regex
import sys
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

Pandas Extension Arrays¶

Text Extensions for Pandas provides several Pandas extension arrays on which much of the functionality is built on top of. This section will introduce and show basic usage of these extension arrays.

SpanArray¶

A SpanArray represents a column of character-based spans over a single target text. It is backed by 2 child arrays of integers that are the begin and end offsets of each span item from the target text. Spans can use any offset within the target text and can also overlap with each other. A SpanArray can efficiently represent the tokenized result of text because each token is not copied, only offsets are stored. Equality of spans is determined by the text and offset values, so each token will be unique within the text.

The SpanArray is a Pandas extension type, so it can be wrapped as a series and included in a DataFrame to make use of standard Pandas functionality. The values of a SpanArray are also designed to render nicely as HTML, for easy display of the span offsets, text and highlighted target text.

We will show some basic operations of the SpanArray by tokenizing a small example piece of text.

In [2]:

# Sample text input.
text = """\
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain \
searching for men to join the Knights of the Round Table. Along the way, \
he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad \
the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir \
Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.\
"""

In [3]:

# Define a crude tokenizer to split by words, for example use only.
def tokenize_with_offsets(text):
    """Return offsets of tokens from given `text`"""
    splits = text.split(" ")
    begins = np.cumsum([0] + [len(s) + 1 for s in splits[:-1]])
    ends = begins + [len(s.strip(",.")) for s in splits]
    return begins, ends

In [4]:

# Tokenize the text to get begin, end offsets and construct a `SpanArray`.
begins, ends = tokenize_with_offsets(text)
tokens = tp.SpanArray(text, begins, ends)

# The array nicely renders in HTML to show offsets, text of the span,
# and highlighted target text.
tokens

Out[4]:

	begin	end	context
0	0	2	In
1	3	5	AD
2	6	9	932
3	11	15	King
4	16	22	Arthur
5	23	26	and
6	27	30	his
7	31	37	squire
8	39	44	Patsy
9	46	52	travel
10	53	63	throughout
11	64	71	Britain
12	72	81	searching
13	82	85	for
14	86	89	men
15	90	92	to
16	93	97	join
17	98	101	the
18	102	109	Knights
19	110	112	of
20	113	116	the
21	117	122	Round
22	123	128	Table
23	130	135	Along
24	136	139	the
25	140	143	way
26	145	147	he
27	148	156	recruits
28	157	160	Sir
29	161	169	Bedevere
30	170	173	the
31	174	178	Wise
32	180	183	Sir
33	184	192	Lancelot
34	193	196	the
35	197	202	Brave
36	204	207	Sir
37	208	215	Galahad
38	216	219	the
39	220	224	Pure
40	226	229	Sir
41	230	235	Robin
42	236	239	the
43	240	274	Not-Quite-So-Brave-as-Sir-Lancelot
44	276	279	and
45	280	283	Sir
46	284	310	Not-Appearing-in-this-Film
47	312	317	along
48	318	322	with
49	323	328	their
50	329	336	squires
51	337	340	and
52	341	348	Robin's
53	349	360	troubadours

In AD 932 , King Arthur and his squire , Patsy , travel throughout Britain searching for men to join the Knights of the Round Table . Along the way , he recruits Sir Bedevere the Wise , Sir Lancelot the Brave , Sir Galahad the Pure , Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot , and Sir Not-Appearing-in-this-Film , along with their squires and Robin's troubadours .

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

In [5]:

# Indexing the array with an integer will produce a `Span`, which is a single
# element in the array.
tok = tokens[43]
tok

Out[5]:

[240, 274): 'Not-Quite-So-Brave-as-Sir-Lancelot'

In [6]:

# It can also be indexed with a slice, producing another `SpanArray`.
toks = tokens[40:44]
toks

Out[6]:

	begin	end	context
0	226	229	Sir
1	230	235	Robin
2	236	239	the
3	240	274	Not-Quite-So-Brave-as-Sir-Lancelot

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot , and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

In [7]:

# Iterate over the array to get each `Span`.
toks = [span for span in tokens[40:44]]
toks

Out[7]:

[[226, 229): 'Sir',
 [230, 235): 'Robin',
 [236, 239): 'the',
 [240, 274): 'Not-Quite-So-Brave-as-Sir-Lancelot']

In [8]:

# Addition of `Span`s or `SpanArray`s are supported.
# The result is the minimum `Span` that covers both `Span`s.
result = toks[0] + toks[-1]
result

Out[8]:

[226, 274): 'Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot'

In [9]:

# You can check if one `Span` contains another.
result.contains(toks[1])

Out[9]:

True

In [10]:

# Also if two `Span`s overlap.
a = toks[0] + toks[2]
b = toks[2] + toks[3]
a.overlaps(b)

Out[10]:

True

In [11]:

# Get 2 `Span`s to test equality.
sir = tokens[36]
other_sir = tokens[40]
sir, other_sir

Out[11]:

([204, 207): 'Sir', [226, 229): 'Sir')

In [12]:

# Equality is determined by text and offset values, not just text.
sir == other_sir, \
sir.covered_text == other_sir.covered_text

Out[12]:

(False, True)

In [13]:

# Only a `Span` from the same target text with matching offsets is equal.
sir == tp.Span(text, 204, 207)

Out[13]:

True

TokenSpanArray¶

A TokenSpanArray builds on a SpanArray with the ability to span text as indices of a SpanArray instead of character based offsets. This makes it convenient to use when doing analysis on the token level. Similar to SpanArray, a single item in a TokenSpanArray is a TokenSpan. For an example, let's define a single TokenSpan using the target text from above.

In [14]:

# Single `TokenSpan` to cover "King Arthur" - notice we begin with the third
# token and end at the fifth.
tp.TokenSpan(tokens, 3, 5)

Out[14]:

[11, 22): 'King Arthur'

In [15]:

# We can also make a `TokenSpanArray` with a list of begin and end offsets of
# measured in tokens. Here we make spans of the names within the target text.
begin_tokens = [3, 8, 28, 32, 36, 40, 45, 52]
end_tokens =   [5, 9, 32, 36, 40, 44, 47, 53]
token_spans = tp.TokenSpanArray(tokens, begin_tokens, end_tokens)
token_spans

Out[15]:

	begin	end	begin token	end token	context
0	11	22	3	5	King Arthur
1	39	44	8	9	Patsy
2	157	178	28	32	Sir Bedevere the Wise
3	180	202	32	36	Sir Lancelot the Brave
4	204	224	36	40	Sir Galahad the Pure
5	226	274	40	44	Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot
6	280	310	45	47	Sir Not-Appearing-in-this-Film
7	341	348	52	53	Robin's

In AD 932, King Arthur and his squire, Patsy , travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise , Sir Lancelot the Brave , Sir Galahad the Pure , Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot , and Sir Not-Appearing-in-this-Film , along with their squires and Robin's troubadours.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

In [16]:

# When all the spans in a `TokenSpanArray` come from the same document, you can access
# the tokens of that document via the `document_tokens` property:
token_spans.document_tokens[:5]

Out[16]:

	begin	end	context
0	0	2	In
1	3	5	AD
2	6	9	932
3	11	15	King
4	16	22	Arthur

In AD 932 , King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

In [17]:

# Both SpanArrays and TokenSpanArrays can contain spans from multiple documents.
tokens_2 = tp.SpanArray("Second document", [0, 7], [6, 15])
token_spans_2 = tp.TokenSpanArray(tokens_2, [0], [2])

two_doc_series = pd.concat([pd.Series(token_spans[0:1]), pd.Series(token_spans_2)])
two_doc_series.array

Out[17]:

	begin	end	begin token	end token	context
0	11	22	3	5	King Arthur

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

	begin	end	begin token	end token	context
0	0	15	0	2	Second document

Second document

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

Note that the HTML representation now contains the annotated text of two documents. We can use the tokens property to view view the two sets of tokens backing the two spans in this array:

In [18]:

two_doc_series.array.tokens

Out[18]:

array([<SpanArray>
       [                                    [0, 2): 'In',
                                            [3, 5): 'AD',
                                           [6, 9): '932',
                                        [11, 15): 'King',
                                      [16, 22): 'Arthur',
                                         [23, 26): 'and',
                                         [27, 30): 'his',
                                      [31, 37): 'squire',
                                       [39, 44): 'Patsy',
                                      [46, 52): 'travel',
                                  [53, 63): 'throughout',
                                     [64, 71): 'Britain',
                                   [72, 81): 'searching',
                                         [82, 85): 'for',
                                         [86, 89): 'men',
                                          [90, 92): 'to',
                                        [93, 97): 'join',
                                        [98, 101): 'the',
                                   [102, 109): 'Knights',
                                        [110, 112): 'of',
                                       [113, 116): 'the',
                                     [117, 122): 'Round',
                                     [123, 128): 'Table',
                                     [130, 135): 'Along',
                                       [136, 139): 'the',
                                       [140, 143): 'way',
                                        [145, 147): 'he',
                                  [148, 156): 'recruits',
                                       [157, 160): 'Sir',
                                  [161, 169): 'Bedevere',
                                       [170, 173): 'the',
                                      [174, 178): 'Wise',
                                       [180, 183): 'Sir',
                                  [184, 192): 'Lancelot',
                                       [193, 196): 'the',
                                     [197, 202): 'Brave',
                                       [204, 207): 'Sir',
                                   [208, 215): 'Galahad',
                                       [216, 219): 'the',
                                      [220, 224): 'Pure',
                                       [226, 229): 'Sir',
                                     [230, 235): 'Robin',
                                       [236, 239): 'the',
        [240, 274): 'Not-Quite-So-Brave-as-Sir-Lancelot',
                                       [276, 279): 'and',
                                       [280, 283): 'Sir',
                [284, 310): 'Not-Appearing-in-this-Film',
                                     [312, 317): 'along',
                                      [318, 322): 'with',
                                     [323, 328): 'their',
                                   [329, 336): 'squires',
                                       [337, 340): 'and',
                                   [341, 348): 'Robin's',
                               [349, 360): 'troubadours']
       Length: 54, dtype: SpanDtype                      ,
       <SpanArray>
       [[0, 6): 'Second', [7, 15): 'document']
       Length: 2, dtype: SpanDtype            ], dtype=object)

Spanner¶

The spanner module of Text Extensions for Pandas provides span-specific operations for Pandas DataFrames, based on the Document Spanners formalism, also known as spanner algebra.

Spanner algebra is an extension of relational algebra with additional operations to cover NLP applications. See the paper "Document Spanners: A Formal Approach to Information Extraction" by Fagin et al. for more information.

The available operations in spanner include: consolidate() to eliminate overlap in a span column, extract matching tokens with extract_dict() for dictionary matching or extract_regex_tok() for regular expression matching, joining series of spans with adjacent_join(), contain_join(), or overlap_join(), and projection on spans with lemmatize().

Here we will show how to extract tokens matching regular expressions and then join the results to a DataFrame.

In [19]:

# Extract tokens using a regular expression, here we find all the knights.
knights = tp.spanner.extract_regex_tok(tokens, regex.compile(r"Sir.\S+"), max_len=2)
knights

Out[19]:

	match
0	[157, 169): 'Sir Bedevere'
1	[180, 192): 'Sir Lancelot'
2	[204, 215): 'Sir Galahad'
3	[226, 235): 'Sir Robin'
4	[280, 310): 'Sir Not-Appearing-in-this-Film'

In [20]:

# Try to find all knight's virtues, not as easy and end up with other spans. 
virtues = tp.spanner.extract_regex_tok(tokens, regex.compile(r"the.\S+"), max_len=2)
virtues

Out[20]:

	match
0	[323, 328): 'their'
0	[98, 109): 'the Knights'
1	[113, 122): 'the Round'
2	[136, 143): 'the way'
3	[170, 178): 'the Wise'
4	[193, 202): 'the Brave'
5	[216, 224): 'the Pure'
6	[236, 274): 'the Not-Quite-So-Brave-as-Sir-Lan...

In [21]:

# Calling `tp.spanner.adjacent_join()` will join two span columns, where a pair
# of spans match if they are adjacent in the text.

# Now, easily join the 2 results and match each knight to their virtue.
tp.spanner.adjacent_join(knights["match"], virtues["match"], first_name="knight", second_name="virtue")

Out[21]:

	knight	virtue
0	[157, 169): 'Sir Bedevere'	[170, 178): 'the Wise'
1	[180, 192): 'Sir Lancelot'	[193, 202): 'the Brave'
2	[204, 215): 'Sir Galahad'	[216, 224): 'the Pure'
3	[226, 235): 'Sir Robin'	[236, 274): 'the Not-Quite-So-Brave-as-Sir-Lan...

TensorArray¶

A TensorArray represents an array of tensors where each element is an N-dimensional tensor of the same shape. If there are M tensor elements in the array, then the entire TensorArray will have a shape of M x N, where the outer dimension is the number of elements. Backing the TensorArray is a numpy.ndarray with shape M x N. Tensors, or numpy.ndarrays, are often used as feature vectors for machine learning model training and inference results. In Text Extensions for Pandas, they are used to store BERT embeddings from io.bert.add_embeddings() that can then be used to train a NLU model.

TensorArrays can be constructed with zero copy from a single numpy.ndarray or with a sequence of elements of similar shape. Conversion of a TensorArray to a numpy.ndarray can be done with zero copy by calling TensorArray.to_numpy() or using the provided numpy array interface, e.g. numpy.asarray(TensorArray(...)). The TensorArray is a Pandas extension type of type TensorDtype and can be wrapped in a pandas.Series or used as a column in a pandas.DataFrame and used in standard Pandas operations. A NULL or missing value in the TensorArray is represented as a N-dimensional numpy.ndarray where all items are numpy.nan. Standard arithmetic and comparison operations are supported and delegated to the backing numpy.ndarray. Taking a slice or multiple item selection will produce another TensorArray, while a single element selection will produce a TensorElement that also wraps a view of the numpy.ndarray, with similar operator support.

In [22]:

# Construct from a numpy.ndarray.
arr = tp.TensorArray(np.arange(10).reshape(5, 2))
arr, arr.dtype

Out[22]:

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 <text_extensions_for_pandas.array.tensor.TensorDtype at 0x10a112150>)

In [23]:

# Wrap in a Pandas Series.
s = pd.Series(arr)
s

Out[23]:

0    [0, 1]
1    [2, 3]
2    [4, 5]
3    [6, 7]
4    [8, 9]
dtype: TensorDtype

In [24]:

# Convert back to numpy using the provided array interface.
np_arr = np.asarray(s)
np_arr, np_arr.dtype

Out[24]:

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 dtype('int64'))

In [25]:

# Apply operations on the Series, result is another Series of type TensorDtype.
thresh = s > 4
thresh

Out[25]:

0    [ False,  False]
1    [ False,  False]
2    [ False,   True]
3    [  True,   True]
4    [  True,   True]
dtype: TensorDtype

In [26]:

# Create a boolean selection mask. Use `.array` to get the Series as
# a `TensorArray` which can be used directly on numpy operations and
# returns another `TensorArray`
mask = np.all(thresh.array, axis=1)
mask, type(mask)

Out[26]:

(array([False, False, False,  True,  True]),
 text_extensions_for_pandas.array.tensor.TensorArray)

In [27]:

# Apply Pandas selection on the Series of TensorDtype by converting
# the mask to a numpy boolean array.
s[mask.to_numpy()]

Out[27]:

3    [6, 7]
4    [8, 9]
dtype: TensorDtype

In [28]:

# TensorArray can also be added to a Pandas DataFrame.
df = pd.DataFrame({"time": pd.date_range('2018-01-01', periods=5, freq='h'), "features": arr})
df

Out[28]:

	time	features
0	2018-01-01 00:00:00	[0, 1]
1	2018-01-01 01:00:00	[2, 3]
2	2018-01-01 02:00:00	[4, 5]
3	2018-01-01 03:00:00	[6, 7]
4	2018-01-01 04:00:00	[8, 9]

In [29]:

# TensorArray supports many of the standard DataFrame operations.
df.sort_values(by="time", ascending=False)

Out[29]:

	time	features
4	2018-01-01 04:00:00	[8, 9]
3	2018-01-01 03:00:00	[6, 7]
2	2018-01-01 02:00:00	[4, 5]
1	2018-01-01 01:00:00	[2, 3]
0	2018-01-01 00:00:00	[0, 1]

Saving Pandas Extension Arrays to Disk¶

Pandas supports several built-in I/O formats, but currently the only supported format for saving DataFrames with Text Extensions for Pandas arrays to disk is with Feather files. Text Extensions for Pandas arrays can also be converted to Apache Arrow format, see https://arrow.apache.org/docs/python/pandas.html#dataframes for more information.

In [30]:

# Dummy function to create some features.
def hasher(span, num_features=4):
    arr = np.zeros(num_features, dtype="int8")
    arr[hash(span.covered_text) % 4] = 1
    return arr

In [31]:

# Create our feature vector.
features = tp.TensorArray([hasher(span) for span in tokens])
features.to_numpy().shape

Out[31]:

(54, 4)

In [32]:

# Add tokens and features to a DataFrame.
df = pd.DataFrame({"span": tokens, "features": features})
df.head()

Out[32]:

	span	features
0	[0, 2): 'In'	[1, 0, 0, 0]
1	[3, 5): 'AD'	[0, 1, 0, 0]
2	[6, 9): '932'	[0, 0, 0, 1]
3	[11, 15): 'King'	[0, 1, 0, 0]
4	[16, 22): 'Arthur'	[0, 0, 0, 1]

In [33]:

# Save DataFrame to a feather file.
# Feather is a lightweight, fast binary columnar format, with basic
# compression and support built into Pandas.
df.to_feather("outputs/tp_overview.feather")

In [34]:

# Read the file back into a new DataFrame.

# Disabled due to deprecated serialization API
#df_load = pd.read_feather("outputs/tp_overview.feather")
#df_load.head()

NLP Library Input/Output Integration¶

Text Extensions for Pandas also provides integration with other NLP libraries and datasets. It takes care of processing the inputs and outputs using Pandas DataFrame as a standard data structure and automatically producing the above extension arrays where applicable. Below is an overview of what each module provides along with more notebooks with example usage.

Watson¶

The io.watson sub-package provides functions to process and help analyze responses the IBM Waton Cloud service APIs.

In the module io.watson.nlu you can use Watson Natural Language Understanding to analyze text and then process the response into Pandas DataFrames containing SpanArrays for tokens, sentences and relations. See getting started on Watson NLU for setting up the Watson NLU Cloud Service, and the notebook Analyze_Text for in-depth examples of using the io.watson.nlu module.

In the module io.watson.table you can use Watson Discovery to extract and analyze tables within documents and web pages, and then process the response into Pandas DataFrames that make it easy to reconstruct and work with the extracted tables. See Waston Discovery Installation and IBM Cloud Pak for Data for getting started with Watson Discovery, and the notebook Understand_Tables for an in-depth example of using the watson.table module.

SpaCy¶

The io.spacy module contains functions to integrate with the popular NLP library SpaCy. This allows you to use a SpaCy tokenizer on text and return the tokens as a SpanArray in a Pandas DataFrame with io.spacy.make_tokens() or with additional token features with io.spacy.make_tokens_and_features(). See the notebook Integrate_NLP_Libraries for more examples with the io.spacy module.

BERT¶

The BERT model is originally from the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. The model is pre-trained with masked language modeling and next sentence prediction objectives, which make it effective for masked token prediction and NLU.

Text Extension for Pandas integrates with the Huggingface Transformers library to process the result of BERT tokenization into a Pandas DataFrame with tokens as aSpanArray column and compute BERT embbeddings that can also be added to a DataFrame as a TensorArray. The embeddings can be used for model training in your NLP application. See the notebook Model_Training_with_BERT for an example of tokenizing text with BERT and computing embeddings for model training/scoring.

CoNLL¶

CoNLL, the SIGNLL Conference on Computational Natural Language Learning, is an annual academic conference for natural language processing researchers. Each year's conference features a competition involving a challenging NLP task. The task for the 2003 competition involved identifying mentions of named entities in English and German news articles from the late 1990's. The corpus for this 2003 competition is one of the most widely-used benchmarks for the performance of named entity recognition models.

Text Extensions for Pandas contains the module io.conll that can help work with an analyze the CoNLL-2003 corpus. The provided functions can help convert between the IOB2 format used in the corpus, and SpanArray with entity type for easier analysis. See the notebooks Analyze_Model_Outputs for an in-depth analysis of the corpus and the 2003 competition results, and Model_Training_with_BERT for using the corpus to train a named entity recognition (NER) model.