We're going to replicate the benchmark in A Named Entity Based Approach to Model Recipes, by Diwan, Batra, and Bagler using StanfordNLP, and check it using seqeval.

Evaluating NER is surprisingly tricky, as David Batista explains, and I want to check that the results in the paper are the same as what seqeval gives, so I can compare it to other models.

The authors share their data in an associated git repository and train a model using Stanford NER, which is open source, so we have a chance of replicating the results.

Installing Stanford NLP¶

We're going to install Stanford NLP which is a Java library. To make things easier we will use stanza which includes tools for installing and invoking Stanford NLP.

In [1]:

    !pip install stanza

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
     |████████████████████████████████| 432 kB 292 kB/s            
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from stanza) (2.26.0)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.7/site-packages (from stanza) (3.19.4)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from stanza) (4.62.3)
Requirement already satisfied: torch>=1.3.0 in /opt/conda/lib/python3.7/site-packages (from stanza) (1.9.1+cpu)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from stanza) (1.20.3)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from stanza) (1.16.0)
Requirement already satisfied: emoji in /opt/conda/lib/python3.7/site-packages (from stanza) (1.7.0)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch>=1.3.0->stanza) (4.1.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (2.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (1.26.7)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (3.1)
Installing collected packages: stanza
Successfully installed stanza-1.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

We can specify where to install Core NLP, but we will us the default, which is either "\$CORE_NLP_HOME", or "\$HOME/stanza_corenlp". (Ideally we'd use stanza to get this, but I couldn't easy work out how.)

In [2]:

import stanza
stanza.install_corenlp()

Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        …

We'll need to invoke the Stanford Core NLP JAR that we just installed, so let's find it.

In [3]:

import os
import re
from pathlib import Path


# Reimplement the logic to find the path where stanza_corenlp is installed.
core_nlp_path = os.getenv('CORENLP_HOME', str(Path.home() / 'stanza_corenlp'))

# A heuristic to find the right jar file
classpath = [str(p) for p in Path(core_nlp_path).iterdir() if re.match(r"stanford-corenlp-[0-9.]+\.jar", p.name)][0]
classpath

Out[3]:

'/root/stanza_corenlp/stanford-corenlp-4.4.0.jar'

Let's test the basic usage.

There are currently models for 8 languages, and for some fairly complex tasks like coreference resolution.

In [4]:

from stanza.server import CoreNLPClient

text = "David Batista wrote a blog post on NER evaluation. " \
       "Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks, such as NER. " \
       "We will test his library against Stanford Core NLP. "

with CoreNLPClient(
     annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
        timeout=30000,
        memory='6G') as client:
    
    ann =  client.annotate(text)

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
			(Note: unspecified annotator properties are English defaults)
			annotators = tokenize,ssplit,pos,lemma,ner,parse,depparse,coref
			inputFormat = text
			outputFormat = serialized
			prettyPrint = false
			threads = 5
[main] INFO CoreNLP - Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger ... done [1.1 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.0 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [1.0 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580705 unique entries out of 581864 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4867 unique entries out of 4867 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585572 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.NERCombinerAnnotator - numeric classifiers: true; SUTime: true [no docDate]; fine grained: true
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... Time elapsed: 2.1 sec
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 20000 vectors, elapsed Time: 2.204 sec
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [4.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
[main] INFO edu.stanford.nlp.coref.statistical.SimpleLinearClassifier - Loading coref model edu/stanford/nlp/models/coref/statistical/ranking_model.ser.gz ... done [0.9 sec].
[main] INFO edu.stanford.nlp.pipeline.CorefMentionAnnotator - Using mention detector type: dependency
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9000
[pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:36852] API call w/annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref

David Batista wrote a blog post on NER evaluation. Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks, such as NER. We will test his library against Stanford Core NLP.

[Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.

We get 3 sentences out.

In [5]:

for sentence in ann.sentence:
    print(" ".join([token.word for token in sentence.token]))

David Batista wrote a blog post on NER evaluation .
Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks , such as NER .
We will test his library against Stanford Core NLP .

It can even do clever things like coreference resolution; resolving that "his library" refers to "Hiroki Nakayama's library".

In [6]:

for chain in ann.corefChain:
    print([ann.mentionsForCoref[mention.mentionID].headString for mention in chain.mention])

['nakayama', 'his']

We can extract things such as lemmas, parts of speech and standard NER tags.

But we want to train our own NER model to detect ingredients. First we will need to collect the data.

In [7]:

import pandas as pd

tokens = ann.sentence[1].token

pd.DataFrame({'word': [s.word for s in tokens],
              'lemma': [s.lemma for s in tokens],
              'pos': [s.pos for s in tokens],
              'ner': [s.ner for s in tokens]}).T

Out[7]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
word	Hiroki	Nakayama	wrote	seqeval	to	evaluate	sequential	labelling	tasks	,	such	as	NER	.
lemma	Hiroki	Nakayama	write	seqeval	to	evaluate	sequential	labelling	task	,	such	as	ner	.
pos	NNP	NNP	VBD	NN	TO	VB	JJ	NN	NNS	,	JJ	IN	NN	.
ner	PERSON	PERSON	O	O	O	O	O	O	O	O	O	O	O	O

Get Data¶

Helpfully the authors provide the annotated ingredients data in the format for Stanford NER that we can download from github.

There are two sources of ingredients, ar is AllRecipes and gk is FOOD.com (formerly GeniusKitchen.com).

In [8]:

from urllib.request import urlretrieve

data_sources = ['ar', 'gk']
data_splits = ['train', 'test']

base_url = 'https://raw.githubusercontent.com/cosylabiiit/recipe-knowledge-mining/master/'

def data_filename(source, split):
    return f'{source}_{split}.tsv'

for source in data_sources:
    for split in data_splits:
        name = data_filename(source, split)
        urlretrieve(base_url + name, name)

Each line of the file is either a single tab (separating different texts), or a token followed by a tab and then the entity type.

So for example the first ingredient is 4 cloves garlic, which is a quantity (4) followed by a unit (cloves) and a name (garlic).

In [9]:

!head {data_filename('ar', 'train')} | cat -t

^I
4^IQUANTITY
cloves^IUNIT
garlic^INAME
^I
2^IQUANTITY
tablespoons^IUNIT
vegetable^INAME
oil^INAME
,^IO

We can read this in to Python, converting it to a list of annotated sentences, which is just a sequence of token, label pairs.

In [10]:

from typing import List, Tuple, Generator

Annotation = Tuple[str, str]
AnnotatedSentence = List[Annotation]

def segment_texts(data: str) -> Generator[AnnotatedSentence, None, None]:
    output = []
    for line in data.split('\n'):
        if line.strip():
            text, token = line.split('\t')
            output.append((text.strip(), token.strip()))
        elif output:
            yield output
            output = []
            
def segment_file(filename: str) -> List[AnnotatedSentence]:
    with open(filename, 'rt') as f:
        return list(segment_texts(f.read()))

In [11]:

ar_train = segment_file(data_filename('ar', 'train'))

In [12]:

ar_train[:2]

Out[12]:

[[('4', 'QUANTITY'), ('cloves', 'UNIT'), ('garlic', 'NAME')],
 [('2', 'QUANTITY'),
  ('tablespoons', 'UNIT'),
  ('vegetable', 'NAME'),
  ('oil', 'NAME'),
  (',', 'O'),
  ('divided', 'STATE')]]

We can then calculate the number of sentences in the training set for a source.

In [13]:

len(ar_train)

Out[13]:

We can use this to check the types of entities annotated, as in the paper (DF is Dried/Fresh).

In [14]:

from collections import Counter

tag_counts = Counter([annotation[1] for sentence in ar_train for annotation in sentence])
tag_counts

Out[14]:

Counter({'QUANTITY': 1583,
         'UNIT': 1338,
         'NAME': 2501,
         'O': 1662,
         'STATE': 879,
         'DF': 154,
         'SIZE': 64,
         'TEMP': 31})

Train NER Model¶

Now we want to train a Stanford NER model on the new annotations.

First we have to configure it; but there's no information on the paper on how it's configured. I've copied this template configuration out of the FAQ For more information on the parameters you can check the NERFeatureFactory documentation or the source.

In [15]:

def ner_prop_str(train_files: List[str], test_files: List[str], output: str) -> str:
    """Returns configuration string to train NER model"""
    train_file_str = ','.join(train_files)
    test_file_str = ','.join(test_files)
    return f"""
trainFileList = {train_file_str}
testFiles = {test_file_str}
serializeTo = {output}
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
"""

This is expected to be a file, so let's write a helper that writes it to a file. (An alternative would be to pass these as arguments to the trainer).

In [16]:

def write_ner_prop_file(ner_prop_file: str, train_files: List[str], test_files: List[str], output_file: str) -> None:
    with open(ner_prop_file, 'wt') as f:
        props = ner_prop_str(train_files, test_files, output_file)
        f.write(props)

Stanza doesn't give an interface to train a CRF NER model using Stanford NLP, but we can invoke edu.stanford.nlp.ie.crf.CRFClassifier directly.

Let's write a properties file and invoke Java to run the classifier. It prints a lot of training information, and importantly a summary report at the end which we want to see.

In [17]:

import subprocess
from typing import List

def train_model(model_name, train_files: List[str], test_files: List[str], print_report=True, classpath=classpath) -> str:
    """Trains CRF NER Model using StanfordNLP"""
    model_file = f'{model_name}.model.ser.gz'
    ner_prop_filename = f'{model_name}.model.props'
    write_ner_prop_file(ner_prop_filename, train_files, test_files, model_file)
        
    result = subprocess.run(
                ['java',
                 '-Xmx2g',
                 '-cp', classpath,
                 'edu.stanford.nlp.ie.crf.CRFClassifier',
                 '-prop', ner_prop_filename],
                capture_output=True)
    
    # If there's an error with invocation better log the stacktrace
    if result.returncode != 0:
        print(result.stderr.decode('utf-8'))
    result.check_returncode()
    
    if print_report:
        print(*result.stderr.decode('utf-8').split('\n')[-11:], sep='\n')
        
    return model_file

We can train models on each dataset separately, and all together. For evaluation we'll use the corresponding test set.

This only takes a few minutes.

In [18]:

%%time

models = {}
for source in ['ar', 'gk', 'ar_gk']:
    print(source)
    train_files = [data_filename(s, 'train') for s in source.split('_')]
    test_files = [data_filename(s, 'test') for s in source.split('_')]
    models[source] = train_model(source, train_files, test_files)
    print()

ar
CRFClassifier tagged 2788 words in 483 documents at 7185.57 words per second.
         Entity	P	R	F1	TP	FP	FN
             DF	1.0000	0.9608	0.9800	49	0	2
           NAME	0.9297	0.9279	0.9288	463	35	36
       QUANTITY	1.0000	0.9962	0.9981	522	0	2
           SIZE	1.0000	1.0000	1.0000	20	0	0
          STATE	0.9601	0.9633	0.9617	289	12	11
           TEMP	0.8750	0.7000	0.7778	7	1	3
           UNIT	0.9819	0.9841	0.9830	434	8	7
         Totals	0.9696	0.9669	0.9682	1784	56	61


gk
CRFClassifier tagged 9886 words in 1705 documents at 11727.16 words per second.
         Entity	P	R	F1	TP	FP	FN
             DF	0.9718	0.9517	0.9617	138	4	7
           NAME	0.9132	0.9021	0.9076	1621	154	176
       QUANTITY	0.9882	0.9870	0.9876	1598	19	21
           SIZE	0.9750	0.9398	0.9571	78	2	5
          STATE	0.9255	0.9503	0.9377	708	57	37
           TEMP	0.8125	0.8125	0.8125	26	6	6
           UNIT	0.9810	0.9721	0.9766	1291	25	37
         Totals	0.9534	0.9497	0.9516	5460	267	289


ar_gk
CRFClassifier tagged 12674 words in 2188 documents at 11648.90 words per second.
         Entity	P	R	F1	TP	FP	FN
             DF	0.9738	0.9490	0.9612	186	5	10
           NAME	0.9136	0.9077	0.9106	2084	197	212
       QUANTITY	0.9911	0.9897	0.9904	2121	19	22
           SIZE	0.9798	0.9417	0.9604	97	2	6
          STATE	0.9386	0.9512	0.9449	994	65	51
           TEMP	0.8140	0.8333	0.8235	35	8	7
           UNIT	0.9801	0.9763	0.9782	1727	35	42
         Totals	0.9563	0.9539	0.9551	7244	331	350


CPU times: user 276 ms, sys: 134 ms, total: 410 ms
Wall time: 2min 17s

The summary report shows for each model and entity type:

True Positives (TP): The number of times that entity was predicted correctly
False Positives (FP): The number of times that entity in the text but not predicted correctly
False Negative (FN): The number of times that entity was not in the text and predicted
Precision (P): Probability a predicted entity is correct, TP/(TP+FP)
Recall (R): Probability a correct entity is predicted, TP/(TP+FN)
F1 Score (F1): Harmonic mean of precision and recall, 2/(1/P + 1/R).

We can compare the F1 Totals to the diagonal of Table IV in the paper

AllRecipes.com (ar): We get 0.9682, they report 0.9682
FOOD.com (gk): We get 0.9516, they report 0.9519
Both (ar_gk): We get 0.9551, they report 0.9611

These are super close. The furthest is ar_gk and in the repository they have a separate ar_gk_train.tsv; it would be interesting to check whether using it directly gives a closer result and why there is a difference.

Running the model in Python¶

We can now use these trained models in Python by invoking Stanford NLP with Stanza.

First we'll load in the test data.

In [19]:

test_data = {}

for source in data_sources:
    test_data[source] = segment_file(data_filename(source, 'test'))
    print(source, len(test_data[source]))

ar 483
gk 1705

We can call StanfordNLP with our custom model by passing the property ner.model.

Our test data is already tokenized in a different way to StanfordNLP, so we'll add an option to the Tokenizer to use whitespace tokenization which is easy to invert.

It takes a while to start up the server so we want to annotate a large number of texts at once.

In [20]:

from tqdm.notebook import tqdm
from stanza.server import CoreNLPClient

def annotate_ner(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True):
    properties = {"ner.model": ner_model_file, "tokenize.whitespace": tokenize_whitespace, "ner.applyNumericClassifiers": False}
    
    annotated = []
    with CoreNLPClient(
         annotators=['tokenize','ssplit','ner'],
         properties=properties,
         timeout=30000,
         be_quiet=True,
        memory='6G') as client:
    
        for text in tqdm(texts):
            annotated.append(client.annotate(text))
    return annotated

We can then get the annotations

In [21]:

annotations = annotate_ner(models['ar'],
                           ['1 cup of frozen peas',
                            'A dash of salt . Or to taste',
                           '12 slices pancetta -LRB- Italian unsmoked cured bacon -RRB-',
                           'pumpkin sliced into 3 cm moons'])

  0%|          | 0/4 [00:00<?, ?it/s]

Note here that the word "Italian" has ner "NATIONALITY", which comes from another model (it wasn't in the training set!).

We want to use the coarseNER.

In [22]:

annotations[2].sentence[0].token[4]

Out[22]:

word: "Italian"
pos: "JJ"
value: "Italian"
originalText: "Italian"
ner: "NATIONALITY"
lemma: "italian"
beginChar: 25
endChar: 32
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
coarseNER: "O"
fineGrainedNER: "NATIONALITY"
entityMentionIndex: 3
nerLabelProbs: "O=0.870902471545891"

When I didn't set "ner.applyNumericClassifiers": False this would come up as a NUMBER.

In [23]:

annotations[3].sentence[0].token[3]

Out[23]:

word: "3"
pos: "CD"
value: "3"
originalText: "3"
ner: "O"
lemma: "3"
beginChar: 20
endChar: 21
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
coarseNER: "O"
fineGrainedNER: "O"
nerLabelProbs: "O=0.8599887537555505"

We can then flatten the sentences and extract the NER tokens

In [24]:

from dataclasses import dataclass, asdict

@dataclass
class NERData:
    ner: List[str]
    tokens: List[str]
        
    # Let's use Pandas to make it pretty in a notebook
    def _repr_html_(self):
        return pd.DataFrame(asdict(self)).T._repr_html_()

def extract_ner_data(annotation) -> NERData:
    tokens = [token for sentence in annotation.sentence for token in sentence.token]
    return NERData(tokens=[t.word for t in tokens], ner=[t.coarseNER for t in tokens])

A relatively simple ingredient works well

In [25]:

extract_ner_data(annotations[0])

Out[25]:

	0	1	2	3	4
ner	QUANTITY	UNIT	O	TEMP	NAME
tokens	1	cup	of	frozen	peas

A more complex sentence does quite badly, perhaps because this kind of thing wasn't seen.

In [26]:

extract_ner_data(annotations[1])

Out[26]:

	0	1	2	3	4	5	6	7
ner	QUANTITY	UNIT	NAME	NAME	NAME	NAME	O	O
tokens	A	dash	of	salt	.	Or	to	taste

In [27]:

extract_ner_data(annotations[2])

Out[27]:

	0	1	2	3	4	5	6	7	8
ner	QUANTITY	UNIT	NAME	O	O	O	O	O	O
tokens	12	slices	pancetta	-LRB-	Italian	unsmoked	cured	bacon	-RRB-

We can chain these functions together to get from text to NER

In [28]:

from typing import Dict

def ner_extract(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True) -> List[Dict[str, List[str]]]:
    annotations = annotate_ner(ner_model_file, texts, tokenize_whitespace)
    return [extract_ner_data(ann) for ann in annotations]

And then for each model, and test data we can calculate the predictions.

In [29]:

preds = {}
for model, modelfile in models.items():
    preds[model] = {}
    for test_source, token_data in test_data.items():
        texts = [' '.join([x[0] for x in text]) for text in token_data]
        preds[model][test_source] = ner_extract(modelfile, texts)

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

Sanity checks¶

Let's check the same tokens come through the model as were input

In [30]:

for test_source, token_data in test_data.items():
    tokens = [[x[0] for x in tokens] for tokens in token_data]
    
    for model in models:
        model_preds = preds[model][test_source]
        
        model_tokens = [p.tokens for p in model_preds]
        
        if tokens != model_tokens:
            raise ValueError("Tokenization issue in %s with model %s" % (test_source, model))

Evaluating¶

Now that we have predictions we can evaulate with seqeval.

In [31]:

!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     |████████████████████████████████| 43 kB 102 kB/s            
  Preparing metadata (setup.py) ... - done
Requirement already satisfied: numpy>=1.14.0 in /opt/conda/lib/python3.7/site-packages (from seqeval) (1.20.3)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/lib/python3.7/site-packages (from seqeval) (1.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.7.3)
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... - \ | done
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=117220ab957b2dfbf6fad8b7cf7fb429b409f1fb1b62fef7ea14d20e38b36203
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Seqeval expects the data to be in one of the following formats:

IOB1
IOB2
IOE1
IOE2
IOBES(only in strict mode)
BILOU(only in strict mode)

These all become important when trying to distinguish distinct entities that are adjacent; these are quite rare in practice. See Wikipedia for a detailed explanation of IOB (inside-outside-beginning).

In this case it's assumed there's only one entity of each type (which can be wrong when multiple names are listing in a single ingredient). We can easily convert it to IOB1 using this assumption by prefixing every tag other than 'O' with an 'I-'.

In [32]:

def convert_to_iob1(tokens):
    return ['I-' + label if label != 'O' else 'O' for label in tokens]

assert convert_to_iob1(['QUANTITY', 'SIZE', 'NAME', 'NAME', 'O', 'STATE']) == ['I-QUANTITY', 'I-SIZE', 'I-NAME', 'I-NAME', 'O', 'I-STATE']

Let's check the classification report for a single example and compare it to the report from StanfordNER.

The classification report doesn't have the TP, TN and FN, but instead has the support - the number of true entities in the data. The set of data is equivalent:

support = TP + FN
TP = R * support
FP = TP (1/P - 1)
FN = support - TP

The results are the same.

In [33]:

from seqeval.metrics import classification_report

test_source = 'ar'
model = 'ar'

actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in test_data[test_source]]
pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]

print(classification_report(actual_ner, pred_ner, digits=4))

              precision    recall  f1-score   support

          DF     1.0000    0.9608    0.9800        51
        NAME     0.9297    0.9279    0.9288       499
    QUANTITY     1.0000    0.9962    0.9981       524
        SIZE     1.0000    1.0000    1.0000        20
       STATE     0.9601    0.9633    0.9617       300
        TEMP     0.8750    0.7000    0.7778        10
        UNIT     0.9819    0.9841    0.9830       441

   micro avg     0.9696    0.9669    0.9682      1845
   macro avg     0.9638    0.9332    0.9471      1845
weighted avg     0.9695    0.9669    0.9682      1845

We can get the micro f1-score directly.

In [34]:

from seqeval.metrics import f1_score
'%0.4f' % f1_score(actual_ner, pred_ner)

Out[34]:

'0.9682'

We can then try to reproduce Table IV by computing the f1-score for each model and data.

In [35]:

scores = {model: {} for model in models}
for test_source, data in test_data.items():
    actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in data]
    for model in models:
        pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]
        scores[model][test_source] = f1_score(actual_ner, pred_ner)

We also need to calculate the scores on the combined test set, by contatenating them

In [36]:

actual_ner = [convert_to_iob1([x[1] for x in ann]) for data in test_data.values() for ann in data]
for model in models:
    pred_ner = [convert_to_iob1(p.ner) for test_source in test_data for p in preds[model][test_source]]
    scores[model]['combined'] = f1_score(actual_ner, pred_ner)

In [37]:

pd.DataFrame(scores).style.format('{:0.4f}')

Out[37]:

	ar	gk	ar_gk
ar	0.9682	0.9331	0.9704
gk	0.8666	0.9511	0.9499
combined	0.8911	0.9469	0.9549

The results are slightly different to those in the paper, but all agree within 0.01 for each row.

So we've successfully reproduced the results in the paper, and shown the evaulation from Stanford NER toolkit is very close to that of seqeval (if you work around hallucinated entities).

In [38]:

reported_scores = pd.DataFrame([[0.9682, 0.9317, 0.9709],
              [0.8672, 0.9519, 0.9498],
              [0.8972, 0.9472, 0.9611]],
             columns = ['AllRecipes', 'FOOD.com', 'BOTH'],
             index = ['AllRecipes', 'FOOD.com', 'BOTH'])
reported_scores

Out[38]:

	AllRecipes	FOOD.com	BOTH
AllRecipes	0.9682	0.9317	0.9709
FOOD.com	0.8672	0.9519	0.9498
BOTH	0.8972	0.9472	0.9611