We're going to replicate the benchmark in A Named Entity Based Approach to Model Recipes, by Diwan, Batra, and Bagler using StanfordNLP, and check it using seqeval.
Evaluating NER is surprisingly tricky, as David Batista explains, and I want to check that the results in the paper are the same as what seqeval gives, so I can compare it to other models.
The authors share their data in an associated git repository and train a model using Stanford NER, which is open source, so we have a chance of replicating the results.
We're going to install Stanford NLP which is a Java library. To make things easier we will use stanza which includes tools for installing and invoking Stanford NLP.
!pip install stanza
Collecting stanza
Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
|████████████████████████████████| 432 kB 292 kB/s
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from stanza) (2.26.0)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.7/site-packages (from stanza) (3.19.4)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from stanza) (4.62.3)
Requirement already satisfied: torch>=1.3.0 in /opt/conda/lib/python3.7/site-packages (from stanza) (1.9.1+cpu)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from stanza) (1.20.3)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from stanza) (1.16.0)
Requirement already satisfied: emoji in /opt/conda/lib/python3.7/site-packages (from stanza) (1.7.0)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch>=1.3.0->stanza) (4.1.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (2.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (1.26.7)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->stanza) (3.1)
Installing collected packages: stanza
Successfully installed stanza-1.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
We can specify where to install Core NLP, but we will us the default, which is either "\$CORE_NLP_HOME", or "\$HOME/stanza_corenlp". (Ideally we'd use stanza to get this, but I couldn't easy work out how.)
import stanza
stanza.install_corenlp()
Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip: 0%| …
We'll need to invoke the Stanford Core NLP JAR that we just installed, so let's find it.
import os
import re
from pathlib import Path
# Reimplement the logic to find the path where stanza_corenlp is installed.
core_nlp_path = os.getenv('CORENLP_HOME', str(Path.home() / 'stanza_corenlp'))
# A heuristic to find the right jar file
classpath = [str(p) for p in Path(core_nlp_path).iterdir() if re.match(r"stanford-corenlp-[0-9.]+\.jar", p.name)][0]
classpath
'/root/stanza_corenlp/stanford-corenlp-4.4.0.jar'
Let's test the basic usage.
There are currently models for 8 languages, and for some fairly complex tasks like coreference resolution.
from stanza.server import CoreNLPClient
text = "David Batista wrote a blog post on NER evaluation. " \
"Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks, such as NER. " \
"We will test his library against Stanford Core NLP. "
with CoreNLPClient(
annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
timeout=30000,
memory='6G') as client:
ann = client.annotate(text)
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- [main] INFO CoreNLP - Server default properties: (Note: unspecified annotator properties are English defaults) annotators = tokenize,ssplit,pos,lemma,ner,parse,depparse,coref inputFormat = text outputFormat = serialized prettyPrint = false threads = 5 [main] INFO CoreNLP - Threads: 5 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger ... done [1.1 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.0 sec]. [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.6 sec]. [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [1.0 sec]. [main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. [main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580705 unique entries out of 581864 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns. [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4867 unique entries out of 4867 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns. [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585572 unique entries from 2 files [main] INFO edu.stanford.nlp.pipeline.NERCombinerAnnotator - numeric classifiers: true; SUTime: true [no docDate]; fine grained: true [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.8 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse [main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... Time elapsed: 2.1 sec [main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 20000 vectors, elapsed Time: 2.204 sec [main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [4.3 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref [main] INFO edu.stanford.nlp.coref.statistical.SimpleLinearClassifier - Loading coref model edu/stanford/nlp/models/coref/statistical/ranking_model.ser.gz ... done [0.9 sec]. [main] INFO edu.stanford.nlp.pipeline.CorefMentionAnnotator - Using mention detector type: dependency [main] INFO CoreNLP - Starting server... [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9000 [pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:36852] API call w/annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
David Batista wrote a blog post on NER evaluation. Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks, such as NER. We will test his library against Stanford Core NLP.
[Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.
We get 3 sentences out.
for sentence in ann.sentence:
print(" ".join([token.word for token in sentence.token]))
David Batista wrote a blog post on NER evaluation . Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks , such as NER . We will test his library against Stanford Core NLP .
It can even do clever things like coreference resolution; resolving that "his library" refers to "Hiroki Nakayama's library".
for chain in ann.corefChain:
print([ann.mentionsForCoref[mention.mentionID].headString for mention in chain.mention])
['nakayama', 'his']
We can extract things such as lemmas, parts of speech and standard NER tags.
But we want to train our own NER model to detect ingredients. First we will need to collect the data.
import pandas as pd
tokens = ann.sentence[1].token
pd.DataFrame({'word': [s.word for s in tokens],
'lemma': [s.lemma for s in tokens],
'pos': [s.pos for s in tokens],
'ner': [s.ner for s in tokens]}).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
word | Hiroki | Nakayama | wrote | seqeval | to | evaluate | sequential | labelling | tasks | , | such | as | NER | . |
lemma | Hiroki | Nakayama | write | seqeval | to | evaluate | sequential | labelling | task | , | such | as | ner | . |
pos | NNP | NNP | VBD | NN | TO | VB | JJ | NN | NNS | , | JJ | IN | NN | . |
ner | PERSON | PERSON | O | O | O | O | O | O | O | O | O | O | O | O |
Helpfully the authors provide the annotated ingredients data in the format for Stanford NER that we can download from github.
There are two sources of ingredients, ar
is AllRecipes and gk
is FOOD.com (formerly GeniusKitchen.com).
from urllib.request import urlretrieve
data_sources = ['ar', 'gk']
data_splits = ['train', 'test']
base_url = 'https://raw.githubusercontent.com/cosylabiiit/recipe-knowledge-mining/master/'
def data_filename(source, split):
return f'{source}_{split}.tsv'
for source in data_sources:
for split in data_splits:
name = data_filename(source, split)
urlretrieve(base_url + name, name)
Each line of the file is either a single tab (separating different texts), or a token followed by a tab and then the entity type.
So for example the first ingredient is 4 cloves garlic
, which is a quantity (4) followed by a unit (cloves) and a name (garlic).
!head {data_filename('ar', 'train')} | cat -t
^I 4^IQUANTITY cloves^IUNIT garlic^INAME ^I 2^IQUANTITY tablespoons^IUNIT vegetable^INAME oil^INAME ,^IO
We can read this in to Python, converting it to a list of annotated sentences, which is just a sequence of token, label pairs.
from typing import List, Tuple, Generator
Annotation = Tuple[str, str]
AnnotatedSentence = List[Annotation]
def segment_texts(data: str) -> Generator[AnnotatedSentence, None, None]:
output = []
for line in data.split('\n'):
if line.strip():
text, token = line.split('\t')
output.append((text.strip(), token.strip()))
elif output:
yield output
output = []
def segment_file(filename: str) -> List[AnnotatedSentence]:
with open(filename, 'rt') as f:
return list(segment_texts(f.read()))
ar_train = segment_file(data_filename('ar', 'train'))
ar_train[:2]
[[('4', 'QUANTITY'), ('cloves', 'UNIT'), ('garlic', 'NAME')], [('2', 'QUANTITY'), ('tablespoons', 'UNIT'), ('vegetable', 'NAME'), ('oil', 'NAME'), (',', 'O'), ('divided', 'STATE')]]
We can then calculate the number of sentences in the training set for a source.
len(ar_train)
1470
We can use this to check the types of entities annotated, as in the paper (DF is Dried/Fresh).
from collections import Counter
tag_counts = Counter([annotation[1] for sentence in ar_train for annotation in sentence])
tag_counts
Counter({'QUANTITY': 1583, 'UNIT': 1338, 'NAME': 2501, 'O': 1662, 'STATE': 879, 'DF': 154, 'SIZE': 64, 'TEMP': 31})
Now we want to train a Stanford NER model on the new annotations.
First we have to configure it; but there's no information on the paper on how it's configured. I've copied this template configuration out of the FAQ For more information on the parameters you can check the NERFeatureFactory documentation or the source.
def ner_prop_str(train_files: List[str], test_files: List[str], output: str) -> str:
"""Returns configuration string to train NER model"""
train_file_str = ','.join(train_files)
test_file_str = ','.join(test_files)
return f"""
trainFileList = {train_file_str}
testFiles = {test_file_str}
serializeTo = {output}
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
"""
This is expected to be a file, so let's write a helper that writes it to a file. (An alternative would be to pass these as arguments to the trainer).
def write_ner_prop_file(ner_prop_file: str, train_files: List[str], test_files: List[str], output_file: str) -> None:
with open(ner_prop_file, 'wt') as f:
props = ner_prop_str(train_files, test_files, output_file)
f.write(props)
Stanza doesn't give an interface to train a CRF NER model using Stanford NLP, but we can invoke edu.stanford.nlp.ie.crf.CRFClassifier
directly.
Let's write a properties file and invoke Java to run the classifier. It prints a lot of training information, and importantly a summary report at the end which we want to see.
import subprocess
from typing import List
def train_model(model_name, train_files: List[str], test_files: List[str], print_report=True, classpath=classpath) -> str:
"""Trains CRF NER Model using StanfordNLP"""
model_file = f'{model_name}.model.ser.gz'
ner_prop_filename = f'{model_name}.model.props'
write_ner_prop_file(ner_prop_filename, train_files, test_files, model_file)
result = subprocess.run(
['java',
'-Xmx2g',
'-cp', classpath,
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-prop', ner_prop_filename],
capture_output=True)
# If there's an error with invocation better log the stacktrace
if result.returncode != 0:
print(result.stderr.decode('utf-8'))
result.check_returncode()
if print_report:
print(*result.stderr.decode('utf-8').split('\n')[-11:], sep='\n')
return model_file
We can train models on each dataset separately, and all together. For evaluation we'll use the corresponding test set.
This only takes a few minutes.
%%time
models = {}
for source in ['ar', 'gk', 'ar_gk']:
print(source)
train_files = [data_filename(s, 'train') for s in source.split('_')]
test_files = [data_filename(s, 'test') for s in source.split('_')]
models[source] = train_model(source, train_files, test_files)
print()
ar CRFClassifier tagged 2788 words in 483 documents at 7185.57 words per second. Entity P R F1 TP FP FN DF 1.0000 0.9608 0.9800 49 0 2 NAME 0.9297 0.9279 0.9288 463 35 36 QUANTITY 1.0000 0.9962 0.9981 522 0 2 SIZE 1.0000 1.0000 1.0000 20 0 0 STATE 0.9601 0.9633 0.9617 289 12 11 TEMP 0.8750 0.7000 0.7778 7 1 3 UNIT 0.9819 0.9841 0.9830 434 8 7 Totals 0.9696 0.9669 0.9682 1784 56 61 gk CRFClassifier tagged 9886 words in 1705 documents at 11727.16 words per second. Entity P R F1 TP FP FN DF 0.9718 0.9517 0.9617 138 4 7 NAME 0.9132 0.9021 0.9076 1621 154 176 QUANTITY 0.9882 0.9870 0.9876 1598 19 21 SIZE 0.9750 0.9398 0.9571 78 2 5 STATE 0.9255 0.9503 0.9377 708 57 37 TEMP 0.8125 0.8125 0.8125 26 6 6 UNIT 0.9810 0.9721 0.9766 1291 25 37 Totals 0.9534 0.9497 0.9516 5460 267 289 ar_gk CRFClassifier tagged 12674 words in 2188 documents at 11648.90 words per second. Entity P R F1 TP FP FN DF 0.9738 0.9490 0.9612 186 5 10 NAME 0.9136 0.9077 0.9106 2084 197 212 QUANTITY 0.9911 0.9897 0.9904 2121 19 22 SIZE 0.9798 0.9417 0.9604 97 2 6 STATE 0.9386 0.9512 0.9449 994 65 51 TEMP 0.8140 0.8333 0.8235 35 8 7 UNIT 0.9801 0.9763 0.9782 1727 35 42 Totals 0.9563 0.9539 0.9551 7244 331 350 CPU times: user 276 ms, sys: 134 ms, total: 410 ms Wall time: 2min 17s
The summary report shows for each model and entity type:
We can compare the F1 Totals to the diagonal of Table IV in the paper
These are super close.
The furthest is ar_gk
and in the repository they have a separate ar_gk_train.tsv
; it would be interesting to check whether using it directly gives a closer result and why there is a difference.
We can now use these trained models in Python by invoking Stanford NLP with Stanza.
First we'll load in the test data.
test_data = {}
for source in data_sources:
test_data[source] = segment_file(data_filename(source, 'test'))
print(source, len(test_data[source]))
ar 483 gk 1705
We can call StanfordNLP with our custom model by passing the property ner.model
.
Our test data is already tokenized in a different way to StanfordNLP, so we'll add an option to the Tokenizer to use whitespace tokenization which is easy to invert.
It takes a while to start up the server so we want to annotate a large number of texts at once.
from tqdm.notebook import tqdm
from stanza.server import CoreNLPClient
def annotate_ner(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True):
properties = {"ner.model": ner_model_file, "tokenize.whitespace": tokenize_whitespace, "ner.applyNumericClassifiers": False}
annotated = []
with CoreNLPClient(
annotators=['tokenize','ssplit','ner'],
properties=properties,
timeout=30000,
be_quiet=True,
memory='6G') as client:
for text in tqdm(texts):
annotated.append(client.annotate(text))
return annotated
We can then get the annotations
annotations = annotate_ner(models['ar'],
['1 cup of frozen peas',
'A dash of salt . Or to taste',
'12 slices pancetta -LRB- Italian unsmoked cured bacon -RRB-',
'pumpkin sliced into 3 cm moons'])
0%| | 0/4 [00:00<?, ?it/s]
Note here that the word "Italian" has ner "NATIONALITY", which comes from another model (it wasn't in the training set!).
We want to use the coarseNER
.
annotations[2].sentence[0].token[4]
word: "Italian" pos: "JJ" value: "Italian" originalText: "Italian" ner: "NATIONALITY" lemma: "italian" beginChar: 25 endChar: 32 tokenBeginIndex: 4 tokenEndIndex: 5 hasXmlContext: false isNewline: false coarseNER: "O" fineGrainedNER: "NATIONALITY" entityMentionIndex: 3 nerLabelProbs: "O=0.870902471545891"
When I didn't set "ner.applyNumericClassifiers": False
this would come up as a NUMBER
.
annotations[3].sentence[0].token[3]
word: "3" pos: "CD" value: "3" originalText: "3" ner: "O" lemma: "3" beginChar: 20 endChar: 21 tokenBeginIndex: 3 tokenEndIndex: 4 hasXmlContext: false isNewline: false coarseNER: "O" fineGrainedNER: "O" nerLabelProbs: "O=0.8599887537555505"
We can then flatten the sentences and extract the NER tokens
from dataclasses import dataclass, asdict
@dataclass
class NERData:
ner: List[str]
tokens: List[str]
# Let's use Pandas to make it pretty in a notebook
def _repr_html_(self):
return pd.DataFrame(asdict(self)).T._repr_html_()
def extract_ner_data(annotation) -> NERData:
tokens = [token for sentence in annotation.sentence for token in sentence.token]
return NERData(tokens=[t.word for t in tokens], ner=[t.coarseNER for t in tokens])
A relatively simple ingredient works well
extract_ner_data(annotations[0])
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
ner | QUANTITY | UNIT | O | TEMP | NAME |
tokens | 1 | cup | of | frozen | peas |
A more complex sentence does quite badly, perhaps because this kind of thing wasn't seen.
extract_ner_data(annotations[1])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
ner | QUANTITY | UNIT | NAME | NAME | NAME | NAME | O | O |
tokens | A | dash | of | salt | . | Or | to | taste |
extract_ner_data(annotations[2])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
ner | QUANTITY | UNIT | NAME | O | O | O | O | O | O |
tokens | 12 | slices | pancetta | -LRB- | Italian | unsmoked | cured | bacon | -RRB- |
We can chain these functions together to get from text to NER
from typing import Dict
def ner_extract(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True) -> List[Dict[str, List[str]]]:
annotations = annotate_ner(ner_model_file, texts, tokenize_whitespace)
return [extract_ner_data(ann) for ann in annotations]
And then for each model, and test data we can calculate the predictions.
preds = {}
for model, modelfile in models.items():
preds[model] = {}
for test_source, token_data in test_data.items():
texts = [' '.join([x[0] for x in text]) for text in token_data]
preds[model][test_source] = ner_extract(modelfile, texts)
0%| | 0/483 [00:00<?, ?it/s]
0%| | 0/1705 [00:00<?, ?it/s]
0%| | 0/483 [00:00<?, ?it/s]
0%| | 0/1705 [00:00<?, ?it/s]
0%| | 0/483 [00:00<?, ?it/s]
0%| | 0/1705 [00:00<?, ?it/s]
Let's check the same tokens come through the model as were input
for test_source, token_data in test_data.items():
tokens = [[x[0] for x in tokens] for tokens in token_data]
for model in models:
model_preds = preds[model][test_source]
model_tokens = [p.tokens for p in model_preds]
if tokens != model_tokens:
raise ValueError("Tokenization issue in %s with model %s" % (test_source, model))
!pip install seqeval
Collecting seqeval
Downloading seqeval-1.2.2.tar.gz (43 kB)
|████████████████████████████████| 43 kB 102 kB/s
Preparing metadata (setup.py) ... - done
Requirement already satisfied: numpy>=1.14.0 in /opt/conda/lib/python3.7/site-packages (from seqeval) (1.20.3)
Requirement already satisfied: scikit-learn>=0.21.3 in /opt/conda/lib/python3.7/site-packages (from seqeval) (1.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.7.3)
Building wheels for collected packages: seqeval
Building wheel for seqeval (setup.py) ... - \ | done
Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=117220ab957b2dfbf6fad8b7cf7fb429b409f1fb1b62fef7ea14d20e38b36203
Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Seqeval expects the data to be in one of the following formats:
These all become important when trying to distinguish distinct entities that are adjacent; these are quite rare in practice. See Wikipedia for a detailed explanation of IOB (inside-outside-beginning).
In this case it's assumed there's only one entity of each type (which can be wrong when multiple names are listing in a single ingredient). We can easily convert it to IOB1 using this assumption by prefixing every tag other than 'O' with an 'I-'.
def convert_to_iob1(tokens):
return ['I-' + label if label != 'O' else 'O' for label in tokens]
assert convert_to_iob1(['QUANTITY', 'SIZE', 'NAME', 'NAME', 'O', 'STATE']) == ['I-QUANTITY', 'I-SIZE', 'I-NAME', 'I-NAME', 'O', 'I-STATE']
Let's check the classification report for a single example and compare it to the report from StanfordNER.
The classification report doesn't have the TP, TN and FN, but instead has the support - the number of true entities in the data. The set of data is equivalent:
The results are the same.
from seqeval.metrics import classification_report
test_source = 'ar'
model = 'ar'
actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in test_data[test_source]]
pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]
print(classification_report(actual_ner, pred_ner, digits=4))
precision recall f1-score support DF 1.0000 0.9608 0.9800 51 NAME 0.9297 0.9279 0.9288 499 QUANTITY 1.0000 0.9962 0.9981 524 SIZE 1.0000 1.0000 1.0000 20 STATE 0.9601 0.9633 0.9617 300 TEMP 0.8750 0.7000 0.7778 10 UNIT 0.9819 0.9841 0.9830 441 micro avg 0.9696 0.9669 0.9682 1845 macro avg 0.9638 0.9332 0.9471 1845 weighted avg 0.9695 0.9669 0.9682 1845
We can get the micro f1-score directly.
from seqeval.metrics import f1_score
'%0.4f' % f1_score(actual_ner, pred_ner)
'0.9682'
We can then try to reproduce Table IV by computing the f1-score for each model and data.
scores = {model: {} for model in models}
for test_source, data in test_data.items():
actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in data]
for model in models:
pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]
scores[model][test_source] = f1_score(actual_ner, pred_ner)
We also need to calculate the scores on the combined test set, by contatenating them
actual_ner = [convert_to_iob1([x[1] for x in ann]) for data in test_data.values() for ann in data]
for model in models:
pred_ner = [convert_to_iob1(p.ner) for test_source in test_data for p in preds[model][test_source]]
scores[model]['combined'] = f1_score(actual_ner, pred_ner)
pd.DataFrame(scores).style.format('{:0.4f}')
ar | gk | ar_gk | |
---|---|---|---|
ar | 0.9682 | 0.9331 | 0.9704 |
gk | 0.8666 | 0.9511 | 0.9499 |
combined | 0.8911 | 0.9469 | 0.9549 |
The results are slightly different to those in the paper, but all agree within 0.01 for each row.
So we've successfully reproduced the results in the paper, and shown the evaulation from Stanford NER toolkit is very close to that of seqeval (if you work around hallucinated entities).
reported_scores = pd.DataFrame([[0.9682, 0.9317, 0.9709],
[0.8672, 0.9519, 0.9498],
[0.8972, 0.9472, 0.9611]],
columns = ['AllRecipes', 'FOOD.com', 'BOTH'],
index = ['AllRecipes', 'FOOD.com', 'BOTH'])
reported_scores
AllRecipes | FOOD.com | BOTH | |
---|---|---|---|
AllRecipes | 0.9682 | 0.9317 | 0.9709 |
FOOD.com | 0.8672 | 0.9519 | 0.9498 |
BOTH | 0.8972 | 0.9472 | 0.9611 |