This notebook contains the fourth part of the model training and analysis code from our CoNLL-2020 paper, "Identifying Incorrect Labels in the CoNLL-2003 Corpus".
If you're new to the Text Extensions for Pandas library, we recommend that you start
by reading through the notebook Analyze_Model_Outputs.ipynb
, which explains the
portions of the library that we use in the notebooks in this directory.
This notebook repeats the model training process from CoNLL_3.ipynb
, but performs a 10-fold cross-validation. This process involves training a total of 170 models -- 10 groups of 17. Next, this notebook evaluates each group of models over the holdout set from the associated fold of the cross-validation. Then it aggregates together these outputs and uses the same techniques used in CoNLL_2.ipynb
to flag potentially-incorrect labels. Finally, the notebook writes out CSV files containing ranked lists of potentially-incorrect labels.
# Libraries
import numpy as np
import pandas as pd
import os
import sys
import time
import torch
import transformers
from typing import *
import sklearn.model_selection
import sklearn.pipeline
import matplotlib.pyplot as plt
import multiprocessing
import gc
# And of course we need the text_extensions_for_pandas library itself.
try:
import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
raise Exception("text_extensions_for_pandas package not found on the Jupyter "
"kernel's path. Please either run:\n"
" ln -s ../../text_extensions_for_pandas .\n"
"from the directory containing this notebook, or use a Python "
"environment on which you have used `pip` to install the package.")
from text_extensions_for_pandas import cleaning
# BERT Configuration
# Keep this in sync with `CoNLL_3.ipynb`.
#bert_model_name = "bert-base-uncased"
#bert_model_name = "bert-large-uncased"
bert_model_name = "dslim/bert-base-NER"
tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name,
add_special_tokens=True)
bert = transformers.BertModel.from_pretrained(bert_model_name)
# If False, use cached values, provided those values are present on disk
_REGENERATE_EMBEDDINGS = True
_REGENERATE_MODELS = True
# Number of dimensions that we reduce the BERT embeddings down to when
# training reduced-quality models.
#_REDUCED_DIMS = [8, 16, 32, 64, 128, 256]
_REDUCED_DIMS = [32, 64, 128, 256]
# How many models we train at each level of dimensionality reduction
_MODELS_AT_DIM = [4] * len(_REDUCED_DIMS)
# Consistent set of random seeds to use when generating dimension-reduced
# models. Index is [index into _REDUCED_DIMS, model number], and there are
# lots of extra entries so we don't need to resize this matrix.
from numpy.random import default_rng
_MASTER_SEED = 42
rng = default_rng(_MASTER_SEED)
_MODEL_RANDOM_SEEDS = rng.integers(0, 1e6, size=(8, 8))
# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
_ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
token_class_dtype, int_to_label, label_to_int = tp.io.conll.make_iob_tag_categories(_ENTITY_TYPES)
# Parameters for splitting the corpus into folds
_KFOLD_RANDOM_SEED = _MASTER_SEED
_KFOLD_NUM_FOLDS = 10
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertModel: ['classifier.weight', 'classifier.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Read in the corpus, retokenize it with the BERT tokenizer, add BERT embeddings, and convert to a single dataframe.
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
# to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info
{'train': 'outputs/eng.train', 'dev': 'outputs/eng.testa', 'test': 'outputs/eng.testb'}
# The raw dataset in its original tokenization
corpus_raw = {}
for fold_name, file_name in data_set_info.items():
df_list = tp.io.conll.conll_2003_to_dataframes(file_name,
["pos", "phrase", "ent"],
[False, True, True])
corpus_raw[fold_name] = [
df.drop(columns=["pos", "phrase_iob", "phrase_type"])
for df in df_list
]
# Retokenize with the BERT tokenizer and regenerate embeddings.
corpus_df,token_class_dtype, int_to_label, label_to_int = cleaning.preprocess.preprocess_documents(corpus_raw,'ent_type',True,carry_cols=['line_num'],iob_col='ent_iob')
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertModel: ['classifier.weight', 'classifier.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
preprocessing fold train
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=946, style=ProgressStyle(desc…
Token indices sequence length is longer than the specified maximum sequence length for this model (559 > 512). Running this sequence through the model will result in indexing errors
preprocessing fold dev
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=216, style=ProgressStyle(desc…
preprocessing fold test
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…
We divide the documents of the corpus into 10 random samples.
# IDs for each of the keys
doc_keys = corpus_df[["fold", "doc_num"]].drop_duplicates().reset_index(drop=True)
doc_keys
fold | doc_num | |
---|---|---|
0 | train | 0 |
1 | train | 1 |
2 | train | 2 |
3 | train | 3 |
4 | train | 4 |
... | ... | ... |
1388 | test | 226 |
1389 | test | 227 |
1390 | test | 228 |
1391 | test | 229 |
1392 | test | 230 |
1393 rows × 2 columns
# We want to split the documents randomly into _NUM_FOLDS sets, then
# for each stage of cross-validation train a model on the union of
# (_NUM_FOLDS - 1) of them while testing on the remaining fold.
# sklearn.model_selection doesn't implement this approach directly,
# but we can piece it together with some help from Numpy.
#from numpy.random import default_rng
rng = np.random.default_rng(seed=_KFOLD_RANDOM_SEED)
iloc_order = rng.permutation(len(doc_keys.index))
kf = sklearn.model_selection.KFold(n_splits=_KFOLD_NUM_FOLDS)
train_keys = []
test_keys = []
for train_ix, test_ix in kf.split(iloc_order):
# sklearn.model_selection.KFold gives us a partitioning of the
# numbers from 0 to len(iloc_order). Use that partitioning to
# choose elements from iloc_order, then use those elements to
# index into doc_keys.
train_iloc = iloc_order[train_ix]
test_iloc = iloc_order[test_ix]
train_keys.append(doc_keys.iloc[train_iloc])
test_keys.append(doc_keys.iloc[test_iloc])
train_keys[1].head(10)
fold | doc_num | |
---|---|---|
146 | train | 146 |
1164 | test | 2 |
483 | train | 483 |
1190 | test | 28 |
20 | train | 20 |
237 | train | 237 |
86 | train | 86 |
408 | train | 408 |
1252 | test | 90 |
1213 | test | 51 |
Train models on the first of our 10 folds and manually examine some of the model outputs.
# Gather the training set together by joining our list of documents
# with the entire corpus on the composite key <fold, doc_num>
train_inputs_df = corpus_df.merge(train_keys[0])
train_inputs_df
fold | doc_num | token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | raw_span | line_num | raw_span_id | ent_iob | ent_type | embedding | token_class | token_class_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 0 | 0 | [0, 0): '' | 101 | 0 | 1 | True | NaN | NaN | NaN | O | <NA> | [ -0.098505184, -0.4050192, 0.7428884... | O | 0 |
1 | train | 0 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 0.0 | 0.0 | O | <NA> | [ -0.057021223, -0.48112097, 0.989868... | O | 0 |
2 | train | 0 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 0.0 | 0.0 | O | <NA> | [ -0.04824195, -0.25330004, 1.167191... | O | 0 |
3 | train | 0 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 0.0 | 0.0 | O | <NA> | [ -0.26682988, -0.31008753, 1.007472... | O | 0 |
4 | train | 0 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 0.0 | 0.0 | O | <NA> | [ -0.22296889, -0.21308492, 0.9331016... | O | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
371472 | test | 230 | 314 | [1386, 1393): 'brother' | 1711 | 0 | 1 | False | [1386, 1393): 'brother' | 50345.0 | 267.0 | O | <NA> | [ -0.028172785, -0.08062388, 0.9804888... | O | 0 |
371473 | test | 230 | 315 | [1393, 1394): ',' | 117 | 0 | 1 | False | [1393, 1394): ',' | 50346.0 | 268.0 | O | <NA> | [ 0.11817408, -0.07008513, 0.865484... | O | 0 |
371474 | test | 230 | 316 | [1395, 1400): 'Bobby' | 5545 | 0 | 1 | False | [1395, 1400): 'Bobby' | 50347.0 | 269.0 | B | PER | [ -0.35689482, 0.31400457, 1.573853... | B-PER | 3 |
371475 | test | 230 | 317 | [1400, 1401): '.' | 119 | 0 | 1 | False | [1400, 1401): '.' | 50348.0 | 270.0 | O | <NA> | [ -0.18957126, -0.24581163, 0.66257... | O | 0 |
371476 | test | 230 | 318 | [0, 0): '' | 102 | 0 | 1 | True | NaN | NaN | NaN | O | <NA> | [ -0.44689128, -0.31665266, 0.779688... | O | 0 |
371477 rows × 16 columns
# Repeat the same process for the test set
test_inputs_df = corpus_df.merge(test_keys[0])
test_inputs_df
fold | doc_num | token_id | span | input_id | token_type_id | attention_mask | special_tokens_mask | raw_span | line_num | raw_span_id | ent_iob | ent_type | embedding | token_class | token_class_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 12 | 0 | [0, 0): '' | 101 | 0 | 1 | True | NaN | NaN | NaN | O | <NA> | [ -0.101977676, -0.42442498, 0.8440171... | O | 0 |
1 | train | 12 | 1 | [0, 1): '-' | 118 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 2664.0 | 0.0 | O | <NA> | [ -0.09124618, -0.47710702, 1.120292... | O | 0 |
2 | train | 12 | 2 | [1, 2): 'D' | 141 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 2664.0 | 0.0 | O | <NA> | [ -0.1695277, -0.27063507, 1.209566... | O | 0 |
3 | train | 12 | 3 | [2, 4): 'OC' | 9244 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 2664.0 | 0.0 | O | <NA> | [ -0.27648172, -0.3675844, 1.092024... | O | 0 |
4 | train | 12 | 4 | [4, 6): 'ST' | 9272 | 0 | 1 | False | [0, 10): '-DOCSTART-' | 2664.0 | 0.0 | O | <NA> | [ -0.24050614, -0.24247544, 1.07511... | O | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45059 | test | 225 | 75 | [208, 213): 'fight' | 2147 | 0 | 1 | False | [208, 213): 'fight' | 49418.0 | 29.0 | O | <NA> | [ -0.09621397, -0.48016888, 0.510937... | O | 0 |
45060 | test | 225 | 76 | [214, 216): 'on' | 1113 | 0 | 1 | False | [214, 216): 'on' | 49419.0 | 30.0 | O | <NA> | [ -0.0858628, -0.2341724, 0.832928... | O | 0 |
45061 | test | 225 | 77 | [217, 225): 'Saturday' | 4306 | 0 | 1 | False | [217, 225): 'Saturday' | 49420.0 | 31.0 | O | <NA> | [ -0.012238501, -0.4282664, 0.619483... | O | 0 |
45062 | test | 225 | 78 | [225, 226): '.' | 119 | 0 | 1 | False | [225, 226): '.' | 49421.0 | 32.0 | O | <NA> | [ -0.042955935, -0.36315423, 0.660203... | O | 0 |
45063 | test | 225 | 79 | [0, 0): '' | 102 | 0 | 1 | True | NaN | NaN | NaN | O | <NA> | [ -0.9504192, 0.012983555, 0.7374987... | O | 0 |
45064 rows × 16 columns
import importlib
import sklearn.linear_model
import ray
ray.init()
# Wrap train_reduced_model in a Ray task
@ray.remote
def train_reduced_model_task(
x_values: np.ndarray, y_values: np.ndarray, n_components: int,
seed: int, max_iter: int = 10000) -> sklearn.base.BaseEstimator:
return cleaning.ensemble.train_reduced_model(x_values, y_values, n_components, seed, max_iter)
# Ray task that trains a model using the entire embedding
@ray.remote
def train_full_model_task(x_values: np.ndarray, y_values: np.ndarray,
max_iter: int = 10000) -> sklearn.base.BaseEstimator:
return (
sklearn.linear_model.LogisticRegression(
multi_class="multinomial", max_iter=max_iter
)
.fit(x_values, y_values)
)
def train_models(train_df: pd.DataFrame) \
-> Dict[str, sklearn.base.BaseEstimator]:
"""
Train an ensemble of models with different levels of noise.
:param train_df: DataFrame of labeled training documents, with one
row per token. Must contain the columns "embedding" (precomputed
BERT embeddings) and "token_class_id" (integer ID of token type)
:returns: A mapping from mnemonic model name to trained model
"""
X = train_df["embedding"].values
Y = train_df["token_class_id"]
# Push the X and Y values to Plasma so that our tasks can share them.
X_id = ray.put(X.to_numpy().copy())
Y_id = ray.put(Y.to_numpy().copy())
names_list = []
futures_list = []
print(f"Training model using all of "
f"{X._tensor.shape[1]}-dimension embeddings.")
names_list.append(f"{X._tensor.shape[1]}_1")
futures_list.append(train_full_model_task.remote(X_id, Y_id))
for i in range(len(_REDUCED_DIMS)):
num_dims = _REDUCED_DIMS[i]
num_models = _MODELS_AT_DIM[i]
for j in range(num_models):
model_name = f"{num_dims}_{j + 1}"
seed = _MODEL_RANDOM_SEEDS[i, j]
print(f"Training model '{model_name}' (#{j + 1} "
f"at {num_dims} dimensions) with seed {seed}")
names_list.append(model_name)
futures_list.append(train_reduced_model_task.remote(X_id, Y_id,
num_dims, seed))
# Block until all training tasks have completed and fetch the resulting models.
models_list = ray.get(futures_list)
models = {
n: m for n, m in zip(names_list, models_list)
}
return models
def maybe_train_models(train_df: pd.DataFrame, fold_num: int):
import pickle
_CACHED_MODELS_FILE = f"outputs/fold_{fold_num}_models.pickle"
if _REGENERATE_MODELS or not os.path.exists(_CACHED_MODELS_FILE):
m = train_models(train_df)
print(f"Trained {len(m)} models.")
with open(_CACHED_MODELS_FILE, "wb") as f:
pickle.dump(m, f)
else:
# Use a cached model when using cached embeddings
with open(_CACHED_MODELS_FILE, "rb") as f:
m = pickle.load(f)
print(f"Loaded {len(m)} models from {_CACHED_MODELS_FILE}.")
return m
models = maybe_train_models(train_inputs_df, 0)
print(f"Model names after loading or training: {', '.join(models.keys())}")
2021-07-12 18:16:33,117 INFO services.py:1267 -- View the Ray dashboard at http://127.0.0.1:8265
Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=256 and seed=781567. (pid=72362) Training model with n_components=256 and seed=402414. (pid=72366) Training model with n_components=128 and seed=839748. (pid=72365) Training model with n_components=256 and seed=643865. (pid=72372) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=32 and seed=654571. (pid=72371) Training model with n_components=32 and seed=773956. (pid=72364) Training model with n_components=128 and seed=128113. (pid=72370) Training model with n_components=64 and seed=526478. (pid=72374) Training model with n_components=64 and seed=94177. (pid=72376) Training model with n_components=64 and seed=975622. (pid=72373) Training model with n_components=64 and seed=201469. (pid=72369) Training model with n_components=128 and seed=513226. (pid=72367) Training model with n_components=128 and seed=450385. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models. Model names after loading or training: 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64_2, 64_3, 64_4, 128_1, 128_2, 128_3, 128_4, 256_1, 256_2, 256_3, 256_4
# Uncomment this code if you need to have the cells that follow ignore
# some of the models saved to disk.
# _MODEL_SIZES_TO_KEEP = [32, 64, 128, 256]
# _RUNS_TO_KEEP = [4] * len(_MODEL_SIZES_TO_KEEP)
# _OTHER_MODELS_TO_KEEP = ["768_1"]
# to_keep = _OTHER_MODELS_TO_KEEP.copy()
# for size in _MODEL_SIZES_TO_KEEP:
# for num_runs in _RUNS_TO_KEEP:
# for i in range(num_runs):
# to_keep.append(f"{size}_{i+1}")
# models = {k: v for k, v in models.items() if k in to_keep}
# print(f"Model names after filtering: {', '.join(models.keys())}")
def eval_models(models: Dict[str, sklearn.base.BaseEstimator],
test_df: pd.DataFrame):
"""
Bulk-evaluate an ensemble of models generated by :func:`train_models`.
:param models: Output of :func:`train_models`
:param test_df: DataFrame of labeled test documents, with one
row per token. Must contain the columns "embedding" (precomputed
BERT embeddings) and "token_class_id" (integer ID of token type)
:returns: A dictionary from model name to results of
:func:`util.analyze_model`
"""
todo = [(name, model) for name, model in models.items()]
results = tp.jupyter.run_with_progress_bar(
len(todo),
lambda i: cleaning.infer_and_extract_entities_iob(test_df,corpus_raw, int_to_label, todo[i][1]),
"model"
)
return {t[0]: result for t, result in zip(todo, results)}
evals = eval_models(models, test_inputs_df)
# display one of the results
evals[list(evals.keys())[0]].head()
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
span | ent_type | fold | doc_num | |
---|---|---|---|---|
0 | [11, 16): 'Saudi' | MISC | train | 12 |
1 | [59, 65): 'MANAMA' | LOC | train | 12 |
2 | [86, 91): 'Saudi' | MISC | train | 12 |
3 | [259, 264): 'Saudi' | MISC | train | 12 |
0 | [55, 65): 'MONTGOMERY' | LOC | train | 20 |
# Summarize how each of the models does on the test set.
gold_elts = cleaning.preprocess.combine_raw_spans_docs_to_match(corpus_raw,evals[list(evals.keys())[0]],label_col = 'ent_type')
def make_summary_df(evals_df: pd.DataFrame) -> pd.DataFrame:
gold_elts = cleaning.preprocess.combine_raw_spans_docs_to_match(corpus_raw, evals['256_4'], label_col = 'ent_type')
summary_df= cleaning.analysis.create_f1_report_ensemble_iob(evals,gold_elts)
summary_df['dims'] = [int(name.split('_')[0]) for name in evals.keys()]
return summary_df
summary_df = make_summary_df(evals)
summary_df
precision | recall | f1-score | dims | |
---|---|---|---|---|
768_1 | 0.947149 | 0.938839 | 0.942976 | 768 |
32_1 | 0.924075 | 0.863742 | 0.892890 | 32 |
32_2 | 0.924755 | 0.875355 | 0.899377 | 32 |
32_3 | 0.925028 | 0.866065 | 0.894576 | 32 |
32_4 | 0.932949 | 0.876129 | 0.903647 | 32 |
64_1 | 0.940086 | 0.902968 | 0.921153 | 64 |
64_2 | 0.938321 | 0.902968 | 0.920305 | 64 |
64_3 | 0.936808 | 0.895226 | 0.915545 | 64 |
64_4 | 0.940828 | 0.902710 | 0.921375 | 64 |
128_1 | 0.944401 | 0.924903 | 0.934550 | 128 |
128_2 | 0.947577 | 0.923613 | 0.935442 | 128 |
128_3 | 0.943212 | 0.921548 | 0.932254 | 128 |
128_4 | 0.940991 | 0.921806 | 0.931300 | 128 |
256_1 | 0.949201 | 0.935484 | 0.942293 | 256 |
256_2 | 0.943396 | 0.929032 | 0.936159 | 256 |
256_3 | 0.945478 | 0.930839 | 0.938101 | 256 |
256_4 | 0.945055 | 0.932129 | 0.938547 | 256 |
# Plot the tradeoff between dimensionality and F1 score
x = summary_df["dims"]
y = summary_df["f1-score"]
plt.figure(figsize=(4,4))
plt.scatter(x, y)
#plt.yscale("log")
#plt.xscale("log")
plt.xlabel("Number of Dimensions")
plt.ylabel("F1 Score")
# Also dump the raw data to a local file.
pd.DataFrame({"num_dims": x, "f1_score": y}).to_csv("outputs/dims_vs_f1_score_xval.csv",
index=False)
plt.show()
full_results = cleaning.flag_suspicious_labels(evals,'ent_type','ent_type',label_name='ent_type',gold_feats=gold_elts,align_over_cols=['fold','doc_num','span'],keep_cols=[],split_doc=False)
full_results
fold | doc_num | span | ent_type | in_gold | count | models | |
---|---|---|---|---|---|---|---|
4927 | train | 907 | [590, 598): 'Gorleben' | LOC | True | 17 | [GOLD, 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64... |
4925 | train | 907 | [63, 67): 'BONN' | LOC | True | 17 | [GOLD, 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64... |
4924 | train | 907 | [11, 17): 'German' | MISC | True | 17 | [GOLD, 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64... |
4923 | train | 896 | [523, 528): 'China' | LOC | True | 17 | [GOLD, 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64... |
4922 | train | 896 | [512, 518): 'Mexico' | LOC | True | 17 | [GOLD, 768_1, 32_1, 32_2, 32_3, 32_4, 64_1, 64... |
... | ... | ... | ... | ... | ... | ... | ... |
374 | dev | 149 | [81, 93): 'Major League' | MISC | True | 0 | [GOLD] |
246 | dev | 120 | [63, 70): 'English' | MISC | True | 0 | [GOLD] |
78 | dev | 64 | [2571, 2575): 'AIDS' | MISC | True | 0 | [GOLD] |
3 | dev | 21 | [86, 90): 'UEFA' | ORG | True | 0 | [GOLD] |
0 | dev | 21 | [25, 39): 'STANDARD LIEGE' | ORG | True | 0 | [GOLD] |
4928 rows × 7 columns
# Drop Boolean columns for now
results = full_results[["fold", "doc_num", "span", "ent_type", "in_gold", "count"]]
results
fold | doc_num | span | ent_type | in_gold | count | |
---|---|---|---|---|---|---|
4927 | train | 907 | [590, 598): 'Gorleben' | LOC | True | 17 |
4925 | train | 907 | [63, 67): 'BONN' | LOC | True | 17 |
4924 | train | 907 | [11, 17): 'German' | MISC | True | 17 |
4923 | train | 896 | [523, 528): 'China' | LOC | True | 17 |
4922 | train | 896 | [512, 518): 'Mexico' | LOC | True | 17 |
... | ... | ... | ... | ... | ... | ... |
374 | dev | 149 | [81, 93): 'Major League' | MISC | True | 0 |
246 | dev | 120 | [63, 70): 'English' | MISC | True | 0 |
78 | dev | 64 | [2571, 2575): 'AIDS' | MISC | True | 0 |
3 | dev | 21 | [86, 90): 'UEFA' | ORG | True | 0 |
0 | dev | 21 | [25, 39): 'STANDARD LIEGE' | ORG | True | 0 |
4928 rows × 6 columns
(results[results["in_gold"] == True][["count", "span"]]
.groupby("count").count()
.rename(columns={"span": "num_ents"}))
num_ents | |
---|---|
count | |
0 | 115 |
1 | 31 |
2 | 23 |
3 | 20 |
4 | 17 |
5 | 18 |
6 | 23 |
7 | 23 |
8 | 19 |
9 | 29 |
10 | 28 |
11 | 41 |
12 | 48 |
13 | 62 |
14 | 75 |
15 | 115 |
16 | 248 |
17 | 2940 |
(results[results["in_gold"] == False][["count", "span"]]
.groupby("count").count()
.rename(columns={"span": "num_ents"}))
num_ents | |
---|---|
count | |
1 | 468 |
2 | 174 |
3 | 94 |
4 | 61 |
5 | 52 |
6 | 26 |
7 | 36 |
8 | 16 |
9 | 17 |
10 | 12 |
11 | 9 |
12 | 9 |
13 | 8 |
14 | 11 |
15 | 14 |
16 | 15 |
17 | 31 |
# Pull out some hard-to-find examples, sorting by document to make labeling easier
hard_to_get = results[results["in_gold"]].sort_values(["count", "fold", "doc_num"]).head(20)
hard_to_get
fold | doc_num | span | ent_type | in_gold | count | |
---|---|---|---|---|---|---|
3 | dev | 21 | [86, 90): 'UEFA' | ORG | True | 0 |
0 | dev | 21 | [25, 39): 'STANDARD LIEGE' | ORG | True | 0 |
78 | dev | 64 | [2571, 2575): 'AIDS' | MISC | True | 0 |
246 | dev | 120 | [63, 70): 'English' | MISC | True | 0 |
374 | dev | 149 | [81, 93): 'Major League' | MISC | True | 0 |
498 | dev | 182 | [2173, 2177): 'Ruch' | ORG | True | 0 |
462 | dev | 182 | [662, 670): 'division' | MISC | True | 0 |
512 | dev | 203 | [879, 881): '90' | LOC | True | 0 |
622 | dev | 214 | [1689, 1705): 'Schindler's List' | MISC | True | 0 |
621 | dev | 214 | [1643, 1648): 'Oscar' | PER | True | 0 |
583 | dev | 214 | [285, 305): 'Venice Film Festival' | MISC | True | 0 |
569 | dev | 214 | [187, 202): 'Michael Collins' | MISC | True | 0 |
802 | test | 15 | [44, 56): 'WORLD SERIES' | MISC | True | 0 |
801 | test | 15 | [32, 43): 'WEST INDIES' | LOC | True | 0 |
942 | test | 21 | [719, 725): 'Wijaya' | PER | True | 0 |
896 | test | 21 | [22, 38): 'WORLD GRAND PRIX' | MISC | True | 0 |
1057 | test | 23 | [1117, 1127): 'NY RANGERS' | ORG | True | 0 |
1052 | test | 23 | [1106, 1113): 'TORONTO' | ORG | True | 0 |
1025 | test | 23 | [673, 689): 'CENTRAL DIVISION' | MISC | True | 0 |
1016 | test | 23 | [599, 611): 'NY ISLANDERS' | ORG | True | 0 |
# Hardest results not in the gold standard for models to avoid
hard_to_avoid = results[~results["in_gold"]].sort_values(["count", "fold", "doc_num"], ascending=[False, True, True]).head(20)
hard_to_avoid
fold | doc_num | span | ent_type | in_gold | count | |
---|---|---|---|---|---|---|
373 | dev | 149 | [81, 102): 'Major League Baseball' | MISC | False | 17 |
570 | dev | 214 | [187, 202): 'Michael Collins' | PER | False | 17 |
983 | test | 23 | [94, 116): 'National Hockey League' | MISC | False | 17 |
1110 | test | 25 | [856, 864): 'NFC East' | MISC | False | 17 |
1109 | test | 25 | [823, 835): 'Philadelphia' | ORG | False | 17 |
1184 | test | 41 | [674, 688): 'Sporting Gijon' | ORG | False | 17 |
1323 | test | 114 | [51, 61): 'sales-USDA' | ORG | False | 17 |
1367 | test | 118 | [776, 791): 'mid-Mississippi' | LOC | False | 17 |
1362 | test | 118 | [535, 550): 'mid-Mississippi' | LOC | False | 17 |
1509 | test | 178 | [1787, 1800): 'Uruguay Round' | MISC | False | 17 |
1560 | test | 180 | [588, 592): 'BILO' | ORG | False | 17 |
1558 | test | 180 | [579, 583): 'TOPS' | ORG | False | 17 |
1550 | test | 180 | [395, 399): 'BILO' | ORG | False | 17 |
1544 | test | 180 | [286, 293): 'Malysia' | ORG | False | 17 |
1542 | test | 180 | [259, 263): 'BILO' | ORG | False | 17 |
1649 | test | 207 | [1041, 1047): 'Oxford' | ORG | False | 17 |
1786 | test | 219 | [368, 381): 'Koo Jeon Woon' | PER | False | 17 |
1807 | test | 222 | [218, 225): 'EASTERN' | MISC | False | 17 |
1805 | test | 222 | [92, 114): 'National Hockey League' | MISC | False | 17 |
2054 | train | 48 | [885, 899): 'Sjeng Schalken' | ORG | False | 17 |
For each of the 10 folds, train a model on the fold's training set and run analysis on the fold's test set.
def handle_fold(fold_ix: int) -> Dict[str, Any]:
"""
The per-fold processing of the previous section's cells, collapsed into
a single function.
:param fold_ix: 0-based index of fold
:returns: a dictionary that maps data structure name to data structure
"""
# To avoid accidentally picking up leftover data from a previous cell,
# variables local to this function are named with a leading underscore
_train_inputs_df = corpus_df.merge(train_keys[fold_ix])
_test_inputs_df = corpus_df.merge(test_keys[fold_ix])
_models = maybe_train_models(_train_inputs_df, fold_ix)
_evals = eval_models(_models, _test_inputs_df)
_summary_df = make_summary_df(_evals)
_gold_elts = cleaning.preprocess.combine_raw_spans_docs_to_match(corpus_raw,_evals[list(evals.keys())[0]])
_full_results = cleaning.flag_suspicious_labels(_evals,'ent_type','ent_type',
label_name='ent_type',
gold_feats=_gold_elts,
align_over_cols=['fold','doc_num','span'],
keep_cols=[],split_doc=False)
_results = _full_results[["fold", "doc_num", "span",
"ent_type", "in_gold", "count"]]
return {
"models": _models,
"summary_df": _summary_df,
"full_results": _full_results,
"results": _results
}
# Start with the (already computed) results for fold 0
results_by_fold = [
{
"models": models,
"summary_df": summary_df,
"full_results": full_results,
"results": results
}
]
for fold in range(1, _KFOLD_NUM_FOLDS):
print(f"Starting fold {fold}.")
results_by_fold.append(handle_fold(fold))
print(f"Done with fold {fold}.")
Starting fold 1. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=32 and seed=438878. (pid=72362) Training model with n_components=32 and seed=654571. (pid=72366) Training model with n_components=64 and seed=975622. (pid=72365) Training model with n_components=32 and seed=773956. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=256 and seed=643865. (pid=72371) Training model with n_components=256 and seed=781567. (pid=72364) Training model with n_components=64 and seed=201469. (pid=72370) Training model with n_components=128 and seed=839748. (pid=72374) Training model with n_components=128 and seed=513226. (pid=72376) Training model with n_components=128 and seed=450385. (pid=72373) Training model with n_components=128 and seed=128113. (pid=72369) Training model with n_components=64 and seed=526478. (pid=72367) Training model with n_components=64 and seed=94177. (pid=72372) Training model with n_components=256 and seed=402414. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 1. Starting fold 2. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=256 and seed=643865. (pid=72362) Training model with n_components=256 and seed=781567. (pid=72366) Training model with n_components=128 and seed=450385. (pid=72365) Training model with n_components=256 and seed=402414. (pid=72372) Training model with n_components=32 and seed=654571. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=32 and seed=438878. (pid=72371) Training model with n_components=32 and seed=773956. (pid=72364) Training model with n_components=128 and seed=128113. (pid=72370) Training model with n_components=64 and seed=526478. (pid=72374) Training model with n_components=64 and seed=975622. (pid=72376) Training model with n_components=64 and seed=94177. (pid=72373) Training model with n_components=64 and seed=201469. (pid=72369) Training model with n_components=128 and seed=839748. (pid=72367) Training model with n_components=128 and seed=513226. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 2. Starting fold 3. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=32 and seed=773956. (pid=72362) Training model with n_components=32 and seed=654571. (pid=72366) Training model with n_components=64 and seed=975622. (pid=72365) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=256 and seed=781567. (pid=72371) Training model with n_components=256 and seed=643865. (pid=72364) Training model with n_components=64 and seed=201469. (pid=72370) Training model with n_components=128 and seed=839748. (pid=72374) Training model with n_components=128 and seed=450385. (pid=72376) Training model with n_components=128 and seed=513226. (pid=72373) Training model with n_components=128 and seed=128113. (pid=72369) Training model with n_components=64 and seed=94177. (pid=72367) Training model with n_components=64 and seed=526478. (pid=72372) Training model with n_components=256 and seed=402414. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 3. Starting fold 4. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72362) Training model with n_components=256 and seed=643865. (pid=72366) Training model with n_components=128 and seed=450385. (pid=72365) Training model with n_components=256 and seed=781567. (pid=72372) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=32 and seed=654571. (pid=72371) Training model with n_components=32 and seed=773956. (pid=72364) Training model with n_components=128 and seed=839748. (pid=72370) Training model with n_components=64 and seed=975622. (pid=72374) Training model with n_components=64 and seed=94177. (pid=72376) Training model with n_components=64 and seed=526478. (pid=72373) Training model with n_components=64 and seed=201469. (pid=72369) Training model with n_components=128 and seed=513226. (pid=72367) Training model with n_components=128 and seed=128113. (pid=72363) Training model with n_components=256 and seed=402414. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 4. Starting fold 5. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=32 and seed=438878. (pid=72362) Training model with n_components=32 and seed=773956. (pid=72366) Training model with n_components=64 and seed=526478. (pid=72365) Training model with n_components=32 and seed=654571. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72364) Training model with n_components=64 and seed=94177. (pid=72374) Training model with n_components=128 and seed=513226. (pid=72369) Training model with n_components=64 and seed=975622. (pid=72367) Training model with n_components=64 and seed=201469. (pid=72373) Training model with n_components=128 and seed=128113. (pid=72376) Training model with n_components=128 and seed=839748. (pid=72370) Training model with n_components=128 and seed=450385. (pid=72371) Training model with n_components=256 and seed=781567. (pid=72375) Training model with n_components=256 and seed=643865. (pid=72372) Training model with n_components=256 and seed=402414. (pid=72362) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 5. Starting fold 6. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=256 and seed=781567. (pid=72362) Training model with n_components=32 and seed=89250. (pid=72366) Training model with n_components=128 and seed=839748. (pid=72372) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=256 and seed=643865. (pid=72375) Training model with n_components=32 and seed=654571. (pid=72371) Training model with n_components=32 and seed=773956. (pid=72364) Training model with n_components=128 and seed=513226. (pid=72370) Training model with n_components=64 and seed=94177. (pid=72374) Training model with n_components=64 and seed=526478. (pid=72376) Training model with n_components=64 and seed=975622. (pid=72373) Training model with n_components=64 and seed=201469. (pid=72369) Training model with n_components=128 and seed=450385. (pid=72367) Training model with n_components=128 and seed=128113. (pid=72365) Training model with n_components=256 and seed=402414. (pid=72375) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 6. Starting fold 7. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=32 and seed=654571. (pid=72366) Training model with n_components=64 and seed=975622. (pid=72365) Training model with n_components=32 and seed=438878. (pid=72372) Training model with n_components=256 and seed=781567. (pid=72368) Training model with n_components=32 and seed=773956. (pid=72375) Training model with n_components=32 and seed=89250. (pid=72371) Training model with n_components=256 and seed=402414. (pid=72364) Training model with n_components=64 and seed=94177. (pid=72370) Training model with n_components=128 and seed=513226. (pid=72374) Training model with n_components=128 and seed=128113. (pid=72376) Training model with n_components=128 and seed=450385. (pid=72373) Training model with n_components=128 and seed=839748. (pid=72369) Training model with n_components=64 and seed=526478. (pid=72367) Training model with n_components=64 and seed=201469. (pid=72362) Training model with n_components=256 and seed=643865. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 7. Starting fold 8. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=256 and seed=781567. (pid=72362) Training model with n_components=32 and seed=773956. (pid=72366) Training model with n_components=128 and seed=450385. (pid=72365) Training model with n_components=256 and seed=402414. (pid=72372) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=256 and seed=643865. (pid=72371) Training model with n_components=32 and seed=654571. (pid=72364) Training model with n_components=128 and seed=513226. (pid=72370) Training model with n_components=64 and seed=94177. (pid=72374) Training model with n_components=64 and seed=201469. (pid=72376) Training model with n_components=64 and seed=975622. (pid=72373) Training model with n_components=64 and seed=526478. (pid=72369) Training model with n_components=128 and seed=839748. (pid=72367) Training model with n_components=128 and seed=128113. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 8. Starting fold 9. Training model using all of 768-dimension embeddings. Training model '32_1' (#1 at 32 dimensions) with seed 89250 Training model '32_2' (#2 at 32 dimensions) with seed 773956 Training model '32_3' (#3 at 32 dimensions) with seed 654571 Training model '32_4' (#4 at 32 dimensions) with seed 438878 Training model '64_1' (#1 at 64 dimensions) with seed 201469 Training model '64_2' (#2 at 64 dimensions) with seed 94177 Training model '64_3' (#3 at 64 dimensions) with seed 526478 Training model '64_4' (#4 at 64 dimensions) with seed 975622 Training model '128_1' (#1 at 128 dimensions) with seed 513226 Training model '128_2' (#2 at 128 dimensions) with seed 128113 Training model '128_3' (#3 at 128 dimensions) with seed 839748 Training model '128_4' (#4 at 128 dimensions) with seed 450385 Training model '256_1' (#1 at 256 dimensions) with seed 781567 Training model '256_2' (#2 at 256 dimensions) with seed 643865 Training model '256_3' (#3 at 256 dimensions) with seed 402414 Training model '256_4' (#4 at 256 dimensions) with seed 822761 (pid=72363) Training model with n_components=32 and seed=773956. (pid=72366) Training model with n_components=64 and seed=975622. (pid=72365) Training model with n_components=32 and seed=438878. (pid=72368) Training model with n_components=32 and seed=89250. (pid=72375) Training model with n_components=32 and seed=654571. (pid=72364) Training model with n_components=64 and seed=526478. (pid=72370) Training model with n_components=128 and seed=513226. (pid=72369) Training model with n_components=64 and seed=94177. (pid=72367) Training model with n_components=64 and seed=201469. (pid=72374) Training model with n_components=128 and seed=128113. (pid=72376) Training model with n_components=128 and seed=839748. (pid=72373) Training model with n_components=128 and seed=450385. (pid=72372) Training model with n_components=256 and seed=781567. (pid=72371) Training model with n_components=256 and seed=643865. (pid=72362) Training model with n_components=256 and seed=402414. (pid=72368) Training model with n_components=256 and seed=822761. Trained 17 models.
IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=17, style=ProgressStyle(descr…
Done with fold 9.
# Combine all the results into a single dataframe for the entire corpus
all_results = pd.concat([r["results"] for r in results_by_fold])
all_results
fold | doc_num | span | ent_type | in_gold | count | |
---|---|---|---|---|---|---|
4927 | train | 907 | [590, 598): 'Gorleben' | LOC | True | 17 |
4925 | train | 907 | [63, 67): 'BONN' | LOC | True | 17 |
4924 | train | 907 | [11, 17): 'German' | MISC | True | 17 |
4923 | train | 896 | [523, 528): 'China' | LOC | True | 17 |
4922 | train | 896 | [512, 518): 'Mexico' | LOC | True | 17 |
... | ... | ... | ... | ... | ... | ... |
271 | dev | 93 | [469, 481): 'JAKARTA POST' | ORG | True | 0 |
183 | dev | 76 | [1285, 1312): 'Chicago Purchasing Managers' | ORG | True | 0 |
126 | dev | 49 | [1920, 1925): 'Tajik' | MISC | True | 0 |
25 | dev | 15 | [109, 133): 'National Football League' | ORG | True | 0 |
17 | dev | 15 | [15, 40): 'AMERICAN FOOTBALL-RANDALL' | MISC | True | 0 |
44802 rows × 6 columns
# Reformat for output
dev_and_test_results = all_results[all_results["fold"].isin(["dev", "test"])]
in_gold_to_write, not_in_gold_to_write = cleaning.analysis.csv_prep(dev_and_test_results, "count")
in_gold_to_write
count | fold | doc_offset | corpus_span | corpus_ent_type | error_type | correct_span | correct_ent_type | notes | time_started | time_stopped | time_elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | 0 | dev | 2 | [760, 765): 'Leeds' | ORG | |||||||
21 | 0 | dev | 2 | [614, 634): 'Duke of Norfolk's XI' | ORG | |||||||
5 | 0 | dev | 2 | [189, 218): 'Test and County Cricket Board' | ORG | |||||||
3 | 0 | dev | 2 | [87, 92): 'Ashes' | MISC | |||||||
0 | 0 | dev | 2 | [25, 30): 'ASHES' | MISC | |||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1738 | 17 | test | 230 | [230, 238): 'Charlton' | PER | |||||||
1737 | 17 | test | 230 | [177, 187): 'Englishman' | MISC | |||||||
1736 | 17 | test | 230 | [135, 142): 'Ireland' | LOC | |||||||
1735 | 17 | test | 230 | [87, 100): 'Jack Charlton' | PER | |||||||
1734 | 17 | test | 230 | [69, 75): 'DUBLIN' | LOC |
11590 rows × 12 columns
not_in_gold_to_write
count | fold | doc_offset | model_span | model_ent_type | error_type | corpus_span | corpus_ent_type | correct_span | correct_ent_type | notes | time_started | time_stopped | time_elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 17 | dev | 2 | [760, 765): 'Leeds' | LOC | |||||||||
25 | 17 | dev | 6 | [567, 572): 'Rotor' | PER | |||||||||
20 | 17 | dev | 6 | [399, 404): 'Rotor' | PER | |||||||||
16 | 17 | dev | 6 | [262, 267): 'Rotor' | PER | |||||||||
142 | 17 | dev | 11 | [1961, 1975): 'Czech Republic' | LOC | |||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1708 | 1 | test | 228 | [771, 784): 'De Graafschap' | ORG | |||||||||
1690 | 1 | test | 228 | [269, 287): 'Brazilian defender' | MISC | |||||||||
1679 | 1 | test | 228 | [40, 43): 'SIX' | ORG | |||||||||
1724 | 1 | test | 230 | [19, 29): 'ENGLISHMAN' | LOC | |||||||||
1727 | 1 | test | 230 | [19, 38): 'ENGLISHMAN CHARLTON' | PER |
4366 rows × 14 columns
in_gold_to_write.to_csv("outputs/CoNLL_4_in_gold.csv", index=False)
not_in_gold_to_write.to_csv("outputs/CoNLL_4_not_in_gold.csv", index=False)
# Repeat for the contents of the original training set
train_results = all_results[all_results["fold"] == "train"]
in_gold_to_write, not_in_gold_to_write = cleaning.analysis.csv_prep(train_results, "count")
in_gold_to_write
count | fold | doc_offset | corpus_span | corpus_ent_type | error_type | correct_span | correct_ent_type | notes | time_started | time_stopped | time_elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1486 | 0 | train | 6 | [121, 137): 'Toronto Dominion' | PER | |||||||
1358 | 0 | train | 24 | [384, 388): 'FLNC' | ORG | |||||||
1355 | 0 | train | 24 | [161, 169): 'Africans' | MISC | |||||||
1965 | 0 | train | 25 | [141, 151): 'mid-Norway' | MISC | |||||||
1383 | 0 | train | 28 | [1133, 1135): 'EU' | ORG | |||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4132 | 17 | train | 945 | [130, 137): 'Preston' | ORG | |||||||
4131 | 17 | train | 945 | [119, 127): 'Plymouth' | ORG | |||||||
4130 | 17 | train | 945 | [72, 79): 'English' | MISC | |||||||
4129 | 17 | train | 945 | [43, 49): 'LONDON' | LOC | |||||||
4128 | 17 | train | 945 | [19, 26): 'ENGLISH' | MISC |
23499 rows × 12 columns
not_in_gold_to_write
count | fold | doc_offset | model_span | model_ent_type | error_type | corpus_span | corpus_ent_type | correct_span | correct_ent_type | notes | time_started | time_stopped | time_elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1738 | 17 | train | 3 | [0, 10): '-DOCSTART-' | LOC | |||||||||
1485 | 17 | train | 6 | [121, 137): 'Toronto Dominion' | LOC | |||||||||
1964 | 17 | train | 25 | [141, 151): 'mid-Norway' | LOC | |||||||||
2022 | 17 | train | 29 | [762, 774): 'Mark O'Meara' | PER | |||||||||
1996 | 17 | train | 29 | [454, 468): 'Phil Mickelson' | PER | |||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4416 | 1 | train | 943 | [25, 46): 'SAN MARINO GRAND PRIX' | PER | |||||||||
4461 | 1 | train | 944 | [25, 32): 'MASTERS' | MISC | |||||||||
4462 | 1 | train | 944 | [25, 32): 'MASTERS' | PER | |||||||||
4463 | 1 | train | 944 | [17, 32): 'BRITISH MASTERS' | LOC | |||||||||
4458 | 1 | train | 944 | [11, 15): 'GOLF' | LOC |
5347 rows × 14 columns
in_gold_to_write.to_csv("outputs/CoNLL_4_train_in_gold.csv", index=False)
not_in_gold_to_write.to_csv("outputs/CoNLL_4_train_not_in_gold.csv", index=False)