To get started: consult start
In the entities notebook we saw how we could use third-party features with our corpus. There we promised to show how we can upgrade such features so that they also work against newer versions of the corpus.
Text-Fabric has machinery to help with that. It turns out that we have to make a mapping between the slots of both versions, and then Text-Fabric can do the rest.
With that mapping in hand, we can port all features, past, present and future, automatically from the older version to the newer, and vice versa.
But: there will be imperfections, unavoidably.
%load_ext autoreload
%autoreload 2
import collections
from tf.app import use
In this notebook we map the slot nodes from version 0.8.1 (source version) to 1.0 (target version).
Basically this means that we map all slots from the source version to corresponding slots in the target version.
Some slots have an empty text (most of them contain some punctuation).
We do not want to be fussy about those slots. We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest non-empty slot.
After establishing the slot mapping, we extend the mapping to all nodes in a generic way. The code for this is already in the TF library.
Note that in this stage we do not need the entity features at all. All we do is to compare two version of the base corpus.
from tf.dataset import Versions
va = "0.8.1"
vb = "1.0"
We load the data for both versions.
This time, we work in our GitHub clone, because we want to make the resulting map available to everyone, after pushing the clone to GitHub.
A = {}
for v in (va, vb):
A[v] = use(
"CLARIAH/wp6-missieven:clone",
checkout="clone",
silent="deep",
version=v
)
We walk through the slots of the target version.
For each target slot we increase the slot in the source version, and check whether
source and target slots have the same value for the trans
feature.
If not, and one of them is empty, we skip the empty word and try the next one.
But if both are not empty and unequal, we have a real problem: a mismatch.
In that case we stop, and you have to inspect what is happening.
def makeSlotMap():
Fa = A[va].api.F
Fb = A[vb].api.F
transA = Fa.trans.v
transB = Fb.trans.v
maxSlotA = Fa.otype.maxSlot
maxSlotB = Fb.otype.maxSlot
print(
f"""\
Computing slotMap between:
{va}: {maxSlotA:>8} slots,
{vb}: {maxSlotB:>8} slots.\
"""
)
slotMap = {}
good = True
wA = 1
wB = 1
while wB <= maxSlotB and wA <= maxSlotA:
textA = transA(wA) or ""
textB = transB(wB) or ""
if textA == textB:
slotMap.setdefault(wA, {})[wB] = None
wA += 1
wB += 1
elif textA.startswith(textB):
slotMap.setdefault(wA, {})[wB] = None
wB += 1
elif textA.endswith(textB):
wA += 1
wB += 1
elif textB.startswith(textA):
slotMap.setdefault(wA, {})[wB] = None
wA += 1
elif textB.endswith(textA):
slotMap.setdefault(wA, {})[wB] = None
wA += 1
wB += 1
else:
print("Mismatch:")
print(f"A: {wA:>8} = `{textA}`")
print(f"B: {wB:>8} = `{textB}`")
good = False
break
maxSlotMap = max(slotMap)
if maxSlotMap > maxSlotA:
print(f"maxSlot in A version {va} exceeded")
print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
good = False
if good:
print(
f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
"""
)
return slotMap
slotMap = makeSlotMap()
Computing slotMap between: 0.8.1: 5316429 slots, 1.0: 5977367 slots. slotMap succesfully created: 5316429 slots mapped.
Note that as of version 1.0 volume 14 has been included.
So we expect a discrepancy there.
And of course, we will not have entity feature values for volume 14.
When we encounter problems, we can do a bit of checking to see what is going on.
The next function shows the line around a slot node, and can do so in both versions.
def show(v, n):
F = A[v].api.F
L = A[v].api.L
T = A[v].api.T
lines = L.u(n, otype="line")
if not lines:
lines = L.u(n + 1, otype="line")
if not lines:
lines = L.u(n - 1, otype="line")
if not lines:
print("no such line")
return
line = lines[0]
print(T.sectionFromNode(line))
words = L.d(line, otype="word")
print(" ".join(f"[{w}={F.trans.v(w)}]" for w in words))
print(T.text(line))
show(va, 49)
show(vb, 49)
(1, 3, 4) [46=Generaal] [47=aan] [48=boord] [49=vertoefde] [50=verandert] [51=daaraan] [52=niets] [53=Ile] [54=de] [55=Mayo] [56=is] [57=een] [58=der] [59=Kaap] [60=Verdische] Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische (1, 3, 4) [45=Generaal] [46=aan] [47=boord] [48=vertoefde] [49=verandert] [50=daaraan] [51=niets] [52=Ile] [53=de] [54=Mayo] [55=is] [56=een] [57=der] [58=Kaap] [59=Verdische] Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische
We now extend the slotMap
to a full node map.
See dataset.Versions in the Text-Fabric documentation.
V = Versions({v: A[v].api for v in (va, vb)}, va, vb, silent="auto", slotMap=slotMap)
V.makeVersionMapping()
12s ********************************************************************************************** * * * Mapping volume nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 23s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (volume) . .............................................................................................. 23s | TOTAL : 100.00% 13x 23s | unique, perfect : 92.31% 12x 23s | multiple, non-perfect : 7.69% 1x 23s ********************************************************************************************** * * * Mapping letter nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 34s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (letter) . .............................................................................................. 34s | TOTAL : 100.00% 589x 34s | unique, perfect : 92.36% 544x 34s | multiple, non-perfect : 7.64% 45x 34s ********************************************************************************************** * * * Mapping page nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 45s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (page) . .............................................................................................. 45s | TOTAL : 100.00% 10149x 45s | unique, perfect : 98.51% 9998x 45s | unique, imperfect : 0.10% 10x 45s | multiple, non-perfect : 1.39% 141x 45s ********************************************************************************************** * * * Mapping table nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 46s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (table) . .............................................................................................. 46s | TOTAL : 100.00% 322x 46s | unique, perfect : 82.61% 266x 46s | unique, imperfect : 17.08% 55x 46s | multiple, non-perfect : 0.31% 1x 46s ********************************************************************************************** * * * Mapping para nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 53s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (para) . .............................................................................................. 53s | TOTAL : 100.00% 33881x 53s | unique, perfect : 95.55% 32374x 53s | unique, imperfect : 3.51% 1188x 53s | multiple, cleanly composed : 0.02% 8x 53s | multiple, non-perfect : 0.91% 307x 53s | not mapped : 0.01% 4x 53s ********************************************************************************************** * * * Mapping remark nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 57s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (remark) . .............................................................................................. 57s | TOTAL : 100.00% 22922x 57s | unique, perfect : 14.21% 3257x 57s | unique, imperfect : 85.79% 19664x 57s | not mapped : 0.00% 1x 57s ********************************************************************************************** * * * Mapping note nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 57s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (note) . .............................................................................................. 57s | TOTAL : 100.00% 12290x 57s | unique, perfect : 2.39% 294x 57s | unique, imperfect : 97.53% 11987x 57s | multiple, cleanly composed : 0.01% 1x 57s | multiple, non-perfect : 0.07% 8x 57s ********************************************************************************************** * * * Mapping line nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 1m 10s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (line) . .............................................................................................. 1m 10s | TOTAL : 100.00% 465366x 1m 10s | unique, perfect : 99.44% 462737x 1m 10s | unique, imperfect : 0.04% 199x 1m 10s | multiple, cleanly composed : 0.05% 213x 1m 10s | multiple, non-perfect : 0.48% 2217x 1m 10s ********************************************************************************************** * * * Mapping row nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 1m 10s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (row) . .............................................................................................. 1m 10s | TOTAL : 100.00% 4566x 1m 10s | unique, perfect : 98.38% 4492x 1m 10s | unique, imperfect : 1.14% 52x 1m 10s | multiple, non-perfect : 0.48% 22x 1m 10s ********************************************************************************************** * * * Mapping folio nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 1m 10s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (folio) . .............................................................................................. 1m 10s | TOTAL : 100.00% 2551x 1m 10s | unique, perfect : 98.28% 2507x 1m 10s | unique, imperfect : 1.72% 44x 1m 10s ********************************************************************************************** * * * Mapping cell nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 1m 10s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (cell) . .............................................................................................. 1m 10s | TOTAL : 100.00% 20593x 1m 10s | unique, perfect : 99.56% 20502x 1m 10s | unique, imperfect : 0.17% 34x 1m 10s | multiple, cleanly composed : 0.11% 23x 1m 10s | multiple, non-perfect : 0.15% 31x 1m 10s | not mapped : 0.01% 3x 1m 10s ********************************************************************************************** * * * Mapping subhead nodes 0.8.1 ==> 1.0 * * * ********************************************************************************************** 1m 10s .............................................................................................. . Statistics for 0.8.1 ==> 1.0 (subhead) . .............................................................................................. 1m 10s | TOTAL : 100.00% 1360x 1m 10s | unique, perfect : 99.93% 1359x 1m 10s | multiple, non-perfect : 0.07% 1x 1m 10s .............................................................................................. . Write edge as TF feature omap@0.8.1-1.0 . .............................................................................................. 0.00s Exporting 0 node and 1 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/tf/1.0: | 8.80s T omap@0.8.1-1.0 to ~/github/CLARIAH/wp6-missieven/tf/1.0 8.80s Exported 0 node features and 1 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/tf/1.0
Now we return to the entity features.
It seems that the node map is not perfect, but we did not expect that.
We migrate the entity features nevertheless.
Remember that they are not in the corpus, but in a third party module of features.
For the sake of persistence, I have copied the features to this repo in directory voc-missives/export/tf
.
We load the older version of the corpus again, now with the entity features for that version. We leave the loaded newer version of the corpus in memory.
# THIRD_PARTY = "cltl/voc-missives/export/tf"
THIRD_PARTY = "CLARIAH/wp6-missieven/voc-missives/export/tf"
api = {}
api[vb] = A[vb].api
A[va] = use(
"CLARIAH/wp6-missieven:clone",
checkout="clone",
mod=f"{THIRD_PARTY}:clone",
silent="deep",
version=va,
)
api[va] = A[va].api
We are going to produce the upgraded entities in voc-missives/migrated/tf
.
V = Versions(api, va, vb, silent="auto")
V.migrateFeatures(
("entityId", "entityKind"),
location=f"~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf",
)
20s start migrating | 47s T omap@0.8.1-1.0 from ~/github/CLARIAH/wp6-missieven/tf/1.0 48s All additional features loaded - for details use TF.isLoaded() 48s Mapping entityId (node) 48s Mapping entityKind (node) 0.00s Exporting 2 node and 0 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0: | 0.03s T entityId to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0 | 0.03s T entityKind to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0 0.05s Exported 2 node features and 0 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0 0.05s Done
A = use(
"CLARIAH/wp6-missieven:clone",
checkout="clone",
mod="CLARIAH/wp6-missieven/voc-missives/migrated/tf:clone",
hoist=globals(),
version="1.0",
silent="verbose",
)
This is Text-Fabric 10.2.6 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 47 features found and 2 ignored 5.47s All features loaded/computed - for details use TF.isLoaded() | 0.17s T entityId from ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0 | 0.14s T entityKind from ~/github/CLARIAH/wp6-missieven/voc-missives/migrated/tf/1.0 0.75s All additional features loaded - for details use TF.isLoaded()
Note that you can click the triangle before CLARIAH/wp6-missieven/voc-missives/migrated/tf, to see which features are used. You can then click further on the triangle before the feature data type, to see more information about that feature, including the fact that it is an upgraded feature.
creator: Sophie Arnoult
dateWritten: 2022-10-11T10:42:45Z
upgraded: ‼️ from version 0.8.1 to 1.0
writtenBy: Text-Fabric
We are going to do a bit of research into the upgraded features.
F.entityId.freqList()[0:20]
(('e_n12_2_632', 8), ('e_n13_15_2306', 8), ('e_n7_8_809', 8), ('e_n13_15_1302', 7), ('e_n7_8_1080', 7), ('e_t10_15_108', 7), ('e_t10_15_273', 7), ('e_n10_11_715', 6), ('e_n12_14_130', 6), ('e_n12_2_383', 6), ('e_n12_2_578', 6), ('e_n13_15_154', 6), ('e_n13_15_1582', 6), ('e_n13_15_1894', 6), ('e_n13_15_285', 6), ('e_n5_28_103', 6), ('e_n5_28_34', 6), ('e_n5_28_675', 6), ('e_n5_7_515', 6), ('e_n8_6_710', 6))
len(F.entityId.freqList())
24500
F.entityKind.freqList()
(('LOC', 12790), ('PER', 10393), ('LOCderiv', 4279), ('ORG', 3841), ('SHP', 2922), ('GPE', 1153), ('RELderiv', 261), ('ORGpart', 58), ('LOCpart', 45), ('RELpart', 28), ('REL', 19))
query = """
word entityId entityKind*
"""
results = A.search(query)
1.96s 32249 results
A.show(results, condensed=True, end=10)
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
Let's view the distribution of named entities over the volumes.
We run a query looking for words with a named entity within a volume.
query = """
volume
word entityId
"""
results = A.search(query)
2.46s 32249 results
Now we process the results, which are tuples consisting of a volume node and a word node.
eDist = collections.Counter()
for (vol, word) in results:
eDist[F.n.v(vol)] += 1
eDist
Counter({1: 1451, 2: 1150, 3: 1536, 4: 1758, 5: 3032, 6: 2864, 7: 2343, 8: 1685, 9: 1870, 10: 1933, 11: 4695, 12: 1909, 13: 6023})
It is apparent that there are no entities in volume 14, because in version 0.8.1. there was no volume 14.
So it is preferable that the third party repeats the entity recognition on the new version of the corpus, so that the entities in volume 14 get recognized too.
This has in fact happened. Sophie Arnoult has run the machinery again.
Let's quickly load that version and compute the distribution of entities there.
A = use(
"CLARIAH/wp6-missieven:clone",
checkout="clone",
mod="CLARIAH/wp6-missieven/voc-missives/export/tf:clone",
hoist=globals(),
version="1.0",
silent="verbose",
)
This is Text-Fabric 10.2.6 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 47 features found and 2 ignored 5.66s All features loaded/computed - for details use TF.isLoaded() 0.54s All additional features loaded - for details use TF.isLoaded()
results = A.search(query)
eDist = collections.Counter()
for (vol, word) in results:
eDist[F.n.v(vol)] += 1
eDist
2.46s 29159 results
Counter({1: 2208, 2: 1799, 3: 2297, 4: 1987, 5: 1858, 6: 4295, 7: 3068, 8: 1251, 9: 2825, 10: 1745, 11: 2455, 12: 1350, 13: 977, 14: 1044})
Clearly, there have been additional changes leading to a very different version 1.0 than 0.8.1, so at this point in time the migrated features (from 0.8.1 to 1.0) are practically obsolete.
CC-BY Dirk Roorda