You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
Consider the semantic actor features in ch-jensen/participants/actor/tf.
We see only features for version c
of the BHSA, but we prefer to work with version 2021
of the BHSA.
When we try to load the features by simply saying
A = use("ETCBC/bhsa", mod="ch-jensen/participants/actor/tf")
we have no luck, because there is no ch-jensen/participants/actor/tf/2021
on GitHub.
But, one of the features in the BHSA is omap@c-2021.tf
and this contains the information to map
all nodes in version c
to the nodes of version 2021
, as faithfully as is reasonably possible.
My homework as Text-Fabric developer is to make it so that the statement above works, by steering Text-Fabric
to download version c
and using the mapping feature to produce upgraded data in the right place.
But I have not get round to that yet.
So, here is what you can do about it 😎.
We take you through the last option and evaluate how well the upgrade process fares.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
import collections
from tf.app import use
from tf.fabric import Fabric
from tf.dataset.nodemaps import Versions
We need the current version (2021
) of the BHSA anyway, so we are going to load it.
We will have two versions of the corpus in our notebook and in our variables. It is handy to have a consistent naming scheme:
N
(the now version): 2021
P
(the previous version): c
N = use("ETCBC/bhsa")
We have forked Christian's repo to etcbc/participants
, so make sure to clone it to your computer:
cd ~/github/etcbc
git clone https://github.com/ETCBC/participants
LOCATION = "data:~/github/etcbc/participants/actor/tf"
Now we can load the actor features for version c
.
P = use(LOCATION, version="c")
By clicking the triangles you can find more information about these features.
We are going to upgrade the participant features from version c
to version 2021
.
For that, we use tf.dataset.nodemaps.Versions.
We initialize the Versions object with two text-fabric API objects:
apis = {"2021": N.api, "c": P.api}
V = Versions(apis, "c", "2021")
Finally we migrate the features from "c" to "2021" and save them in the correct location.
We skip the otext
feature, since it is a special config feature, not a data feature made by Christian.
V.migrateFeatures(("actor", "coref", "prs_actor"), location=LOCATION)
49s start migrating 0.03s Done
Here it is handy to make the migration a bit more verbose. We do it again:
V.migrateFeatures(("actor", "coref", "prs_actor"), location=LOCATION, silent="auto")
57s start migrating 0.32s All additional features loaded - for details use TF.isLoaded() 0.32s Mapping actor (node) 0.33s Mapping coref (edge) 0.40s Mapping prs_actor (node) 0.00s Exporting 2 node and 1 edge and 0 config features to data:~/github/etcbc/participants/actor/tf/2021: | 0.00s T actor to data:~/github/etcbc/participants/actor/tf/2021 | 0.00s T prs_actor to data:~/github/etcbc/participants/actor/tf/2021 | 0.03s T coref to data:~/github/etcbc/participants/actor/tf/2021 0.03s Exported 2 node features and 1 edge features and 0 config features to data:~/github/etcbc/participants/actor/tf/2021 0.03s Done
Now we are in a position that we can load version 2021 of the BHSA together with the migrated module of participant features.
Note that we we point Text-Fabric to the forked repo (etcbc
instead of ch-jensen
) and then to
our local clone (:clone
).
We increase the verbosity, in order to display more metadata of the features.
N = use("etcbc/bhsa", mod="etcbc/participants/actor/tf:clone", silent="verbose")
This is Text-Fabric 10.2.0 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 125 features found and 0 ignored 0.67s Dataset without structure sections in otext:no structure functions in the T-API 2.18s All features loaded/computed - for details use TF.isLoaded() 1.48s All additional features loaded - for details use TF.isLoaded()
If you click the triangles and navigate to the full metadata of the participants features, you see a line
upgraded: ‼️ from version c to 2021
Let's do a few checks to see how well the upgrade process has worked.
First we load the c
version of the BHSA and Christian's original features.
P = use("etcbc/bhsa", mod="ch-jensen/participants/actor/tf", version="c")
Below we are going to peek into the corpus by means of pretty displays. Here we tweak what is displayed and in what style.
N.load("omap@c-2021", silent="deep")
N.isLoaded("omap@c-2021")
hiddenTypes="half_verse,sentence_atom,clause,clause_atom"
N.displaySetup(hiddenTypes=hiddenTypes, condenseType="sentence", withNodes=True, fmt="text-phono-full")
P.displaySetup(hiddenTypes=hiddenTypes, condenseType="sentence", withNodes=True, fmt="text-phono-full")
omap@c-2021 edge (int) ⚠️ Maps the nodes of version c to 2021
What are the node types that have an actor value?
{P.api.F.otype.v(n) for n in P.api.N.walk() if P.api.F.actor.v(n) is not None}
{'phrase_atom', 'subphrase'}
{N.api.F.otype.v(n) for n in N.api.N.walk() if N.api.F.actor.v(n) is not None}
{'phrase_atom', 'subphrase'}
Let's inspect the frequency lists of actor, per node type.
for otype in ("phrase_atom", "subphrase"):
frequenciesN = N.api.F.actor.freqList(nodeTypes={otype})
frequenciesP = P.api.F.actor.freqList(nodeTypes={otype})
freqDictN = {v: f for (v, f) in frequenciesN}
freqDictP = {v: f for (v, f) in frequenciesP}
goodOnes = []
badOnes = []
for v in sorted(set(freqDictN) | set(freqDictP)):
fN = freqDictN.get(v, 0)
fP = freqDictP.get(v, 0)
if fN == fP:
goodOnes.append(v)
else:
badOnes.append((v, fN, fP))
print(f"\nComparing frequencies on {otype}s: {len(goodOnes)} OK; {len(badOnes)} discrepancies")
for (v, fN, fP) in badOnes[0:100]:
print(f"{fN:>3} {fP:>3} {v}")
Comparing frequencies on phrase_atoms: 361 OK; 2 discrepancies 91 94 >JC 7 9 CNH Comparing frequencies on subphrases: 135 OK; 0 discrepancies
Most actors on phrase atoms carry over well. But e.g. CNH
has discrepancies.
Let's get a feel of why we get the discrepancies.
actorCNH = """
phrase_atom
actor=CNH
"""
resultsN = N.search(actorCNH)
resultsP = P.search(actorCNH)
0.09s 7 results 0.09s 9 results
N.table(resultsN)
P.table(resultsP)
n | p | phrase_atom |
---|---|---|
1 | Leviticus 25:11 | 945873tihyˈeh |
2 | Leviticus 25:12 | 945886yôvˈēl |
3 | Leviticus 25:12 | 945887hˈiw |
4 | Leviticus 25:12 | 945888qˌōḏeš |
5 | Leviticus 25:12 | 945889tihyˈeh |
6 | Leviticus 25:51 | 946353baššānˈîm |
7 | Leviticus 25:52 | 946362baššānˈîm |
n | p | phrase_atom |
---|---|---|
1 | Leviticus 25:10 | 945830šānˈā |
2 | Leviticus 25:11 | 945851šānˌā |
3 | Leviticus 25:11 | 945852tihyˈeh |
4 | Leviticus 25:12 | 945865yôvˈēl |
5 | Leviticus 25:12 | 945866hˈiw |
6 | Leviticus 25:12 | 945867qˌōḏeš |
7 | Leviticus 25:12 | 945868tihyˈeh |
8 | Leviticus 25:51 | 946332baššānˈîm |
9 | Leviticus 25:52 | 946341baššānˈîm |
Clearly, there is something interesting in Leviticus 25 verses 10 and 11.
We compare verse 10 in both versions.
Here are the original actors in version c
:
P.show(resultsP, start=1, end=1, condensed=True)
sentence 1
Let's find the same sentence in version 2021
sP = 1181939
mappedSb = N.api.Es("omap@c-2021").f(sP)
mappedSb
((1181957, None),)
N.pretty(mappedSb[0][0])
Aha: in version 2021 there is no counterpart of the phrase atom 945830, the one which carried actor=CNH
.
This phrase atom has morphed into a subphrase, and hence we loose the connection and this particular annotation.
coref
¶We also have an edge feature in the module. Let's test that as well.
First we explore the edge feature a little bit. From which node type to which node type do they go?
We constrain our displays to phrases from now on.
N.displaySetup(condenseType="phrase")
P.displaySetup(condenseType="phrase")
nodeTypes = collections.Counter()
for (f, ts) in P.api.E.coref.items():
fromType = P.api.F.otype.v(f)
for t in ts:
toType = P.api.F.otype.v(t)
nodeTypes[(fromType, toType)] += 1
nodeTypes
Counter({('word', 'subphrase'): 471, ('word', 'phrase_atom'): 20254, ('word', 'word'): 19884, ('phrase_atom', 'phrase_atom'): 34404, ('phrase_atom', 'subphrase'): 1621, ('phrase_atom', 'word'): 20254, ('subphrase', 'word'): 471, ('subphrase', 'subphrase'): 1086, ('subphrase', 'phrase_atom'): 1621})
The coref
relation seems to be symmetrical, so when we check cases, we can skip a number
of pairs.
done = set()
for (fromType, toType) in nodeTypes:
if (fromType, toType) in done:
continue
done.add((fromType, toType))
done.add((toType, fromType))
print(f"{fromType:<15} - {toType:<15}")
template = f"""
{fromType}
-coref> {toType}
"""
resultsN = N.search(template)
resultsP = P.search(template)
goodOnes = []
badOnes = []
phonoN = lambda n: N.api.T.text(n, fmt="text-phono-full")
phonoP = lambda n: P.api.T.text(n, fmt="text-phono-full")
for ((fN, tN), (fP, tP)) in zip(resultsN, resultsP):
fNp = phonoN(fN)
fPp = phonoP(fP)
tNp = phonoN(tN)
tPp = phonoP(tP)
if fNp == fPp and tNp == tPp:
goodOnes.append(f"{fNp} => {tNp}")
else:
fDif = fNp if fNp == fPp else f"{fNp} != {fPp}"
tDif = tNp if tNp == tPp else f"{tNp} != {tPp}"
badOnes.append((f"{fDif} => {tDif}", fN, fP, tN, tP))
print(f"good: {len(goodOnes):>5}\nbad : {len(badOnes):>5}")
if len(goodOnes):
print("Good:")
for rep in goodOnes[0:3]:
print(f"\t{rep}")
if len(badOnes):
print("Bad:")
for (rep, fN, fP, tN, tP) in badOnes[0:3]:
print(f"\t{rep} {fN} {fP} => {tN} {tP}")
print("-" * 40)
print("")
word - subphrase 0.09s 471 results 0.08s 471 results good: 471 bad : 0 Good: bānˈāʸw => ʔˈel-ʔahᵃrˈōn zivḥêhem => bᵊnˈê yiśrāʔˈēl zivḥêhem => mibbᵊnˈê yiśrāʔˈēl ---------------------------------------- word - phrase_atom 0.17s 20188 results 0.16s 20254 results good: 3785 bad : 16403 Good: ʔᵃlêhˈem => ʔˈel-ʔahᵃrˈōn wᵊʔel-bānˈāʸw wᵊʔˌel kol-bᵊnˈê yiśrāʔˈēl hᵉvîʔˌô => šˌôr ʔô-ḵˈeśev ʔô-ʕˌēz ʕammˈô . => ʔˌîš ʔîš Bad: zzarʕˈô => ʔˈîš ʔîš != ʔˈîš 64423 64422 => 944121 944096 zzarʕˈô => yittˈēn != ʔîš 64423 64422 => 944127 944097 zzarʕˈô => yûmˈāṯ != yittˈēn 64423 64422 => 944131 944103 ---------------------------------------- word - word 0.22s 19884 results 0.22s 19884 results good: 19884 bad : 0 Good: zivḥêhem => zivḥêhˈem zivḥêhem => lāhˌem zivḥêhem => ḏōrōṯˈām . ---------------------------------------- phrase_atom - phrase_atom 0.16s 34215 results 0.16s 34404 results good: 745 bad : 33470 Good: yᵊḏabbˌēr => [yᵊhwˌāh] yᵊḏabbˌēr => llēʔmˈōr . yᵊḏabbˌēr => ṣiwwˌā Bad: ʔˌîš ʔˈîš != ʔˌîš => ʔˌîš ʔˈîš != ʔˈîš 943311 943285 => 943311 943286 mibbˈêṯ yiśrāʔˈēl ûmin-haggˌēr != ʔˈîš => mibbˈêṯ yiśrāʔˈēl ûmin-haggˌēr != ʔˌîš 943312 943286 => 943292 943285 ggˈār != mibbˈêṯ yiśrāʔˈēl ûmin-haggˌēr => yāḡˈûr != mibbˈêṯ yiśrāʔˈēl ûmin-haggˌēr 943314 943287 => 943294 943266 ---------------------------------------- phrase_atom - subphrase 0.06s 1599 results 0.07s 1621 results good: 220 bad : 1379 Good: yᵊḏabbˌēr => [yᵊhwˈāh] yᵊḏabbˌēr => [yᵊhwˈāh] [yᵊhwˌāh] => [yᵊhwˈāh] Bad: ʔˌîš ʔˈîš != ʔˌîš => ʔîš 943311 943285 => 1317262 1317261 ʔˌîš ʔˈîš != ʔˌîš => ʔˈîš 943311 943285 => 1317334 1317331 ggˈār != ʔˈîš => min-haggˌēr != ʔîš 943314 943286 => 1317308 1317261 ---------------------------------------- subphrase - subphrase 0.05s 1086 results 0.04s 1086 results good: 1086 bad : 0 Good: bᵊnˈê yiśrāʔˈēl => mibbᵊnˈê yiśrāʔˈēl yiśrāʔˈēl => yiśrāʔˈēl yiśrāʔˈēl => yiśrāʔˈēl ----------------------------------------
All coref
links between words and subphrases match perfectly.
But where phrase atoms are involved, we get bad ones, sometimes more bad ones than good ones.
We inspect a few bad cases.
zzarʕˈô => ʔˈîš ʔîš != ʔˈîš 64423 64422 => 944121 944096
fP = 64422
tP = 944096
pfP = P.api.L.u(fP, otype="phrase")[0]
ptP = P.api.L.u(tP, otype="phrase")[0]
highlightsP = {fP: "orange", tP: "cyan"}
fN = 64423
tN = 944121
pfN = N.api.L.u(fN, otype="phrase")[0]
ptN = N.api.L.u(tN, otype="phrase")[0]
highlightsN = {fN: "orange", tN: "cyan"}
# original coref link
P.pretty(pfP, highlights=highlightsP)
if pfP != ptP:
P.pretty(ptP, highlights=highlightsP)
# mapped `coref` link
N.pretty(pfN, highlights=highlightsN)
if pfN != pfP:
N.pretty(ptN, highlights=highlightsN)
Force majeure! The phrase atom in the original has changed. In the new version it is combined with its neighbour, and the two constituting parts are now subphrases.
ʔˌîš ʔˈîš != ʔˌîš => ʔˌîš ʔˈîš != ʔˈîš 943311 943285 => 943311 943286
fP = 943285
tP = 943286
pfP = P.api.L.u(fP, otype="phrase")[0]
ptP = P.api.L.u(tP, otype="phrase")[0]
highlightsP = {fP: "orange", tP: "cyan"}
fN = 943311
tN = 943311
pfN = N.api.L.u(tN, otype="phrase")[0]
ptN = N.api.L.u(tN, otype="phrase")[0]
highlightsN = {fN: "orange", tN: "cyan"}
P.pretty(pfP, highlights=highlightsP)
if pfP != ptP:
P.pretty(ptP, highlights=highlightsP)
N.pretty(pfN, highlights=highlightsN)
if pfN != ptN:
N.pretty(ptN, highlights=highlightsN)
The same kind of force majeure.
In this case the link was between the two original phrase atoms.
In the new version these have merged into one phrase atom, and now there is
a coref
self-link!
ʔˌîš ʔˈîš != ʔˌîš => ʔîš 943311 943285 => 1317262 1317261
fP = 943285
tP = 1317261
pfP = P.api.L.u(fP, otype="phrase")[0]
ptP = P.api.L.u(tP, otype="phrase")[0]
highlightsP = {fP: "orange", tP: "cyan"}
fN = 943311
tN = 1317262
pfN = N.api.L.u(fN, otype="phrase")[0]
ptN = N.api.L.u(tN, otype="phrase")[0]
highlightsN = {fN: "orange", tN: "cyan"}
# original `coref` link
P.pretty(pfP, highlights=highlightsP)
if pfP != ptP:
P.pretty(ptP, highlights=highlightsP)
# mapped `coref` link
N.pretty(pfN, highlights=highlightsN)
if pfN != ptN:
N.pretty(ptN, highlights=highlightsN)
The same kind of force majeure.
Clearly, there is a massive reorganization of phrase atoms in version 2021
as compared to version c
.
It is great to be able to upgrade features from a version against which they have been created to a newer version. But the corpus may have been changed in unforeseen ways, and not every node in the old corpus can be necessarily matched with a unique node in the new corpus. If there are annotations on such nodes, then they either do not carry over to the new version, or they may carry over to unintended extra nodes in the new version.
We saw a lot of "bad" cases. But yet, all these discrepancies are really not that bad. The mapping has always picked the closest node in the new version that corresponds with the original node in the old version.
There are ways to detect such discrepancies, and the node mapping already has relevant information about the quality of the mapping.
In fact, the migrateFeatures
of Text-Fabric uses the quality information when it assigns feature values to nodes.
But nothing beats generating the features against the new version by the same code that generated them against the old version. If there are issues due to important version differences, the author of the generated feature knows best how to handle that.
CC-BY Dirk Roorda