Notebook

In [1]:

%load_ext autoreload
%autoreload 2

Make node mapping between two versions of the TF dataset¶

In this notebook we map the slot nodes from version 0.4 (source version) to 0.7 (target version).

Basically this means that we map all slots from the source version to corresponding slots in the target version. In the target version there are more slots, because also footnotes occupy slots there, in contrast with the source version, where footnotes only appear inside feature values of the slot that precedes the footnote mark.

Some slots have an empty text (most of them contain some punctuation).

We do not want to be fussy about those slots. We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest non-empty slot.

After establishing the slot mapping, we extend the mapping to all nodes in a generic way. The code for this is already in the TF library.

In [2]:

from tf.fabric import Fabric
from tf.dataset import Versions

from lib import TF_DIR

va = "0.4"
# vb = "0.9.1"
vb = "1.0"

We load the two versions of the TF data by means of the lower level Fabric method, and we only load the features we need.

In [3]:

TF = {}
api = {}
E = {}
Es = {}
F = {}
Fs = {}
L = {}
T = {}
maxSlot = {}
features = {
    va: "trans",
    vb: "trans isnote",
}

In [4]:

for v in (va, vb):
    TF[v] = Fabric(locations=TF_DIR, modules=v)
    api[v] = TF[v].load(features[v])

This is Text-Fabric 9.4.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

35 features found and 0 ignored
  3.83s All features loaded/computed - for details use TF.isLoaded()
This is Text-Fabric 9.4.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

43 features found and 0 ignored
  4.78s All features loaded/computed - for details use TF.isLoaded()

We put the parts of the TF API that we need in the various dictionaries.

In [5]:

for v in (va, vb):
    E[v] = api[v].E
    Es[v] = api[v].Es
    F[v] = api[v].F
    Fs[v] = api[v].Fs
    L[v] = api[v].L
    T[v] = api[v].T
    maxSlot[v] = F[v].otype.maxSlot

Make the slot mapping¶

We walk through the slots of the target version (0.7) and skip its footnote slots. For each target slot we increase the slot in the source version in (0.5), and check whether source and target slots have the same value for the trans feature. If not, and one of them is empty, we skip the empty word and try the next one. But if both are not empty and unequal, we have a real problem: a mismatch.

However, in version 0.5 we have an imperfect separation of numbers and words. So, sometimes we have to split words.

In that case we stop, and you have to inspect what is happening.

In [6]:

def makeSlotMapOld():
    Fa = F[va]
    Fb = F[vb]
    transA = Fa.trans.v
    transB = Fb.trans.v
    isNote = Fb.isnote.v
    maxSlotA = maxSlot[va]
    maxSlotB = maxSlot[vb]

    print(
        f"""\
    Computing slotMap between:
    {va}: {maxSlotA:>8} slots,
    {vb}: {maxSlotB:>8} slots.\
"""
    )

    slotMap = {}

    good = True
    wA = 1
    emptyA = 0
    emptyB = 0

    for wB in range(1, maxSlotB + 1):
        if isNote(wB):
            continue
        textA = transA(wA) or ""
        textB = transB(wB) or ""

        if textB == "":
            if textA != "":
                emptyB += 1
                continue
        else:
            while textA == "" and wA < maxSlotA:
                wA += 1
                emptyA += 1
                textA = transA(wA) or ""

        if textA != textB:
            print("Mismatch:")
            print(f"A: {wA:>8} = `{textA}`")
            print(f"B: {wB:>8} = `{textB}`")
            good = False
            break

        if wA <= maxSlotA:
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
        else:
            if textB:
                print(f"No more slots in {va} to match slot {wB} in {vb}")
                break

    maxSlotMap = max(slotMap)
    if maxSlotMap > maxSlotA:
        print(f"maxSlot in A version {va} exceeded")
        print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
        good = False

    if good:
        print(
            f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
{va}: {emptyA:>6} empty slots,
{vb}: {emptyB:>6} empty slots.\
"""
        )
    return slotMap

In [7]:

def makeSlotMap():
    Fa = F[va]
    Fb = F[vb]
    transA = Fa.trans.v
    transB = Fb.trans.v
    isNote = Fb.isnote.v
    maxSlotA = maxSlot[va]
    maxSlotB = maxSlot[vb]

    print(
        f"""\
    Computing slotMap between:
    {va}: {maxSlotA:>8} slots,
    {vb}: {maxSlotB:>8} slots.\
"""
    )

    slotMap = {}

    good = True
    wA = 1
    wB = 1

    while wB <= maxSlotB and wA <= maxSlotA:
        if isNote(wB):
            wB += 1
            continue

        textA = transA(wA) or ""
        textB = transB(wB) or ""

        if textA == textB:
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        elif textA.startswith(textB):
            slotMap.setdefault(wA, {})[wB] = None
            wB += 1
        elif textA.endswith(textB):
            wA += 1
            wB += 1

        elif textB.startswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
        elif textB.endswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        else:
            print("Mismatch:")
            print(f"A: {wA:>8} = `{textA}`")
            print(f"B: {wB:>8} = `{textB}`")
            good = False
            break

    maxSlotMap = max(slotMap)
    if maxSlotMap > maxSlotA:
        print(f"maxSlot in A version {va} exceeded")
        print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
        good = False

    if good:
        print(
            f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
"""
        )
    return slotMap

In [8]:

slotMap = makeSlotMap()

    Computing slotMap between:
    0.4:  5030444 slots,
    1.0:  5977367 slots.
slotMap succesfully created: 5030444 slots mapped.

When we encounter problems, we can do a bit of checking to see what is going on.

The next function shows the line around a slot node, and can do so in both versions.

In [9]:

def show(v, n):
    lines = L[v].u(n, otype="line")
    if not lines:
        lines = L[v].u(n + 1, otype="line")
    if not lines:
        lines = L[v].u(n - 1, otype="line")
    if not lines:
        print("no such line")
        return
    line = lines[0]
    print(T[v].sectionFromNode(line))
    words = L[v].d(line, otype="word")
    print(" ".join(f"[{w}={F[v].trans.v(w)}]" for w in words))
    print(T[v].text(line))

In [10]:

show(va, 49)
show(vb, 96)

(1, None, 2)
[45=961] [46=copie] [47=5] [48=folio] [49=s]
961, copie, 5 folio's.
(1, 3, 4)
[95=„Journaelsgewijse] [96=reisbeschrijving]
„Journaelsgewijse" reisbeschrijving »

Make the complete node map¶

We now extend the slotMap to a full node map.

See dataset.Versions in the Text-Fabric documentation.

In [11]:

V = Versions(api, va, vb, slotMap)

In [12]:

V.makeVersionMapping()

**********************************************************************************************
*                                                                                            *
*       0.00s Mapping volume nodes 0.4 ==> 1.0                                               *
*                                                                                            *
**********************************************************************************************

|       0.00s Extending slot mapping 0.4 ==> 1.0 for volume nodes
|         10s 	Done
..............................................................................................
.         10s Statistics for 0.4 ==> 1.0 (volume)                                            .
..............................................................................................
|         10s 	TOTAL                          : 100.00%      13x
|         10s 	unique, imperfect              : 100.00%      13x

**********************************************************************************************
*                                                                                            *
*         10s Mapping letter nodes 0.4 ==> 1.0                                               *
*                                                                                            *
**********************************************************************************************

|         10s Extending slot mapping 0.4 ==> 1.0 for letter nodes
|         20s 	Done
..............................................................................................
.         20s Statistics for 0.4 ==> 1.0 (letter)                                            .
..............................................................................................
|         20s 	TOTAL                          : 100.00%     589x
|         20s 	unique, perfect                :  21.05%     124x
|         20s 	unique, imperfect              :  77.93%     459x
|         20s 	multiple, non-perfect          :   1.02%       6x

**********************************************************************************************
*                                                                                            *
*         20s Mapping page nodes 0.4 ==> 1.0                                                 *
*                                                                                            *
**********************************************************************************************

|         20s Extending slot mapping 0.4 ==> 1.0 for page nodes
|         30s 	Done
..............................................................................................
.         30s Statistics for 0.4 ==> 1.0 (page)                                              .
..............................................................................................
|         30s 	TOTAL                          : 100.00%   10149x
|         30s 	unique, perfect                :  47.53%    4824x
|         30s 	unique, imperfect              :  51.63%    5240x
|         30s 	multiple, non-perfect          :   0.84%      85x

**********************************************************************************************
*                                                                                            *
*         30s Mapping table nodes 0.4 ==> 1.0                                                *
*                                                                                            *
**********************************************************************************************

|         30s Extending slot mapping 0.4 ==> 1.0 for table nodes
|         30s 	Done
..............................................................................................
.         30s Statistics for 0.4 ==> 1.0 (table)                                             .
..............................................................................................
|         30s 	TOTAL                          : 100.00%     322x
|         30s 	unique, perfect                :  92.55%     298x
|         30s 	unique, imperfect              :   7.14%      23x
|         30s 	multiple, non-perfect          :   0.31%       1x

**********************************************************************************************
*                                                                                            *
*         30s Mapping para nodes 0.4 ==> 1.0                                                 *
*                                                                                            *
**********************************************************************************************

|         30s Extending slot mapping 0.4 ==> 1.0 for para nodes
|         37s 	Done
..............................................................................................
.         37s Statistics for 0.4 ==> 1.0 (para)                                              .
..............................................................................................
|         37s 	TOTAL                          : 100.00%   33885x
|         37s 	unique, perfect                :  77.44%   26242x
|         37s 	unique, imperfect              :  22.50%    7625x
|         37s 	multiple, non-perfect          :   0.01%       5x
|         37s 	not mapped                     :   0.04%      13x

**********************************************************************************************
*                                                                                            *
*         37s Mapping remark nodes 0.4 ==> 1.0                                               *
*                                                                                            *
**********************************************************************************************

|         37s Extending slot mapping 0.4 ==> 1.0 for remark nodes
|         40s 	Done
..............................................................................................
.         40s Statistics for 0.4 ==> 1.0 (remark)                                            .
..............................................................................................
|         40s 	TOTAL                          : 100.00%   22922x
|         40s 	unique, perfect                :  97.36%   22318x
|         40s 	unique, imperfect              :   2.64%     604x

**********************************************************************************************
*                                                                                            *
*         40s Mapping head nodes 0.4 ==> 1.0                                                 *
*                                                                                            *
**********************************************************************************************

|         40s Extending slot mapping 0.4 ==> 1.0 for head nodes
|         40s 	Done
..............................................................................................
.         40s Statistics for 0.4 ==> 1.0 (head)                                              .
..............................................................................................
|         40s 	TOTAL                          : 100.00%     589x
|         40s 	unique, perfect                :  91.17%     537x
|         40s 	unique, imperfect              :   8.83%      52x

**********************************************************************************************
*                                                                                            *
*         40s Mapping line nodes 0.4 ==> 1.0                                                 *
*                                                                                            *
**********************************************************************************************

|         40s Extending slot mapping 0.4 ==> 1.0 for line nodes
|         51s 	Done
..............................................................................................
.         51s Statistics for 0.4 ==> 1.0 (line)                                              .
..............................................................................................
|         51s 	TOTAL                          : 100.00%  444978x
|         51s 	unique, perfect                :  97.15%  432317x
|         51s 	unique, imperfect              :   0.40%    1770x
|         51s 	multiple, cleanly composed     :   0.31%    1368x
|         51s 	multiple, non-perfect          :   2.14%    9523x

**********************************************************************************************
*                                                                                            *
*         51s Mapping row nodes 0.4 ==> 1.0                                                  *
*                                                                                            *
**********************************************************************************************

|         51s Extending slot mapping 0.4 ==> 1.0 for row nodes
|         51s 	Done
..............................................................................................
.         51s Statistics for 0.4 ==> 1.0 (row)                                               .
..............................................................................................
|         51s 	TOTAL                          : 100.00%    4566x
|         51s 	unique, perfect                :  98.90%    4516x
|         51s 	unique, imperfect              :   1.10%      50x

**********************************************************************************************
*                                                                                            *
*         51s Mapping folio nodes 0.4 ==> 1.0                                                *
*                                                                                            *
**********************************************************************************************

|         51s Extending slot mapping 0.4 ==> 1.0 for folio nodes
|         51s 	Done
..............................................................................................
.         51s Statistics for 0.4 ==> 1.0 (folio)                                             .
..............................................................................................
|         51s 	TOTAL                          : 100.00%    2555x
|         51s 	unique, perfect                :  95.58%    2442x
|         51s 	unique, imperfect              :   4.11%     105x
|         51s 	not mapped                     :   0.31%       8x

**********************************************************************************************
*                                                                                            *
*         51s Mapping cell nodes 0.4 ==> 1.0                                                 *
*                                                                                            *
**********************************************************************************************

|         51s Extending slot mapping 0.4 ==> 1.0 for cell nodes
|         52s 	Done
..............................................................................................
.         52s Statistics for 0.4 ==> 1.0 (cell)                                              .
..............................................................................................
|         52s 	TOTAL                          : 100.00%   20593x
|         52s 	unique, perfect                :  99.71%   20533x
|         52s 	unique, imperfect              :   0.28%      58x
|         52s 	multiple, cleanly composed     :   0.00%       1x
|         52s 	not mapped                     :   0.00%       1x

**********************************************************************************************
*                                                                                            *
*         52s Mapping subhead nodes 0.4 ==> 1.0                                              *
*                                                                                            *
**********************************************************************************************

|         52s Extending slot mapping 0.4 ==> 1.0 for subhead nodes
|         52s 	Done
..............................................................................................
.         52s Statistics for 0.4 ==> 1.0 (subhead)                                           .
..............................................................................................
|         52s 	TOTAL                          : 100.00%    1360x
|         52s 	unique, perfect                :  99.71%    1356x
|         52s 	unique, imperfect              :   0.29%       4x
..............................................................................................
.         52s Write edge as TF feature omap@0.4-1.0                                          .
..............................................................................................
  0.00s Exporting 0 node and 1 edge and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0:
   |     8.50s T omap@0.4-1.0         to ~/github/clariah/wp6-missieven/tf/1.0
  8.50s Exported 0 node features and 1 edge features and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0

The result is a new feature in the latest version of the dataset:

In [13]:

!ls -l ~/github/clariah/wp6-missieven/tf/{vb}/omap@*.tf

-rw-r--r--  1 werk  staff  43874572 May  6 15:48 /Users/werk/github/clariah/wp6-missieven/tf/1.0/omap@0.4-1.0.tf

Observations¶

in the vast majority of cases, a node in the source version has just one obvious counterpart in the target version
most cases of ambiguity arise in the node type line.

Maybe we can shed some light on those cases.

First we load the mapping as a TF edge feature:

We are interested in mappings between line nodes which are diagnosed as multiple, non-perfect. First we ask for the list of diagnostic labels:

In [14]:

V.legend()

b = unique, perfect
d = multiple, one perfect
c = unique, imperfect
f = multiple, cleanly composed
e = multiple, non-perfect
a = not mapped

We need to inspect line nodes in version 0.6 that have label e:

In [15]:

diags = V.getDiagnosis(node="line", label="e")
print(type(diags))
print(len(diags))

<class 'tuple'>
9523

In [16]:

T[va].text(diags[0])

Out[16]:

'VOOR ILE DE MAYO 25 februari 1610.'

In [17]:

V.edge[diags[0]]

Out[17]:

{6018784: 18, 6018788: 4}

In [18]:

for (lnb, dis) in V.edge[diags[0]].items():
    print(f"dis={dis:>2} text={T[vb].text(lnb)}")

dis=18 text=VOOR ILE DE MAYO De eerste drie brieven, door Both op reis naar Indië geschrovon, wijken niet af van 
dis= 4 text=25 februari 1610.

Explanation¶

In the target version footnotes occupy lines themselves. Line breaks in footnotes now become line breaks in the text as a whole. So lines in the source version may become split into several parts when they have a reference to a multiline footnote.

The mapping then detects the two target lines, each of which is an imperfect target of the source line. We cannot do much about it.

We could have made another coding decision: line breaks in footnotes are different from line breaks in the body text. Then we would have a good correspondence between the lines in both versions.