Checks¶

We check the correctness of the conversion of Abegg's data files to TF.

In this notebook we concentrate on the main fields in the data files:

transcription fullo
language/lexeme lang and lexo
morphology morpho

and we'll keep track of the source location: biblical or non-biblical file, line number in the file.

We show that all this material has been transferred to TF completely and faithfully.

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

import os
import re
import yaml

from tf.app import use

from checksLib import Compare

In [3]:

A = use("ETCBC/dss:clone", checkout="clone", hoist=globals(), silent=False)

Using TF-app in /Users/dirk/github/annotation/app-dss/code:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/dss/tf/0.5:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/dss/parallels/tf/0.5:
	repo clone offline under ~/github (local github)

Documentation: DSS Character table Feature docs dss API Text-Fabric API 7.7.5 Search Reference

Loaded features:

Dead Sea Scrolls: after alt biblical book chapter cl cl2 cor fragment full fulle fullo glex glexe glexo glyph glyphe glypho gn gn2 gn3 halfverse intl lang lex lexe lexo line md merr morpho nu nu2 nu3 otype ps ps2 ps3 punc punce punco rec rem script scroll sp srcLn st type unc vac verse vs vt occ oslots

Parallel Passages: sim

Overview¶

We compare the material in the source files with the o-style features of the TF dataset. The o-style features fullo, lexo, morpho contain the unmodified strings corresponding to fields in the lines of the source files. we add the lang feature to the mix.

We'll compile two lists of this material, one based directly on the source files, and one based on the TF features.

Both lists consist of tuples, one for each word, and inside each tuple we also store whether the word comes from the biblical or non-biblical file and what the line number is.

Then we'll compare the tuples of both lists one by one.

Step 1¶

We determine the node of the first word in the biblical source file.

In [4]:

ln = T.nodeFromSection(("1Q1", "f1", "1"))
words = L.d(ln, otype="word")
firstBibWord = words[0]
firstBibWord

Out[4]:

Step 2¶

We determine the words for which the feature biblical is 2. These are the words that occur in both source files.

We have chosen to retain the biblical entries of these words, and ignore the non biblical entries.

So, when we are going to compare the source material and the TF material, we have to leave out these words from the non-biblical part of the source material. The non-biblical version turned out to be either equal to the biblical version, or it had no material and the biblical version has a reconstruction.

In order to do that, we make a set of the lines involved, marked by their scroll, fragment and line number.

In [5]:

bib2Lines = {
    "{} {}:{}".format(*T.sectionFromNode(ln))
    for ln in F.otype.s("line")
    if F.biblical.v(ln) == 2
}
bib2Lines

Out[5]:

{'2Q29 f1:1',
 '2Q29 f1:2',
 '2Q29 f1:3',
 '4Q249j f1:1',
 '4Q249j f1:2',
 '4Q249j f1:3',
 '4Q249j f1:4',
 '4Q249j f1:5',
 '4Q249j f1:6',
 '4Q483 f1:1',
 '4Q483 f1:2',
 '4Q483 f1:3',
 '4Q483 f2:1',
 '4Q483 f2:2'}

Step 3¶

Build the list based on TF: wordsTf.

In [6]:

wordsTf = []

for w in F.otype.s("word"):
    biblical = F.biblical.v(w)
    bib = biblical in {1, 2}
    wordsTf.append(
        (
            bib,
            F.srcLn.v(w),
            F.fullo.v(w),
            F.lang.v(w) or "",
            F.lexo.v(w) or "",
            F.morpho.v(w) or "",
        )
    )

We sort the words by source file first and then by source line numbers

In [7]:

wordsTf.sort(key=lambda x: (x[0], x[1]))
len(wordsTf)

Out[7]:

In [8]:

wordsTf[0:5]

Out[8]:

[(False, 4, 'w', '', 'w◊', 'Pc'),
 (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),
 (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),
 (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),
 (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]

Step 4¶

Build the list according to the source files.

We have applied fixes during conversion. We should apply the same fixes here.

In [24]:

FIXES_DECL = os.path.expanduser("~/github/etcbc/dss/yaml/fixes.yaml")


def readYaml(fileName):
    with open(fileName) as y:
        y = yaml.load(y)
    return y


fixesDecl = readYaml(FIXES_DECL)

lineFixes = fixesDecl["lineFixes"]
fieldFixes = fixesDecl["fieldFixes"]

We read the source files and apply line fixes.

In [25]:

sourceDir = os.path.expanduser("~/local/dss/sanitized")
bibSource = "dss_bib"
nonbibSource = "dss_nonbib"
sources = ("nonbib", "bib")
sourceLines = {}
for src in sources:
    biblical = src == "bib"
    lineFix = lineFixes[biblical]

    srcPath = f"{sourceDir}/dss_{src}.txt"
    with open(srcPath) as fh:
        sourceLines[src] = list(fh)
    for (i, line) in enumerate(sourceLines[src]):
        ln = i + 1
        if ln in lineFix:
            (fr, to, expl) = lineFix[ln]
            if fr in line:
                oline = line
                line = line.replace(fr, to)
                sourceLines[src][i] = line
                print(f"{src} line {ln} fixed:\n\t{oline}\t{line}")

nonbib line 256841 fixed:
	4Q491 f36:2,4.1 [\\]  \\\@0
	4Q491 f36:2,4.1 [\\] \\\@0

nonbib line 348565 fixed:
	11Q19 2:1,2.1 -- \0
	11Q19 2:1,2.1 -- \@0

nonbib line 348900 fixed:
	11Q19 3:13,3,1 -- \@0
	11Q19 3:13,3.1 -- \@0

bib line 36238 fixed:
	Is 44:21	1Q8 19:1	[\ \\\@0		21829
	Is 44:21	1Q8 19:1	[\	\\\@0	21829

bib line 99010 fixed:
	Deut 33:29	4Q29 f10:2	--		2895
	Deut 33:29	4Q29 f10:2	--	\@0	2895

bib line 143765 fixed:
	Is 56:2	4Q56 f48:3	--		30427
	Is 56:2	4Q56 f48:3	--	\@0	30427

bib line 186962 fixed:
	Dan 2:10	4Q112 f1ii:3	l|]	l\\%@0	516
	Dan 2:10	4Q112 f1ii:3	l|]	l\\%0	516

bib line 208179 fixed:
	8Q3 f12_16:17	8Q3 f12_16:17	--	\@0		949
	8Q3 f12_16:17	8Q3 f12_16:17	--	\@0	949

bib line 217582 fixed:
	Ps 135:9	11Q5 14:17	--		11023
	Ps 135:9	11Q5 14:17	--	\@0	11023

Step 5¶

We split the lines into fields and apply the field fixes.

Not all lines in the source correspond to words.

If a line does not have word material, it is not a word. We skip these lines.

We remember whether a material is in Greek.

Some source lines contain an escape character. We call those lines control lines. If the line contains (f0), it is in Greek, together with subsequent lines. Greek terminates at (fy).

We also skip the words from the non-biblical file that also have an entry in the biblical file. These are the words occurring in the lines we collected in bib2Lines in step 2.

Furthermore, we must treat a transcription of the form ]d[ as a line number, not a real transcription, so we have to skip these lines as well. Here d is any decimal number.

In [26]:

wordlessRe = re.compile(r"^[\\\[\]≤≥?{}<>()\^]*$")
isNumber = re.compile(r"\][0-9]+\[$")

wordsSrc = []

skippedWordLines = []

for src in sources:
    bib = src == "bib"
    fieldFix = fieldFixes[bib]
    sep = "\t" if bib else " "
    greek = False
    for (i, line) in enumerate(sourceLines[src]):
        if "\u001b" in line:
            if "(f0)" in line:
                greek = True
            elif "(fy)" in line:
                greek = False
            continue
        fields = line.rstrip("\n").split(sep)
        nFields = len(fields)
        ln = i + 1
        if nFields < 3:
            continue
        if not bib:
            scroll = fields[0]
            label = fields[1].split(",")[0]
            passage = f"{scroll} {label}"
            if passage in bib2Lines:
                skippedWordLines.append(ln)
                continue
        word = fields[2]
        lex = fields[3] if nFields >= 4 else ""
        lang = ""
        parts = lex.split("@", maxsplit=1)
        if len(parts) > 1:
            (lex, morph) = parts
        else:
            parts = lex.split("%", maxsplit=1)
            if len(parts) > 1:
                (lex, morph) = parts
                lang = "a"
            else:
                morph = ""

        if ln in fieldFix:
            for (field, (fr, to, expl)) in fieldFix[ln].items():
                iVal = (
                    word
                    if field == "trans"
                    else lex
                    if field == "lex"
                    else morph
                    if field == "morph"
                    else None
                )
                if iVal == fr:
                    if field == "trans":
                        word = to
                    elif field == "lex":
                        lex = to
                    elif field == "morph":
                        morph = to
                    print(f"{src} line {ln} field {field} fixed:\n\t{iVal}\t{to}")

        if (
            word == "/" or wordlessRe.match(word) or isNumber.match(word)
        ) and lex == "":
            continue
        theLang = "g" if greek else lang
        wordsSrc.append((bib, i + 1, word, theLang, lex, morph))
print(f"{len(wordsSrc)} lines, {len(skippedWordLines)} word lines skipped")

nonbib line 38512 field trans fixed:
	≤]	≥≤
nonbib line 48129 field morph fixed:
	vhp3cpX3mp{2}	vhp3cp{2}X3mp
nonbib line 59593 field trans fixed:
	 ± 	±
nonbib line 127763 field morph fixed:
	vhp3cpX3ms{2}	vhp3cp{2}X3ms
nonbib line 153845 field trans fixed:
	b]	b
nonbib line 153970 field trans fixed:
	b]	b
nonbib line 154026 field trans fixed:
	b]	b
nonbib line 173512 field trans fixed:
	^b	^b^
nonbib line 211343 field trans fixed:
	y»tkwØ_nw	y»tkwØnw
nonbib line 248844 field trans fixed:
	t_onh]	tonh]
nonbib line 263123 field lex fixed:
	82	kj
nonbib line 287243 field trans fixed:
	oyN_	oyN
nonbib line 290592 field trans fixed:
	a	A
nonbib line 291886 field trans fixed:
	a	A
nonbib line 324473 field trans fixed:
	[˝w»b|a|]	[w»b|a|]
nonbib line 335846 field trans fixed:
	3	
bib line 48768 field morph fixed:
	vp12ms	vp1ms
bib line 109489 field morph fixed:
	0ncfp	ncfp
bib line 115544 field morph fixed:
	\	0
bib line 124566 field lex fixed:
	jll-2	jll_2
bib line 146637 field morph fixed:
	0ncfs	ncfs
bib line 147953 field trans fixed:
	[^≥	[≥
bib line 154933 field trans fixed:
	≥1a≤	≥a≤
bib line 154949 field trans fixed:
	≥2a≤	≥a≤
bib line 157840 field morph fixed:
	2	0
bib line 158371 field morph fixed:
	4	0
bib line 158401 field morph fixed:
	3	0
bib line 158493 field trans fixed:
	[\\]^	[\\]
bib line 185650 field trans fixed:
	h«\\wØ(	h«\\wØ
bib line 186373 field morph fixed:
	Pp@0	Pp
bib line 202206 field trans fixed:
	alwhiM	alwhyM
500995 lines, 113 word lines skipped

In [27]:

wordsSrc[0:5]

Out[27]:

[(False, 4, 'w', '', 'w◊', 'Pc'),
 (False, 5, 'oth', '', 'oAt;Dh', 'Pd'),
 (False, 6, 'Cmow', '', 'vmo', 'vqvmp'),
 (False, 7, 'kl', '', 'k;Ol', 'ncmsc'),
 (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]

Step 6¶

The comparison. In the companion module checksLib.py we have defined a few handy functions.

In [28]:

CC = Compare(sourceLines, wordsSrc, A.api, wordsTf)

We demonstrate a few functions that help with the comparison.

We need to peek into the source files, at a line number with some context.

In [29]:

CC.showSrc(True, 18)

    B16: Gen 1:20       ┃1Q1 f1:1       ┃w              ┃w◊@Pc          ┃41.5           
    B17: Gen 1:20       ┃1Q1 f1:1       ┃yamr[          ┃amr_1@vqw3ms   ┃42             
>>> B18: Gen 1:20       ┃1Q1 f1:1       ┃/              ┃               ┃54             
    B19: Gen 1:20       ┃1Q1 f1:2       ┃]alhyM         ┃aTløhIyM@ncmp  ┃55             
    B20: Gen 1:20       ┃1Q1 f1:2       ┃yC[rwxw        ┃vrX@vqi3mp     ┃56

The function showTf looks up a line number in TF.

In [30]:

CC.showTf(True, 18)

    B16: Gen 1:20       ┃1Q1 f1:1       ┃w              ┃ ┃w◊             ┃Pc┃1889893┃
    B17: Gen 1:20       ┃1Q1 f1:1       ┃yamr[          ┃ ┃amr_1          ┃vqw3ms┃1889894┃
>>> B18: no nodes
    B19: Gen 1:20       ┃1Q1 f1:2       ┃]alhyM         ┃ ┃aTløhIyM       ┃ncmp┃1889895┃
    B20: Gen 1:20       ┃1Q1 f1:2       ┃yC[rwxw        ┃ ┃vrX            ┃vqi3mp┃1889896┃

And showDiff combines firstDiff and showSrc and showTf to get a meaningful display of the first difference, as we'll see later.

Step 7¶

Now we can go comparing!

In [31]:

CC.showDiff()

EQUAL

Step 8¶

That's easily said. We can compare the two lists very transparently as follows:

In [32]:

wordsSrc == wordsTf

Out[32]:

True

Let's consciously distort something, and run the comparison again.

In [33]:

nr = 200000
item = list(wordsSrc[nr])
item

Out[33]:

[False, 258361, 'm|\\]', '', 'm\\\\', '0']

In [34]:

item[3] = "a"
wordsSrc[nr] = tuple(item)

In [35]:

CC.showDiff()

item 200000:
TF  N258361 m|\]           ┃               ┃m\\            ┃0              
SRC N258361 m|\]           ┃a              ┃m\\            ┃0              
TF:
    N258360: 4Q496 f20:2    ┃[\\            ┃ ┃\\\            ┃0┃1807071┃
>>> N258361: 4Q496 f20:2    ┃m|\]           ┃ ┃m\\            ┃0┃1807072┃
    N258362: 4Q496 f20:2    ┃--             ┃ ┃\              ┃0┃1807073┃
SRC:
    N258360: 4Q496          ┃f20:2,3.1      ┃[\\            ┃\\\@0          
>>> N258361: 4Q496          ┃f20:2,4.1      ┃m|\]           ┃m\\@0          
    N258362: 4Q496          ┃f20:2,5.1      ┃--             ┃\@0