We check the correctness of the conversion of Abegg's data files to TF.
In this notebook we concentrate on the main fields in the data files:
fullo
lang
and lexo
morpho
and we'll keep track of the source location: biblical or non-biblical file, line number in the file.
We show that all this material has been transferred to TF completely and faithfully.
%load_ext autoreload
%autoreload 2
import os
import re
import yaml
from tf.app import use
from checksLib import Compare
A = use("ETCBC/dss:clone", checkout="clone", hoist=globals(), silent=False)
Using TF-app in /Users/dirk/github/annotation/app-dss/code: repo clone offline under ~/github (local github) Using data in /Users/dirk/github/etcbc/dss/tf/0.5: repo clone offline under ~/github (local github) Using data in /Users/dirk/github/etcbc/dss/parallels/tf/0.5: repo clone offline under ~/github (local github)
Dead Sea Scrolls: after alt biblical book chapter cl cl2 cor fragment full fulle fullo glex glexe glexo glyph glyphe glypho gn gn2 gn3 halfverse intl lang lex lexe lexo line md merr morpho nu nu2 nu3 otype ps ps2 ps3 punc punce punco rec rem script scroll sp srcLn st type unc vac verse vs vt occ oslots
Parallel Passages: sim
We compare the material in the source files with the o
-style features of the TF dataset.
The o
-style features fullo
, lexo
, morpho
contain the unmodified strings corresponding to
fields in the lines of the source files. we add the lang
feature to the mix.
We'll compile two lists of this material, one based directly on the source files, and one based on the TF features.
Both lists consist of tuples, one for each word, and inside each tuple we also store whether the word comes from the biblical or non-biblical file and what the line number is.
Then we'll compare the tuples of both lists one by one.
We determine the node of the first word in the biblical source file.
ln = T.nodeFromSection(("1Q1", "f1", "1"))
words = L.d(ln, otype="word")
firstBibWord = words[0]
firstBibWord
1889878
We determine the words for which the feature biblical
is 2. These are the words
that occur in both source files.
We have chosen to retain the biblical entries of these words, and ignore the non biblical entries.
So, when we are going to compare the source material and the TF material, we have to leave out these words from the non-biblical part of the source material. The non-biblical version turned out to be either equal to the biblical version, or it had no material and the biblical version has a reconstruction.
In order to do that, we make a set of the lines involved, marked by their scroll, fragment and line number.
bib2Lines = {
"{} {}:{}".format(*T.sectionFromNode(ln))
for ln in F.otype.s("line")
if F.biblical.v(ln) == 2
}
bib2Lines
{'2Q29 f1:1', '2Q29 f1:2', '2Q29 f1:3', '4Q249j f1:1', '4Q249j f1:2', '4Q249j f1:3', '4Q249j f1:4', '4Q249j f1:5', '4Q249j f1:6', '4Q483 f1:1', '4Q483 f1:2', '4Q483 f1:3', '4Q483 f2:1', '4Q483 f2:2'}
Build the list based on TF: wordsTf
.
wordsTf = []
for w in F.otype.s("word"):
biblical = F.biblical.v(w)
bib = biblical in {1, 2}
wordsTf.append(
(
bib,
F.srcLn.v(w),
F.fullo.v(w),
F.lang.v(w) or "",
F.lexo.v(w) or "",
F.morpho.v(w) or "",
)
)
We sort the words by source file first and then by source line numbers
wordsTf.sort(key=lambda x: (x[0], x[1]))
len(wordsTf)
500995
wordsTf[0:5]
[(False, 4, 'w', '', 'w◊', 'Pc'), (False, 5, 'oth', '', 'oAt;Dh', 'Pd'), (False, 6, 'Cmow', '', 'vmo', 'vqvmp'), (False, 7, 'kl', '', 'k;Ol', 'ncmsc'), (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]
Build the list according to the source files.
We have applied fixes during conversion. We should apply the same fixes here.
FIXES_DECL = os.path.expanduser("~/github/etcbc/dss/yaml/fixes.yaml")
def readYaml(fileName):
with open(fileName) as y:
y = yaml.load(y)
return y
fixesDecl = readYaml(FIXES_DECL)
lineFixes = fixesDecl["lineFixes"]
fieldFixes = fixesDecl["fieldFixes"]
We read the source files and apply line fixes.
sourceDir = os.path.expanduser("~/local/dss/sanitized")
bibSource = "dss_bib"
nonbibSource = "dss_nonbib"
sources = ("nonbib", "bib")
sourceLines = {}
for src in sources:
biblical = src == "bib"
lineFix = lineFixes[biblical]
srcPath = f"{sourceDir}/dss_{src}.txt"
with open(srcPath) as fh:
sourceLines[src] = list(fh)
for (i, line) in enumerate(sourceLines[src]):
ln = i + 1
if ln in lineFix:
(fr, to, expl) = lineFix[ln]
if fr in line:
oline = line
line = line.replace(fr, to)
sourceLines[src][i] = line
print(f"{src} line {ln} fixed:\n\t{oline}\t{line}")
nonbib line 256841 fixed: 4Q491 f36:2,4.1 [\\] \\\@0 4Q491 f36:2,4.1 [\\] \\\@0 nonbib line 348565 fixed: 11Q19 2:1,2.1 -- \0 11Q19 2:1,2.1 -- \@0 nonbib line 348900 fixed: 11Q19 3:13,3,1 -- \@0 11Q19 3:13,3.1 -- \@0 bib line 36238 fixed: Is 44:21 1Q8 19:1 [\ \\\@0 21829 Is 44:21 1Q8 19:1 [\ \\\@0 21829 bib line 99010 fixed: Deut 33:29 4Q29 f10:2 -- 2895 Deut 33:29 4Q29 f10:2 -- \@0 2895 bib line 143765 fixed: Is 56:2 4Q56 f48:3 -- 30427 Is 56:2 4Q56 f48:3 -- \@0 30427 bib line 186962 fixed: Dan 2:10 4Q112 f1ii:3 l|] l\\%@0 516 Dan 2:10 4Q112 f1ii:3 l|] l\\%0 516 bib line 208179 fixed: 8Q3 f12_16:17 8Q3 f12_16:17 -- \@0 949 8Q3 f12_16:17 8Q3 f12_16:17 -- \@0 949 bib line 217582 fixed: Ps 135:9 11Q5 14:17 -- 11023 Ps 135:9 11Q5 14:17 -- \@0 11023
We split the lines into fields and apply the field fixes.
Not all lines in the source correspond to words.
If a line does not have word material, it is not a word. We skip these lines.
We remember whether a material is in Greek.
Some source lines contain an escape character.
We call those lines control lines.
If the line contains (f0)
, it is in Greek, together with subsequent lines.
Greek terminates at (fy)
.
We also skip the words from the non-biblical file that also have an entry in the biblical file.
These are the words occurring in the lines
we collected in bib2Lines
in step 2.
Furthermore, we must treat a transcription of the form ]
d[
as a line number, not a real transcription,
so we have to skip these lines as well. Here d is any decimal number.
wordlessRe = re.compile(r"^[\\\[\]≤≥?{}<>()\^]*$")
isNumber = re.compile(r"\][0-9]+\[$")
wordsSrc = []
skippedWordLines = []
for src in sources:
bib = src == "bib"
fieldFix = fieldFixes[bib]
sep = "\t" if bib else " "
greek = False
for (i, line) in enumerate(sourceLines[src]):
if "\u001b" in line:
if "(f0)" in line:
greek = True
elif "(fy)" in line:
greek = False
continue
fields = line.rstrip("\n").split(sep)
nFields = len(fields)
ln = i + 1
if nFields < 3:
continue
if not bib:
scroll = fields[0]
label = fields[1].split(",")[0]
passage = f"{scroll} {label}"
if passage in bib2Lines:
skippedWordLines.append(ln)
continue
word = fields[2]
lex = fields[3] if nFields >= 4 else ""
lang = ""
parts = lex.split("@", maxsplit=1)
if len(parts) > 1:
(lex, morph) = parts
else:
parts = lex.split("%", maxsplit=1)
if len(parts) > 1:
(lex, morph) = parts
lang = "a"
else:
morph = ""
if ln in fieldFix:
for (field, (fr, to, expl)) in fieldFix[ln].items():
iVal = (
word
if field == "trans"
else lex
if field == "lex"
else morph
if field == "morph"
else None
)
if iVal == fr:
if field == "trans":
word = to
elif field == "lex":
lex = to
elif field == "morph":
morph = to
print(f"{src} line {ln} field {field} fixed:\n\t{iVal}\t{to}")
if (
word == "/" or wordlessRe.match(word) or isNumber.match(word)
) and lex == "":
continue
theLang = "g" if greek else lang
wordsSrc.append((bib, i + 1, word, theLang, lex, morph))
print(f"{len(wordsSrc)} lines, {len(skippedWordLines)} word lines skipped")
nonbib line 38512 field trans fixed: ≤] ≥≤ nonbib line 48129 field morph fixed: vhp3cpX3mp{2} vhp3cp{2}X3mp nonbib line 59593 field trans fixed: ± ± nonbib line 127763 field morph fixed: vhp3cpX3ms{2} vhp3cp{2}X3ms nonbib line 153845 field trans fixed: b] b nonbib line 153970 field trans fixed: b] b nonbib line 154026 field trans fixed: b] b nonbib line 173512 field trans fixed: ^b ^b^ nonbib line 211343 field trans fixed: y»tkwØ_nw y»tkwØnw nonbib line 248844 field trans fixed: t_onh] tonh] nonbib line 263123 field lex fixed: 82 kj nonbib line 287243 field trans fixed: oyN_ oyN nonbib line 290592 field trans fixed: a A nonbib line 291886 field trans fixed: a A nonbib line 324473 field trans fixed: [˝w»b|a|] [w»b|a|] nonbib line 335846 field trans fixed: 3 bib line 48768 field morph fixed: vp12ms vp1ms bib line 109489 field morph fixed: 0ncfp ncfp bib line 115544 field morph fixed: \ 0 bib line 124566 field lex fixed: jll-2 jll_2 bib line 146637 field morph fixed: 0ncfs ncfs bib line 147953 field trans fixed: [^≥ [≥ bib line 154933 field trans fixed: ≥1a≤ ≥a≤ bib line 154949 field trans fixed: ≥2a≤ ≥a≤ bib line 157840 field morph fixed: 2 0 bib line 158371 field morph fixed: 4 0 bib line 158401 field morph fixed: 3 0 bib line 158493 field trans fixed: [\\]^ [\\] bib line 185650 field trans fixed: h«\\wØ( h«\\wØ bib line 186373 field morph fixed: Pp@0 Pp bib line 202206 field trans fixed: alwhiM alwhyM 500995 lines, 113 word lines skipped
wordsSrc[0:5]
[(False, 4, 'w', '', 'w◊', 'Pc'), (False, 5, 'oth', '', 'oAt;Dh', 'Pd'), (False, 6, 'Cmow', '', 'vmo', 'vqvmp'), (False, 7, 'kl', '', 'k;Ol', 'ncmsc'), (False, 8, 'ywdoy', '', 'ydo', 'vqPmpc')]
The comparison.
In the companion module checksLib.py
we have defined a few handy functions.
CC = Compare(sourceLines, wordsSrc, A.api, wordsTf)
We demonstrate a few functions that help with the comparison.
We need to peek into the source files, at a line number with some context.
CC.showSrc(True, 18)
B16: Gen 1:20 ┃1Q1 f1:1 ┃w ┃w◊@Pc ┃41.5 B17: Gen 1:20 ┃1Q1 f1:1 ┃yamr[ ┃amr_1@vqw3ms ┃42 >>> B18: Gen 1:20 ┃1Q1 f1:1 ┃/ ┃ ┃54 B19: Gen 1:20 ┃1Q1 f1:2 ┃]alhyM ┃aTløhIyM@ncmp ┃55 B20: Gen 1:20 ┃1Q1 f1:2 ┃yC[rwxw ┃vrX@vqi3mp ┃56
The function showTf
looks up a line number in TF.
CC.showTf(True, 18)
B16: Gen 1:20 ┃1Q1 f1:1 ┃w ┃ ┃w◊ ┃Pc┃1889893┃ B17: Gen 1:20 ┃1Q1 f1:1 ┃yamr[ ┃ ┃amr_1 ┃vqw3ms┃1889894┃ >>> B18: no nodes B19: Gen 1:20 ┃1Q1 f1:2 ┃]alhyM ┃ ┃aTløhIyM ┃ncmp┃1889895┃ B20: Gen 1:20 ┃1Q1 f1:2 ┃yC[rwxw ┃ ┃vrX ┃vqi3mp┃1889896┃
And showDiff
combines firstDiff
and showSrc
and showTf
to get a meaningful display of the first difference,
as we'll see later.
Now we can go comparing!
CC.showDiff()
EQUAL
That's easily said. We can compare the two lists very transparently as follows:
wordsSrc == wordsTf
True
Let's consciously distort something, and run the comparison again.
nr = 200000
item = list(wordsSrc[nr])
item
[False, 258361, 'm|\\]', '', 'm\\\\', '0']
item[3] = "a"
wordsSrc[nr] = tuple(item)
CC.showDiff()
item 200000: TF N258361 m|\] ┃ ┃m\\ ┃0 SRC N258361 m|\] ┃a ┃m\\ ┃0 TF: N258360: 4Q496 f20:2 ┃[\\ ┃ ┃\\\ ┃0┃1807071┃ >>> N258361: 4Q496 f20:2 ┃m|\] ┃ ┃m\\ ┃0┃1807072┃ N258362: 4Q496 f20:2 ┃-- ┃ ┃\ ┃0┃1807073┃ SRC: N258360: 4Q496 ┃f20:2,3.1 ┃[\\ ┃\\\@0 >>> N258361: 4Q496 ┃f20:2,4.1 ┃m|\] ┃m\\@0 N258362: 4Q496 ┃f20:2,5.1 ┃-- ┃\@0