In [1]:
import sys
import os
from IPython.display import display, HTML, Markdown
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

Handling Biblical data with IKEA logistics

Dirk Roorda - 2018-03-20 ETEN Workshop

Perspective

  • Research data
  • Researchers are ...
    • not engineers
    • not consumers
    • tinkerers (in their own sheds ... or labs as they like to say)
  • Programming theologians, cuneiform decipherers, humanists

Value proposition

In [ ]:
def effectiveness(IT_effort, PT_effort):
    effect = IT_effort * PT_effort
    euros = "€" * effect
    display(HTML(f"<p>Return on investment<p><p>{euros}</p><p>{effect} €</p>"))
In [2]:
widget = interactive(effectiveness, IT_effort=(10, 100), PT_effort=(1, 10))
display(widget)

Modular resources

The IKEA way

Lessons of a cabinet ...

kitchen

kitchen

metod

... applied to data

Modular resources

  1. help to separate concerns
  2. help to finely tune data sets by recombination
  3. are usable in novel settings

Research questions often involve new demands on data and creation of new treasures.

problem-oriented annotation: take a corpus and add to it your own form of annotation, oriented towards your own research goal

standards for corpus annotation: annotations should be separable

Hence, annotation should be modular: separate text (sic!) and annotation

Thanks to Johan de Joode, Leuven

Think of the IKEA warehouse: everything is nicely piled together

  • a pallet with words
  • a corridor with linguistic objects
  • shelves with clauses and phrases
  • crates with books, chapters, and verses
  • every part is labeled with a node number

Open Scriptures: peeling off the packaging

The OSM source

Have a look in the Open Scriptures Morphology source files: Lam.xml

 <div type="book" osisID="Lam">
      <chapter osisID="Lam.1">
        <verse osisID="Lam.1.1">
          <w lemma="349 b" n="1.1.1.0" morph="HTi">אֵיכָ֣ה</w>
          <seg type="x-paseq">׀</seg>
          <w lemma="3427" morph="HVqp3fs">יָשְׁבָ֣ה</w>
          <w lemma="910" n="1.1.1" morph="HNcmsa">בָדָ֗ד</w>

Book names

Lam
Lam.1
Lam.1.1

Lam = Lmt = Lament = Lamentations = Threni = Klagesangene = ሰቆቃው_ኤርምያስ

Sections

 book
      chapter
        verse
          w
          w
          w

Identifiers

                  osisID= Lam
               osisID= Lam.1
               osisID= Lam.1.1
             lemma= 349 b  n= 1.1.1.0

             lemma= 3427
             lemma= 910  n= 1.1.1

Full text

אֵיכָ֣ה
׀
יָשְׁבָ֣ה
בָדָ֗ד

XML markup

 <div type="   " >
      <"">
        <"">
          <"" "" morph=""></>
          <seg type="x-paseq"></seg>
          <"" morph=""></>
          <"" "" morph=""></>

Morph

but all you want is the treasure: morph

HTi
HVqp3fs
HNcmsa

from

 <div type="book" osisID="Lam">
      <chapter osisID="Lam.1">
        <verse osisID="Lam.1.1">
          <w lemma="349 b" n="1.1.1.0" morph="HTi">אֵיכָ֣ה</w>
          <seg type="x-paseq">׀</seg>
          <w lemma="3427" morph="HVqp3fs">יָשְׁבָ֣ה</w>
          <w lemma="910" n="1.1.1" morph="HNcmsa">בָדָ֗ד</w>

In a nutshell

  1. you get more than you want
  2. what you want is intricately wrapped up
  3. we suffer from leaking concerns
  4. we are being micro-managed at several levels

but we do need better logistics in treasure sharing

Text-Fabric: weave your own web

AD 1425 Hausbücher der Nürnberger Zwölfbrüderstiftungen

Text Fabric

TF

Warp and weft

Every TF resource must have two special features: warp features.

All other features are weft features, they are woven into the warp.

wikipedia.

Warp features

  • otype: each node has a type
  • oslots: each non-slot node is linked to a set of slot nodes
  • otext: specification of sections and text formats

Weft features

These contain the concrete, tangible information:

  • the text
  • the linguistic annotations
  • additional data that is linked to the text

Resources and modules

A TF resource is a bunch of TF files.

A TF file contains the data for a single feature.

  • one fixed set of warp features: otype oslots otext
  • arbitrary many weft features: sp g_word_utf8, ...
  • can be augmented with wefts from TF modules.

A TF module

  • has only weft features
  • uses the warp of a main resource.

A weave

Fabric model + IKEA logistics => Workbench

Workbench for Cuneiform Tablets

In [3]:
LOC = ("~/github", "Nino-cunei/uruk", "Copenhagen2018")
from tf.extra.cunei import Cunei  # noqa E402

CN = Cunei(*LOC)
CN.api.makeAvailableIn(globals())
Found 2095 ideograph linearts
Found 2724 tablet linearts
Found 5495 tablet photos

This notebook online: NBViewer GitHub

In [4]:
pNumX = "P005381"
In [5]:
CN.photo(pNumX, width="400")
Out[5]:
In [6]:
CN.lineart(pNumX, width="300")
Out[6]:
In [7]:
tabletX = T.nodeFromSection((pNumX,))
sourceLines = CN.getSource(tabletX)
print("\n".join(sourceLines))
&P005381 = MSVO 3, 70
#atf: lang qpc 
@obverse 
@column 1 
1.a. 2(N14) , SZE~a SAL TUR3~a NUN~a 
1.b. 3(N19) , |GISZ.TE| 
2. 1(N14) , NAR NUN~a SIG7 
3. 2(N04)# , PIRIG~b1 SIG7 URI3~a NUN~a 
@column 2 
1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a 
2. , GU7 AZ SI4~f 
@reverse 
@column 1 
1. 3(N14) , SZE~a 
2. 3(N19) 5(N04) , 
3. , GU7 
@column 2 
1. , AZ SI4~f 
In [8]:
case = CN.nodeFromCase((pNumX, "obverse:1", "1a"))
In [9]:
CN.lineart(CN.getOuterQuads(case), width=50)

Tablet calculator

In [10]:
pNums = """
    P005381
    P005447
    P005448
""".strip().split()

pNumPat = "|".join(pNums)
In [11]:
shinPP = dict(
    N41=0.2,
    N04=1,
    N19=6,
    N46=60,
    N36=180,
    N49=1800,
)

shinPPPat = "|".join(shinPP)

We query for shinPP numerals on the faces of selected tablets. The result of the query is a list of tuples (t, f, s) consisting of a tablet node, a face node and a sign node, which is a shinPP numeral.

In [12]:
query = f"""
tablet catalogId={pNumPat}
    face
        sign type=numeral grapheme={shinPPPat}
"""
In [13]:
results = list(S.search(query))
len(results)
Out[13]:
20

We have found 20 numerals. We group the results by tablet and by face.

In [14]:
numerals = {}
for (tablet, face, sign) in results:
    numerals.setdefault(tablet, {}).setdefault(face, []).append(sign)

We show the tablets, the shinPP numerals per face, and we add up the numerals per face.

In [15]:
def dm(x):
    display(Markdown(x))


def showResult(pNum, tabletLineart):
    dm("---\n")
    tablet = T.nodeFromSection((pNum,))
    if tabletLineart:
        display(CN.lineart(tablet, withCaption="top", width="200"))
    faces = numerals[tablet]
    for (face, signs) in faces.items():
        dm(f"### {F.type.v(face)}")
        distinctSigns = {}
        for s in signs:
            distinctSigns.setdefault(CN.atfFromSign(s), []).append(s)
        display(CN.lineart(distinctSigns))
        total = 0
        for (signAtf, signs) in distinctSigns.items():
            # note that all signs for the same signAtf have the same grapheme and repeat
            value = 0
            for s in signs:
                value += F.repeat.v(s) * shinPP[F.grapheme.v(s)]
            total += value
            amount = len(signs)
            shinPPval = shinPP[F.grapheme.v(signs[0])]
            repeat = F.repeat.v(signs[0])
            print(f"{amount} x {signAtf} = {amount} x {repeat} x {shinPPval} = {value}")
        dm(f"**total** = **{total}**")
In [16]:
widget = interactive(showResult, pNum=sorted(pNums), tabletLineart=False)
display(widget)

Workbench for Syriac Linking

x

In [17]:
from tf.fabric import Fabric  # noqa E402

REPO = "~/github/etcbc/linksyr"
SOURCE = "syrnt"
CORPUS = f"{REPO}/data/tf/{SOURCE}"
In [18]:
TF = Fabric(locations=[CORPUS], modules=[""], silent=False)
api = TF.load("", silent=True)
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
TF.load(loadableFeatures, add=True, silent=True)
api.makeAvailableIn(globals())
This is Text-Fabric 3.2.5
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

37 features found and 0 ignored
In [19]:
SYRIACA = os.path.expanduser(f"{REPO}/data/syriaca")
SC_PEOPLE = f"{SYRIACA}/index_of_persons.csv"
SC_PLACES = f"{SYRIACA}/index_of_places.csv"

SC_URL = "http://syriaca.org"
SC_PLACE = "place"
SC_PERSON = "person"

SC_CONFIG = (
    (SC_PERSON, SC_URL, SC_PEOPLE),
    (SC_PLACE, SC_URL, SC_PLACES),
)

SC_TYPES = tuple(x[0] for x in SC_CONFIG)

SC_FIELDS = ("trans", "syriac", "id")

NA_SYRIAC = {
    "[Syriac Not Available]",
    "[Syriac Not",
    "[Syriac",
}
In [20]:
HTML(
    """
<style>
.syc {
    font-family: Estrangelo Edessa;
    font-size: 14pt;
}
</style>
"""
)
Out[20]:
In [21]:
tables = {}
irregular = {}

(transF, syriacF, idF) = SC_FIELDS

for (dataType, baseUrl, dataFile) in SC_CONFIG:
    tables[dataType] = {field: {} for field in SC_FIELDS}
    irregular[dataType] = set()
    dest = tables[dataType]
    irreg = irregular[dataType]
    table = dest[idF]
    indexTrans = dest[transF]
    indexSyriac = dest[syriacF]
    with open(dataFile) as fh:
        for (i, line) in enumerate(fh):
            (transV, syriacV, idV) = line.rstrip("\n").split("\t")
            prefix = f"{baseUrl}/{dataType}/"
            if idV.startswith(prefix):
                idV = idV.replace(prefix, "", 1)
            else:
                irreg.add(idV)
            table[idV] = (transV, syriacV)
            indexTrans.setdefault(transV, set()).add(idV)
            if syriacV not in NA_SYRIAC:
                if "[" in syriacV:
                    print(f'WARNING {dataType} line {i+1}: syriac value "{syriacV}"')
                indexSyriac.setdefault(syriacV, set()).add(idV)
In [23]:
for (dataType, data) in tables.items():
    table = data[idF]
    irreg = irregular[dataType]
    print(
        f"""
{dataType:>12}s: {len(table):>5} (irregular: {len(irreg):>4})
{"by syriac":>12} : {len(data[syriacF]):>5}
{"by trans":>12} : {len(data[transF]):>5}
"""
    )
      persons:  2371 (irregular:    0)
   by syriac :  1503
    by trans :  1964


       places:  2488 (irregular:    0)
   by syriac :   527
    by trans :  2165

In [22]:
hits = {dataType: {} for dataType in SC_TYPES}

for lx in F.otype.s("lexeme"):
    lex = F.lexeme.v(lx)
    for dataType in SC_TYPES:
        idV = tables[dataType][syriacF].get(lex, None)
        if idV is not None:
            hits[dataType][lx] = idV
In [23]:
for (dataType, theseHits) in hits.items():
    print(f"{dataType:>12}s: {len(theseHits):>5} hits")
      persons:    98 hits
       places:    37 hits

We show the hits by picking the first occurrence of each lexeme and showing it in context.

In [24]:
for (dataType, theseHits) in hits.items():
    markdown = f"""### {dataType}s
lexeme | linked | n-occs | passage | verse text
--- | --- | --- | --- | ---
"""
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    )[0:10]:
        lex = F.lexeme.v(lx)
        ids = " ".join(sorted(linked))
        occs = L.d(lx, otype="word")
        passage = "{} {}:{}".format(*T.sectionFromNode(occs[0]))
        verse = L.u(occs[0], otype="verse")[0]
        text = T.text(L.d(verse, otype="word"))
        markdown += (
            f'<span class="syc">{lex}</span> | {ids} | {len(occs)} | {passage} |'
            f' <span class="syc">{text}</syc>\n'
        )
    display(Markdown(markdown))

persons

lexeme linked n-occs passage verse text
ܐܒܐ 1094 2582 308 9 Matthew 2:22 ܟܕ ܕܝܢ ܫܡܥ ܕܐܪܟܠܐܘܣ ܗܘܐ ܡܠܟܐ ܒܝܗܘܕ ܚܠܦ ܗܪܘܕܣ ܐܒܘܗܝ ܕܚܠ ܕܢܐܙܠ ܠܬܡܢ ܘܐܬܚܙܝ ܠܗ ܒܚܠܡܐ ܕܢܐܙܠ ܠܐܬܪܐ ܕܓܠܝܠܐ </syc>
ܐܒܪܗܡ 1108 1109 1110 1546 1547 1548 1549 1551 1552 1553 1554 2202 964 2 Matthew 1:1 ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ </syc>
ܐܕܝ 1117 1118 2203 2 Luke 3:28 ܒܪ ܡܠܟܝ ܒܪ ܐܕܝ ܒܪ ܩܘܣܡ ܒܪ ܐܠܡܘܕܕ ܒܪ ܥܝܪ </syc>
ܐܕܡ 1560 208 Luke 3:38 ܒܪ ܐܢܘܫ ܒܪ ܫܝܬ ܒܪ ܐܕܡ ܕܡܢ ܐܠܗܐ </syc>
ܐܗܪܘܢ 1012 1092 1533 1534 3 Luke 1:5 ܗܘܐ ܒܝܘܡܬܗ ܕܗܪܘܕܣ ܡܠܟܐ ܕܝܗܘܕܐ ܟܗܢܐ ܚܕ ܕܫܡܗ ܗܘܐ ܙܟܪܝܐ ܡܢ ܬܫܡܫܬܐ ܕܒܝܬ ܐܒܝܐ ܘܐܢܬܬܗ ܡܢ ܒܢܬܗ ܕܐܗܪܘܢ ܫܡܗ ܗܘܐ ܐܠܝܫܒܥ </syc>
ܐܘܒܘܠܘܣ 3028 1 2_Timothy 4:21 ܢܬܒܛܠ ܠܟ ܕܩܕܡ ܣܬܘܐ ܬܐܬܐ ܫܐܠ ܒܫܠܡܟ ܐܘܒܘܠܘܣ ܘܦܘܕܣ ܘܠܝܢܘܣ ܘܩܠܘܕܝܐ ܘܐܚܐ ܟܠܗܘܢ </syc>
ܐܚܐ 1122 1123 1740 3 Matthew 1:2 ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ </syc>
ܐܝܣܚܩ 1788 1789 1790 1791 1792 2578 3 Matthew 1:2 ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ </syc>
ܐܠܝܐ 1698 1699 1700 1703 1704 1705 2541 3145 945 1 Matthew 2:18 ܩܠܐ ܐܫܬܡܥ ܒܪܡܬܐ ܒܟܝܐ ܘܐܠܝܐ ܣܓܝܐܐ ܪܚܝܠ ܒܟܝܐ ܥܠ ܒܢܝܗ ܘܠܐ ܨܒܝܐ ܠܡܬܒܝܐܘ ܡܛܠ ܕܠܐ ܐܝܬܝܗܘܢ </syc>
ܐܠܟܣܢܕܪܘܣ 1574 887 1 Mark 15:21 ܘܫܚܪܘ ܚܕ ܕܥܒܪ ܗܘܐ ܫܡܥܘܢ ܩܘܪܝܢܝܐ ܕܐܬܐ ܗܘܐ ܡܢ ܩܪܝܬܐ ܐܒܘܗܝ ܕܐܠܟܣܢܕܪܘܣ ܘܕܪܘܦܘܣ ܕܢܫܩܘܠ ܙܩܝܦܗ </syc>

places

lexeme linked n-occs passage verse text
ܐܘܪܫܠܡ 104 2 Matthew 2:1 ܟܕ ܕܝܢ ܐܬܝܠܕ ܝܫܘܥ ܒܒܝܬ-ܠܚܡ ܕܝܗܘܕܐ ܒܝܘܡܝ ܗܪܘܕܣ ܡܠܟܐ ܐܬܘ ܡܓܘܫܐ ܡܢ ܡܕܢܚܐ ܠܐܘܪܫܠܡ </syc>
ܐܠܟܣܢܕܪܝܐ 572 2 Acts 6:9 ܘܩܡܘ ܗܘܘ ܐܢܫܐ ܡܢ ܟܢܘܫܬܐ ܕܡܬܩܪܝܐ ܕܠܝܒܪܛܝܢܘ ܘܩܘܪܝܢܝܐ ܘܐܠܟܣܢܕܪܝܐ ܘܕܡܢ ܩܝܠܝܩܝܐ ܘܡܢ ܐܣܝܐ ܘܕܪܫܝܢ ܗܘܘ ܥܡ ܐܣܛܦܢܘܣ </syc>
ܐܢܛܝܘܟܝܐ 10 995 44 Acts 6:5 ܘܫܦܪܬ ܗܕܐ ܡܠܬܐ ܩܕܡ ܟܠܗ ܥܡܐ ܘܓܒܘ ܠܐܣܛܦܢܘܣ ܓܒܪܐ ܕܡܠܐ ܗܘܐ ܗܝܡܢܘܬܐ ܘܪܘܚܐ ܕܩܘܕܫܐ ܘܠܦܝܠܝܦܘܣ ܘܠܦܪܟܪܘܣ ܘܠܢܝܩܢܘܪ ܘܠܛܝܡܘܢ ܘܠܦܪܡܢܐ ܘܠܢܝܩܠܐܘܣ ܓܝܘܪܐ ܐܢܛܝܘܟܝܐ </syc>
ܐܣܦܣ 288 5 Romans 3:13 ܩܒܪܐ ܦܬܝܚܐ ܓܓܪܬܗܘܢ ܘܠܫܢܝܗܘܢ ܢܟܘܠܬܢܝܢ ܘܚܡܬܐ ܕܐܣܦܣ ܬܚܝܬ ܣܦܘܬܗܘܢ </syc>
ܐܦܣܘܣ 623 69 Acts 18:19 ܘܡܛܝܘ ܠܐܦܣܘܣ ܘܥܠ ܦܘܠܘܣ ܠܟܢܘܫܬܐ ܘܡܡܠܠ ܗܘܐ ܥܡ ܝܗܘܕܝܐ </syc>
ܐܪܟ 515 1 Matthew 23:5 ܘܟܠܗܘܢ ܥܒܕܝܗܘܢ ܥܒܕܝܢ ܕܢܬܚܙܘܢ ܠܒܢܝ ܐܢܫܐ ܡܦܬܝܢ ܓܝܪ ܬܦܠܝܗܘܢ ܘܡܘܪܟܝܢ ܬܟܠܬܐ ܕܡܪܛܘܛܝܗܘܢ </syc>
ܓܐܝܘܣ 1494 1 Acts 19:29 ܘܐܫܬܓܫܬ ܟܠܗ ܡܕܝܢܬܐ ܘܪܗܛܘ ܐܟܚܕܐ ܘܐܙܠܘ ܠܬܐܛܪܘܢ ܘܚܛܦܘ ܐܘܒܠܘ ܥܡܗܘܢ ܠܓܐܝܘܣ ܘܠܐܪܣܛܪܟܘܣ ܓܒܪܐ ܡܩܕܘܢܝܐ ܒܢܝ ܠܘܝܬܗ ܕܦܘܠܘܣ </syc>
ܕܪܐ 67 32 Matthew 21:44 ܘܡܢ ܕܢܦܠ ܥܠ ܟܐܦܐ ܗܕܐ ܢܬܪܥܥ ܘܟܠ ܡܢ ܕܗܝ ܬܦܠ ܥܠܘܗܝ ܬܕܪܝܘܗܝ </syc>
ܕܪܡܣܘܩ 66 24 Acts 9:2 ܘܫܐܠ ܠܗ ܐܓܪܬܐ ܡܢ ܪܒ ܟܗܢܐ ܕܢܬܠ ܠܗ ܠܕܪܡܣܘܩ ܠܟܢܘܫܬܐ ܕܐܢ ܗܘ ܕܢܫܟܚ ܕܪܕܝܢ ܒܗܕܐ ܐܘܪܚܐ ܓܒܪܐ ܐܘ ܢܫܐ ܢܐܣܘܪ ܢܝܬܐ ܐܢܘܢ ܠܐܘܪܫܠܡ </syc>
ܚܘܪܐ 1456 4 Matthew 5:36 ܐܦܠܐ ܒܪܫܟ ܬܐܡܐ ܕܠܐ ܡܫܟܚ ܐܢܬ ܠܡܥܒܕ ܒܗ ܡܢܬܐ ܚܕܐ ܕܣܥܪܐ ܐܘܟܡܬܐ ܐܘ ܚܘܪܬܐ </syc>
In [26]:
syriacaResolve = os.path.expanduser(f"{REPO}/data/user/syriacaSyrNT.csv")
In [27]:
fieldNames = ("lexeme", "trans", "url", "applicable")

fh = open(syriacaResolve, "w")
for (dataType, theseHits) in hits.items():
    tsv = "\t".join(fieldNames) + "\n"
    markdown = f"""### {dataType}s
{" | ".join(fieldNames)}
--- | --- | --- | ---
"""
    table = tables[dataType]
    data = table[idF]
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    )[0:3]:
        lex = F.lexeme.v(lx)
        for lid in linked:
            trans = data[lid][0]
            url = f"{SC_URL}/{dataType}/{lid}"
            markdown += f'<span class="syc">{lex}</span> | {trans} | {url} | no\n'
            tsv += f"{lex}\t{trans}\t{url}\tno\n"

    fh.write(tsv)
    display(Markdown(markdown))
fh.close()

persons

lexeme trans url applicable
ܐܒܐ Aba of Nineveh http://syriaca.org/person/1094 no
ܐܒܐ Abba http://syriaca.org/person/2582 no
ܐܒܐ Aba http://syriaca.org/person/308 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/964 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/1548 no
ܐܒܪܗܡ Abraham of Harran http://syriaca.org/person/1549 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/1546 no
ܐܒܪܗܡ Abraham of the High Mountain http://syriaca.org/person/1109 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/1110 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/1547 no
ܐܒܪܗܡ Abraham II of Adiabene http://syriaca.org/person/1552 no
ܐܒܪܗܡ Abraham of Adiabene http://syriaca.org/person/1551 no
ܐܒܪܗܡ Abraham http://syriaca.org/person/2202 no
ܐܒܪܗܡ Abraham the Egyptian http://syriaca.org/person/1553 no
ܐܒܪܗܡ Abraham the Priest http://syriaca.org/person/1554 no
ܐܒܪܗܡ Abraham, bishop of Arbela http://syriaca.org/person/1108 no
ܐܕܝ Addai http://syriaca.org/person/1118 no
ܐܕܝ Addai http://syriaca.org/person/1117 no
ܐܕܝ Addai http://syriaca.org/person/2203 no

places

lexeme trans url applicable
ܐܘܪܫܠܡ Jerusalem (settlement) http://syriaca.org/place/104 no
ܐܠܟܣܢܕܪܝܐ Alexandria (settlement) http://syriaca.org/place/572 no
ܐܢܛܝܘܟܝܐ Antioch (settlement) http://syriaca.org/place/10 no
ܐܢܛܝܘܟܝܐ Antioch (region) http://syriaca.org/place/995 no

Workbench (BHSA)

x

In [28]:
import sys  # noqa F401
import os  # noqa F401
import collections  # noqa F401
from utils import structure, layout  # noqa F401
from IPython.display import display, HTML, Markdown  # noqa F401
from ipywidgets import interact, interactive, fixed, interact_manual  # noqa F401
import ipywidgets as widgets  # noqa F401

from tf.Fabric import Fabric  # noqa F401
from tf.extra.bhsa import Bhsa  # noqa F401
In [32]:
VERSION = "2017"
BASE = "~/github"
ETCBC = f"{BASE}/etcbc"
BHSA = f"bhsa/tf/{VERSION}"
TREES = f"lingo/trees/tf/{VERSION}"  # derived wefts
OSM = f"bridging/tf/{VERSION}"  # wefts from the OSM crafts shop
PHONO = f"phono/tf/{VERSION}"  # derived wefts
PARALLELS = f"parallels/tf/{VERSION}"  # derived wefts
In [34]:
TF = Fabric(locations=[ETCBC], modules=[BHSA, PHONO, PARALLELS, TREES, OSM])
api = TF.load(
    """
    g_word_utf8 g_cons_utf8
    voc_lex_utf8 gloss
    phono crossref tree
    osm
""",
    silent=True,
)
api.makeAvailableIn(globals())

B = Bhsa(api, "Copenhagen2018", version="2017")
This is Text-Fabric 3.2.5
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

124 features found and 0 ignored

This notebook online: NBViewer GitHub

In [35]:
verse = T.nodeFromSection(("Genesis", 1, 7))
In [36]:
B.pretty(verse)
1414196
sentence 1172226
clause 427571 NA WayX
phrase 651562 Conj CP
וַ
and
phrase 651563 Pred VP
יַּ֣עַשׂ
make
phrase 651564 Subj NP
אֱלֹהִים֮
god(s)
phrase 651565 Objc PP
אֶת־
<object marker>
הָ
the
רָקִיעַ֒
firmament
sentence 1172227
clause 427572 NA Way0
phrase 651566 Conj CP
וַ
and
phrase 651567 Pred VP
יַּבְדֵּ֗ל
separate
phrase 651568 Cmpl PP
בֵּ֤ין
interval
הַ
the
מַּ֨יִם֙
water
clause 427573 Attr NmCl
phrase 651569 Rela CP
אֲשֶׁר֙
<relative>
phrase 651570 PreC PP
מִ
from
תַּ֣חַת
under part
phrase 651570 PreC PP
לָ
to
the
רָקִ֔יעַ
firmament
clause 427572 NA Way0
phrase 651568 Cmpl PP
וּ
and
phrase 651568 Cmpl PP
בֵ֣ין
interval
הַ
the
מַּ֔יִם
water
clause 427574 Attr NmCl
phrase 651571 Rela CP
אֲשֶׁ֖ר
<relative>
phrase 651572 PreC PP
מֵ
from
עַ֣ל
upon
לָ
to
the
רָקִ֑יעַ
firmament
sentence 1172228
clause 427575 NA Way0
phrase 651573 Conj CP
וַֽ
and
phrase 651574 Pred VP
יְהִי־
be
phrase 651575 Modi AdvP
כֵֽן׃
thus
In [37]:
clause = 427572
B.pretty(clause)
clause 427572 NA Way0
phrase 651566 Conj CP
וַ
and
phrase 651567 Pred VP
יַּבְדֵּ֗ל
separate
phrase 651568 Cmpl PP
בֵּ֤ין
interval
הַ
the
מַּ֨יִם֙
water
clause 427572 NA Way0
phrase 651568 Cmpl PP
וּ
and
phrase 651568 Cmpl PP
בֵ֣ין
interval
הַ
the
מַּ֔יִם
water
In [38]:
HTML(B.shbLink(clause))
Out[38]:

Queries

ku

In [39]:
ellipQuery = """
sentence
  c1:clause
    phrase function=Pred
      word pdp=verb
  c2:clause
    phrase function=Pred
  c3:clause typ=Ellp
    phrase function=Objc
      word pdp=subs|nmpr|prps|prde|prin
  c1 << c2
  c2 << c3
"""
In [40]:
results = B.search(ellipQuery)
len(results)
Out[40]:
1410
In [41]:
def f(n):
    B.show(results, n, n + 1, withNodes=True)
In [42]:
interact(f, n=widgets.IntSlider(min=0, max=len(results) - 1, step=1, value=0))
Out[42]:
<function __main__.f(n)>

You can do this!

because:

  • the text model works with proper logic:
    • graph = nodes + edges + feature annotations
    • very similar to the model of Emdros (MQL)
  • the data packaging is for efficient logistics
  • but do take a beginners course in Python

Trees

In 2013/2014 we extracted tree structures from the BHSA data.

Every sentence has a tree associated with it, like this:

(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))

The numbers refer to the words in the sentence.

Trees as feature

The trees are available in a feature tree, defined for sentences.

@node
@converter=Dirk Roorda
@convertor=trees.ipynb
@coreData=BHSA
@coreVersion=2017
@description=penn treebank represententation for sentences
@url=https://github.com/etcbc/lingo/trees/trees.ipynb
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-01-21T18:53:06Z

1172209 (S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
(S(C(CP(cj 0))(NP(dt 1)(n 2))(VP(vb 3))(NP(U(n 4))(cj 5)(U(n 6)))))
(S(C(CP(cj 0))(NP(n 1))(PP(pp 2)(U(n 3))(U(n 4)))))
(S(C(CP(cj 0))(NP(U(n 1))(U(n 2)))(VP(vb 3))(PP(pp 4)(U(n 5))(U(dt 6)(n 7)))))

... and 60,000 more lines

Trees are nice. But this output does not look nice.

Display

We want

  • multiline view
  • see the words
  • phonetically
  • with gloss
  • and with Open Scriptures Morphology tag!
In [43]:
passage = ("Job", 3, 16)
passageStr = "{} {}:{}".format(*passage)
verse = T.nodeFromSection(passage)
sentence = L.d(verse, otype="sentence")[0]
firstSlot = L.d(sentence, otype="word")[0]
stringTree = F.tree.v(sentence)
print(f"{passageStr} - first word = {firstSlot}\n\ntree =\n{stringTree}")
Job 3:16 - first word = 336986

tree =
(S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))

Parsing

Parse it into a structure:

In [44]:
tree = structure(stringTree)
tree
Out[44]:
['S',
 ['C',
  ['Ccoor',
   ['CP', [('cj', 0)]],
   ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]],
   ['NegP', [('ng', 4)]],
   ['VP', [('vb', 5)]]],
  ['Ccoor',
   ['PP',
    [('pp', 6)],
    [('n', 7)],
    ['Cattr',
     ['NegP', [('ng', 8)]],
     ['VP', [('vb', 9)]],
     ['NP', [('n', 10)]]]]]]]

We can display it a bit more friendly:

In [45]:
print(layout(tree, firstSlot, str))
  S
    C
      Ccoor
        CP
          cj 336986
        PP
          pp 336987
          U
            n 336988
          U
            vb 336989
        NegP
          ng 336990
        VP
          vb 336991
      Ccoor
        PP
          pp 336992
          n 336993
          Cattr
            NegP
              ng 336994
            VP
              vb 336995
            NP
              n 336996

Note that the layout() has replaced the relative word numbers in the sentence by absolute slot numbers in the dataset.

Weaving the wefts ...

All wefts are there, we have to weave them around each warp.

In [46]:
def osmPhonoGloss(n):
    lexNode = L.u(n, otype="lex")[0]
    return '{{{}}} "{}" [{}] = {}'.format(
        F.osm.v(n),
        F.g_word_utf8.v(n),
        F.phono.v(n),
        F.gloss.v(lexNode),  # gloss is a feature on lexemes, not words
        # F.voc_lex_utf8.v(lexNode),
    )

... into a weave

In [47]:
print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))
 1  S
 2    C
 3      Ccoor
 4        CP
 5          cj {HC} "אֹ֚ו" [ˈʔô] = or
 4        PP
 5          pp {HR} "כְ" [ḵᵊ] = as
 5          U
 6            n {HNcmsa} "נֵ֣פֶל" [nˈēfel] = miscarriage
 5          U
 6            vb {HVqsmsa} "טָ֭מוּן" [ˈṭāmûn] = hide
 4        NegP
 5          ng {HTn} "לֹ֣א" [lˈō] = not
 4        VP
 5          vb {HVqi1cs} "אֶהְיֶ֑ה" [ʔehyˈeh] = be
 3      Ccoor
 4        PP
 5          pp {HR} "כְּ֝" [ˈkᵊ] = as
 5          n {HNcmpa} "עֹלְלִ֗ים" [ʕōlᵊlˈîm] = child
 5          Cattr
 6            NegP
 7              ng {HTn} "לֹא" [lō-] = not
 6            VP
 7              vb {HVqp3cp} "רָ֥אוּ" [rˌāʔû] = see
 6            NP
 7              n {HNcbsa} "אֹֽור" [ʔˈôr] = light
In [48]:
def showTree(s):
    t = F.tree.v(s)
    tree = structure(t)
    firstSlot = L.d(s, otype="word")[0]
    label = "{} {}:{}".format(*T.sectionFromNode(firstSlot))
    print(label)
    print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))
    return 0


sentenceInfo = [c for c in C.levels.data if c[0] == "sentence"][0]
minSentence = sentenceInfo[2]
maxSentence = sentenceInfo[3]
In [49]:
interact(
    showTree,
    s=widgets.IntSlider(
        min=minSentence,
        max=maxSentence,
        step=1,
        value=minSentence,
    ),
)
Out[49]:
<function __main__.showTree(s)>

No leaking of concerns

  • The TREES module knows nothing of OS morphology
  • OS morphology is not aware of TREES
  • thank goodness
  • But they are woven cosily together in one display

a

On Perseus, via Scaife

Carried away by tree structures

The raw strings are handy for structure analysis, in a way the woven trees cannot be.

Let us see how many distinct tree structures we've got.

liberate yourselve from micro-management

In [50]:
treeDistribution = F.tree.freqList()

distinct = len(treeDistribution)
total = sum(x[1] for x in treeDistribution)

print(f"{distinct} distinct trees of {total} in total")
28096 distinct trees of 63711 in total
In [51]:
for (tree, amount) in treeDistribution[0:10]:
    print(f"{amount:>4} x {tree}")
3772 x (S(C(CP(cj 0))(VP(vb 1))))
1238 x (S(C(VP(vb 0))))
1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))
 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))
 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))
 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))
 554 x (S(C(VP(vb 0))(NP(n 1))))
 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))
 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))

I'm intrigued by the most frequent tree structure.

Which verbs occur in such a sentence? Let's find out.

In [52]:
lexemes = collections.Counter()
short = treeDistribution[0][0]
for s in F.otype.s("sentence"):
    if F.tree.v(s) == short:
        verb = L.d(s, otype="word")[1]
        lexeme = L.u(verb, otype="lex")[0]
        lexemes[lexeme] += 1
print(f"{len(lexemes)} lexemes found")
501 lexemes found
In [53]:
for (lex, amount) in sorted(
    lexemes.items(),
    key=lambda x: (-x[1], x[0]),
)[0:10]:
    print(f'{amount:>4} x {lex} "{F.voc_lex_utf8.v(lex)}" = {F.gloss.v(lex)}')
1045 x 1437422 "אמר" = say
 203 x 1437412 "היה" = be
 107 x 1437561 "הלך" = walk
 106 x 1437570 "מות" = die
  87 x 1437424 "ראה" = see
  80 x 1437574 "בוא" = come
  71 x 1437645 "שׁוב" = return
  70 x 1437569 "אכל" = eat
  52 x 1437685 "קום" = arise
  45 x 1437654 "חיה" = be alive

Open Scriptures Morphology

  • align the WLC with the BHS
  • compare the OSM with the BHSA

Aligning

See BHSAbridgeOSM.ipynb

  • performs a consonant by consonant alignment between the WLC and BHS
  • stumbled on a few cases requiring a hint:
exceptions = {
    215253: 1,
    266189: 1,
    287360: 2,
    376865: 1,
    383405: 2,
    384049: 1,
    384050: 1,
    405102: -2,
}
Succeeded in aligning BHS with OSM
420103 BHS words matched against 469448 OSM morphemes with 8 known exceptions

Spotting the anomalies

With a bit of weaving, these exceptions are:

Isaiah 9:6
                    BHS 215253         = מרבה
                    OSM w1             = םרבה
Ezekiel 4:6
                    BHS 266189         = ימוני
                    OSM w7             = ימיני
Ezekiel 43:11
                    BHS 287360         = צורתו
                    OSM w17, w17       = צורת/י
Daniel 10:19
                    BHS 376865         = כְ
                    OSM w10            = בְ
Ezra 10:44
                    BHS 383405         = נשׂאו
                    OSM w3, w3         = נשא/י
Nehemiah 2:13
                    BHS 384049         = הם
                    OSM w17            = ה
Nehemiah 2:13
                    BHS 384050         = פרוצים
                    OSM w17            = מפרוצים
1_Chronicles 27:12
                    BHS 405102, 405103 = בן/ימיני
                    OSM w6             = בנימיני

Word breaking

There are cases where the OSM and the BHSA differ in the breaking-up of words.

OSM morphemes without BHSA word:          0
OSM morphemes with multiple BHSA words: 130
OSM morphemes with 2        BHSA words: 123
OSM morphemes with 3        BHSA words:   7

Unfinished

The OSM is not yet finished.

We made a list of word nodes for which no morpheme has been tagged

53841 =~ 10% unfinished.

Non-marked-up stretches having length x: y times
   1: 14990
   2:  8336
   3:  2802
   4:  1090
   5:   493
   6:   285
   7:   162
   8:    90
   9:    70
  10:    37
  11:    33
  12:    19
  13:    11
  14:    17
  15:     9
  16:     7
  17:     2
  18:     2
  19:     6
  20:     1
  21:     1
  22:     3
  23:     2
  25:     2
  26:     2
  27:     1
  28:     1
  29:     1
  32:     1
  33:     1
  35:     1
  36:     1
  38:     2
  41:     1
  47:     1
  60:     1
  61:     1
  72:     1
  74:     1
  75:     1

What remains is: filling in the dots!

We will carry out the comparison for unproblematic words:

  • good alignment ( 8 BHSA words excluded)
  • same word breaks (276 BHSA words excluded)
  • morph tags available

Result: OSM module

Two new TF features:

  • osm.tf (main words)
  • osm_sf.tf (suffixes)

Together: the OSM module

@node
@conversion=notebook openscriptures in BHSA repo
@conversion_author=Dirk Roorda
@coreData=BHSA
@coreVersion=2017
@description=primary morphology string according to OpenScriptures
@source=Open Scriptures
@source_url=https://github.com/openscriptures/morphhb
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-01-12T13:21:01Z

HR
HNcfsa
HVqp3ms
HNcmpa
HTo
HTd
HNcmpa
HC
HTo
HTd
HNcbsa

... and 400,000 more lines

Comparing

We compare categories.

In OSM:

  • part-of-speech
  • and their subtypes

In BHSA the features:

OSM categories

pspOSM = {
    '': dict(
        A='adjective',
        C='conjunction',
        D='adverb',
        N='noun',
        P='pronoun',
        R='preposition',
        S='suffix',
        T='particle',
        V='verb',
    ),
    'A': dict(
        a='adjective',
        c='cardinal number',
        g='gentilic',
        o='ordinal number',
    ),
    'N': dict(
        c='common',
        g='gentilic',
        p='proper name',
    ),
    'P': dict(
        d='demonstrative',
        f='indefinite',
        i='interrogative',
        p='personal',
        r='relative',
    ),
    'R': dict(
        d='definite article',
    ),
    'S': dict(
        d='directional he',
        h='paragogic he',
        n='paragogic nun',
        p='pronominal',
    ),
    'T': dict(
        a='affirmation',
        d='definite article',
        e='exhortation',
        i='interrogative',
        j='interjection',
        m='demonstrative',
        n='negative',
        o='direct object marker',
        r='relative',
    ),
}

BHSA categories

spBHS = dict(
    art='article',
    verb='verb',
    subs='noun',
    nmpr='proper noun',
    advb='adverb',
    prep='preposition',
    conj='conjunction',
    prps='personal pronoun',
    prde='demonstrative pronoun',
    prin='interrogative pronoun',
    intj='interjection',
    nega='negative particle',
    inrg='interrogative particle',
    adjv='adjective',
)
lsBHS = dict(
    nmdi='distributive noun',
    nmcp='copulative noun',
    padv='potential adverb',
    afad='anaphoric adverb',
    ppre='potential preposition',
    cjad='conjunctive adverb',
    ordn='ordinal',
    vbcp='copulative verb',
    mult='noun of multitude',
    focp='focus particle',
    ques='interrogative particle',
    gntl='gentilic',
    quot='quotation verb',
    card='cardinal',
    none=MISSING,
)
nametypeBHS = dict(
    pers='person',
    mens='measurement unit',
    gens='people',
    topo='place',
    ppde='demonstrative personal pronoun',
)
nametypeBHS.update({
    'pers,gens,topo': 'person',
    'pers,gens': 'person',
    'gens,topo': 'gentilic',
    'pers,god': 'person',
})

Better dumb than smart

We just counted the pairs of OSM, BHSA categories that co-occurred on words.

A selection of the outcomes.

This is OSM versus BHSA

Verbs

verb
    verb::                         ( 84% =  50691x)
    verb:quotation verb:           ( 10% =   6137x)
    verb:copulative verb:          (  5% =   3246x)
    noun::                         (  0% =      6x)
    adjective::                    (  0% =      3x)
    preposition::                  (  0% =      1x)
    proper noun::                  (  0% =      1x)

Excellent Just 11 discrepancies in 60,000 cases, 99.98% !

Prepositions

preposition
    preposition::                  ( 96% =  50697x)
    noun:potential preposition:    (  3% =   1643x)
    adverb:conjunctive adverb:     (  0% =    194x)
    interrogative particle::       (  0% =    169x)
    noun:cardinal:                 (  0% =     13x)
    conjunction::                  (  0% =      5x)
    noun::                         (  0% =      2x)
    proper noun::                  (  0% =      2x)
    article::                      (  0% =      1x)
    verb::                         (  0% =      1x)
In [54]:
disc = 194 + 169 + 13 + 5 + 2 + 2 + 1 + 1
tot = 50697 + disc
discPerc = round(100 * disc / tot, 2)
print(f"Discrepancies: {discPerc}% = {disc}x out of {tot}")
Discrepancies: 0.76% = 387x out of 51084

Attention needed!

  • all rare cases have been collected into a big list
  • context info has been woven into the list
  • there are 645 such cases
  • see allCategoriesCases.tsv on GitHub

1

2

Follow up?

  • inspect the rare cases:
    • these might be glitches, in BHSA or in OSM or in both
    • these might be disputable cases: add them to the docs
  • inspect the majority cases: which categories map to which?
    • maybe some categories can be harmonized
    • if that is not desirable: we can generate an exhaustive mapping
  • in the end: we can make a BHSA-OSM category mapping that is
    • comprehensive
    • machine-readable
    • documented

Conclusions

BHSA versus OSM

is awesome

is terrific

$\gt($ awesome $+$ terrific $)$

Data, Logic, and Logistics: Text-Fabric

Text-Fabric is a

to support the logistics of the interchange of textual treasures

So that you ...

  • ... researcher and tinkerer
  • ... programming theologian

  • can grab parts from GitHub

  • bring them to your shed
  • and join them together on your workbench

Designed especially for you - Thank you

[email protected]