Notebook

In [2]:

import os
from itertools import chain

from tf.app import use
from tf.core.files import dirMake

In [3]:

A = use("ETCBC/bhsa:clone", checkout="clone", hoist=globals())

Locating corpus resources ...

app: ~/github/ETCBC/bhsa/app

data: ~/github/ETCBC/bhsa/tf/2021

data: ~/github/ETCBC/phono/tf/2021

data: ~/github/ETCBC/parallels/tf/2021

Text-Fabric: Text-Fabric API 12.0.5, ETCBC/bhsa/app v3, Search Reference
Data: ETCBC - bhsa 2021, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

str

✅ verbal stem (qal; piel; hif; apel; pael)

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

Settings:

specified

apiVersion: 3
appName: ETCBC/bhsa
appPath: /Users/me/github/ETCBC/bhsa/app
commit: no value
css: ''
dataDisplay:
- exampleSectionHtml:
  <code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
- excludedFeatures:
  - g_uvf_utf8
  - g_vbs
  - kq_hybrid
  - languageISO
  - g_nme
  - lex0
  - is_root
  - g_vbs_utf8
  - g_uvf
  - dist
  - root
  - suffix_person
  - g_vbe
  - dist_unit
  - suffix_number
  - distributional_parent
  - kq_hybrid_utf8
  - crossrefSET
  - instruction
  - g_prs
  - lexeme_count
  - rank_occ
  - g_pfm_utf8
  - freq_occ
  - crossrefLCS
  - functional_parent
  - g_pfm
  - g_nme_utf8
  - g_vbe_utf8
  - kind
  - g_prs_utf8
  - suffix_gender
  - mother_object_type
- noneValues:
  - absent
  - n/a
  - none
  - unknown
  - no value
  - NA
docs:
- docBase: {docRoot}/{repo}
- docExt: ''
- docPage: ''
- docRoot: https://{org}.github.io
- featurePage: 0_home
interfaceDefaults: {}
isCompatible: True
local: clone
localDir: /Users/me/github/ETCBC/bhsa/_temp
provenanceSpec:
- corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
- doi: 10.5281/zenodo.1007624
- moduleSpecs:
  - :
    backend: no value
    corpus: Phonetic Transcriptions
    docUrl:
    https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
    doi: 10.5281/zenodo.1007636
    org: ETCBC
    relative: /tf
    repo: phono
  - :
    backend: no value
    corpus: Parallel Passages
    docUrl:
    https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
    doi: 10.5281/zenodo.1007642
    org: ETCBC
    relative: /tf
    repo: parallels
- org: ETCBC
- relative: /tf
- repo: bhsa
- version: 2021
- webBase: https://shebanq.ancient-data.org/hebrew
- webHint: Show this on SHEBANQ
- webLang: la
- webLexId: True
- webUrl:
  {webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: no value
typeDisplay:
- clause:
  - label: {typ} {rela}
  - style: ''
- clause_atom:
  - hidden: True
  - label: {code}
  - level: 1
  - style: ''
- half_verse:
  - hidden: True
  - label: {label}
  - style: ''
  - verselike: True
- lex:
  - featuresBare: gloss
  - label: {voc_lex_utf8}
  - lexOcc: word
  - style: orig
  - template: {voc_lex_utf8}
- phrase:
  - label: {typ} {function}
  - style: ''
- phrase_atom:
  - hidden: True
  - label: {typ} {rela}
  - level: 1
  - style: ''
- sentence:
  - label: {number}
  - style: ''
- sentence_atom:
  - hidden: True
  - label: {number}
  - level: 1
  - style: ''
- subphrase:
  - hidden: True
  - label: {number}
  - style: ''
- word:
  - features: pdp vs vt
  - featuresBare: lex:gloss
writing: hbo

Text-Fabric API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Find multiple word phrases with equal `nu` on first and last word¶

In TF, by means of a query:¶

In [17]:

query = """
phrase
    =: fi:word
    # la:word
    :=

fi .nu. la
"""

resultsQ = A.search(query)

  0.90s 15522 results

In [18]:

A.show(resultsQ, end=3, condenseType="clause")

result 1

Genesis 1:2

clause WXQt NA

phrase CP Conj

וְ

phrase NP Subj

הָ

אָ֗רֶץ

nu=sg

phrase VP Pred

הָיְתָ֥ה

nu=sg

phrase NP PreC

תֹ֨הוּ֙

nu=sg

וָ

בֹ֔הוּ

nu=sg

result 2

Genesis 1:4

clause WayX NA

phrase CP Conj

וַ

phrase VP Pred

יַּבְדֵּ֣ל

nu=sg

phrase NP Subj

אֱלֹהִ֔ים

nu=pl

phrase PP Cmpl

nu=sg

nu=sg

nu=sg

nu=sg

result 3

Genesis 1:5

clause NmCl NA

phrase NP PreC

יֹ֥ום

nu=sg

אֶחָֽד׃ פ

nu=sg

In TF, by means of hand-coding:¶

In [19]:

resultsH = []

for p in F.otype.s("phrase"):
    ws = L.d(p, otype="word")
    if len(ws) < 2:
        continue
    fi = ws[0]
    la = ws[-1]
    if F.nu.v(fi) != F.nu.v(la):
        continue
    resultsH.append((p, fi, la))

len(resultsH)

Out[19]:

Check¶

In [20]:

set(resultsQ) == set(resultsH)

Out[20]:

True

In STAM¶

Challenges:

Find the first and last word in each phrase¶

Given a phrase, we need to find its words. Well, a phrase is an annotation with key otype and value phrase and target some word annotations. These target annotations are easy to get, by means of the annotations() method on annotations.

Then we have to find the first and last words among these targets.

That is currently difficult!

You need a concept of order between annotations. One possibility is to put sequence numbers in the data, as annotations. But that is very cumbersome, because you need to refer to yet another level of annotation. And it will inflate the data.

The other possibility is "canonical ordering". Annotations that target the text can be ordered by their targets. A target is a subset of textual positions. Two such subsets can be ordered as follows:

if A is a subset of B then B <= A
if B is a subset of A then A <= B
if A and B are no subsets of each other, the set that has the smallest element that does not belong to the other set, is the smallest.

As part of the index building, you could create the rank of each annotation in this ordering.

Annotations that target annotations that are already canonically ordered, can themselves be canonically ordered w.r.t. their targets.

Without this, the user will need to implement sorting in ad-hoc ways.

Retrieve values for the first and last word¶

Given the annotations for the first and last word in a phrase, we have to find annotations with key nu and target these words, and read off their value.

That is currently difficult!

A way out is this:

As preparation, before looping through the phrases: Make a dict that associates word annotation identifiers with nu-values:

retrieve all annotations that have key nu, for each annotation:
pick the target, it is a word annotation, pick its id and use that as key in the dict
pick the data and from that the value and use that as value in the dict

Then, for each phrase with at least two words:

retrieve the first word and from there the nu-value for that word
retrieve the second word and from there the nu-value for that word
compare the two values. If they are equal, we have a hit.

This can be improved if the API offers an efficient function to look up values. That could be a pre computation of all those dictionaries.

Even better: those dictionaries could be the primary data!

In [22]:

import stam

from memutil import memUsage
memUsage()

workDir = f"{A.tempDir}/stam"
storeC = stam.AnnotationStore(file=f"{workDir}/bhsa.store.stam.csv")

Current:  3.01 GB
Delta:    3.01 GB

In [44]:

aDataSet = list(storeC.annotationsets())[0]

In [45]:

def stamOtype(otype):
    otypeData = aDataSet.find_data("otype", otype)
    otypeAnnos = otypeData[0].annotations()
    return otypeAnnos

def idsOf(annos):
    return {a.id() for a in annos}

In [59]:

# get the word annotations, sorted, and the phrase annotations

def getPos(wordAnno):
    t = wordAnno.textselections()[0]
    return (t.begin(), t.end())

wordAnnos = stamOtype("word")
wordIds = idsOf(wordAnnos)
phraseAnnos = stamOtype("phrase")

In [60]:

wordAnnos = sorted(wordAnnos, key=getPos)

In [47]:

# make a rank of the word annos

wordRank = {anno.id(): i for (i, anno) in enumerate(wordAnnos)}

In [49]:

# get the phrase annotations together with their first and last word

phrases = []

for pAnno in phraseAnnos:
    words = pAnno.annotations()
    if len(words) < 2:
        continue
    sortedWords = sorted(words, key=lambda x: wordRank[x.id()])
    phrases.append((p, words[0], words[-1]))

len(phrases)

Out[49]:

In [50]:

# intermediate check with TF

query = """
phrase
    =: word
    # word
    :=
"""
results = A.search(query)

  0.90s 78754 results

In [51]:

# get the `nu` information ready
# we collect a dict keyed by word id with values the grammatical number of the word

nuKey = aDataSet.key("nu")
nuAnnos = nuKey.annotations()

nuValue = {}

for nuAnno in nuAnnos:
    value = nuAnno.data()[0].value()
    word = list(nuAnno.annotations())[0]

    nuValue[word.id()] = value

In [52]:

# check some values

for wordAnno in wordAnnos[0:11]:
    print(f"{wordAnno} {nuValue[wordAnno.id()]}")

בְּ NA
רֵאשִׁ֖ית  sg
בָּרָ֣א  sg
אֱלֹהִ֑ים  pl
אֵ֥ת  NA
הַ NA
שָּׁמַ֖יִם  pl
וְ NA
אֵ֥ת  NA
הָ NA
אָֽרֶץ00  sg

So far so good!

In [53]:

# now compute the final result

resultsSTAM = [x for x in phrases if nuValue[x[1].id()] == nuValue[x[2].id()]]
len(resultsSTAM)

Out[53]:

Perfect!

Now in one go¶

In order to see the performance, let's do this again in one go.

In [54]:

# The complete task in one go

def getNicePhrases():
    aDataSet = list(storeC.annotationsets())[0]
    wordAnnos = sorted(stamOtype("word"), key=getPos)
    wordIds = idsOf(wordAnnos)
    wordRank = {anno.id(): i for (i, anno) in enumerate(wordAnnos)}

    phraseAnnos = stamOtype("phrase")
    phrases = []

    for pAnno in phraseAnnos:
        words = pAnno.annotations()
        if len(words) < 2:
            continue
        sortedWords = sorted(words, key=lambda x: wordRank[x.id()])
        phrases.append((p, words[0], words[-1]))

    nuKey = aDataSet.key("nu")
    nuAnnos = nuKey.annotations()

    nuValue = {}

    for nuAnno in nuAnnos:
        value = nuAnno.data()[0].value()
        word = list(nuAnno.annotations())[0]

        nuValue[word.id()] = value

    results = [x for x in phrases if nuValue[x[1].id()] == nuValue[x[2].id()]]
    print(len(results))
    return results

In [55]:

resultsSTAM = getNicePhrases()

The execution times for this task were

TF query	TF hand coding	STAM
0.9	0.2	2.0

But in STAM we can move quite a bit out of the task:

sorting the words should be taken care of during the index building/loading when loading the STAM dataset (saves 0.3 sec)
retrieving the nu value should be optimized (could save 0.9 sec)

Find multiple word phrases with equal nu on first and last word¶

In TF, by means of a query:¶

In TF, by means of hand-coding:¶

Check¶

In STAM¶

Find the first and last word in each phrase¶

Retrieve values for the first and last word¶

Now in one go¶

Find multiple word phrases with equal `nu` on first and last word¶