Notebook

You might want to consider the start of this tutorial.

Short introductions to other TF datasets:

or the

Q'uran

Trees¶

The textual objects of the BHSA text are syntactic, but they are not syntax trees.

The BHSA is the result of a data-driven parsing strategy with occasional human decisions. It results in functional objects such as sentences, clauses, and phrases, which are build from chunks called sentece-atoms, clause-atoms, and phrase-atoms.

There is no deeper nesting of clauses within phrases, or even clauses within clauses or phrases within phrases. Instead, whenever objects are linguistically nested, there is an edge called mother between the objects in question.

For people that prefer to think in trees, we have unwrapped the mother relationship between clauses and made tree structures out of the data.

The whole generation process of trees, including the quirks underway, is documented in the notebook trees.ipynb. You see it done there for version 2017. We have used an ordinary Python program to generate trees for all versions of the BHSA: alltrees.py

Those trees are available as a feature on sentence nodes, and you can load those features alongside the BHSA data.

Here we show some examples of what you can do with it.

In [1]:

%load_ext autoreload
%autoreload 2

Incantation¶

The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.

In [2]:

from utils import structure, layout
from tf.app import use

Note that we load the trees module.

We also load the morphology of Open Scriptures for example usage later on.

In [3]:

A = use("ETCBC/bhsa", mod="ETCBC/trees/tf,ETCBC/bridging/tf", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/bhsa/app

data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021

data: ~/text-fabric-data/github/ETCBC/trees/tf/2021

data: ~/text-fabric-data/github/ETCBC/bridging/tf/2021

data: ~/text-fabric-data/github/ETCBC/phono/tf/2021

data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021

   |     0.89s T osm                  from ~/text-fabric-data/github/ETCBC/bridging/tf/2021
   |     0.12s T osm_sf               from ~/text-fabric-data/github/ETCBC/bridging/tf/2021
   |     0.21s T tree                 from ~/text-fabric-data/github/ETCBC/trees/tf/2021
   |     0.30s T treen                from ~/text-fabric-data/github/ETCBC/trees/tf/2021

Text-Fabric: Text-Fabric API 12.0.4, ETCBC/bhsa/app v3, Search Reference
Data: ETCBC - bhsa 2021, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

str

✅ verbal stem (qal; piel; hif; apel; pael)

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

ETCBC/bridging/tf

osm

str

🆗 morphology tag (primary morpheme) by OpenScriptures (HR HVqp3ms)

osm_sf

str

🆗 morphology tag (secundary morpheme) by OpenScriptures

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

ETCBC/trees/tf

tree

str

🆗 sentence: penn treebank ((VP(vb 2)))

treen

str

🆗 sentence: penn treebank with node numbers included ((VP{651574}(vb 2)))

Settings:

specified

apiVersion: 3
appName: ETCBC/bhsa
appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
commit: gd905e3fb6e80d0fa537600337614adc2af157309
css: ''
dataDisplay:
- exampleSectionHtml:
  <code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
- excludedFeatures:
  - g_uvf_utf8
  - g_vbs
  - kq_hybrid
  - languageISO
  - g_nme
  - lex0
  - is_root
  - g_vbs_utf8
  - g_uvf
  - dist
  - root
  - suffix_person
  - g_vbe
  - dist_unit
  - suffix_number
  - distributional_parent
  - kq_hybrid_utf8
  - crossrefSET
  - instruction
  - g_prs
  - lexeme_count
  - rank_occ
  - g_pfm_utf8
  - freq_occ
  - crossrefLCS
  - functional_parent
  - g_pfm
  - g_nme_utf8
  - g_vbe_utf8
  - kind
  - g_prs_utf8
  - suffix_gender
  - mother_object_type
- noneValues:
  - none
  - unknown
  - no value
  - NA
docs:
- docBase: {docRoot}/{repo}
- docExt: ''
- docPage: ''
- docRoot: https://{org}.github.io
- featurePage: 0_home
interfaceDefaults: {}
isCompatible: True
local: local
localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
provenanceSpec:
- corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
- doi: 10.5281/zenodo.1007624
- moduleSpecs:
  - :
    backend: no value
    corpus: Phonetic Transcriptions
    docUrl:
    https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
    doi: 10.5281/zenodo.1007636
    org: ETCBC
    relative: /tf
    repo: phono
  - :
    backend: no value
    corpus: Parallel Passages
    docUrl:
    https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
    doi: 10.5281/zenodo.1007642
    org: ETCBC
    relative: /tf
    repo: parallels
- org: ETCBC
- relative: /tf
- repo: bhsa
- version: 2021
- webBase: https://shebanq.ancient-data.org/hebrew
- webHint: Show this on SHEBANQ
- webLang: la
- webLexId: True
- webUrl:
  {webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: v1.8
typeDisplay:
- clause:
  - label: {typ} {rela}
  - style: ''
- clause_atom:
  - hidden: True
  - label: {code}
  - level: 1
  - style: ''
- half_verse:
  - hidden: True
  - label: {label}
  - style: ''
  - verselike: True
- lex:
  - featuresBare: gloss
  - label: {voc_lex_utf8}
  - lexOcc: word
  - style: orig
  - template: {voc_lex_utf8}
- phrase:
  - label: {typ} {function}
  - style: ''
- phrase_atom:
  - hidden: True
  - label: {typ} {rela}
  - level: 1
  - style: ''
- sentence:
  - label: {number}
  - style: ''
- sentence_atom:
  - hidden: True
  - label: {number}
  - level: 1
  - style: ''
- subphrase:
  - hidden: True
  - label: {number}
  - style: ''
- word:
  - features: pdp vs vt
  - featuresBare: lex:gloss
writing: hbo

Text-Fabric API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

We first inspect the nature of these features, lets pick the first, last and middle sentence of the Hebrew Bible

In [4]:

sentences = F.otype.s("sentence")
examples = (sentences[0], sentences[len(sentences) // 2], sentences[-1])

We examine feature tree:

In [5]:

for s in examples:
    print(F.tree.v(s))

(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
(S(C(VP(vb 0))(NP(n 1))))
(S(C(CP(cj 0))(VP(vb 1))))

Now treen:

In [6]:

for s in examples:
    print(F.treen.v(s))

(S{1172308}(C{427559}(PP{651573}(pp 0)(n 1))(VP{651574}(vb 2))(NP{651575}(n 3))(PP{651576}(U{1300539}(pp 4)(dt 5)(n 6))(cj 7)(U{1300540}(pp 8)(dt 9)(n 10)))))
(S{1204166}(C{471249}(VP{782581}(vb 0))(NP{782582}(n 1))))
(S{1236024}(C{515689}(CP{904774}(cj 0))(VP{904775}(vb 1))))

The structure of the trees is the same, but treen has numbers between braces in the tags of the nodes. These numbers are the Text-Fabric nodes of the sentences, clauses and phrases that the nodes of the tree correspond to.

Using trees¶

These strings are not very pleasant to the eye. For one thing, we see numbers instead of words. They also seem a bit unwieldy to integrate with the usual text-fabric business. But nothing is farther from the truth.

We show how to

produce a multiline view
see the words (in several representations)
add a gloss
add morphological data from an other project (Open Scriptures)

Honesty compels us to note that we make use of a bunch of auxiliary functions in an accompanying utils pacckage:

In [7]:

passage = ("Job", 3, 16)
passageStr = "{} {}:{}".format(*passage)
verse = T.nodeFromSection(passage)
sentence = L.d(verse, otype="sentence")[0]
firstSlot = L.d(sentence, otype="word")[0]
stringTree = F.tree.v(sentence)
print(f"{passageStr} - first word = {firstSlot}\n\ntree =\n{stringTree}")

Job 3:16 - first word = 336990

tree =
(S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))

Parsing¶

Key to effective manipulation of tree strings is to parse them into tree structures: lists of lists.

Here we use the generic utility structure():

In [8]:

tree = structure(stringTree)
tree

Out[8]:

['S',
 ['C',
  ['Ccoor',
   ['CP', [('cj', 0)]],
   ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]],
   ['NegP', [('ng', 4)]],
   ['VP', [('vb', 5)]]],
  ['Ccoor',
   ['PP',
    [('pp', 6)],
    [('n', 7)],
    ['Cattr',
     ['NegP', [('ng', 8)]],
     ['VP', [('vb', 9)]],
     ['NP', [('n', 10)]]]]]]]

Apply layout¶

Having the real tree structure in hand, we can layout it in all kinds of ways. We use the generic utility layout() to display it a bit more friendly and to replace the numbers by real Text-Fabric slot numbers:

In [9]:

print(layout(tree, firstSlot, str))

That opens up the way to get the words in. The third argument of layout() above is str, which is a function that is applied to the slot numbers. It returns those numbers as string, and this is what ends up in the layout.

Fillin the words¶

We can pass any function, why not the function that looks up the word?

Remember that F.g_word_utf8.v is a function that returns the full Hebrew word given a slot node.

In [10]:

print(layout(tree, firstSlot, F.g_word_utf8.v))

  S
    C
      Ccoor
        CP
          cj אֹ֚ו
        PP
          pp כְ
          U
            n נֵ֣פֶל
          U
            vb טָ֭מוּן
        NegP
          ng לֹ֣א
        VP
          vb אֶהְיֶ֑ה
      Ccoor
        PP
          pp כְּ֝
          n עֹלְלִ֗ים
          Cattr
            NegP
              ng לֹא
            VP
              vb רָ֥אוּ
            NP
              n אֹֽור

Add a gloss¶

In [11]:

def gloss(n):
    lexNode = L.u(n, otype="lex")[0]
    return f'{F.g_word_utf8.v(n)} "{F.gloss.v(lexNode)}"'


print(layout(tree, firstSlot, gloss))

  S
    C
      Ccoor
        CP
          cj אֹ֚ו "or"
        PP
          pp כְ "as"
          U
            n נֵ֣פֶל "miscarriage"
          U
            vb טָ֭מוּן "hide"
        NegP
          ng לֹ֣א "not"
        VP
          vb אֶהְיֶ֑ה "be"
      Ccoor
        PP
          pp כְּ֝ "as"
          n עֹלְלִ֗ים "child"
          Cattr
            NegP
              ng לֹא "not"
            VP
              vb רָ֥אוּ "see"
            NP
              n אֹֽור "light"

Morphology¶

In 2018 I compared the morphology of Open Scriptures with that of the BHSA. See brdiging.

As a by-product I saved their morphology as a Text-Fabric feature on words. So we can add it to our trees.

We also show the nesting depth in the resulting tree.

In [12]:

def osmPhonoGloss(n):
    lexNode = L.u(n, otype="lex")[0]
    return (
        f'({F.osm.v(n)}) {F.g_word_utf8.v(n)} [{F.phono.v(n)}] "{F.gloss.v(lexNode)}"'
    )


print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))

 1  S
 2    C
 3      Ccoor
 4        CP
 5          cj (HC) אֹ֚ו [ˈʔô] "or"
 4        PP
 5          pp (HR) כְ [ḵᵊ] "as"
 5          U
 6            n (HNcmsa) נֵ֣פֶל [nˈēfel] "miscarriage"
 5          U
 6            vb (HVqsmsa) טָ֭מוּן [ˈṭāmûn] "hide"
 4        NegP
 5          ng (HTn) לֹ֣א [lˈō] "not"
 4        VP
 5          vb (HVqi1cs) אֶהְיֶ֑ה [ʔehyˈeh] "be"
 3      Ccoor
 4        PP
 5          pp (HR) כְּ֝ [ˈkᵊ] "as"
 5          n (HNcmpa) עֹלְלִ֗ים [ʕōlᵊlˈîm] "child"
 5          Cattr
 6            NegP
 7              ng (HTn) לֹא [lō-] "not"
 6            VP
 7              vb (HVqp3cp) רָ֥אוּ [rˌāʔû] "see"
 6            NP
 7              n (HNcbsa) אֹֽור [ʔˈôr] "light"

Taking it further¶

We saw how the fact that we have slot numbers in our tree structures opens up all kinds of possibilities for further processing.

However, so far, we have only made use of slot nodes.

What if we want to draw in side information for the non-terminal nodes?

That is where the feature treen comes in. It has node information for all non-terminals between braces, so it is fairly easy to write new structure() and layout() functions that exploit them.

All steps¶

start your first step in mastering the bible computationally
display become an expert in creating pretty displays of your text structures
search turbo charge your hand-coding with search templates
exportExcel make tailor-made spreadsheets out of your results
share draw in other people's data and let them use yours
export export your dataset as an Emdros database
annotate annotate plain text by means of other tools and import the annotations as TF features
map map somebody else's annotations to a new version of the corpus
volumes work with selected books only
trees work with the BHSA data as syntax trees

CC-BY Dirk Roorda