You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
The textual objects of the BHSA text are syntactic, but they are not syntax trees.
The BHSA is the result of a data-driven parsing strategy with occasional human decisions. It results in functional objects such as sentences, clauses, and phrases, which are build from chunks called sentece-atoms, clause-atoms, and phrase-atoms.
There is no deeper nesting of clauses within phrases, or even clauses within clauses or phrases within phrases.
Instead, whenever objects are linguistically nested, there is an edge called mother
between the
objects in question.
For people that prefer to think in trees, we have unwrapped the mother
relationship between clauses
and made tree structures out of the data.
The whole generation process of trees, including the quirks underway, is documented in the notebook trees.ipynb. You see it done there for version 2017. We have used an ordinary Python program to generate trees for all versions of the BHSA: alltrees.py
Those trees are available as a feature on sentence nodes, and you can load those features alongside the BHSA data.
Here we show some examples of what you can do with it.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
from utils import structure, layout
from tf.app import use
Note that we load the trees module.
We also load the morphology of Open Scriptures for example usage later on.
A = use("ETCBC/bhsa", mod="ETCBC/trees/tf,ETCBC/bridging/tf", hoist=globals())
Locating corpus resources ...
| 0.89s T osm from ~/text-fabric-data/github/ETCBC/bridging/tf/2021 | 0.12s T osm_sf from ~/text-fabric-data/github/ETCBC/bridging/tf/2021 | 0.21s T tree from ~/text-fabric-data/github/ETCBC/trees/tf/2021 | 0.30s T treen from ~/text-fabric-data/github/ETCBC/trees/tf/2021
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gd905e3fb6e80d0fa537600337614adc2af157309
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
We first inspect the nature of these features, lets pick the first, last and middle sentence of the Hebrew Bible
sentences = F.otype.s("sentence")
examples = (sentences[0], sentences[len(sentences) // 2], sentences[-1])
We examine feature tree
:
for s in examples:
print(F.tree.v(s))
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10))))) (S(C(VP(vb 0))(NP(n 1)))) (S(C(CP(cj 0))(VP(vb 1))))
Now treen
:
for s in examples:
print(F.treen.v(s))
(S{1172308}(C{427559}(PP{651573}(pp 0)(n 1))(VP{651574}(vb 2))(NP{651575}(n 3))(PP{651576}(U{1300539}(pp 4)(dt 5)(n 6))(cj 7)(U{1300540}(pp 8)(dt 9)(n 10))))) (S{1204166}(C{471249}(VP{782581}(vb 0))(NP{782582}(n 1)))) (S{1236024}(C{515689}(CP{904774}(cj 0))(VP{904775}(vb 1))))
The structure of the trees is the same, but treen
has numbers between braces in the tags of the nodes.
These numbers are the Text-Fabric nodes of the sentences, clauses and phrases that the nodes of the tree
correspond to.
These strings are not very pleasant to the eye. For one thing, we see numbers instead of words. They also seem a bit unwieldy to integrate with the usual text-fabric business. But nothing is farther from the truth.
We show how to
Honesty compels us to note that we make use of a bunch of auxiliary functions in an
accompanying utils
pacckage:
passage = ("Job", 3, 16)
passageStr = "{} {}:{}".format(*passage)
verse = T.nodeFromSection(passage)
sentence = L.d(verse, otype="sentence")[0]
firstSlot = L.d(sentence, otype="word")[0]
stringTree = F.tree.v(sentence)
print(f"{passageStr} - first word = {firstSlot}\n\ntree =\n{stringTree}")
Job 3:16 - first word = 336990 tree = (S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))
Key to effective manipulation of tree strings is to parse them into tree structures: lists of lists.
Here we use the generic utility structure()
:
tree = structure(stringTree)
tree
['S', ['C', ['Ccoor', ['CP', [('cj', 0)]], ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]], ['NegP', [('ng', 4)]], ['VP', [('vb', 5)]]], ['Ccoor', ['PP', [('pp', 6)], [('n', 7)], ['Cattr', ['NegP', [('ng', 8)]], ['VP', [('vb', 9)]], ['NP', [('n', 10)]]]]]]]
Having the real tree structure in hand, we can layout it in all kinds of ways.
We use the generic utility layout()
to
display it a bit more friendly and to replace the numbers by real Text-Fabric slot numbers:
print(layout(tree, firstSlot, str))
S C Ccoor CP cj 336990 PP pp 336991 U n 336992 U vb 336993 NegP ng 336994 VP vb 336995 Ccoor PP pp 336996 n 336997 Cattr NegP ng 336998 VP vb 336999 NP n 337000
That opens up the way to get the words in.
The third argument of layout()
above is str
, which is a function that is applied to the slot numbers.
It returns those numbers as string, and this is what ends up in the layout.
We can pass any function, why not the function that looks up the word?
Remember that F.g_word_utf8.v
is a function that returns the full Hebrew word given a slot node.
print(layout(tree, firstSlot, F.g_word_utf8.v))
S C Ccoor CP cj אֹ֚ו PP pp כְ U n נֵ֣פֶל U vb טָ֭מוּן NegP ng לֹ֣א VP vb אֶהְיֶ֑ה Ccoor PP pp כְּ֝ n עֹלְלִ֗ים Cattr NegP ng לֹא VP vb רָ֥אוּ NP n אֹֽור
def gloss(n):
lexNode = L.u(n, otype="lex")[0]
return f'{F.g_word_utf8.v(n)} "{F.gloss.v(lexNode)}"'
print(layout(tree, firstSlot, gloss))
S C Ccoor CP cj אֹ֚ו "or" PP pp כְ "as" U n נֵ֣פֶל "miscarriage" U vb טָ֭מוּן "hide" NegP ng לֹ֣א "not" VP vb אֶהְיֶ֑ה "be" Ccoor PP pp כְּ֝ "as" n עֹלְלִ֗ים "child" Cattr NegP ng לֹא "not" VP vb רָ֥אוּ "see" NP n אֹֽור "light"
def osmPhonoGloss(n):
lexNode = L.u(n, otype="lex")[0]
return (
f'({F.osm.v(n)}) {F.g_word_utf8.v(n)} [{F.phono.v(n)}] "{F.gloss.v(lexNode)}"'
)
print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))
1 S 2 C 3 Ccoor 4 CP 5 cj (HC) אֹ֚ו [ˈʔô] "or" 4 PP 5 pp (HR) כְ [ḵᵊ] "as" 5 U 6 n (HNcmsa) נֵ֣פֶל [nˈēfel] "miscarriage" 5 U 6 vb (HVqsmsa) טָ֭מוּן [ˈṭāmûn] "hide" 4 NegP 5 ng (HTn) לֹ֣א [lˈō] "not" 4 VP 5 vb (HVqi1cs) אֶהְיֶ֑ה [ʔehyˈeh] "be" 3 Ccoor 4 PP 5 pp (HR) כְּ֝ [ˈkᵊ] "as" 5 n (HNcmpa) עֹלְלִ֗ים [ʕōlᵊlˈîm] "child" 5 Cattr 6 NegP 7 ng (HTn) לֹא [lō-] "not" 6 VP 7 vb (HVqp3cp) רָ֥אוּ [rˌāʔû] "see" 6 NP 7 n (HNcbsa) אֹֽור [ʔˈôr] "light"
We saw how the fact that we have slot numbers in our tree structures opens up all kinds of possibilities for further processing.
However, so far, we have only made use of slot nodes.
What if we want to draw in side information for the non-terminal nodes?
That is where the feature treen
comes in.
It has node information for all non-terminals between braces, so it is fairly easy to write
new structure()
and layout()
functions that exploit them.
CC-BY Dirk Roorda