You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
This tutorial is a companion to the Text-Fabric documentation on data sharing.
The ETCBC has a few other repositories with data that work in conjunction with the BHSA data. One of them you have already seen: phono, for phonetic transcriptions. There is also parallels for detecting parallel passages, and valence for studying patterns around verbs that determine their meanings.
If you study the additional data, you can observe how that data is created and also how it is turned into a text-fabric data module. The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers and the values string or numbers as a Text-Fabric feature. When you are creating data, you have already constructed those dictionaries, so writing them out is just one method call. See for example how the flowchart notebook in valence writes out verb sense data.
You can then easily share your new features on GitHub, so that your colleagues everywhere can try it out for themselves.
Here is how you draw in other data, for example
You can add such data on the fly, by passing a mod={org}/{repo}/{path}
parameter,
or a bunch of them separated by commas, or packed in a list or tuple.
If the data is there, it will be auto-downloaded and stored on your machine.
Let's do it.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
from tf.app import use
First we are going to include the work of Cody Kingham on heads of phrases and some earlier work by Janet Dyk and Dirk Roorda on verbal valence.
A = use('ETCBC/bhsa', mod="etcbc/lingo/heads/tf,etcbc/valence/tf", hoist=globals())
Locating corpus resources ...
rate limit is 5000 requests per hour, with 5000 left for this hour connecting to online GitHub repo ETCBC/bhsa ... connected
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gb112c161cfd21eae403d51a2733740d8743460e7
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8.1
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
You see that the features from the etcbc/valence/tf and etcbc/lingo/heads/tf modules have been added to the mix.
Click the triangle before etcbc/valence/tf to see what features have been contributed.
Note that edge features are in *bold italic*.
Let's find out more about sense.
You can start with clicking the triangle after "sense str" above. It tells you where the feature comes from, and it shows you the context where it has been constructed. You might go there to see additional documentation.
But we can also dive directly into its data:
F.sense.freqList()
(('--', 17941), ('d-', 9975), ('-p', 6537), ('-i', 3604), ('-c', 3231), ('dp', 1899), ('dc', 1002), ('di', 918), ('l.', 876), ('i.', 630), ('n.', 532), ('-b', 64), ('db', 61), ('c.', 57), ('k.', 54))
Which nodes have a sense feature?
{F.otype.v(n) for n in N.walk() if F.sense.v(n)}
{'word'}
results = A.search(
"""
word sense
"""
)
0.16s 47381 results
Let's show some of the rarer sense values:
results = A.search(
"""
word sense=k.
"""
)
0.16s 54 results
A.table(results, end=5)
n | p | word |
---|---|---|
1 | Genesis 4:17 | יִּקְרָא֙ |
2 | Genesis 13:16 | שַׂמְתִּ֥י |
3 | Genesis 32:13 | שַׂמְתִּ֤י |
4 | Genesis 34:31 | יַעֲשֶׂ֖ה |
5 | Genesis 48:20 | יְשִֽׂמְךָ֣ |
If we do a pretty display, the sense
feature shows up.
A.show(results, start=1, end=1, withNodes=True)
result 1
If you click the triangle before etcbc/lingo/heads/tf you see what features it contributes. Unfortunately, the authors have not provided a description of this feature, but if you click on the triangle after heads none, you see where the feature comes from and who has made it.
Moreover, the fact that heads is in italics makes clear that it is an edge feature.
Let's use it in a query:
Now, heads
is an edge feature, we cannot directly make it visible in pretty displays, but we can use it in queries.
We also want to make the feature sense
visible, so we mention the feature in the query, without restricting the results.
results = A.search(
"""
book book=Genesis
chapter chapter=1
clause
phrase
-heads> word sense*
"""
)
0.42s 402 results
A.show(results, start=1, end=2)
result 1
result 2
Note how the words that are *heads* of their phrases are highlighted within their phrases.
Now we are going to add another promising module, provided by Christian Canu Højgaard, from this repo: participants.
Let's do it in the straightforward way:
A = use(
'ETCBC/bhsa',
mod=(
"ETCBC/lingo/heads/tf",
"ETCBC/valence/tf",
"ch-jensen/participants/actor/tf"
),
hoist=globals(),
)
Locating corpus resources ...
The requested data is not available offline ~/text-fabric-data/github/ch-jensen/participants/actor/tf/2021 not found rate limit is 5000 requests per hour, with 5000 left for this hour connecting to online GitHub repo ch-jensen/participants ... connected No directory /actor/tf/2021 in #9671910a329c069cfd3d366526ea816de57666dcWill try something else Failed
No directory /actor/tf/2021 in #9671910a329c069cfd3d366526ea816de57666dc Failed
There were problems with loading data. The TF API has not been loaded! The app "ETCBC/bhsa" will not work!
The features are not there!
If we have a look on GitHub in this repo we see under
actor/tf
the directory c
only. Christian has produced his features against version c
of the BHSA.
Ok, then we go back, and run our command for version c
.
A = use(
'ETCBC/bhsa',
version="c",
mod=(
"ETCBC/lingo/heads/tf",
"ETCBC/valence/tf",
"ch-jensen/participants/actor/tf"
),
hoist=globals(),
)
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.05 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9233 | 46.20 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45180 | 9.44 | 100 |
sentence | 63727 | 6.69 | 100 |
sentence_atom | 64525 | 6.61 | 100 |
clause | 88121 | 4.84 | 100 |
clause_atom | 90688 | 4.70 | 100 |
phrase | 253207 | 1.68 | 100 |
phrase_atom | 267541 | 1.59 | 100 |
subphrase | 113812 | 1.42 | 38 |
word | 426584 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gb112c161cfd21eae403d51a2733740d8743460e7
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
c
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8.1
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
A.header(allMeta=True)
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gb112c161cfd21eae403d51a2733740d8743460e7
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8.1
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
While this succeeded, there are scenarios where you have more trouble. For example, you decide that you really, really need the bhsa data as in release 1.7.1.
Then you discover that this does note work:
A = use(
'etcbc/bhsa',
version="c",
checkout="v1.7.1",
mod=("etcbc/lingo/heads/tf" ,"etcbc/valence/tf", "ch-jensen/participants/actor/tf"),
hoist=globals(),
)
because the BHSA invokes two standard modules, etcbc/phono/tf
and etcbc/parallels/tf
and if you go to their
GitHub repos, you see that they do not have a release v1.7.1
.
You have to walk through their releases and find one with the right data version.
Having found them, you can then get it all like this:
A = use(
'etcbc/bhsa',
version="c",
checkout="v1.7.1",
mod=(
"etcbc/phono/tf:1.2",
"etcbc/parallels/tf:v1.2",
"etcbc/lingo/heads/tf",
"etcbc/valence/tf",
"ch-jensen/participants/actor/tf",
),
hoist=globals(),
)
Let's find out about actor.
Again, we can click on the triangles and see information about the features. Christian has provided descriptions in the metadata of the features.
And we can look into the data itself.
fl = F.actor.freqList()
len(fl)
415
fl[0:10]
(('JHWH', 358), ('BN JFR>L', 205), ('>JC', 101), ('2sm"YOUSgmas"', 67), ('MCH', 60), ('>RY', 58), ('>TM', 45), ('>X "YOUSgmas"', 36), ('JFR>L', 35), ('KHN', 33))
Which nodes have an actor feature?
{F.otype.v(n) for n in N.walk() if F.actor.v(n)}
{'phrase_atom', 'subphrase'}
results = A.search(
"""
phrase_atom actor
"""
)
0.08s 2062 results
Let's show some of the rarer actor values:
results = A.search(
"""
phrase_atom actor=KHN
"""
)
0.10s 30 results
A.table(results)
n | p | phrase_atom |
---|---|---|
1 | Leviticus 17:5 | אֶל־הַכֹּהֵ֑ן |
2 | Leviticus 17:6 | זָרַ֨ק |
3 | Leviticus 17:6 | הַכֹּהֵ֤ן |
4 | Leviticus 17:6 | הִקְטִ֣יר |
5 | Leviticus 19:22 | כִפֶּר֩ |
6 | Leviticus 19:22 | הַכֹּהֵ֜ן |
7 | Leviticus 21:1 | אֶל־הַכֹּהֲנִ֖ים |
8 | Leviticus 21:1 | בְּנֵ֣י אַהֲרֹ֑ן |
9 | Leviticus 21:5 | יִקְרְח֤וּ |
10 | Leviticus 21:5 | יְגַלֵּ֑חוּ |
11 | Leviticus 21:5 | יִשְׂרְט֖וּ |
12 | Leviticus 21:6 | קְדֹשִׁ֤ים |
13 | Leviticus 21:6 | יִהְיוּ֙ |
14 | Leviticus 21:6 | יְחַלְּל֔וּ |
15 | Leviticus 21:6 | הֵ֥ם |
16 | Leviticus 21:6 | מַקְרִיבִ֖ם |
17 | Leviticus 21:6 | הָ֥יוּ |
18 | Leviticus 21:6 | קֹֽדֶשׁ׃ |
19 | Leviticus 21:7 | יִקָּ֔חוּ |
20 | Leviticus 21:7 | יִקָּ֑חוּ |
21 | Leviticus 22:11 | כֹהֵ֗ן |
22 | Leviticus 22:11 | יִקְנֶ֥ה |
23 | Leviticus 22:14 | לַכֹּהֵ֖ן |
24 | Leviticus 23:10 | אֶל־הַכֹּהֵֽן׃ |
25 | Leviticus 23:11 | הֵנִ֧יף |
26 | Leviticus 23:11 | יְנִיפֶ֖נּוּ |
27 | Leviticus 23:11 | הַכֹּהֵֽן׃ |
28 | Leviticus 23:20 | הֵנִ֣יף |
29 | Leviticus 23:20 | הַכֹּהֵ֣ן׀ |
30 | Leviticus 23:20 | לַכֹּהֵֽן׃ |
A.show(results, start=1, end=1)
result 1
We see no highlights! That is because phrase atoms are hidden by default. So let's unhide:
A.displaySetup(hiddenTypes="subphrase clause_atom sentence_atom half_verse")
The next calls to show()
will work as if hiddenTypes="subphrase clause_atom sentence_atom half_verse"
is passed to them.
A.show(results, start=1, end=1)
result 1
We make the feature sense
from the valence module visible:
A.show(results, start=1, end=3, withNodes=True, extraFeatures="sense")
result 1
result 2
result 3
Here is a query that shows results with all features.
results = A.search(
"""
book book=Leviticus
phrase sense*
phrase_atom actor=KHN
-heads> word
"""
)
0.26s 30 results
A.displaySetup(
condensed=True,
condenseType="verse",
hiddenTypes="subphrase clause_atom sentence_atom half_verse",
)
A.show(results, start=8, end=8)
A.displaySetup()
verse 8
See whether you can find the quote in the Easter egg that is in
etcbc/lingo/easter/tf
!
CC-BY Dirk Roorda