This notebook gets you started with using Text-Fabric for coding in the Hebrew Bible.
Familiarity with the underlying data model is recommended.
Short introductions to other TF datasets:
or the
If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your repository. If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.
Text-Fabric will fetch a standard set of features for you from the newest github release binaries.
It will fetch version 2021
.
The data will be stored in the text-fabric-data
in your home directory.
The simplest way to get going is by this incantation:
from tf.app import use
For the very last version, use hot
.
For the latest release, use latest
.
If you have cloned the repos (TF app and data), use clone
.
If you do not want/need to upgrade, leave out the checkout specifiers.
A = use("ETCBC/bhsa", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gd905e3fb6e80d0fa537600337614adc2af157309
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
The data of the BHSA is organized in features. They are columns of data. Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all 425,000 words.
The information which part-of-speech each word is, constitutes a column in that spreadsheet. The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more textual objects.
Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.
You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.
Edge features are marked by bold italic formatting.
There are ways to tweak the set of features that is loaded. You can load more and less.
See share for examples.
Note that we have phono
features.
The BHSA data has a special 1-1 transcription from Hebrew to ASCII,
but not a phonetic transcription.
I have made a notebook that tries hard to find phonological representations for all the words. The result is a module in text-fabric format. We'll encounter that later.
This module, and the module etcbc/parallels are standard modules of the BHSA app.
The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the text and data of the Hebrew Bible.
At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).
The most essential thing for now is that we can use F
to access the data in the features
we've loaded.
But there is more, such as N
, which helps us to walk over the text, as we see in a minute.
The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.
Text-Fabric contains a flexible search engine, that does not only work for the BHSA data, but also for data that you add to it.
Search is the quickest way to come up-to-speed with your data, without too much programming.
Jump to the dedicated search search tutorial first, to whet your appetite. And if you already know MQL queries, you can build from that in searchFromMQL.
The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:
Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.
Before we start coding, we load some modules that we need underway:
%load_ext autoreload
%autoreload 2
import os
import collections
from itertools import chain
In order to get acquainted with the data, we start with the simple task of counting.
We use the
N.walk()
generator
to walk through the nodes.
We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows slots
, because they are the textual positions that can be filled with words.
We also mentioned that there are also 1,000,000 more textual objects. They are the phrases, clauses, sentences, verses, chapters and books. They also correspond to rows in the big spreadsheet.
In Text-Fabric we call all these rows nodes, and the N()
generator
carries us through those nodes in the textual order.
Just one extra thing: the info
statements generate timed messages.
If you use them instead of print
you'll get a sense of the amount of time that
the various processing steps typically need.
A.indent(reset=True)
A.info("Counting nodes ...")
i = 0
for n in N.walk():
i += 1
A.info("{} nodes".format(i))
0.00s Counting nodes ... 0.09s 1446831 nodes
Here you see it: 1,4 M nodes!
Every node has a type, like word, or phrase, sentence. We know that we have approximately 425,000 words and a million other nodes. But what exactly are they?
Text-Fabric has two special features, otype
and oslots
, that must occur in every Text-Fabric data set.
otype
tells you for each node its type, and you can ask for the number of slot
s in the text.
Here we go!
F.otype.slotType
'word'
F.otype.maxSlot
426590
F.otype.maxNode
1446831
F.otype.all
('book', 'chapter', 'lex', 'verse', 'half_verse', 'sentence', 'sentence_atom', 'clause', 'clause_atom', 'phrase', 'phrase_atom', 'subphrase', 'word')
C.levels.data
(('book', 10938.205128205129, 426591, 426629), ('chapter', 459.1926803013994, 426630, 427558), ('lex', 46.21776814734561, 1437602, 1446831), ('verse', 18.377202429673027, 1414389, 1437601), ('half_verse', 9.442218729940901, 606394, 651572), ('sentence', 6.6950735282577645, 1172308, 1236024), ('sentence_atom', 6.612363207985863, 1236025, 1300538), ('clause', 4.840408028956894, 427559, 515689), ('clause_atom', 4.7031001940377495, 515690, 606393), ('phrase', 1.684774666966821, 651573, 904775), ('phrase_atom', 1.5945382234648566, 904776, 1172307), ('subphrase', 1.4213614404918753, 1300539, 1414388), ('word', 1, 1, 426590))
This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the indent
in conjunction with info
to produce neat timed
and indented progress messages.
A.indent(reset=True)
A.info("counting objects ...")
for otype in F.otype.all:
i = 0
A.indent(level=1, reset=True)
for n in F.otype.s(otype):
i += 1
A.info("{:>7} {}s".format(i, otype))
A.indent(level=0)
A.info("Done")
0.00s counting objects ... | 0.00s 39 books | 0.00s 929 chapters | 0.00s 9230 lexs | 0.00s 23213 verses | 0.00s 45179 half_verses | 0.00s 63717 sentences | 0.00s 64514 sentence_atoms | 0.01s 88131 clauses | 0.00s 90704 clause_atoms | 0.01s 253203 phrases | 0.01s 267532 phrase_atoms | 0.01s 113850 subphrases | 0.02s 426590 words 0.08s Done
We use the A API (the extra power) to peek into the corpus.
First some words. Node 15890 is a word with a dotless shin.
Node 1002 is a word with a yod after a seqhol hataf.
Node 100,000 is just a word slot.
Let's inspect them and see where they are.
First the plain view:
F.otype.v(1)
'word'
wordShows = (15890, 1002, 100000)
for word in wordShows:
A.plain(word, withPassage=True)
You can leave out the passage reference:
for word in wordShows:
A.plain(word, withPassage=False)
Now we show other objects, both with and without passage reference.
normalShow = dict(
wordShow=wordShows[0],
phraseShow=700000,
clauseShow=500000,
sentenceShow=1200000,
lexShow=1437667,
)
sectionShow = dict(
verseShow=1420000,
chapterShow=427000,
bookShow=426598,
)
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.plain(n)
A.plain(n, withPassage=False)
A.dm("\n---\n")
wordShow = node 15890
phraseShow = node 700000
clauseShow = node 500000
sentenceShow = node 1200000
lexShow = node 1437667
Note that for section nodes (except verse and half-verse) the withPassage
has little effect.
The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text
in question.
for (name, n) in sectionShow.items():
if name == "verseShow":
continue
A.dm(f"**{name}** = node `{n}`\n")
A.plain(n)
A.plain(n, withPassage=False)
A.dm("\n---\n")
We can also dive into the structure of the textual objects, provided they are not too large.
The function pretty
gives a display of the object that a node stands for together with the structure below that node.
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n)
A.dm("\n---\n")
wordShow = node 15890
phraseShow = node 700000
clauseShow = node 500000
sentenceShow = node 1200000
lexShow = node 1437667
Note
If you need a link to shebanq for just any node:
million = 1000000
A.webLink(million)
We can show some standard features in the display:
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n, standardFeatures=True)
A.dm("\n---\n")
wordShow = node 15890
phraseShow = node 700000
clauseShow = node 500000
sentenceShow = node 1200000
lexShow = node 1437667
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n, standardFeatures=True)
A.dm("\n---\n")
wordShow = node 15890
phraseShow = node 700000
clauseShow = node 500000
sentenceShow = node 1200000
lexShow = node 1437667
For more display options, see display.
F
gives access to all features.
Every feature has a method
freqList()
to generate a frequency list of its values, higher frequencies first.
Here are the parts of speech:
F.sp.freqList()
(('subs', 125583), ('verb', 75451), ('prep', 73298), ('conj', 62737), ('nmpr', 35607), ('art', 30387), ('adjv', 10141), ('nega', 6059), ('prps', 5035), ('advb', 4603), ('prde', 2678), ('intj', 1912), ('inrg', 1303), ('prin', 1026))
verbs = collections.Counter()
A.indent(reset=True)
A.info("Collecting data")
for w in F.otype.s("word"):
if F.sp.v(w) != "verb":
continue
verbs[F.lex.v(w)] += 1
A.info("Done")
print(
"".join(
"{}: {}\n".format(verb, cnt)
for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]
)
)
0.00s Collecting data 0.09s Done >MR[: 5378 HJH[: 3561 <FH[: 2629 BW>[: 2570 NTN[: 2017 HLK[: 1554 R>H[: 1298 CM<[: 1168 DBR[: 1138 JCB[: 1082
An alternative way to do this is to use the feature freq_lex
, defined for lex
nodes.
Now we walk the lexemes instead of the occurrences.
Note that the feature sp
(part-of-speech) is defined for nodes of type word
as well as lex
.
Both also have the lex
feature.
verbs = collections.Counter()
A.indent(reset=True)
A.info("Collecting data")
for w in F.otype.s("lex"):
if F.sp.v(w) != "verb":
continue
verbs[F.lex.v(w)] += F.freq_lex.v(w)
A.info("Done")
print(
"".join(
"{}: {}\n".format(verb, cnt)
for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]
)
)
0.00s Collecting data 0.00s Done >MR[: 5378 HJH[: 3561 <FH[: 2629 BW>[: 2570 NTN[: 2017 HLK[: 1554 R>H[: 1298 CM<[: 1168 DBR[: 1138 JCB[: 1082
This is an order of magnitude faster. In this case, that means the difference between a third of a second and a hundredth of a second, not a big gain in absolute terms. But suppose you need to run this a 1000 times in a loop. Then it is the difference between 5 minutes and 10 seconds. A five minute wait is not pleasant in interactive computing!
We make a mapping between lexeme forms and the number of occurrences of those lexemes.
lexeme_dict = {F.lex_utf8.v(n): F.freq_lex.v(n) for n in F.otype.s("word")}
list(lexeme_dict.items())[0:10]
[('ב', 15542), ('ראשׁית', 51), ('ברא', 48), ('אלהים', 2601), ('את', 10987), ('ה', 30386), ('שׁמים', 421), ('ו', 50272), ('ארץ', 2504), ('היה', 3561)]
As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on Collocation MI Analysis of the Hebrew Bible
It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results.
In case the name has changed, the enclosing repo is here.
A.indent(reset=True)
hapax = []
zero = set()
for lx in F.otype.s("lex"):
occs = L.d(lx, otype="word")
n = len(occs)
if n == 0: # that's weird: should not happen
zero.add(lx)
elif n == 1: # hapax found!
hapax.append(lx)
A.info("{} hapaxes found".format(len(hapax)))
if zero:
A.error("{} zeroes found".format(len(zero)), tm=False)
else:
A.info("No zeroes found", tm=False)
for h in hapax[0:10]:
print("\t{:<8} {}".format(F.lex.v(h), F.gloss.v(h)))
0.04s 3071 hapaxes found No zeroes found PJCWN/ Pishon CWP[ bruise HRWN/ pregnancy Z<H/ sweat LHV/ flame NWD/ Nod XNWK=/ Enoch MXWJ>L/ Mehujael MXJJ>L/ Mehujael JBL=/ Jabal
The occurrence base of a lexeme are the verses, chapters and books in which occurs. Let's look for lexemes that occur in a single chapter.
If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter. So, if you go up from the lexeme, you encounter the chapter.
Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it, so if you go up from such lexemes, you don not find chapters.
Let's check it out.
Oh yes, we have already found the hapaxes, we will skip them here.
A.indent(reset=True)
A.info("Finding single chapter lexemes")
singleCh = []
multipleCh = []
for lx in F.otype.s("lex"):
chapters = L.u(lx, "chapter")
if len(chapters) == 1:
if lx not in hapax:
singleCh.append(lx)
elif len(chapters) > 0: # should not happen
multipleCh.append(lx)
A.info("{} single chapter lexemes found".format(len(singleCh)))
if multipleCh:
A.error(
"{} chapter embedders of multiple lexemes found".format(len(multipleCh)),
tm=False,
)
else:
A.info("No chapter embedders of multiple lexemes found", tm=False)
for s in singleCh[0:10]:
print(
"{:<20} {:<6}".format(
"{} {}:{}".format(*T.sectionFromNode(s)),
F.lex.v(s),
)
)
0.00s Finding single chapter lexemes 0.05s 450 single chapter lexemes found No chapter embedders of multiple lexemes found Genesis 4:1 QJN=/ Genesis 4:2 HBL=/ Genesis 4:18 <JRD/ Genesis 4:18 MTWC>L/ Genesis 4:19 YLH/ Genesis 4:22 TWBL_QJN/ Genesis 10:11 KLX=/ Genesis 14:1 >MRPL/ Genesis 14:1 >RJWK/ Genesis 14:1 >LSR/
As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and the number of lexemes that occur exclusively in that book.
A.indent(reset=True)
A.info("Making book-lexeme index")
allBook = collections.defaultdict(set)
allLex = set()
for b in F.otype.s("book"):
for w in L.d(b, "word"):
lx = L.u(w, "lex")[0]
allBook[b].add(lx)
allLex.add(lx)
A.info("Found {} lexemes".format(len(allLex)))
0.00s Making book-lexeme index 1.08s Found 9230 lexemes
A.indent(reset=True)
A.info("Finding single book lexemes")
singleBook = collections.defaultdict(lambda: 0)
for lx in F.otype.s("lex"):
book = L.u(lx, "book")
if len(book) == 1:
singleBook[book[0]] += 1
A.info("found {} single book lexemes".format(sum(singleBook.values())))
0.00s Finding single book lexemes 0.01s found 4224 single book lexemes
print(
"{:<20}{:>5}{:>5}{:>5}\n{}".format(
"book",
"#all",
"#own",
"%own",
"-" * 35,
)
)
booklist = []
for b in F.otype.s("book"):
book = T.bookName(b)
a = len(allBook[b])
o = singleBook.get(b, 0)
p = 100 * o / a
booklist.append((book, a, o, p))
for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):
print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
book #all #own %own ----------------------------------- Daniel 1122 428 38.1% 1_Chronicles 2013 487 24.2% Ezra 991 199 20.1% Joshua 1175 206 17.5% Esther 472 67 14.2% Isaiah 2555 350 13.7% Numbers 1457 197 13.5% Ezekiel 1719 212 12.3% Song_of_songs 503 60 11.9% Job 1717 202 11.8% Genesis 1816 208 11.5% Nehemiah 1076 110 10.2% Psalms 2250 216 9.6% Leviticus 960 88 9.2% Judges 1210 99 8.2% Ecclesiastes 575 46 8.0% Proverbs 1356 103 7.6% Jeremiah 1949 147 7.5% 2_Samuel 1304 89 6.8% 1_Samuel 1256 85 6.8% 2_Kings 1266 85 6.7% Exodus 1425 92 6.5% 1_Kings 1291 81 6.3% Deuteronomy 1449 80 5.5% Lamentations 592 31 5.2% 2_Chronicles 1411 67 4.7% Nahum 357 16 4.5% Hosea 742 33 4.4% Ruth 319 14 4.4% Habakkuk 393 17 4.3% Amos 652 27 4.1% Joel 398 14 3.5% Zechariah 726 25 3.4% Obadiah 167 5 3.0% Micah 586 16 2.7% Zephaniah 367 10 2.7% Jonah 252 5 2.0% Haggai 208 3 1.4% Malachi 314 4 1.3%
The book names may sound a bit unfamiliar, they are in Latin here. Later we'll see that you can also get them in English, or in Swahili.
We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (L
) provides functions: u()
for going up, and d()
for going down,
n()
for going to next nodes and p()
for going to previous nodes.
These directions are indirect notions: nodes are just numbers, but by means of the
oslots
feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow or precede the slots of the other one.
L.u(node)
Up is going to nodes that embed node
.
L.d(node)
Down is the opposite direction, to those that are contained in node
.
L.n(node)
Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node
.
L.p(node)
Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node
.
All these functions yield nodes of all possible otypes. By passing an optional parameter, you can restrict the results to nodes of that type.
The result are ordered according to the order of things in the text.
The functions return always a tuple, even if there is just one node in the result.
We go from the first word to the book it contains.
Note the [0]
at the end. You expect one book, yet L
returns a tuple.
To get the only element of that tuple, you need to do that [0]
.
If you are like me, you keep forgetting it, and that will lead to weird error messages later on.
firstBook = L.u(1, otype="book")[0]
print(firstBook)
426591
And let's see all the containing objects of word 3:
w = 3
for otype in F.otype.all:
if otype == F.otype.slotType:
continue
up = L.u(w, otype=otype)
upNode = "x" if len(up) == 0 else up[0]
print("word {} is contained in {} {}".format(w, otype, upNode))
word 3 is contained in book 426591 word 3 is contained in chapter 426630 word 3 is contained in lex 1437604 word 3 is contained in verse 1414389 word 3 is contained in half_verse 606394 word 3 is contained in sentence 1172308 word 3 is contained in sentence_atom 1236025 word 3 is contained in clause 427559 word 3 is contained in clause_atom 515690 word 3 is contained in phrase 651574 word 3 is contained in phrase_atom 904777 word 3 is contained in subphrase x
Let's go to the next nodes of the first book.
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
secondBook = L.n(firstBook, otype="book")[0]
28765: word first slot=28765 , last slot=28765 923533: phrase_atom first slot=28765 , last slot=28765 669555: phrase first slot=28765 , last slot=28765 521826: clause_atom first slot=28765 , last slot=28769 433546: clause first slot=28765 , last slot=28769 609394: half_verse first slot=28765 , last slot=28772 1240671: sentence_atom first slot=28765 , last slot=28774 1176925: sentence first slot=28765 , last slot=28793 1415922: verse first slot=28765 , last slot=28778 426680: chapter first slot=28765 , last slot=29113 426592: book first slot=28765 , last slot=52512
And let's see what is right before the second book.
for n in L.p(secondBook):
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
426591: book first slot=1 , last slot=28764 426679: chapter first slot=28260 , last slot=28764 1415921: verse first slot=28747 , last slot=28764 609393: half_verse first slot=28755 , last slot=28764 1176924: sentence first slot=28758 , last slot=28764 1240670: sentence_atom first slot=28758 , last slot=28764 433545: clause first slot=28758 , last slot=28764 521825: clause_atom first slot=28758 , last slot=28764 669554: phrase first slot=28763 , last slot=28764 923532: phrase_atom first slot=28763 , last slot=28764 28764: word first slot=28764 , last slot=28764
We go to the chapters of the second book, and just count them.
chapters = L.d(secondBook, otype="chapter")
print(len(chapters))
40
We pick the first verse and the first word, and explore what is above and below them.
for n in [1, L.u(1, otype="verse")[0]]:
A.indent(level=0)
A.info("Node {}".format(n), tm=False)
A.indent(level=1)
A.info("UP", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
A.indent(level=1)
A.info("DOWN", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)
Node 1 | UP | | 1437602 lex | | 904776 phrase_atom | | 651573 phrase | | 606394 half_verse | | 515690 clause_atom | | 427559 clause | | 1236025 sentence_atom | | 1172308 sentence | | 1414389 verse | | 426630 chapter | | 426591 book | DOWN | | Node 1414389 | UP | | 515690 clause_atom | | 427559 clause | | 1236025 sentence_atom | | 1172308 sentence | | 426630 chapter | | 426591 book | DOWN | | 1172308 sentence | | 1236025 sentence_atom | | 427559 clause | | 515690 clause_atom | | 606394 half_verse | | 651573 phrase | | 904776 phrase_atom | | 1 word | | 2 word | | 651574 phrase | | 904777 phrase_atom | | 3 word | | 651575 phrase | | 904778 phrase_atom | | 4 word | | 606395 half_verse | | 651576 phrase | | 904779 phrase_atom | | 1300539 subphrase | | 5 word | | 6 word | | 7 word | | 8 word | | 1300540 subphrase | | 9 word | | 10 word | | 11 word Done
So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.
In the same way as F
gives access to feature data,
T
gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: T
.
Hebrew text can be represented in a number of ways:
If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature otext
, which specifies the formats and which features
must be used to produce them. otext
is the third special feature in a TF data set,
next to otype
and oslots
.
It is an optional feature.
If it is absent, there will be no T
API.
Here is a list of all available formats in this data set.
sorted(T.formats)
['lex-default', 'lex-orig-full', 'lex-orig-plain', 'lex-trans-full', 'lex-trans-plain', 'text-orig-full', 'text-orig-full-ketiv', 'text-orig-plain', 'text-phono-full', 'text-trans-full', 'text-trans-full-ketiv', 'text-trans-plain']
Note the text-phono-full
format here.
It does not come from the main data source bhsa
, but from the module phono
.
Look in your data directory, find ~/github/etcbc/phono/tf/2017/otext@phono.tf
,
and you'll see this format defined there.
We can pretty display in other formats:
for word in wordShows:
A.pretty(word, fmt="text-phono-full")
This function is central to get text representations of nodes. Its most basic usage is
T.text(nodes, fmt=fmt)
where nodes
is a list or iterable of nodes, usually word nodes, and fmt
is the name of a format.
If you leave out fmt
, the default text-orig-full
is chosen.
The result is the text in that format for all nodes specified:
T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-plain")
'בראשׁית ברא אלהים את השׁמים ואת הארץ׃ '
There is also another usage of this function:
T.text(node, fmt=fmt)
where node
is a single node.
In this case, the default format is ntype-orig-full
where ntype is the type of node
.
So for a lex
node, the default format is lex-orig-full
.
If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in node
will be looked up
and represented with the default format text-orig-full
.
In this way we can sensibly represent a lot of different nodes, such as chapters, verses, sentences, words and lexemes.
We compose a set of example nodes and run T.text
on them:
exampleNodes = [
1,
F.otype.s("sentence")[0],
F.otype.s("verse")[0],
F.otype.s("chapter")[0],
F.otype.s("lex")[1],
]
exampleNodes
[1, 1172308, 1414389, 426630, 1437603]
for n in exampleNodes:
print(f"This is {F.otype.v(n)} {n}:")
print(T.text(n))
print("")
This is word 1: בְּ This is sentence 1172308: בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ This is verse 1414389: בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ This is chapter 426630: בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃ וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם קָרָ֣א יַמִּ֑ים וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב מַזְרִ֣יעַ זֶ֔רַע עֵ֣ץ פְּרִ֞י עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא עֵ֣שֶׂב מַזְרִ֤יעַ זֶ֨רַע֙ לְמִינֵ֔הוּ וְעֵ֧ץ עֹ֥שֶׂה פְּרִ֛י אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שְׁלִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה וְהָי֤וּ לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים וּלְיָמִ֖ים וְשָׁנִֽים׃ וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהָאִ֖יר עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִ֔ים אֶת־שְׁנֵ֥י הַמְּאֹרֹ֖ת הַגְּדֹלִ֑ים אֶת־הַמָּאֹ֤ור הַגָּדֹל֙ לְמֶמְשֶׁ֣לֶת הַיֹּ֔ום וְאֶת־הַמָּאֹ֤ור הַקָּטֹן֙ לְמֶמְשֶׁ֣לֶת הַלַּ֔יְלָה וְאֵ֖ת הַכֹּוכָבִֽים׃ וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום רְבִיעִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יִשְׁרְצ֣וּ הַמַּ֔יִם שֶׁ֖רֶץ נֶ֣פֶשׁ חַיָּ֑ה וְעֹוף֙ יְעֹופֵ֣ף עַל־הָאָ֔רֶץ עַל־פְּנֵ֖י רְקִ֥יעַ הַשָּׁמָֽיִם׃ וַיִּבְרָ֣א אֱלֹהִ֔ים אֶת־הַתַּנִּינִ֖ם הַגְּדֹלִ֑ים וְאֵ֣ת כָּל־נֶ֣פֶשׁ הַֽחַיָּ֣ה׀ הָֽרֹמֶ֡שֶׂת אֲשֶׁר֩ שָׁרְצ֨וּ הַמַּ֜יִם לְמִֽינֵהֶ֗ם וְאֵ֨ת כָּל־עֹ֤וף כָּנָף֙ לְמִינֵ֔הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיְבָ֧רֶךְ אֹתָ֛ם אֱלֹהִ֖ים לֵאמֹ֑ר פְּר֣וּ וּרְב֗וּ וּמִלְא֤וּ אֶת־הַמַּ֨יִם֙ בַּיַּמִּ֔ים וְהָעֹ֖וף יִ֥רֶב בָּאָֽרֶץ׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום חֲמִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים תֹּוצֵ֨א הָאָ֜רֶץ נֶ֤פֶשׁ חַיָּה֙ לְמִינָ֔הּ בְּהֵמָ֥ה וָרֶ֛מֶשׂ וְחַֽיְתֹו־אֶ֖רֶץ לְמִינָ֑הּ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִים֩ אֶת־חַיַּ֨ת הָאָ֜רֶץ לְמִינָ֗הּ וְאֶת־הַבְּהֵמָה֙ לְמִינָ֔הּ וְאֵ֛ת כָּל־רֶ֥מֶשׂ הָֽאֲדָמָ֖ה לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֔ים נַֽעֲשֶׂ֥ה אָדָ֛ם בְּצַלְמֵ֖נוּ כִּדְמוּתֵ֑נוּ וְיִרְדּוּ֩ בִדְגַ֨ת הַיָּ֜ם וּבְעֹ֣וף הַשָּׁמַ֗יִם וּבַבְּהֵמָה֙ וּבְכָל־הָאָ֔רֶץ וּבְכָל־הָרֶ֖מֶשׂ הָֽרֹמֵ֥שׂ עַל־הָאָֽרֶץ׃ וַיִּבְרָ֨א אֱלֹהִ֤ים׀ אֶת־הָֽאָדָם֙ בְּצַלְמֹ֔ו בְּצֶ֥לֶם אֱלֹהִ֖ים בָּרָ֣א אֹתֹ֑ו זָכָ֥ר וּנְקֵבָ֖ה בָּרָ֥א אֹתָֽם׃ וַיְבָ֣רֶךְ אֹתָם֮ אֱלֹהִים֒ וַיֹּ֨אמֶר לָהֶ֜ם אֱלֹהִ֗ים פְּר֥וּ וּרְב֛וּ וּמִלְא֥וּ אֶת־הָאָ֖רֶץ וְכִבְשֻׁ֑הָ וּרְד֞וּ בִּדְגַ֤ת הַיָּם֙ וּבְעֹ֣וף הַשָּׁמַ֔יִם וּבְכָל־חַיָּ֖ה הָֽרֹמֶ֥שֶׂת עַל־הָאָֽרֶץ׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע לָכֶ֥ם יִֽהְיֶ֖ה לְאָכְלָֽה׃ וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה וַֽיְהִי־כֵֽן׃ וַיַּ֤רְא אֱלֹהִים֙ אֶת־כָּל־אֲשֶׁ֣ר עָשָׂ֔ה וְהִנֵּה־טֹ֖וב מְאֹ֑ד וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום הַשִּׁשִּֽׁי׃ פ This is lex 1437603: רֵאשִׁית
Now let's use those formats to print out the first verse of the Hebrew Bible.
for fmt in sorted(T.formats):
print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt)))
lex-default: בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ lex-orig-full: בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ lex-orig-plain: ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ lex-trans-full: B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY lex-trans-plain: B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ text-orig-full: בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ text-orig-full-ketiv: בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ text-orig-plain: בראשׁית ברא אלהים את השׁמים ואת הארץ׃ text-phono-full: bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . text-trans-full: B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 text-trans-full-ketiv: B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 text-trans-plain: BR>CJT BR> >LHJM >T HCMJM W>T H>RY00
Note that lex-default
is a format that only works for nodes of type lex
.
If we do not specify a format, the default format is used (text-orig-full
).
T.text(range(1, 12))
'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '
firstVerse = F.otype.s("verse")[0]
T.text(firstVerse)
'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '
T.text(firstVerse, fmt="text-phono-full")
'bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . '
The important things to remember are:
lex-default
)T.text(lx)
for lexeme nodes lx
and it will give the vocalized lexeme (using format lex-default
)n
in default format by T.text(n)
n
in other formats by T.text(n, fmt=fmt, descend=True)
Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Hebrew Bible is a piece of cake.
It takes less than ten seconds to have that cake and eat it. In nearly a dozen formats.
A.indent(reset=True)
A.info("writing plain text of whole Bible in all formats ...")
text = collections.defaultdict(list)
for v in F.otype.s("verse"):
for fmt in sorted(T.formats):
text[fmt].append(T.text(v, fmt=fmt, descend=True))
A.info("done {} formats".format(len(text)))
0.00s writing plain text of whole Bible in all formats ... 3.09s done 12 formats
for fmt in sorted(text):
print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))
lex-default בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ וְ הַ אֶרֶץ היה תֹּהוּ וְ בֹּהוּ וְ חֹשֶׁךְ עַל פָּנֶה תְּהֹום וְ רוּחַ אֱלֹהִים רחף עַל פָּנֶה הַ מַיִם וְ אמר אֱלֹהִים היה אֹור וְ היה אֹור וְ ראה אֱלֹהִים אֵת הַ אֹור כִּי טוב וְ בדל אֱלֹהִים בַּיִן הַ אֹור וְ בַּיִן הַ חֹשֶׁךְ וְ קרא אֱלֹהִים לְ הַ אֹור יֹום וְ לְ הַ חֹשֶׁךְ קרא לַיְלָה וְ היה עֶרֶב וְ היה בֹּקֶר יֹום אֶחָד lex-orig-full בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ וְ הָ אָרֶץ הָי תֹהוּ וָ בֹהוּ וְ חֹשֶׁךְ עַל פְּן תְהֹום וְ רוּחַ אֱלֹה רַחֶף עַל פְּן הַ מָּי וַ אמֶר אֱלֹה הִי אֹור וַ הִי אֹור וַ רְא אֱלֹה אֶת הָ אֹור כִּי טֹוב וַ בְדֵּל אֱלֹה בֵּין הָ אֹור וּ בֵין הַ חֹשֶׁךְ וַ קְרָא אֱלֹה לָ אֹור יֹום וְ לַ חֹשֶׁךְ קָרָא לָיְלָה וַ הִי עֶרֶב וַ הִי בֹקֶר יֹום אֶחָד lex-orig-plain ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ ו ה ארץ היה תהו ו בהו ו חשׁך על פנה תהום ו רוח אלהים רחף על פנה ה מים ו אמר אלהים היה אור ו היה אור ו ראה אלהים את ה אור כי טוב ו בדל אלהים בין ה אור ו בין ה חשׁך ו קרא אלהים ל ה אור יום ו ל ה חשׁך קרא לילה ו היה ערב ו היה בקר יום אחד lex-trans-full B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY W:- H@- >@REY H@J TOHW. W@- BOHW. W:- XOCEK: <AL P.:N T:HOWM W:- RW.XA >:ELOH RAXEP <AL P.:N HA- M.@J WA- >MER >:ELOH HIJ >OWR WA- HIJ >OWR WA- R:> >:ELOH >ET H@- >OWR K.IJ VOWB WA- B:D.;L >:ELOH B.;JN H@- >OWR W.- B;JN HA- XOCEK: WA- Q:R@> >:ELOH L@- - >OWR JOWM W:- LA- - XOCEK: Q@R@> L@J:L@H WA- HIJ <EREB WA- HIJ BOQER JOWM >EX@D lex-trans-plain B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ W H >RY/ HJH[ THW/ W BHW/ W XCK/ <L PNH/ THWM/ W RWX/ >LHJM/ RXP[ <L PNH/ H MJM/ W >MR[ >LHJM/ HJH[ >WR/ W HJH[ >WR/ W R>H[ >LHJM/ >T H >WR/ KJ VWB[ W BDL[ >LHJM/ BJN/ H >WR/ W BJN/ H XCK/ W QR>[ >LHJM/ L H >WR/ JWM/ W L H XCK/ QR>[ LJLH/ W HJH[ <RB/ W HJH[ BQR=/ JWM/ >XD/ text-orig-full בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ text-orig-full-ketiv בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ text-orig-plain בראשׁית ברא אלהים את השׁמים ואת הארץ׃ והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃ ויאמר אלהים יהי אור ויהי־אור׃ וירא אלהים את־האור כי־טוב ויבדל אלהים בין האור ובין החשׁך׃ ויקרא אלהים׀ לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ text-phono-full bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f text-trans-full B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: <AL&P.:N;74J T:HO92WM W:-R74W.XA >:ELOHI80JM M:RAXE73PET <AL&P.:N;71J HA-M.@75JIM00 WA-J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&<E71REB WA45-J:HIJ&BO73QER JO71WM >EX@75D00_P text-trans-full-ketiv B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: <AL&P.:N;74J T:HO92WM W:-R74W.XA >:ELOHI80JM M:RAXE73PET <AL&P.:N;71J HA-M.@75JIM00 WA-J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&<E71REB WA45-J:HIJ&BO73QER JO71WM >EX@75D00_P text-trans-plain BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 WH>RY HJTH THW WBHW WXCK <L&PNJ THWM WRWX >LHJM MRXPT <L&PNJ HMJM00 WJ>MR >LHJM JHJ >WR WJHJ&>WR00 WJR> >LHJM >T&H>WR KJ&VWB WJBDL >LHJM BJN H>WR WBJN HXCK00 WJQR> >LHJM05 L>WR JWM WLXCK QR> LJLH WJHJ&<RB WJHJ&BQR JWM >XD00_P
We write a few formats to file, in your Downloads folder.
for fmt in """
text-orig-full
text-phono-full
""".strip().split():
with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f:
f.write("\n".join(text[fmt]))
T.languages
{'': {'language': 'default', 'languageEnglish': 'default'}, 'am': {'language': 'ኣማርኛ', 'languageEnglish': 'amharic'}, 'ar': {'language': 'العَرَبِية', 'languageEnglish': 'arabic'}, 'bn': {'language': 'বাংলা', 'languageEnglish': 'bengali'}, 'da': {'language': 'Dansk', 'languageEnglish': 'danish'}, 'de': {'language': 'Deutsch', 'languageEnglish': 'german'}, 'el': {'language': 'Ελληνικά', 'languageEnglish': 'greek'}, 'en': {'language': 'English', 'languageEnglish': 'english'}, 'es': {'language': 'Español', 'languageEnglish': 'spanish'}, 'fa': {'language': 'فارسی', 'languageEnglish': 'farsi'}, 'fr': {'language': 'Français', 'languageEnglish': 'french'}, 'he': {'language': 'עברית', 'languageEnglish': 'hebrew'}, 'hi': {'language': 'हिन्दी', 'languageEnglish': 'hindi'}, 'id': {'language': 'Bahasa Indonesia', 'languageEnglish': 'indonesian'}, 'ja': {'language': '日本語', 'languageEnglish': 'japanese'}, 'ko': {'language': '한국어', 'languageEnglish': 'korean'}, 'la': {'language': 'Latina', 'languageEnglish': 'latin'}, 'nl': {'language': 'Nederlands', 'languageEnglish': 'dutch'}, 'pa': {'language': 'ਪੰਜਾਬੀ', 'languageEnglish': 'punjabi'}, 'pt': {'language': 'Português', 'languageEnglish': 'portuguese'}, 'ru': {'language': 'Русский', 'languageEnglish': 'russian'}, 'sw': {'language': 'Kiswahili', 'languageEnglish': 'swahili'}, 'syc': {'language': 'ܠܫܢܐ ܣܘܪܝܝܐ', 'languageEnglish': 'syriac'}, 'tr': {'language': 'Türkçe', 'languageEnglish': 'turkish'}, 'ur': {'language': 'اُردُو', 'languageEnglish': 'urdu'}, 'yo': {'language': 'èdè Yorùbá', 'languageEnglish': 'yoruba'}, 'zh': {'language': '中文', 'languageEnglish': 'chinese'}}
Get the book names in Swahili.
nodeToSwahili = ""
for b in F.otype.s("book"):
nodeToSwahili += "{} = {}\n".format(b, T.bookName(b, lang="sw"))
print(nodeToSwahili)
426591 = Mwanzo 426592 = Kutoka 426593 = Mambo_ya_Walawi 426594 = Hesabu 426595 = Kumbukumbu_la_Torati 426596 = Yoshua 426597 = Waamuzi 426598 = 1_Samweli 426599 = 2_Samweli 426600 = 1_Wafalme 426601 = 2_Wafalme 426602 = Isaya 426603 = Yeremia 426604 = Ezekieli 426605 = Hosea 426606 = Yoeli 426607 = Amosi 426608 = Obadia 426609 = Yona 426610 = Mika 426611 = Nahumu 426612 = Habakuki 426613 = Sefania 426614 = Hagai 426615 = Zekaria 426616 = Malaki 426617 = Zaburi 426618 = Ayubu 426619 = Mithali 426620 = Ruthi 426621 = Wimbo_Ulio_Bora 426622 = Mhubiri 426623 = Maombolezo 426624 = Esta 426625 = Danieli 426626 = Ezra 426627 = Nehemia 426628 = 1_Mambo_ya_Nyakati 426629 = 2_Mambo_ya_Nyakati
OK, there they are. We copy them into a string, and do the opposite: get the nodes back. We check whether we get exactly the same nodes as the ones we started with.
swahiliNames = """
Mwanzo
Kutoka
Mambo_ya_Walawi
Hesabu
Kumbukumbu_la_Torati
Yoshua
Waamuzi
1_Samweli
2_Samweli
1_Wafalme
2_Wafalme
Isaya
Yeremia
Ezekieli
Hosea
Yoeli
Amosi
Obadia
Yona
Mika
Nahumu
Habakuki
Sefania
Hagai
Zekaria
Malaki
Zaburi
Ayubu
Mithali
Ruthi
Wimbo_Ulio_Bora
Mhubiri
Maombolezo
Esta
Danieli
Ezra
Nehemia
1_Mambo_ya_Nyakati
2_Mambo_ya_Nyakati
""".strip().split()
swahiliToNode = ""
for nm in swahiliNames:
swahiliToNode += "{} = {}\n".format(T.bookNode(nm, lang="sw"), nm)
if swahiliToNode != nodeToSwahili:
print("Something is not right with the book names")
else:
print("Going from nodes to booknames and back yields the original nodes")
Going from nodes to booknames and back yields the original nodes
A section in the Hebrew bible is a book, a chapter or a verse.
Knowledge of sections is not baked into Text-Fabric.
The config feature otext.tf
may specify three section levels, and tell
what the corresponding node types and features are.
From that knowledge it can construct mappings from nodes to sections, e.g. from verse nodes to tuples of the form:
(bookName, chapterNumber, verseNumber)
You can get the section of a node as a tuple of relevant book, chapter, and verse nodes. Or you can get it as a passage label, a string.
You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.
If you are dealing with book and chapter nodes, you can ask to fill out the verse and chapter parts as well.
Here are examples of getting the section that corresponds to a node and vice versa.
NB: sectionFromNode
always delivers a verse specification, either from the
first slot belonging to that node, or, if lastSlot
, from the last slot
belonging to that node.
for (desc, n) in chain(normalShow.items(), sectionShow.items()):
for lang in "en la sw".split():
d = f"{n:>7} {desc}" if lang == "en" else ""
first = A.sectionStrFromNode(n, lang=lang)
last = A.sectionStrFromNode(n, lang=lang, lastSlot=True, fillup=True)
tup = (
T.sectionTuple(n)
if lang == "en"
else T.sectionTuple(n, lastSlot=True, fillup=True)
if lang == "la"
else ""
)
print(f"{d:<20} {lang} - {first:<30} {last:<30} {tup}")
15890 wordShow en - Genesis 30:18 Genesis 30:18 (426591, 426659, 1415237) la - Genesis 30:18 Genesis 30:18 (426591, 426659, 1415237) sw - Mwanzo 30:18 Mwanzo 30:18 700000 phraseShow en - Numbers 22:31 Numbers 22:31 (426594, 426768, 1418795) la - Numeri 22:31 Numeri 22:31 (426594, 426768, 1418795) sw - Hesabu 22:31 Hesabu 22:31 500000 clauseShow en - Job 36:27 Job 36:27 (426618, 427382, 1432958) la - Iob 36:27 Iob 36:27 (426618, 427382, 1432958) sw - Ayubu 36:27 Ayubu 36:27 1200000 sentenceShow en - 2_Kings 6:5 2_Kings 6:5 (426601, 426944, 1423986) la - Reges_II 6:5 Reges_II 6:5 (426601, 426944, 1423986) sw - 2_Wafalme 6:5 2_Wafalme 6:5 1437667 lexShow en - Genesis 1:16 2_Chronicles 22:1 (426591, 426630, 1414404) la - Genesis 1:16 Chronica_II 22:1 (426629, 427544, 1437230) sw - Mwanzo 1:16 2_Mambo_ya_Nyakati 22:1 1420000 verseShow en - Deuteronomy 27:25 Deuteronomy 27:25 (426595, 426809, 1420000) la - Deuteronomium 27:25 Deuteronomium 27:25 (426595, 426809, 1420000) sw - Kumbukumbu_la_Torati 27:25 Kumbukumbu_la_Torati 27:25 427000 chapterShow en - Isaiah 37 Isaiah 37:38 (426602, 427000) la - Jesaia 37 Jesaia 37:38 (426602, 427000, 1425295) sw - Isaya 37 Isaya 37:38 426598 bookShow en - 1_Samuel 1_Samuel 31:13 (426598,) la - Samuel_I Samuel_I 31:13 (426598, 426892, 1422328) sw - 1_Samweli 1_Samweli 31:13
And here are examples to get back:
for (lang, section) in (
("en", "Ezekiel"),
("la", "Ezechiel"),
("sw", "Ezekieli"),
("en", "Isaiah 43"),
("la", "Jesaia 43"),
("sw", "Isaya 43"),
("en", "Deuteronomy 28:34"),
("la", "Deuteronomium 28:34"),
("sw", "Kumbukumbu_la_Torati 28:34"),
("en", "Job 37:3"),
("la", "Iob 37:3"),
("sw", "Ayubu 37:3"),
("en", "Numbers 22:33"),
("la", "Numeri 22:33"),
("sw", "Hesabu 22:33"),
("en", "Genesis 30:18"),
("la", "Genesis 30:18"),
("sw", "Mwanzo 30:18"),
("en", "Genesis 1:30"),
("la", "Genesis 1:30"),
("sw", "Mwanzo 1:30"),
("en", "Psalms 37:2"),
("la", "Psalmi 37:2"),
("sw", "Zaburi 37:2"),
):
n = A.nodeFromSectionStr(section, lang=lang)
nType = F.otype.v(n)
print(f"{section:<30} {lang} {nType:<20} {n}")
Ezekiel en book 426604 Ezechiel la book 426604 Ezekieli sw book 426604 Isaiah 43 en chapter 427006 Jesaia 43 la chapter 427006 Isaya 43 sw chapter 427006 Deuteronomy 28:34 en verse 1420035 Deuteronomium 28:34 la verse 1420035 Kumbukumbu_la_Torati 28:34 sw verse 1420035 Job 37:3 en verse 1432967 Iob 37:3 la verse 1432967 Ayubu 37:3 sw verse 1432967 Numbers 22:33 en verse 1418797 Numeri 22:33 la verse 1418797 Hesabu 22:33 sw verse 1418797 Genesis 30:18 en verse 1415237 Genesis 30:18 la verse 1415237 Mwanzo 30:18 sw verse 1415237 Genesis 1:30 en verse 1414418 Genesis 1:30 la verse 1414418 Mwanzo 1:30 sw verse 1414418 Psalms 37:2 en verse 1430067 Psalmi 37:2 la verse 1430067 Zaburi 37:2 sw verse 1430067
If you go up from a sentence node, you expect to find a verse node. But some sentences span multiple verses, and in that case, you will not find the enclosing verse node, because it is not there.
Here is a piece of code to detect and list all cases where sentences span multiple verses.
The idea is to pick the first and the last word of a sentence, use T.sectionFromNode
to
discover the verse in which that word occurs, and if they are different: bingo!
We show the first 5 of ca. 900 cases.
By the way: doing this in the 2016
version of the data yields 915 results.
The splitting up of the text into sentences is not carved in stone!
A.indent(reset=True)
A.info("Get sentences that span multiple verses")
spanSentences = []
for s in F.otype.s("sentence"):
fs = T.sectionFromNode(s, lastSlot=False)
ls = T.sectionFromNode(s, lastSlot=True)
if fs != ls:
spanSentences.append("{} {}:{}-{}".format(fs[0], fs[1], fs[2], ls[2]))
A.info("Found {} cases".format(len(spanSentences)))
A.info("\n{}".format("\n".join(spanSentences[0:10])))
0.00s Get sentences that span multiple verses 1.09s Found 887 cases 1.09s Genesis 1:17-18 Genesis 1:29-30 Genesis 2:4-7 Genesis 7:2-3 Genesis 7:8-9 Genesis 7:13-14 Genesis 9:9-10 Genesis 10:11-12 Genesis 10:13-14 Genesis 10:15-18
A different way, with better display, is:
A.indent(reset=True)
A.info("Get sentences that span multiple verses")
spanSentences = []
for s in F.otype.s("sentence"):
words = L.d(s, otype="word")
fw = words[0]
lw = words[-1]
fVerse = L.u(fw, otype="verse")[0]
lVerse = L.u(lw, otype="verse")[0]
if fVerse != lVerse:
spanSentences.append((s, fVerse, lVerse))
A.info("Found {} cases".format(len(spanSentences)))
A.table(spanSentences, end=1)
0.00s Get sentences that span multiple verses 0.38s Found 887 cases
n | p | sentence | verse | verse |
---|---|---|---|---|
1 | Genesis 1:17 | וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ |
Wait a second, the columns with the verses are empty. In tables, the content of a verse is not shown. And by default, the passage that is relevant to a row is computed from one of the columns.
But here, we definitely want the passage of columns 2 and 3, so:
A.table(spanSentences, end=10, withPassage={2, 3})
n | sentence | verse | verse |
---|---|---|---|
1 | וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ | Genesis 1:17 | Genesis 1:18 |
2 | הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה | Genesis 1:29 | Genesis 1:30 |
3 | בְּיֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְשָׁמָֽיִם׃ וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה | Genesis 2:4 | Genesis 2:7 |
4 | מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו וּמִן־הַבְּהֵמָ֡ה אֲ֠שֶׁר לֹ֣א טְהֹרָ֥ה הִ֛וא שְׁנַ֖יִם אִ֥ישׁ וְאִשְׁתֹּֽו׃ גַּ֣ם מֵעֹ֧וף הַשָּׁמַ֛יִם שִׁבְעָ֥ה שִׁבְעָ֖ה זָכָ֣ר וּנְקֵבָ֑ה לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃ | Genesis 7:2 | Genesis 7:3 |
5 | מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה אֲשֶׁ֥ר אֵינֶ֖נָּה טְהֹרָ֑ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל אֲשֶׁר־רֹמֵ֖שׂ עַל־הָֽאֲדָמָֽה׃ שְׁנַ֨יִם שְׁנַ֜יִם בָּ֧אוּ אֶל־נֹ֛חַ אֶל־הַתֵּבָ֖ה זָכָ֣ר וּנְקֵבָ֑ה כַּֽאֲשֶׁ֛ר צִוָּ֥ה אֱלֹהִ֖ים אֶת־נֹֽחַ׃ | Genesis 7:8 | Genesis 7:9 |
6 | בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃ הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ | Genesis 7:13 | Genesis 7:14 |
7 | וַאֲנִ֕י הִנְנִ֥י מֵקִ֛ים אֶת־בְּרִיתִ֖י אִתְּכֶ֑ם וְאֶֽת־זַרְעֲכֶ֖ם אַֽחֲרֵיכֶֽם׃ וְאֵ֨ת כָּל־נֶ֤פֶשׁ הַֽחַיָּה֙ אֲשֶׁ֣ר אִתְּכֶ֔ם בָּעֹ֧וף בַּבְּהֵמָ֛ה וּֽבְכָל־חַיַּ֥ת הָאָ֖רֶץ אִתְּכֶ֑ם מִכֹּל֙ יֹצְאֵ֣י הַתֵּבָ֔ה לְכֹ֖ל חַיַּ֥ת הָאָֽרֶץ׃ | Genesis 9:9 | Genesis 9:10 |
8 | וַיִּ֨בֶן֙ אֶת־נִ֣ינְוֵ֔ה וְאֶת־רְחֹבֹ֥ת עִ֖יר וְאֶת־כָּֽלַח׃ וְֽאֶת־רֶ֔סֶן בֵּ֥ין נִֽינְוֵ֖ה וּבֵ֣ין כָּ֑לַח | Genesis 10:11 | Genesis 10:12 |
9 | וּמִצְרַ֡יִם יָלַ֞ד אֶת־לוּדִ֧ים וְאֶת־עֲנָמִ֛ים וְאֶת־לְהָבִ֖ים וְאֶת־נַפְתֻּחִֽים׃ וְֽאֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים אֲשֶׁ֨ר יָצְא֥וּ מִשָּׁ֛ם פְּלִשְׁתִּ֖ים וְאֶת־כַּפְתֹּרִֽים׃ ס | Genesis 10:13 | Genesis 10:14 |
10 | וּכְנַ֗עַן יָלַ֛ד אֶת־צִידֹ֥ן בְּכֹרֹ֖ו וְאֶת־חֵֽת׃ וְאֶת־הַיְבוּסִי֙ וְאֶת־הָ֣אֱמֹרִ֔י וְאֵ֖ת הַגִּרְגָּשִֽׁי׃ וְאֶת־הַֽחִוִּ֥י וְאֶת־הַֽעַרְקִ֖י וְאֶת־הַסִּינִֽי׃ וְאֶת־הָֽאַרְוָדִ֥י וְאֶת־הַצְּמָרִ֖י וְאֶת־הַֽחֲמָתִ֑י | Genesis 10:15 | Genesis 10:18 |
We can zoom in:
A.show(spanSentences, condensed=False, start=6, end=6, baseTypes={"sentence_atom"})
result 6
Let us explore where Ketiv/Qere pairs are and how they render.
qeres = [w for w in F.otype.s("word") if F.qere.v(w) is not None]
print("{} qeres".format(len(qeres)))
for w in qeres[0:10]:
print(
'{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format(
w,
F.g_word.v(w),
F.trailer.v(w),
F.qere.v(w),
F.qere_trailer.v(w),
)
)
1892 qeres 3897: ketiv = "*HWY>"+" " qere = "HAJ:Y;74>"+" " 4420: ketiv = "*>HLH"+"00 " qere = ">@H:@LO75W"+"00" 5645: ketiv = "*>HLH"+" " qere = ">@H:@LO92W"+" " 5912: ketiv = "*>HLH"+" " qere = ">@95H:@LOW03"+" " 6246: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" " 6354: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" " 11762: ketiv = "*W-"+"" qere = "WA"+"" 11763: ketiv = "*JJFM"+" " qere = "J.W.FA70m"+" " 12784: ketiv = "*GJJM"+" " qere = "GOWJIm03"+" " 13685: ketiv = "*YJDH"+"00 " qere = "Y@75JID"+"00"
Let us print all text representations of the verse in which the second qere occurs.
refWord = qeres[1]
print(f"Reference word is {refWord}")
vn = L.u(refWord, otype="verse")[0]
print("{} {}:{}".format(*T.sectionFromNode(refWord)))
for fmt in sorted(T.formats):
if fmt.startswith("text-"):
print("{:<25} {}".format(fmt, T.text(vn, fmt=fmt, descend=True)))
Reference word is 4420 Genesis 9:21 text-orig-full וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אָהֳלֹֽו׃ text-orig-full-ketiv וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אהלה׃ text-orig-plain וישׁת מן־היין וישׁכר ויתגל בתוך אהלה׃ text-phono-full wayyˌēšt min-hayyˌayin wayyiškˈār wayyiṯgˌal bᵊṯˌôḵ *ʔohᵒlˈô . text-trans-full WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: >@H:@LO75W00 text-trans-full-ketiv WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: *>HLH00 text-trans-plain WJCT MN&HJJN WJCKR WJTGL BTWK >HLH00
We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet, the edges point from one row to another.
One edge we have encountered: the special feature oslots
.
Each non-slot node is linked by oslots
to all of its slot nodes.
An edge is really a feature as well. Whereas a node feature is a column of information, one cell per node, an edge feature is also a column of information, one cell per pair of nodes.
Linguists use more relationships between textual objects, for example:
linguistic dependency.
In the BHSA all cases of linguistic dependency are coded in the edge feature mother
.
Let us do a few basic enquiry on an edge feature: mother.
We count how many mothers nodes can have (it turns to be 0 or 1).
We walk through all nodes and per node we retrieve the mother nodes, and
we store the lengths (if non-zero) in a dictionary (mother_len
).
We see that nodes have at most one mother.
We also count the inverse relationship: daughters.
A.indent(reset=True)
A.info("Counting mothers")
motherLen = {}
daughterLen = {}
for c in N.walk():
lms = E.mother.f(c) or []
lds = E.mother.t(c) or []
nms = len(lms)
nds = len(lds)
if nms:
motherLen[c] = nms
if nds:
daughterLen[c] = nds
A.info("{} nodes have mothers".format(len(motherLen)))
A.info("{} nodes have daughters".format(len(daughterLen)))
motherCount = collections.Counter()
daughterCount = collections.Counter()
for (n, lm) in motherLen.items():
motherCount[lm] += 1
for (n, ld) in daughterLen.items():
daughterCount[ld] += 1
print("mothers", motherCount)
print("daughters", daughterCount)
0.00s Counting mothers 0.73s 182269 nodes have mothers 0.73s 144112 nodes have daughters mothers Counter({1: 182269}) daughters Counter({1: 117986, 2: 17370, 3: 6284, 4: 1851, 5: 470, 6: 125, 7: 21, 8: 5})
Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.
But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.
There are two ways to do that:
.tf
directory of your dataset, and remove all .tfx
files in it.
This might be a bit awkward to do, because the .tf
directory is hidden on Unix-like systems.TF.clearCache()
, which does exactly the same.It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.
# TF.clearCache()
By now you have an impression how to compute around in the Hebrew Bible. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.
Here are a few directions for unleashing that power.
CC-BY Dirk Roorda