Notebook

Tutorial¶

This notebook gets you started with using Text-Fabric for coding in the Hebrew Bible.

Familiarity with the underlying data model is recommended.

Short introductions to other TF datasets:

or the

Q'uran

Installing Text-Fabric¶

See here

Tip¶

If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your repository. If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.

BHSA data¶

Text-Fabric will fetch a standard set of features for you from the newest github release binaries.

It will fetch version 2021.

The data will be stored in the text-fabric-data in your home directory.

Incantation¶

The simplest way to get going is by this incantation:

In [1]:

from tf.app import use

For the very last version, use hot.

For the latest release, use latest.

If you have cloned the repos (TF app and data), use clone.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [2]:

A = use("ETCBC/bhsa", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/bhsa/app

data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021

data: ~/text-fabric-data/github/ETCBC/phono/tf/2021

data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021

Text-Fabric: Text-Fabric API 12.0.4, ETCBC/bhsa/app v3, Search Reference
Data: ETCBC - bhsa 2021, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

str

✅ verbal stem (qal; piel; hif; apel; pael)

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

Settings:

specified

apiVersion: 3
appName: ETCBC/bhsa
appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
commit: gd905e3fb6e80d0fa537600337614adc2af157309
css: ''
dataDisplay:
- exampleSectionHtml:
  <code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
- excludedFeatures:
  - g_uvf_utf8
  - g_vbs
  - kq_hybrid
  - languageISO
  - g_nme
  - lex0
  - is_root
  - g_vbs_utf8
  - g_uvf
  - dist
  - root
  - suffix_person
  - g_vbe
  - dist_unit
  - suffix_number
  - distributional_parent
  - kq_hybrid_utf8
  - crossrefSET
  - instruction
  - g_prs
  - lexeme_count
  - rank_occ
  - g_pfm_utf8
  - freq_occ
  - crossrefLCS
  - functional_parent
  - g_pfm
  - g_nme_utf8
  - g_vbe_utf8
  - kind
  - g_prs_utf8
  - suffix_gender
  - mother_object_type
- noneValues:
  - none
  - unknown
  - no value
  - NA
docs:
- docBase: {docRoot}/{repo}
- docExt: ''
- docPage: ''
- docRoot: https://{org}.github.io
- featurePage: 0_home
interfaceDefaults: {}
isCompatible: True
local: local
localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
provenanceSpec:
- corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
- doi: 10.5281/zenodo.1007624
- moduleSpecs:
  - :
    backend: no value
    corpus: Phonetic Transcriptions
    docUrl:
    https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
    doi: 10.5281/zenodo.1007636
    org: ETCBC
    relative: /tf
    repo: phono
  - :
    backend: no value
    corpus: Parallel Passages
    docUrl:
    https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
    doi: 10.5281/zenodo.1007642
    org: ETCBC
    relative: /tf
    repo: parallels
- org: ETCBC
- relative: /tf
- repo: bhsa
- version: 2021
- webBase: https://shebanq.ancient-data.org/hebrew
- webHint: Show this on SHEBANQ
- webLang: la
- webLexId: True
- webUrl:
  {webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: v1.8
typeDisplay:
- clause:
  - label: {typ} {rela}
  - style: ''
- clause_atom:
  - hidden: True
  - label: {code}
  - level: 1
  - style: ''
- half_verse:
  - hidden: True
  - label: {label}
  - style: ''
  - verselike: True
- lex:
  - featuresBare: gloss
  - label: {voc_lex_utf8}
  - lexOcc: word
  - style: orig
  - template: {voc_lex_utf8}
- phrase:
  - label: {typ} {function}
  - style: ''
- phrase_atom:
  - hidden: True
  - label: {typ} {rela}
  - level: 1
  - style: ''
- sentence:
  - label: {number}
  - style: ''
- sentence_atom:
  - hidden: True
  - label: {number}
  - level: 1
  - style: ''
- subphrase:
  - hidden: True
  - label: {number}
  - style: ''
- word:
  - features: pdp vs vt
  - featuresBare: lex:gloss
writing: hbo

Text-Fabric API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Features¶

The data of the BHSA is organized in features. They are columns of data. Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all 425,000 words.

The information which part-of-speech each word is, constitutes a column in that spreadsheet. The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more textual objects.

Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.

You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.

Edge features are marked by bold italic formatting.

There are ways to tweak the set of features that is loaded. You can load more and less.

See share for examples.

Modules¶

Note that we have phono features. The BHSA data has a special 1-1 transcription from Hebrew to ASCII, but not a phonetic transcription.

I have made a notebook that tries hard to find phonological representations for all the words. The result is a module in text-fabric format. We'll encounter that later.

This module, and the module etcbc/parallels are standard modules of the BHSA app.

See the share tutorial or Data how you can add and invoke additional data.

API¶

The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the text and data of the Hebrew Bible.

At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).

The most essential thing for now is that we can use F to access the data in the features we've loaded. But there is more, such as N, which helps us to walk over the text, as we see in a minute.

The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.

Search¶

Text-Fabric contains a flexible search engine, that does not only work for the BHSA data, but also for data that you add to it.

Search is the quickest way to come up-to-speed with your data, without too much programming.

Jump to the dedicated search search tutorial first, to whet your appetite. And if you already know MQL queries, you can build from that in searchFromMQL.

The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:

compose dynamic queries
process query results

Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.

Before we start coding, we load some modules that we need underway:

In [3]:

%load_ext autoreload
%autoreload 2

In [4]:

import os
import collections
from itertools import chain

Counting¶

In order to get acquainted with the data, we start with the simple task of counting.

Count all nodes¶

We use the N.walk() generator to walk through the nodes.

We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words. In Text-Fabric, we call the rows slots, because they are the textual positions that can be filled with words.

We also mentioned that there are also 1,000,000 more textual objects. They are the phrases, clauses, sentences, verses, chapters and books. They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows nodes, and the N() generator carries us through those nodes in the textual order.

Just one extra thing: the info statements generate timed messages. If you use them instead of print you'll get a sense of the amount of time that the various processing steps typically need.

In [5]:

A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.09s 1446831 nodes

Here you see it: 1,4 M nodes!

What are those million nodes?¶

Every node has a type, like word, or phrase, sentence. We know that we have approximately 425,000 words and a million other nodes. But what exactly are they?

Text-Fabric has two special features, otype and oslots, that must occur in every Text-Fabric data set. otype tells you for each node its type, and you can ask for the number of slots in the text.

Here we go!

In [6]:

F.otype.slotType

Out[6]:

'word'

In [7]:

F.otype.maxSlot

Out[7]:

In [8]:

F.otype.maxNode

Out[8]:

In [9]:

F.otype.all

Out[9]:

('book',
 'chapter',
 'lex',
 'verse',
 'half_verse',
 'sentence',
 'sentence_atom',
 'clause',
 'clause_atom',
 'phrase',
 'phrase_atom',
 'subphrase',
 'word')

In [10]:

C.levels.data

Out[10]:

(('book', 10938.205128205129, 426591, 426629),
 ('chapter', 459.1926803013994, 426630, 427558),
 ('lex', 46.21776814734561, 1437602, 1446831),
 ('verse', 18.377202429673027, 1414389, 1437601),
 ('half_verse', 9.442218729940901, 606394, 651572),
 ('sentence', 6.6950735282577645, 1172308, 1236024),
 ('sentence_atom', 6.612363207985863, 1236025, 1300538),
 ('clause', 4.840408028956894, 427559, 515689),
 ('clause_atom', 4.7031001940377495, 515690, 606393),
 ('phrase', 1.684774666966821, 651573, 904775),
 ('phrase_atom', 1.5945382234648566, 904776, 1172307),
 ('subphrase', 1.4213614404918753, 1300539, 1414388),
 ('word', 1, 1, 426590))

This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.

Count individual object types¶

This is an intuitive way to count the number of nodes in each type. Note in passing, how we use the indent in conjunction with info to produce neat timed and indented progress messages.

In [11]:

A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0

    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")

  0.00s counting objects ...
   |     0.00s      39 books
   |     0.00s     929 chapters
   |     0.00s    9230 lexs
   |     0.00s   23213 verses
   |     0.00s   45179 half_verses
   |     0.00s   63717 sentences
   |     0.00s   64514 sentence_atoms
   |     0.01s   88131 clauses
   |     0.00s   90704 clause_atoms
   |     0.01s  253203 phrases
   |     0.01s  267532 phrase_atoms
   |     0.01s  113850 subphrases
   |     0.02s  426590 words
  0.08s Done

Viewing textual objects¶

We use the A API (the extra power) to peek into the corpus.

First some words. Node 15890 is a word with a dotless shin.

Node 1002 is a word with a yod after a seqhol hataf.

Node 100,000 is just a word slot.

Let's inspect them and see where they are.

First the plain view:

In [12]:

F.otype.v(1)

Out[12]:

'word'

In [13]:

wordShows = (15890, 1002, 100000)
for word in wordShows:
    A.plain(word, withPassage=True)

Genesis 30:18 יִשָּׂשכָֽר׃

Genesis 2:18 הֱיֹ֥ות

Deuteronomy 11:19 בָּ֑ם

You can leave out the passage reference:

In [14]:

for word in wordShows:
    A.plain(word, withPassage=False)

יִשָּׂשכָֽר׃

הֱיֹ֥ות

בָּ֑ם

Now we show other objects, both with and without passage reference.

In [15]:

normalShow = dict(
    wordShow=wordShows[0],
    phraseShow=700000,
    clauseShow=500000,
    sentenceShow=1200000,
    lexShow=1437667,
)

sectionShow = dict(
    verseShow=1420000,
    chapterShow=427000,
    bookShow=426598,
)

In [16]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.plain(n)
    A.plain(n, withPassage=False)
    A.dm("\n---\n")

wordShow = node 15890

Genesis 30:18 יִשָּׂשכָֽר׃

יִשָּׂשכָֽר׃

phraseShow = node 700000

Numbers 22:31 אֶת־מַלְאַ֤ךְ יְהוָה֙

אֶת־מַלְאַ֤ךְ יְהוָה֙

clauseShow = node 500000

Job 36:27 יָזֹ֖קּוּ מָטָ֣ר לְאֵדֹֽו׃

יָזֹ֖קּוּ מָטָ֣ר לְאֵדֹֽו׃

sentenceShow = node 1200000

2_Kings 6:5 אֲהָ֥הּ אֲדֹנִ֖י

אֲהָ֥הּ אֲדֹנִ֖י

lexShow = node 1437667

קָטֹן

Note that for section nodes (except verse and half-verse) the withPassage has little effect. The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text in question.

In [17]:

for (name, n) in sectionShow.items():
    if name == "verseShow":
        continue
    A.dm(f"**{name}** = node `{n}`\n")
    A.plain(n)
    A.plain(n, withPassage=False)
    A.dm("\n---\n")

chapterShow = node 427000

Isaiah 37

bookShow = node 426598

1_Samuel

We can also dive into the structure of the textual objects, provided they are not too large.

The function pretty gives a display of the object that a node stands for together with the structure below that node.

In [18]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n)
    A.dm("\n---\n")

wordShow = node 15890

Genesis 30:18

יִשָּׂשכָֽר׃

phraseShow = node 700000

phrase

clauseShow = node 500000

Job 36:27

clause

phrase

phrase

phrase

sentenceShow = node 1200000

2_Kings 6:5

sentence

clause

phrase

אֲהָ֥הּ

phrase

אֲדֹנִ֖י

lexShow = node 1437667

קָטֹן

lex

Genesis 1:16 - 2_Chronicles 22:1

Note

if you click on a word in a pretty display you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme;
if you click on the passage, you go to SHEBANQ, to exactly this verse.

If you need a link to shebanq for just any node:

In [19]:

million = 1000000
A.webLink(million)

1_Samuel 25:24

We can show some standard features in the display:

In [20]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n, standardFeatures=True)
    A.dm("\n---\n")

wordShow = node 15890

Genesis 30:18

יִשָּׂשכָֽר׃

Issacharpdp=nmpr

phraseShow = node 700000

phrase PP Objc

messengerpdp=subs

YHWHpdp=nmpr

clauseShow = node 500000

Job 36:27

clause ZYq0 Coor

phrase VP Pred

יָזֹ֖קּוּ

filterpdp=verbvs=qalvt=impf

phrase NP Objc

מָטָ֣ר

rainpdp=subs

phrase PP Adju

לְ

topdp=prep

אֵדֹֽו׃

<uncertain>pdp=subs

sentenceShow = node 1200000

2_Kings 6:5

sentence 21

clause Voct NA

phrase InjP Intj

אֲהָ֥הּ

alaspdp=intj

phrase NP Voct

אֲדֹנִ֖י

lordpdp=subs

lexShow = node 1437667

קָטֹן

lex קָטֹן

small

Genesis 1:16 - 2_Chronicles 22:1

In [21]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n, standardFeatures=True)
    A.dm("\n---\n")

wordShow = node 15890

Genesis 30:18

יִשָּׂשכָֽר׃

Issacharpdp=nmpr

phraseShow = node 700000

phrase PP Objc

messengerpdp=subs

YHWHpdp=nmpr

clauseShow = node 500000

Job 36:27

clause ZYq0 Coor

phrase VP Pred

יָזֹ֖קּוּ

filterpdp=verbvs=qalvt=impf

phrase NP Objc

מָטָ֣ר

rainpdp=subs

phrase PP Adju

לְ

topdp=prep

אֵדֹֽו׃

<uncertain>pdp=subs

sentenceShow = node 1200000

2_Kings 6:5

sentence 21

clause Voct NA

phrase InjP Intj

אֲהָ֥הּ

alaspdp=intj

phrase NP Voct

אֲדֹנִ֖י

lordpdp=subs

lexShow = node 1437667

קָטֹן

lex קָטֹן

small

Genesis 1:16 - 2_Chronicles 22:1

For more display options, see display.

Feature statistics¶

F gives access to all features. Every feature has a method freqList() to generate a frequency list of its values, higher frequencies first. Here are the parts of speech:

In [22]:

F.sp.freqList()

Out[22]:

(('subs', 125583),
 ('verb', 75451),
 ('prep', 73298),
 ('conj', 62737),
 ('nmpr', 35607),
 ('art', 30387),
 ('adjv', 10141),
 ('nega', 6059),
 ('prps', 5035),
 ('advb', 4603),
 ('prde', 2678),
 ('intj', 1912),
 ('inrg', 1303),
 ('prin', 1026))

Lexeme matters¶

Top 10 frequent verbs¶

If we count the frequency of words, we usually mean the frequency of their corresponding lexemes.

There are several methods for working with lexemes.

Method 1: counting words¶

In [23]:

verbs = collections.Counter()
A.indent(reset=True)
A.info("Collecting data")

for w in F.otype.s("word"):
    if F.sp.v(w) != "verb":
        continue
    verbs[F.lex.v(w)] += 1

A.info("Done")
print(
    "".join(
        "{}: {}\n".format(verb, cnt)
        for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]
    )
)

  0.00s Collecting data
  0.09s Done
>MR[: 5378
HJH[: 3561
<FH[: 2629
BW>[: 2570
NTN[: 2017
HLK[: 1554
R>H[: 1298
CM<[: 1168
DBR[: 1138
JCB[: 1082

Method 2: counting lexemes¶

An alternative way to do this is to use the feature freq_lex, defined for lex nodes. Now we walk the lexemes instead of the occurrences.

Note that the feature sp (part-of-speech) is defined for nodes of type word as well as lex. Both also have the lex feature.

In [24]:

verbs = collections.Counter()
A.indent(reset=True)
A.info("Collecting data")
for w in F.otype.s("lex"):
    if F.sp.v(w) != "verb":
        continue
    verbs[F.lex.v(w)] += F.freq_lex.v(w)
A.info("Done")
print(
    "".join(
        "{}: {}\n".format(verb, cnt)
        for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10]
    )
)

  0.00s Collecting data
  0.00s Done
>MR[: 5378
HJH[: 3561
<FH[: 2629
BW>[: 2570
NTN[: 2017
HLK[: 1554
R>H[: 1298
CM<[: 1168
DBR[: 1138
JCB[: 1082

This is an order of magnitude faster. In this case, that means the difference between a third of a second and a hundredth of a second, not a big gain in absolute terms. But suppose you need to run this a 1000 times in a loop. Then it is the difference between 5 minutes and 10 seconds. A five minute wait is not pleasant in interactive computing!

A frequency mapping of lexemes¶

We make a mapping between lexeme forms and the number of occurrences of those lexemes.

In [25]:

lexeme_dict = {F.lex_utf8.v(n): F.freq_lex.v(n) for n in F.otype.s("word")}

In [26]:

list(lexeme_dict.items())[0:10]

Out[26]:

[('ב', 15542),
 ('ראשׁית', 51),
 ('ברא', 48),
 ('אלהים', 2601),
 ('את', 10987),
 ('ה', 30386),
 ('שׁמים', 421),
 ('ו', 50272),
 ('ארץ', 2504),
 ('היה', 3561)]

Real work¶

As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on Collocation MI Analysis of the Hebrew Bible

It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results.

In case the name has changed, the enclosing repo is here.

Lexeme distribution¶

Let's do a bit more fancy lexeme stuff.

Hapaxes¶

A hapax can be found by inspecting lexemes and see to how many word nodes they are linked. If that is number is one, we have a hapax.

We print 10 hapaxes with their glosses.

In [27]:

A.indent(reset=True)

hapax = []
zero = set()

for lx in F.otype.s("lex"):
    occs = L.d(lx, otype="word")
    n = len(occs)
    if n == 0:  # that's weird: should not happen
        zero.add(lx)
    elif n == 1:  # hapax found!
        hapax.append(lx)

A.info("{} hapaxes found".format(len(hapax)))

if zero:
    A.error("{} zeroes found".format(len(zero)), tm=False)
else:
    A.info("No zeroes found", tm=False)
for h in hapax[0:10]:
    print("\t{:<8} {}".format(F.lex.v(h), F.gloss.v(h)))

  0.04s 3071 hapaxes found
No zeroes found
	PJCWN/   Pishon
	CWP[     bruise
	HRWN/    pregnancy
	Z<H/     sweat
	LHV/     flame
	NWD/     Nod
	XNWK=/   Enoch
	MXWJ>L/  Mehujael
	MXJJ>L/  Mehujael
	JBL=/    Jabal

Small occurrence base¶

The occurrence base of a lexeme are the verses, chapters and books in which occurs. Let's look for lexemes that occur in a single chapter.

If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter. So, if you go up from the lexeme, you encounter the chapter.

Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it, so if you go up from such lexemes, you don not find chapters.

Let's check it out.

Oh yes, we have already found the hapaxes, we will skip them here.

In [28]:

A.indent(reset=True)
A.info("Finding single chapter lexemes")

singleCh = []
multipleCh = []

for lx in F.otype.s("lex"):
    chapters = L.u(lx, "chapter")
    if len(chapters) == 1:
        if lx not in hapax:
            singleCh.append(lx)
    elif len(chapters) > 0:  # should not happen
        multipleCh.append(lx)

A.info("{} single chapter lexemes found".format(len(singleCh)))

if multipleCh:
    A.error(
        "{} chapter embedders of multiple lexemes found".format(len(multipleCh)),
        tm=False,
    )
else:
    A.info("No chapter embedders of multiple lexemes found", tm=False)
for s in singleCh[0:10]:
    print(
        "{:<20} {:<6}".format(
            "{} {}:{}".format(*T.sectionFromNode(s)),
            F.lex.v(s),
        )
    )

  0.00s Finding single chapter lexemes
  0.05s 450 single chapter lexemes found
No chapter embedders of multiple lexemes found
Genesis 4:1          QJN=/ 
Genesis 4:2          HBL=/ 
Genesis 4:18         <JRD/ 
Genesis 4:18         MTWC>L/
Genesis 4:19         YLH/  
Genesis 4:22         TWBL_QJN/
Genesis 10:11        KLX=/ 
Genesis 14:1         >MRPL/
Genesis 14:1         >RJWK/
Genesis 14:1         >LSR/

Confined to books¶

As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and the number of lexemes that occur exclusively in that book.

In [29]:

A.indent(reset=True)
A.info("Making book-lexeme index")

allBook = collections.defaultdict(set)
allLex = set()

for b in F.otype.s("book"):
    for w in L.d(b, "word"):
        lx = L.u(w, "lex")[0]
        allBook[b].add(lx)
        allLex.add(lx)

A.info("Found {} lexemes".format(len(allLex)))

  0.00s Making book-lexeme index
  1.08s Found 9230 lexemes

In [30]:

A.indent(reset=True)
A.info("Finding single book lexemes")

singleBook = collections.defaultdict(lambda: 0)
for lx in F.otype.s("lex"):
    book = L.u(lx, "book")
    if len(book) == 1:
        singleBook[book[0]] += 1

A.info("found {} single book lexemes".format(sum(singleBook.values())))

  0.00s Finding single book lexemes
  0.01s found 4224 single book lexemes

In [31]:

print(
    "{:<20}{:>5}{:>5}{:>5}\n{}".format(
        "book",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)
booklist = []

for b in F.otype.s("book"):
    book = T.bookName(b)
    a = len(allBook[b])
    o = singleBook.get(b, 0)
    p = 100 * o / a
    booklist.append((book, a, o, p))

for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))

book                 #all #own %own
-----------------------------------
Daniel               1122  428 38.1%
1_Chronicles         2013  487 24.2%
Ezra                  991  199 20.1%
Joshua               1175  206 17.5%
Esther                472   67 14.2%
Isaiah               2555  350 13.7%
Numbers              1457  197 13.5%
Ezekiel              1719  212 12.3%
Song_of_songs         503   60 11.9%
Job                  1717  202 11.8%
Genesis              1816  208 11.5%
Nehemiah             1076  110 10.2%
Psalms               2250  216  9.6%
Leviticus             960   88  9.2%
Judges               1210   99  8.2%
Ecclesiastes          575   46  8.0%
Proverbs             1356  103  7.6%
Jeremiah             1949  147  7.5%
2_Samuel             1304   89  6.8%
1_Samuel             1256   85  6.8%
2_Kings              1266   85  6.7%
Exodus               1425   92  6.5%
1_Kings              1291   81  6.3%
Deuteronomy          1449   80  5.5%
Lamentations          592   31  5.2%
2_Chronicles         1411   67  4.7%
Nahum                 357   16  4.5%
Hosea                 742   33  4.4%
Ruth                  319   14  4.4%
Habakkuk              393   17  4.3%
Amos                  652   27  4.1%
Joel                  398   14  3.5%
Zechariah             726   25  3.4%
Obadiah               167    5  3.0%
Micah                 586   16  2.7%
Zephaniah             367   10  2.7%
Jonah                 252    5  2.0%
Haggai                208    3  1.4%
Malachi               314    4  1.3%

The book names may sound a bit unfamiliar, they are in Latin here. Later we'll see that you can also get them in English, or in Swahili.

Locality API¶

We travel upwards and downwards, forwards and backwards through the nodes. The Locality-API (L) provides functions: u() for going up, and d() for going down, n() for going to next nodes and p() for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the oslots feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. And one if next or previous to an other, if its slots follow or precede the slots of the other one.

L.u(node) Up is going to nodes that embed node.

L.d(node) Down is the opposite direction, to those that are contained in node.

L.n(node) Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node.

L.p(node) Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node.

All these functions yield nodes of all possible otypes. By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

Going up¶

We go from the first word to the book it contains. Note the [0] at the end. You expect one book, yet L returns a tuple. To get the only element of that tuple, you need to do that [0].

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [32]:

firstBook = L.u(1, otype="book")[0]
print(firstBook)

And let's see all the containing objects of word 3:

In [33]:

w = 3
for otype in F.otype.all:
    if otype == F.otype.slotType:
        continue
    up = L.u(w, otype=otype)
    upNode = "x" if len(up) == 0 else up[0]
    print("word {} is contained in {} {}".format(w, otype, upNode))

word 3 is contained in book 426591
word 3 is contained in chapter 426630
word 3 is contained in lex 1437604
word 3 is contained in verse 1414389
word 3 is contained in half_verse 606394
word 3 is contained in sentence 1172308
word 3 is contained in sentence_atom 1236025
word 3 is contained in clause 427559
word 3 is contained in clause_atom 515690
word 3 is contained in phrase 651574
word 3 is contained in phrase_atom 904777
word 3 is contained in subphrase x

Going next¶

Let's go to the next nodes of the first book.

In [34]:

afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
secondBook = L.n(firstBook, otype="book")[0]

  28765: word          first slot=28765 , last slot=28765 
 923533: phrase_atom   first slot=28765 , last slot=28765 
 669555: phrase        first slot=28765 , last slot=28765 
 521826: clause_atom   first slot=28765 , last slot=28769 
 433546: clause        first slot=28765 , last slot=28769 
 609394: half_verse    first slot=28765 , last slot=28772 
1240671: sentence_atom first slot=28765 , last slot=28774 
1176925: sentence      first slot=28765 , last slot=28793 
1415922: verse         first slot=28765 , last slot=28778 
 426680: chapter       first slot=28765 , last slot=29113 
 426592: book          first slot=28765 , last slot=52512

Going previous¶

And let's see what is right before the second book.

In [35]:

for n in L.p(secondBook):
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )

 426591: book          first slot=1     , last slot=28764 
 426679: chapter       first slot=28260 , last slot=28764 
1415921: verse         first slot=28747 , last slot=28764 
 609393: half_verse    first slot=28755 , last slot=28764 
1176924: sentence      first slot=28758 , last slot=28764 
1240670: sentence_atom first slot=28758 , last slot=28764 
 433545: clause        first slot=28758 , last slot=28764 
 521825: clause_atom   first slot=28758 , last slot=28764 
 669554: phrase        first slot=28763 , last slot=28764 
 923532: phrase_atom   first slot=28763 , last slot=28764 
  28764: word          first slot=28764 , last slot=28764

Going down¶

We go to the chapters of the second book, and just count them.

In [36]:

chapters = L.d(secondBook, otype="chapter")
print(len(chapters))

The first verse¶

We pick the first verse and the first word, and explore what is above and below them.

In [37]:

for n in [1, L.u(1, otype="verse")[0]]:
    A.indent(level=0)
    A.info("Node {}".format(n), tm=False)
    A.indent(level=1)
    A.info("UP", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    A.indent(level=1)
    A.info("DOWN", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)

Node 1
   |   UP
   |      |   1437602         lex
   |      |   904776          phrase_atom
   |      |   651573          phrase
   |      |   606394          half_verse
   |      |   515690          clause_atom
   |      |   427559          clause
   |      |   1236025         sentence_atom
   |      |   1172308         sentence
   |      |   1414389         verse
   |      |   426630          chapter
   |      |   426591          book
   |   DOWN
   |      |   
Node 1414389
   |   UP
   |      |   515690          clause_atom
   |      |   427559          clause
   |      |   1236025         sentence_atom
   |      |   1172308         sentence
   |      |   426630          chapter
   |      |   426591          book
   |   DOWN
   |      |   1172308         sentence
   |      |   1236025         sentence_atom
   |      |   427559          clause
   |      |   515690          clause_atom
   |      |   606394          half_verse
   |      |   651573          phrase
   |      |   904776          phrase_atom
   |      |   1               word
   |      |   2               word
   |      |   651574          phrase
   |      |   904777          phrase_atom
   |      |   3               word
   |      |   651575          phrase
   |      |   904778          phrase_atom
   |      |   4               word
   |      |   606395          half_verse
   |      |   651576          phrase
   |      |   904779          phrase_atom
   |      |   1300539         subphrase
   |      |   5               word
   |      |   6               word
   |      |   7               word
   |      |   8               word
   |      |   1300540         subphrase
   |      |   9               word
   |      |   10              word
   |      |   11              word
Done

Text API¶

So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.

In the same way as F gives access to feature data, T gives access to the text. That is also feature data, but you can tell Text-Fabric which features are specifically carrying the text, and in return Text-Fabric offers you a Text API: T.

Formats¶

Hebrew text can be represented in a number of ways:

fully pointed (vocalized and accented), or consonantal,
in transliteration, phonetic transcription or in Hebrew characters,
showing the actual text or only the lexemes,
following the ketiv or the qere, at places where they deviate from each other.

If you wonder where the information about text formats is stored: not in the program text-fabric, but in the data set. It has a feature otext, which specifies the formats and which features must be used to produce them. otext is the third special feature in a TF data set, next to otype and oslots. It is an optional feature. If it is absent, there will be no T API.

Here is a list of all available formats in this data set.

In [38]:

sorted(T.formats)

Out[38]:

['lex-default',
 'lex-orig-full',
 'lex-orig-plain',
 'lex-trans-full',
 'lex-trans-plain',
 'text-orig-full',
 'text-orig-full-ketiv',
 'text-orig-plain',
 'text-phono-full',
 'text-trans-full',
 'text-trans-full-ketiv',
 'text-trans-plain']

Note the text-phono-full format here. It does not come from the main data source bhsa, but from the module phono. Look in your data directory, find ~/github/etcbc/phono/tf/2017/otext@phono.tf, and you'll see this format defined there.

Using the formats¶

We can pretty display in other formats:

In [39]:

for word in wordShows:
    A.pretty(word, fmt="text-phono-full")

T.text()¶

This function is central to get text representations of nodes. Its most basic usage is

T.text(nodes, fmt=fmt)

where nodes is a list or iterable of nodes, usually word nodes, and fmt is the name of a format. If you leave out fmt, the default text-orig-full is chosen.

The result is the text in that format for all nodes specified:

In [40]:

T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-plain")

Out[40]:

'בראשׁית ברא אלהים את השׁמים ואת הארץ׃ '

There is also another usage of this function:

T.text(node, fmt=fmt)

where node is a single node. In this case, the default format is ntype-orig-full where ntype is the type of node. So for a lex node, the default format is lex-orig-full.

If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in node will be looked up and represented with the default format text-orig-full.

In this way we can sensibly represent a lot of different nodes, such as chapters, verses, sentences, words and lexemes.

We compose a set of example nodes and run T.text on them:

In [41]:

exampleNodes = [
    1,
    F.otype.s("sentence")[0],
    F.otype.s("verse")[0],
    F.otype.s("chapter")[0],
    F.otype.s("lex")[1],
]
exampleNodes

Out[41]:

[1, 1172308, 1414389, 426630, 1437603]

In [42]:

for n in exampleNodes:
    print(f"This is {F.otype.v(n)} {n}:")
    print(T.text(n))
    print("")

This is word 1:
בְּ

This is sentence 1172308:
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 

This is verse 1414389:
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 

This is chapter 426630:
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃ וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃ וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם קָרָ֣א יַמִּ֑ים וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב מַזְרִ֣יעַ זֶ֔רַע עֵ֣ץ פְּרִ֞י עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא עֵ֣שֶׂב מַזְרִ֤יעַ זֶ֨רַע֙ לְמִינֵ֔הוּ וְעֵ֧ץ עֹ֥שֶׂה פְּרִ֛י אֲשֶׁ֥ר זַרְעֹו־בֹ֖ו לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שְׁלִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה וְהָי֤וּ לְאֹתֹת֙ וּלְמֹ֣ועֲדִ֔ים וּלְיָמִ֖ים וְשָׁנִֽים׃ וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם לְהָאִ֖יר עַל־הָאָ֑רֶץ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִ֔ים אֶת־שְׁנֵ֥י הַמְּאֹרֹ֖ת הַגְּדֹלִ֑ים אֶת־הַמָּאֹ֤ור הַגָּדֹל֙ לְמֶמְשֶׁ֣לֶת הַיֹּ֔ום וְאֶת־הַמָּאֹ֤ור הַקָּטֹן֙ לְמֶמְשֶׁ֣לֶת הַלַּ֔יְלָה וְאֵ֖ת הַכֹּוכָבִֽים׃ וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום רְבִיעִֽי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יִשְׁרְצ֣וּ הַמַּ֔יִם שֶׁ֖רֶץ נֶ֣פֶשׁ חַיָּ֑ה וְעֹוף֙ יְעֹופֵ֣ף עַל־הָאָ֔רֶץ עַל־פְּנֵ֖י רְקִ֥יעַ הַשָּׁמָֽיִם׃ וַיִּבְרָ֣א אֱלֹהִ֔ים אֶת־הַתַּנִּינִ֖ם הַגְּדֹלִ֑ים וְאֵ֣ת כָּל־נֶ֣פֶשׁ הַֽחַיָּ֣ה׀ הָֽרֹמֶ֡שֶׂת אֲשֶׁר֩ שָׁרְצ֨וּ הַמַּ֜יִם לְמִֽינֵהֶ֗ם וְאֵ֨ת כָּל־עֹ֤וף כָּנָף֙ לְמִינֵ֔הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיְבָ֧רֶךְ אֹתָ֛ם אֱלֹהִ֖ים לֵאמֹ֑ר פְּר֣וּ וּרְב֗וּ וּמִלְא֤וּ אֶת־הַמַּ֨יִם֙ בַּיַּמִּ֔ים וְהָעֹ֖וף יִ֥רֶב בָּאָֽרֶץ׃ וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום חֲמִישִֽׁי׃ פ וַיֹּ֣אמֶר אֱלֹהִ֗ים תֹּוצֵ֨א הָאָ֜רֶץ נֶ֤פֶשׁ חַיָּה֙ לְמִינָ֔הּ בְּהֵמָ֥ה וָרֶ֛מֶשׂ וְחַֽיְתֹו־אֶ֖רֶץ לְמִינָ֑הּ וַֽיְהִי־כֵֽן׃ וַיַּ֣עַשׂ אֱלֹהִים֩ אֶת־חַיַּ֨ת הָאָ֜רֶץ לְמִינָ֗הּ וְאֶת־הַבְּהֵמָה֙ לְמִינָ֔הּ וְאֵ֛ת כָּל־רֶ֥מֶשׂ הָֽאֲדָמָ֖ה לְמִינֵ֑הוּ וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃ וַיֹּ֣אמֶר אֱלֹהִ֔ים נַֽעֲשֶׂ֥ה אָדָ֛ם בְּצַלְמֵ֖נוּ כִּדְמוּתֵ֑נוּ וְיִרְדּוּ֩ בִדְגַ֨ת הַיָּ֜ם וּבְעֹ֣וף הַשָּׁמַ֗יִם וּבַבְּהֵמָה֙ וּבְכָל־הָאָ֔רֶץ וּבְכָל־הָרֶ֖מֶשׂ הָֽרֹמֵ֥שׂ עַל־הָאָֽרֶץ׃ וַיִּבְרָ֨א אֱלֹהִ֤ים׀ אֶת־הָֽאָדָם֙ בְּצַלְמֹ֔ו בְּצֶ֥לֶם אֱלֹהִ֖ים בָּרָ֣א אֹתֹ֑ו זָכָ֥ר וּנְקֵבָ֖ה בָּרָ֥א אֹתָֽם׃ וַיְבָ֣רֶךְ אֹתָם֮ אֱלֹהִים֒ וַיֹּ֨אמֶר לָהֶ֜ם אֱלֹהִ֗ים פְּר֥וּ וּרְב֛וּ וּמִלְא֥וּ אֶת־הָאָ֖רֶץ וְכִבְשֻׁ֑הָ וּרְד֞וּ בִּדְגַ֤ת הַיָּם֙ וּבְעֹ֣וף הַשָּׁמַ֔יִם וּבְכָל־חַיָּ֖ה הָֽרֹמֶ֥שֶׂת עַל־הָאָֽרֶץ׃ וַיֹּ֣אמֶר אֱלֹהִ֗ים הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע לָכֶ֥ם יִֽהְיֶ֖ה לְאָכְלָֽה׃ וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה וַֽיְהִי־כֵֽן׃ וַיַּ֤רְא אֱלֹהִים֙ אֶת־כָּל־אֲשֶׁ֣ר עָשָׂ֔ה וְהִנֵּה־טֹ֖וב מְאֹ֑ד וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום הַשִּׁשִּֽׁי׃ פ 

This is lex 1437603:
רֵאשִׁית

Using the formats¶

Now let's use those formats to print out the first verse of the Hebrew Bible.

In [43]:

for fmt in sorted(T.formats):
    print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt)))

lex-default:
	בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ 
lex-orig-full:
	בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
lex-orig-plain:
	ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ 
lex-trans-full:
	B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
lex-trans-plain:
	B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
text-orig-full:
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-full-ketiv:
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-plain:
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
text-phono-full:
	bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . 
text-trans-full:
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-full-ketiv:
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-plain:
	BR>CJT BR> >LHJM >T HCMJM W>T H>RY00

Note that lex-default is a format that only works for nodes of type lex.

If we do not specify a format, the default format is used (text-orig-full).

In [44]:

T.text(range(1, 12))

Out[44]:

'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '

In [45]:

firstVerse = F.otype.s("verse")[0]
T.text(firstVerse)

Out[45]:

'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '

In [46]:

T.text(firstVerse, fmt="text-phono-full")

Out[46]:

'bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . '

The important things to remember are:

you can supply a list of word nodes and get them represented in all formats (except lex-default)
you can use T.text(lx) for lexeme nodes lx and it will give the vocalized lexeme (using format lex-default)
you can get non-word nodes n in default format by T.text(n)
you can get non-word nodes n in other formats by T.text(n, fmt=fmt, descend=True)

Whole text in all formats¶

Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Hebrew Bible is a piece of cake.

It takes less than ten seconds to have that cake and eat it. In nearly a dozen formats.

In [47]:

A.indent(reset=True)
A.info("writing plain text of whole Bible in all formats ...")
text = collections.defaultdict(list)
for v in F.otype.s("verse"):
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(v, fmt=fmt, descend=True))
A.info("done {} formats".format(len(text)))

  0.00s writing plain text of whole Bible in all formats ...
  3.09s done 12 formats

In [48]:

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))

lex-default
בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ 
וְ הַ אֶרֶץ היה תֹּהוּ וְ בֹּהוּ וְ חֹשֶׁךְ עַל פָּנֶה תְּהֹום וְ רוּחַ אֱלֹהִים רחף עַל פָּנֶה הַ מַיִם 
וְ אמר אֱלֹהִים היה אֹור וְ היה אֹור 
וְ ראה אֱלֹהִים אֵת הַ אֹור כִּי טוב וְ בדל אֱלֹהִים בַּיִן הַ אֹור וְ בַּיִן הַ חֹשֶׁךְ 
וְ קרא אֱלֹהִים לְ הַ אֹור יֹום וְ לְ הַ חֹשֶׁךְ קרא לַיְלָה וְ היה עֶרֶב וְ היה בֹּקֶר יֹום אֶחָד 

lex-orig-full
בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
וְ הָ אָרֶץ הָי תֹהוּ וָ בֹהוּ וְ חֹשֶׁךְ עַל פְּן תְהֹום וְ רוּחַ אֱלֹה רַחֶף עַל פְּן הַ מָּי 
וַ אמֶר אֱלֹה הִי אֹור וַ הִי אֹור 
וַ רְא אֱלֹה אֶת הָ אֹור כִּי טֹוב וַ בְדֵּל אֱלֹה בֵּין הָ אֹור וּ בֵין הַ חֹשֶׁךְ 
וַ קְרָא אֱלֹה לָ  אֹור יֹום וְ לַ  חֹשֶׁךְ קָרָא לָיְלָה וַ הִי עֶרֶב וַ הִי בֹקֶר יֹום אֶחָד 

lex-orig-plain
ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ 
ו ה ארץ היה תהו ו בהו ו חשׁך על פנה תהום ו רוח אלהים רחף על פנה ה מים 
ו אמר אלהים היה אור ו היה אור 
ו ראה אלהים את ה אור כי טוב ו בדל אלהים בין ה אור ו בין ה חשׁך 
ו קרא אלהים ל ה אור יום ו ל ה חשׁך קרא לילה ו היה ערב ו היה בקר יום אחד 

lex-trans-full
B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
W:- H@- >@REY H@J TOHW. W@- BOHW. W:- XOCEK: <AL P.:N T:HOWM W:- RW.XA >:ELOH RAXEP <AL P.:N HA- M.@J 
WA- >MER >:ELOH HIJ >OWR WA- HIJ >OWR 
WA- R:> >:ELOH >ET H@- >OWR K.IJ VOWB WA- B:D.;L >:ELOH B.;JN H@- >OWR W.- B;JN HA- XOCEK: 
WA- Q:R@> >:ELOH L@- - >OWR JOWM W:- LA- - XOCEK: Q@R@> L@J:L@H WA- HIJ <EREB WA- HIJ BOQER JOWM >EX@D 

lex-trans-plain
B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
W H >RY/ HJH[ THW/ W BHW/ W XCK/ <L PNH/ THWM/ W RWX/ >LHJM/ RXP[ <L PNH/ H MJM/ 
W >MR[ >LHJM/ HJH[ >WR/ W HJH[ >WR/ 
W R>H[ >LHJM/ >T H >WR/ KJ VWB[ W BDL[ >LHJM/ BJN/ H >WR/ W BJN/ H XCK/ 
W QR>[ >LHJM/ L H >WR/ JWM/ W L H XCK/ QR>[ LJLH/ W HJH[ <RB/ W HJH[ BQR=/ JWM/ >XD/ 

text-orig-full
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ 
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ 
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ 
וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ 

text-orig-full-ketiv
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ 
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃ 
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ 
וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ 

text-orig-plain
בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃ 
ויאמר אלהים יהי אור ויהי־אור׃ 
וירא אלהים את־האור כי־טוב ויבדל אלהים בין האור ובין החשׁך׃ 
ויקרא אלהים׀ לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ 

text-phono-full
bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . 
wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . 
wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . 
wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . 
wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f 

text-trans-full
B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: <AL&P.:N;74J T:HO92WM W:-R74W.XA >:ELOHI80JM M:RAXE73PET <AL&P.:N;71J HA-M.@75JIM00 
WA-J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 
WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 
WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&<E71REB WA45-J:HIJ&BO73QER JO71WM >EX@75D00_P 

text-trans-full-ketiv
B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
W:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. W:-XO73CEK: <AL&P.:N;74J T:HO92WM W:-R74W.XA >:ELOHI80JM M:RAXE73PET <AL&P.:N;71J HA-M.@75JIM00 
WA-J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA45-J:HIJ&>O75WR00 
WA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR K.IJ&VO92WB WA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 
WA-J.IQ:R@63> >:ELOHI70JM05 L@-->OWR03 JO80WM W:-LA--XO73CEK: Q@74R@> L@92J:L@H WA45-J:HIJ&<E71REB WA45-J:HIJ&BO73QER JO71WM >EX@75D00_P 

text-trans-plain
BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
WH>RY HJTH THW WBHW WXCK <L&PNJ THWM WRWX >LHJM MRXPT <L&PNJ HMJM00 
WJ>MR >LHJM JHJ >WR WJHJ&>WR00 
WJR> >LHJM >T&H>WR KJ&VWB WJBDL >LHJM BJN H>WR WBJN HXCK00 
WJQR> >LHJM05 L>WR JWM WLXCK QR> LJLH WJHJ&<RB WJHJ&BQR JWM >XD00_P

The full plain text¶

We write a few formats to file, in your Downloads folder.

In [49]:

for fmt in """
    text-orig-full
    text-phono-full
""".strip().split():
    with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f:
        f.write("\n".join(text[fmt]))

Book names¶

For Bible book names, we can use several languages.

Languages¶

Here are the languages that we can use for book names. These languages come from the features book@ll, where ll is a two letter ISO language code. Have a look in your data directory, you can't miss them.

In [50]:

T.languages

Out[50]:

{'': {'language': 'default', 'languageEnglish': 'default'},
 'am': {'language': 'ኣማርኛ', 'languageEnglish': 'amharic'},
 'ar': {'language': 'العَرَبِية', 'languageEnglish': 'arabic'},
 'bn': {'language': 'বাংলা', 'languageEnglish': 'bengali'},
 'da': {'language': 'Dansk', 'languageEnglish': 'danish'},
 'de': {'language': 'Deutsch', 'languageEnglish': 'german'},
 'el': {'language': 'Ελληνικά', 'languageEnglish': 'greek'},
 'en': {'language': 'English', 'languageEnglish': 'english'},
 'es': {'language': 'Español', 'languageEnglish': 'spanish'},
 'fa': {'language': 'فارسی', 'languageEnglish': 'farsi'},
 'fr': {'language': 'Français', 'languageEnglish': 'french'},
 'he': {'language': 'עברית', 'languageEnglish': 'hebrew'},
 'hi': {'language': 'हिन्दी', 'languageEnglish': 'hindi'},
 'id': {'language': 'Bahasa Indonesia', 'languageEnglish': 'indonesian'},
 'ja': {'language': '日本語', 'languageEnglish': 'japanese'},
 'ko': {'language': '한국어', 'languageEnglish': 'korean'},
 'la': {'language': 'Latina', 'languageEnglish': 'latin'},
 'nl': {'language': 'Nederlands', 'languageEnglish': 'dutch'},
 'pa': {'language': 'ਪੰਜਾਬੀ', 'languageEnglish': 'punjabi'},
 'pt': {'language': 'Português', 'languageEnglish': 'portuguese'},
 'ru': {'language': 'Русский', 'languageEnglish': 'russian'},
 'sw': {'language': 'Kiswahili', 'languageEnglish': 'swahili'},
 'syc': {'language': 'ܠܫܢܐ ܣܘܪܝܝܐ', 'languageEnglish': 'syriac'},
 'tr': {'language': 'Türkçe', 'languageEnglish': 'turkish'},
 'ur': {'language': 'اُردُو', 'languageEnglish': 'urdu'},
 'yo': {'language': 'èdè Yorùbá', 'languageEnglish': 'yoruba'},
 'zh': {'language': '中文', 'languageEnglish': 'chinese'}}

Book names in Swahili¶

Get the book names in Swahili.

In [51]:

nodeToSwahili = ""
for b in F.otype.s("book"):
    nodeToSwahili += "{} = {}\n".format(b, T.bookName(b, lang="sw"))
print(nodeToSwahili)

426591 = Mwanzo
426592 = Kutoka
426593 = Mambo_ya_Walawi
426594 = Hesabu
426595 = Kumbukumbu_la_Torati
426596 = Yoshua
426597 = Waamuzi
426598 = 1_Samweli
426599 = 2_Samweli
426600 = 1_Wafalme
426601 = 2_Wafalme
426602 = Isaya
426603 = Yeremia
426604 = Ezekieli
426605 = Hosea
426606 = Yoeli
426607 = Amosi
426608 = Obadia
426609 = Yona
426610 = Mika
426611 = Nahumu
426612 = Habakuki
426613 = Sefania
426614 = Hagai
426615 = Zekaria
426616 = Malaki
426617 = Zaburi
426618 = Ayubu
426619 = Mithali
426620 = Ruthi
426621 = Wimbo_Ulio_Bora
426622 = Mhubiri
426623 = Maombolezo
426624 = Esta
426625 = Danieli
426626 = Ezra
426627 = Nehemia
426628 = 1_Mambo_ya_Nyakati
426629 = 2_Mambo_ya_Nyakati

Book nodes from Swahili¶

OK, there they are. We copy them into a string, and do the opposite: get the nodes back. We check whether we get exactly the same nodes as the ones we started with.

In [52]:

swahiliNames = """
Mwanzo
Kutoka
Mambo_ya_Walawi
Hesabu
Kumbukumbu_la_Torati
Yoshua
Waamuzi
1_Samweli
2_Samweli
1_Wafalme
2_Wafalme
Isaya
Yeremia
Ezekieli
Hosea
Yoeli
Amosi
Obadia
Yona
Mika
Nahumu
Habakuki
Sefania
Hagai
Zekaria
Malaki
Zaburi
Ayubu
Mithali
Ruthi
Wimbo_Ulio_Bora
Mhubiri
Maombolezo
Esta
Danieli
Ezra
Nehemia
1_Mambo_ya_Nyakati
2_Mambo_ya_Nyakati
""".strip().split()

swahiliToNode = ""
for nm in swahiliNames:
    swahiliToNode += "{} = {}\n".format(T.bookNode(nm, lang="sw"), nm)

if swahiliToNode != nodeToSwahili:
    print("Something is not right with the book names")
else:
    print("Going from nodes to booknames and back yields the original nodes")

Going from nodes to booknames and back yields the original nodes

Sections¶

A section in the Hebrew bible is a book, a chapter or a verse. Knowledge of sections is not baked into Text-Fabric. The config feature otext.tf may specify three section levels, and tell what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from verse nodes to tuples of the form:

(bookName, chapterNumber, verseNumber)

You can get the section of a node as a tuple of relevant book, chapter, and verse nodes. Or you can get it as a passage label, a string.

You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.

If you are dealing with book and chapter nodes, you can ask to fill out the verse and chapter parts as well.

Here are examples of getting the section that corresponds to a node and vice versa.

NB: sectionFromNode always delivers a verse specification, either from the first slot belonging to that node, or, if lastSlot, from the last slot belonging to that node.

In [53]:

for (desc, n) in chain(normalShow.items(), sectionShow.items()):
    for lang in "en la sw".split():
        d = f"{n:>7} {desc}" if lang == "en" else ""
        first = A.sectionStrFromNode(n, lang=lang)
        last = A.sectionStrFromNode(n, lang=lang, lastSlot=True, fillup=True)
        tup = (
            T.sectionTuple(n)
            if lang == "en"
            else T.sectionTuple(n, lastSlot=True, fillup=True)
            if lang == "la"
            else ""
        )
        print(f"{d:<20} {lang} - {first:<30} {last:<30} {tup}")

  15890 wordShow     en - Genesis 30:18                  Genesis 30:18                  (426591, 426659, 1415237)
                     la - Genesis 30:18                  Genesis 30:18                  (426591, 426659, 1415237)
                     sw - Mwanzo 30:18                   Mwanzo 30:18                   
 700000 phraseShow   en - Numbers 22:31                  Numbers 22:31                  (426594, 426768, 1418795)
                     la - Numeri 22:31                   Numeri 22:31                   (426594, 426768, 1418795)
                     sw - Hesabu 22:31                   Hesabu 22:31                   
 500000 clauseShow   en - Job 36:27                      Job 36:27                      (426618, 427382, 1432958)
                     la - Iob 36:27                      Iob 36:27                      (426618, 427382, 1432958)
                     sw - Ayubu 36:27                    Ayubu 36:27                    
1200000 sentenceShow en - 2_Kings 6:5                    2_Kings 6:5                    (426601, 426944, 1423986)
                     la - Reges_II 6:5                   Reges_II 6:5                   (426601, 426944, 1423986)
                     sw - 2_Wafalme 6:5                  2_Wafalme 6:5                  
1437667 lexShow      en - Genesis 1:16                   2_Chronicles 22:1              (426591, 426630, 1414404)
                     la - Genesis 1:16                   Chronica_II 22:1               (426629, 427544, 1437230)
                     sw - Mwanzo 1:16                    2_Mambo_ya_Nyakati 22:1        
1420000 verseShow    en - Deuteronomy 27:25              Deuteronomy 27:25              (426595, 426809, 1420000)
                     la - Deuteronomium 27:25            Deuteronomium 27:25            (426595, 426809, 1420000)
                     sw - Kumbukumbu_la_Torati 27:25     Kumbukumbu_la_Torati 27:25     
 427000 chapterShow  en - Isaiah 37                      Isaiah 37:38                   (426602, 427000)
                     la - Jesaia 37                      Jesaia 37:38                   (426602, 427000, 1425295)
                     sw - Isaya 37                       Isaya 37:38                    
 426598 bookShow     en - 1_Samuel                       1_Samuel 31:13                 (426598,)
                     la - Samuel_I                       Samuel_I 31:13                 (426598, 426892, 1422328)
                     sw - 1_Samweli                      1_Samweli 31:13

And here are examples to get back:

In [54]:

for (lang, section) in (
    ("en", "Ezekiel"),
    ("la", "Ezechiel"),
    ("sw", "Ezekieli"),
    ("en", "Isaiah 43"),
    ("la", "Jesaia 43"),
    ("sw", "Isaya 43"),
    ("en", "Deuteronomy 28:34"),
    ("la", "Deuteronomium 28:34"),
    ("sw", "Kumbukumbu_la_Torati 28:34"),
    ("en", "Job 37:3"),
    ("la", "Iob 37:3"),
    ("sw", "Ayubu 37:3"),
    ("en", "Numbers 22:33"),
    ("la", "Numeri 22:33"),
    ("sw", "Hesabu 22:33"),
    ("en", "Genesis 30:18"),
    ("la", "Genesis 30:18"),
    ("sw", "Mwanzo 30:18"),
    ("en", "Genesis 1:30"),
    ("la", "Genesis 1:30"),
    ("sw", "Mwanzo 1:30"),
    ("en", "Psalms 37:2"),
    ("la", "Psalmi 37:2"),
    ("sw", "Zaburi 37:2"),
):
    n = A.nodeFromSectionStr(section, lang=lang)
    nType = F.otype.v(n)
    print(f"{section:<30} {lang} {nType:<20} {n}")

Ezekiel                        en book                 426604
Ezechiel                       la book                 426604
Ezekieli                       sw book                 426604
Isaiah 43                      en chapter              427006
Jesaia 43                      la chapter              427006
Isaya 43                       sw chapter              427006
Deuteronomy 28:34              en verse                1420035
Deuteronomium 28:34            la verse                1420035
Kumbukumbu_la_Torati 28:34     sw verse                1420035
Job 37:3                       en verse                1432967
Iob 37:3                       la verse                1432967
Ayubu 37:3                     sw verse                1432967
Numbers 22:33                  en verse                1418797
Numeri 22:33                   la verse                1418797
Hesabu 22:33                   sw verse                1418797
Genesis 30:18                  en verse                1415237
Genesis 30:18                  la verse                1415237
Mwanzo 30:18                   sw verse                1415237
Genesis 1:30                   en verse                1414418
Genesis 1:30                   la verse                1414418
Mwanzo 1:30                    sw verse                1414418
Psalms 37:2                    en verse                1430067
Psalmi 37:2                    la verse                1430067
Zaburi 37:2                    sw verse                1430067

Sentences spanning multiple verses¶

If you go up from a sentence node, you expect to find a verse node. But some sentences span multiple verses, and in that case, you will not find the enclosing verse node, because it is not there.

Here is a piece of code to detect and list all cases where sentences span multiple verses.

The idea is to pick the first and the last word of a sentence, use T.sectionFromNode to discover the verse in which that word occurs, and if they are different: bingo!

We show the first 5 of ca. 900 cases.

By the way: doing this in the 2016 version of the data yields 915 results. The splitting up of the text into sentences is not carved in stone!

In [55]:

A.indent(reset=True)
A.info("Get sentences that span multiple verses")

spanSentences = []
for s in F.otype.s("sentence"):
    fs = T.sectionFromNode(s, lastSlot=False)
    ls = T.sectionFromNode(s, lastSlot=True)
    if fs != ls:
        spanSentences.append("{} {}:{}-{}".format(fs[0], fs[1], fs[2], ls[2]))

A.info("Found {} cases".format(len(spanSentences)))
A.info("\n{}".format("\n".join(spanSentences[0:10])))

  0.00s Get sentences that span multiple verses
  1.09s Found 887 cases
  1.09s 
Genesis 1:17-18
Genesis 1:29-30
Genesis 2:4-7
Genesis 7:2-3
Genesis 7:8-9
Genesis 7:13-14
Genesis 9:9-10
Genesis 10:11-12
Genesis 10:13-14
Genesis 10:15-18

A different way, with better display, is:

In [56]:

A.indent(reset=True)
A.info("Get sentences that span multiple verses")

spanSentences = []
for s in F.otype.s("sentence"):
    words = L.d(s, otype="word")
    fw = words[0]
    lw = words[-1]
    fVerse = L.u(fw, otype="verse")[0]
    lVerse = L.u(lw, otype="verse")[0]
    if fVerse != lVerse:
        spanSentences.append((s, fVerse, lVerse))

A.info("Found {} cases".format(len(spanSentences)))
A.table(spanSentences, end=1)

  0.00s Get sentences that span multiple verses
  0.38s Found 887 cases

n	p	sentence	verse	verse
1	Genesis 1:17	וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ

Wait a second, the columns with the verses are empty. In tables, the content of a verse is not shown. And by default, the passage that is relevant to a row is computed from one of the columns.

But here, we definitely want the passage of columns 2 and 3, so:

In [57]:

A.table(spanSentences, end=10, withPassage={2, 3})

n	sentence	verse	verse
1	וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם לְהָאִ֖יר עַל־הָאָֽרֶץ׃ וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ	Genesis 1:17	Genesis 1:18
2	הִנֵּה֩ נָתַ֨תִּי לָכֶ֜ם אֶת־כָּל־עֵ֣שֶׂב׀ זֹרֵ֣עַ זֶ֗רַע אֲשֶׁר֙ עַל־פְּנֵ֣י כָל־הָאָ֔רֶץ וְאֶת־כָּל־הָעֵ֛ץ אֲשֶׁר־בֹּ֥ו פְרִי־עֵ֖ץ זֹרֵ֣עַ זָ֑רַע וּֽלְכָל־חַיַּ֣ת הָ֠אָרֶץ וּלְכָל־עֹ֨וף הַשָּׁמַ֜יִם וּלְכֹ֣ל׀ רֹומֵ֣שׂ עַל־הָאָ֗רֶץ אֲשֶׁר־בֹּו֙ נֶ֣פֶשׁ חַיָּ֔ה אֶת־כָּל־יֶ֥רֶק עֵ֖שֶׂב לְאָכְלָ֑ה	Genesis 1:29	Genesis 1:30
3	בְּיֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְשָׁמָֽיִם׃ וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה	Genesis 2:4	Genesis 2:7
4	מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו וּמִן־הַבְּהֵמָ֡ה אֲ֠שֶׁר לֹ֣א טְהֹרָ֥ה הִ֛וא שְׁנַ֖יִם אִ֥ישׁ וְאִשְׁתֹּֽו׃ גַּ֣ם מֵעֹ֧וף הַשָּׁמַ֛יִם שִׁבְעָ֥ה שִׁבְעָ֖ה זָכָ֣ר וּנְקֵבָ֑ה לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃	Genesis 7:2	Genesis 7:3
5	מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה אֲשֶׁ֥ר אֵינֶ֖נָּה טְהֹרָ֑ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל אֲשֶׁר־רֹמֵ֖שׂ עַל־הָֽאֲדָמָֽה׃ שְׁנַ֨יִם שְׁנַ֜יִם בָּ֧אוּ אֶל־נֹ֛חַ אֶל־הַתֵּבָ֖ה זָכָ֣ר וּנְקֵבָ֑ה כַּֽאֲשֶׁ֛ר צִוָּ֥ה אֱלֹהִ֖ים אֶת־נֹֽחַ׃	Genesis 7:8	Genesis 7:9
6	בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃ הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃	Genesis 7:13	Genesis 7:14
7	וַאֲנִ֕י הִנְנִ֥י מֵקִ֛ים אֶת־בְּרִיתִ֖י אִתְּכֶ֑ם וְאֶֽת־זַרְעֲכֶ֖ם אַֽחֲרֵיכֶֽם׃ וְאֵ֨ת כָּל־נֶ֤פֶשׁ הַֽחַיָּה֙ אֲשֶׁ֣ר אִתְּכֶ֔ם בָּעֹ֧וף בַּבְּהֵמָ֛ה וּֽבְכָל־חַיַּ֥ת הָאָ֖רֶץ אִתְּכֶ֑ם מִכֹּל֙ יֹצְאֵ֣י הַתֵּבָ֔ה לְכֹ֖ל חַיַּ֥ת הָאָֽרֶץ׃	Genesis 9:9	Genesis 9:10
8	וַיִּ֨בֶן֙ אֶת־נִ֣ינְוֵ֔ה וְאֶת־רְחֹבֹ֥ת עִ֖יר וְאֶת־כָּֽלַח׃ וְֽאֶת־רֶ֔סֶן בֵּ֥ין נִֽינְוֵ֖ה וּבֵ֣ין כָּ֑לַח	Genesis 10:11	Genesis 10:12
9	וּמִצְרַ֡יִם יָלַ֞ד אֶת־לוּדִ֧ים וְאֶת־עֲנָמִ֛ים וְאֶת־לְהָבִ֖ים וְאֶת־נַפְתֻּחִֽים׃ וְֽאֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים אֲשֶׁ֨ר יָצְא֥וּ מִשָּׁ֛ם פְּלִשְׁתִּ֖ים וְאֶת־כַּפְתֹּרִֽים׃ ס	Genesis 10:13	Genesis 10:14
10	וּכְנַ֗עַן יָלַ֛ד אֶת־צִידֹ֥ן בְּכֹרֹ֖ו וְאֶת־חֵֽת׃ וְאֶת־הַיְבוּסִי֙ וְאֶת־הָ֣אֱמֹרִ֔י וְאֵ֖ת הַגִּרְגָּשִֽׁי׃ וְאֶת־הַֽחִוִּ֥י וְאֶת־הַֽעַרְקִ֖י וְאֶת־הַסִּינִֽי׃ וְאֶת־הָֽאַרְוָדִ֥י וְאֶת־הַצְּמָרִ֖י וְאֶת־הַֽחֲמָתִ֑י	Genesis 10:15	Genesis 10:18

We can zoom in:

In [58]:

A.show(spanSentences, condensed=False, start=6, end=6, baseTypes={"sentence_atom"})

result 6

Genesis 7:13

verse

sentence

clause בְּעֶ֨צֶם הַיֹּ֤ום הַזֶּה֙ בָּ֣א נֹ֔חַ וְשֵׁם־וְחָ֥ם וָיֶ֖פֶת בְּנֵי־נֹ֑חַ וְאֵ֣שֶׁת נֹ֗חַ וּשְׁלֹ֧שֶׁת נְשֵֽׁי־בָנָ֛יו אִתָּ֖ם אֶל־הַתֵּבָֽה׃

Genesis 7:14

verse

sentence

clause הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ

clause הָרֹמֵ֥שׂ עַל־הָאָ֖רֶץ

clause לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃

Ketiv Qere¶

Let us explore where Ketiv/Qere pairs are and how they render.

In [59]:

qeres = [w for w in F.otype.s("word") if F.qere.v(w) is not None]
print("{} qeres".format(len(qeres)))
for w in qeres[0:10]:
    print(
        '{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format(
            w,
            F.g_word.v(w),
            F.trailer.v(w),
            F.qere.v(w),
            F.qere_trailer.v(w),
        )
    )

1892 qeres
3897: ketiv = "*HWY>"+" " qere = "HAJ:Y;74>"+" "
4420: ketiv = "*>HLH"+"00 " qere = ">@H:@LO75W"+"00"
5645: ketiv = "*>HLH"+" " qere = ">@H:@LO92W"+" "
5912: ketiv = "*>HLH"+" " qere = ">@95H:@LOW03"+" "
6246: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" "
6354: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" "
11762: ketiv = "*W-"+"" qere = "WA"+""
11763: ketiv = "*JJFM"+" " qere = "J.W.FA70m"+" "
12784: ketiv = "*GJJM"+" " qere = "GOWJIm03"+" "
13685: ketiv = "*YJDH"+"00 " qere = "Y@75JID"+"00"

Show a ketiv-qere pair¶

Let us print all text representations of the verse in which the second qere occurs.

In [60]:

refWord = qeres[1]
print(f"Reference word is {refWord}")
vn = L.u(refWord, otype="verse")[0]
print("{} {}:{}".format(*T.sectionFromNode(refWord)))
for fmt in sorted(T.formats):
    if fmt.startswith("text-"):
        print("{:<25} {}".format(fmt, T.text(vn, fmt=fmt, descend=True)))

Reference word is 4420
Genesis 9:21
text-orig-full            וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אָהֳלֹֽו׃
text-orig-full-ketiv      וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אהלה׃ 
text-orig-plain           וישׁת מן־היין וישׁכר ויתגל בתוך אהלה׃ 
text-phono-full           wayyˌēšt min-hayyˌayin wayyiškˈār wayyiṯgˌal bᵊṯˌôḵ *ʔohᵒlˈô .
text-trans-full           WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: >@H:@LO75W00
text-trans-full-ketiv     WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: *>HLH00 
text-trans-plain          WJCT MN&HJJN WJCKR WJTGL BTWK >HLH00

Edge features: mother¶

We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet, the edges point from one row to another.

One edge we have encountered: the special feature oslots. Each non-slot node is linked by oslots to all of its slot nodes.

An edge is really a feature as well. Whereas a node feature is a column of information, one cell per node, an edge feature is also a column of information, one cell per pair of nodes.

Linguists use more relationships between textual objects, for example: linguistic dependency. In the BHSA all cases of linguistic dependency are coded in the edge feature mother.

Let us do a few basic enquiry on an edge feature: mother.

We count how many mothers nodes can have (it turns to be 0 or 1). We walk through all nodes and per node we retrieve the mother nodes, and we store the lengths (if non-zero) in a dictionary (mother_len).

We see that nodes have at most one mother.

We also count the inverse relationship: daughters.

In [61]:

A.indent(reset=True)
A.info("Counting mothers")

motherLen = {}
daughterLen = {}

for c in N.walk():
    lms = E.mother.f(c) or []
    lds = E.mother.t(c) or []
    nms = len(lms)
    nds = len(lds)
    if nms:
        motherLen[c] = nms
    if nds:
        daughterLen[c] = nds

A.info("{} nodes have mothers".format(len(motherLen)))
A.info("{} nodes have daughters".format(len(daughterLen)))

motherCount = collections.Counter()
daughterCount = collections.Counter()

for (n, lm) in motherLen.items():
    motherCount[lm] += 1
for (n, ld) in daughterLen.items():
    daughterCount[ld] += 1

print("mothers", motherCount)
print("daughters", daughterCount)

  0.00s Counting mothers
  0.73s 182269 nodes have mothers
  0.73s 144112 nodes have daughters
mothers Counter({1: 182269})
daughters Counter({1: 117986, 2: 17370, 3: 6284, 4: 1851, 5: 470, 6: 125, 7: 21, 8: 5})

Clean caches¶

Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.

There are two ways to do that:

Locate the .tf directory of your dataset, and remove all .tfx files in it. This might be a bit awkward to do, because the .tf directory is hidden on Unix-like systems.
Call TF.clearCache(), which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.

In [65]:

# TF.clearCache()

All steps¶

By now you have an impression how to compute around in the Hebrew Bible. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

start your first step in mastering the bible computationally
display become an expert in creating pretty displays of your text structures
search turbo charge your hand-coding with search templates
exportExcel make tailor-made spreadsheets out of your results
share draw in other people's data and let them use yours
export export your dataset as an Emdros database
annotate annotate plain text by means of other tools and import the annotations as TF features
map map somebody else's annotations to a new version of the corpus
volumes work with selected books only
trees work with the BHSA data as syntax trees

CC-BY Dirk Roorda