You might want to consider the start of this tutorial.

Short introductions to other TF datasets:

Annotation outside TF

Task:

  • prepare a text file based on TF data.
  • annotate the text file by assigning values to pieces of text
  • generate TF features based on these annotations

We use a device in Text-Fabric that has been developed for this kind of round-trip: the Recorder.

In [1]:
%load_ext autoreload
%autoreload 2

Incantation

The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.

In [2]:
from tf.app import use
from tf.convert.recorder import Recorder
In [3]:
A = use("etcbc/bhsa", hoist=globals())
TF-app: ~/text-fabric-data/etcbc/bhsa/app
data: ~/text-fabric-data/etcbc/bhsa/tf/2021
data: ~/text-fabric-data/etcbc/phono/tf/2021
data: ~/text-fabric-data/etcbc/parallels/tf/2021
This is Text-Fabric 9.2.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

122 features found and 0 ignored
Text-Fabric: Text-Fabric API 9.2.3, etcbc/bhsa/app v3, Search Reference
Data: BHSA, Character table, Feature docs
Features:
Parallel Passages
int
๐Ÿ†— links between similar passages
author:
BHSA Data: Constantijn Sikkel; Parallels Notebook: Dirk Roorda, Martijn Naaijer
coreData:
BHSA
dateWritten:
2021-12-09T14:40:46Z
provenance:
Parallels notebook, see https://github.com/ETCBC/parallels
version:
2021
writtenBy:
Text-Fabric
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
str
โœ… book name in Latin (Genesis; Numeri; Reges1; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:55Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… book name in amharic (แŠฃแˆ›แˆญแŠ›)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:20:27Z
encoders:
Dirk Roorda (TF)
language:
แŠฃแˆ›แˆญแŠ›
languageCode:
am
languageEnglish:
amharic
provenance:
book names from wikipedia and other sources
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… chapter number (1; 2; 3; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:55Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… identifier of a clause atom relationship (0; 74; 367; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:56Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
det
str
โœ… determinedness of phrase(atom) (det; und; NA.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:56Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:57Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… frequency of lexemes
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:24:45Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
computed on the basis of the ETCBC core set of features
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… syntactic function of phrase (Cmpl; Objc; Pred; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:57Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word consonantal-transliterated (B R>CJT BR> >LHJM ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:57Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word consonantal-Hebrew (ื‘ ืจืืฉืื™ืช ื‘ืจื ืืœื”ื™ื)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:58Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… lexeme pointed-transliterated (B.:- R;>CIJT [email protected]@> >:ELOH ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:58Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… lexeme pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืื™ืช ื‘ึธึผืจึธื ืึฑืœึนื”)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:17:59Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word pointed-transliterated (B.:- R;>CI73JT [email protected]@74> >:ELOHI92JM)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:04Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืึ–ื™ืช ื‘ึธึผืจึธึฃื ืึฑืœึนื”ึดึ‘ื™ื)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:04Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
๐Ÿ†— english translation of lexeme (beginning create god(s))
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:13Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
gn
str
โœ… grammatical gender (m; f; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:05Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… (half-)verse label (half verses: A; B; C; verses: GEN 01,02)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:06Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… of word or lexeme (Hebrew; Aramaic.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:13Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
lex
str
โœ… lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:14Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… lexeme consonantal-Hebrew (ื‘ ืจืืฉืื™ืชึœ ื‘ืจื ืืœื”ื™ืึœ)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:15Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
ls
str
โœ… lexical set, subclassification of part-of-speech (card; ques; mult)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:15Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โš ๏ธ named entity type (pers; mens; gens; topo; ppde.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:15Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
nme
str
โœ… nominal ending consonantal-transliterated (absent; n/a; JM, ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:08Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
nu
str
โœ… grammatical number (sg; du; pl; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:08Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… sequence number of an object within its context
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:09Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:15Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
๐Ÿ†— hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:22:50Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional paragraph file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
pdp
str
โœ… phrase dependent part-of-speech (art; verb; subs; nmpr, ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:10Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
pfm
str
โœ… preformative consonantal-transliterated (absent; n/a; J, ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:11Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
prs
str
โœ… pronominal suffix consonantal-transliterated (absent; n/a; W; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:11Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… pronominal suffix gender (m; f; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:11Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… pronominal suffix number (sg; du; pl; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:12Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… pronominal suffix person (p1; p2; p3; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:12Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
ps
str
โœ… grammatical person (p1; p2; p3; NA; unknown.)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:12Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word pointed-transliterated masoretic reading correction
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:23:29Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional ketiv/qere file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… interword material -pointed-transliterated (Masoretic correction)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:23:29Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional ketiv/qere file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… interword material -pointed-transliterated (Masoretic correction)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:23:29Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional ketiv/qere file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… word pointed-Hebrew masoretic reading correction
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:23:29Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional ketiv/qere file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… ranking of lexemes based on freqnuecy
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:24:46Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
computed on the basis of the ETCBC core set of features
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:13Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
sp
str
โœ… part-of-speech (art; verb; subs; nmpr, ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:16Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
st
str
โœ… state of a noun (a (absolute); c (construct); e (emphatic).)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:14Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
tab
int
โœ… clause atom: its level in the linguistic embedding
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:16Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… interword material pointed-transliterated (& 00 05 00_P ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:01Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… interword material pointed-Hebrew (ึพ ืƒ)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:01Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
txt
str
โœ… text type of clause and surrounding (repetion of ? N D Q as in feature domain)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:16Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
typ
str
โœ… clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:16Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
uvf
str
โœ… univalent final consonant consonantal-transliterated (absent; N; J; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:17Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
vbe
str
โœ… verbal ending consonantal-transliterated (n/a; W; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:17Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
vbs
str
โœ… root formation consonantal-transliterated (absent; n/a; H; ...)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:17Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
int
โœ… verse number
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:18Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:16Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
str
โœ… vocalized lexeme pointed-Hebrew (ื‘ึฐึผ ืจึตืืฉึดืื™ืช ื‘ืจื ืึฑืœึนื”ึดื™ื)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:17Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
provenance:
from additional lexicon file provided by the ETCBC
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
vs
str
โœ… verbal stem (qal; piel; hif; apel; pael)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:18Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
vt
str
โœ… verbal tense (perf; impv; wayq; infc)
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:18Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
none
โœ… linguistic dependency between textual objects
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:18:22Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
none
author:
Eep Talstra Centre for Bible and Computer
dataset:
BHSA
datasetName:
Biblia Hebraica Stuttgartensia Amstelodamensis
dateWritten:
2021-12-09T14:21:17Z
encoders:
Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
version:
2021
website:
https://shebanq.ancient-data.org
writtenBy:
Text-Fabric
Phonetic Transcriptions
str
๐Ÿ†— phonological transcription (bแตŠ rฤ“ลกหŒรฎแนฏ bฤrหˆฤ ส”แต‰lลhหˆรฎm)
author:
BHSA Data: Constantijn Sikkel; Phono Notebook: Dirk Roorda
coreData:
BHSA
dateWritten:
2021-12-09T14:25:55Z
provenance:
computed by the phono notebook, see https://github.com/ETCBC/phono
version:
2021
writtenBy:
Text-Fabric
str
๐Ÿ†— interword material in phonological transcription
author:
BHSA Data: Constantijn Sikkel; Phono Notebook: Dirk Roorda
coreData:
BHSA
dateWritten:
2021-12-09T14:25:55Z
provenance:
computed by the phono notebook, see https://github.com/ETCBC/phono
version:
2021
writtenBy:
Text-Fabric
Text-Fabric API: names N F E L T S C TF directly usable

We work with Genesis 1 (in fact, only the first 10 clauses).

In [4]:
gen1 = T.nodeFromSection(("Genesis", 1))

We prepare our portion of text for annotation outside TF.

What needs to happen is, that we produce a text file and that we remember the postions of the relevant nodes in that text file.

The Recorder is a new thing in TF (in development) that lets you create a string from nodes, where the positions of the nodes in that string are remembered. You may add all kinds of material in between the texts of the nodes. And it is up to you how you represent the nodes.

We start a recorder.

In [5]:
rec = Recorder()

We can add strings to the recorder, and we can tell nodes to start and to stop.

We add clause atoms and phrase atoms to the recorder.

In [6]:
LIMIT = 10

for (i, cla) in enumerate(L.d(gen1, otype="clause_atom")):
    if i >= LIMIT:  # only first ten clause atoms
        break

    # we want a label in front of each clause atom
    label = "{} {}:{}".format(*T.sectionFromNode(cla))
    rec.add(f"{label}@{i} ")

    # we start a clause atom node:
    #   until we end this node, all text that we add counts as material for this clause atom
    rec.start(cla)

    for pa in L.d(cla, otype="phrase_atom"):
        # we start a phrase node
        #   until we end this node, all text that we add also counts as material for this phrase atom
        rec.start(pa)

        # we add text, it belongs to the current clause atom and to the current phrase atom
        rec.add(T.text(pa, fmt="text-trans-plain"))

        # we end the phrase atom
        rec.end(pa)

    # we end the clause atom
    rec.end(cla)

    # very clause atom on its own line
    #  this return character does not belong to any node
    rec.add("\n")

We can print the recorded text.

In [7]:
print(rec.text())
Genesis 1:[email protected] BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
Genesis 1:[email protected] WH>RY HJTH THW WBHW 
Genesis 1:[email protected] WXCK <L&PNJ THWM 
Genesis 1:[email protected] WRWX >LHJM MRXPT <L&PNJ HMJM00 
Genesis 1:[email protected] WJ>MR >LHJM 
Genesis 1:[email protected] JHJ >WR 
Genesis 1:[email protected] WJHJ&>WR00 
Genesis 1:[email protected] WJR> >LHJM >T&H>WR 
Genesis 1:[email protected] KJ&VWB 
Genesis 1:[email protected] WJBDL >LHJM BJN H>WR WBJN HXCK00 

We can print the recorded node positions.

In [8]:
print("\n".join(f"pos {i}: {p}" for (i, p) in enumerate(rec.positions()) if p))
pos 14: frozenset({904776, 515690})
pos 15: frozenset({904776, 515690})
pos 16: frozenset({904776, 515690})
pos 17: frozenset({904776, 515690})
pos 18: frozenset({904776, 515690})
pos 19: frozenset({904776, 515690})
pos 20: frozenset({904776, 515690})
pos 21: frozenset({904777, 515690})
pos 22: frozenset({904777, 515690})
pos 23: frozenset({904777, 515690})
pos 24: frozenset({904777, 515690})
pos 25: frozenset({515690, 904778})
pos 26: frozenset({515690, 904778})
pos 27: frozenset({515690, 904778})
pos 28: frozenset({515690, 904778})
pos 29: frozenset({515690, 904778})
pos 30: frozenset({515690, 904778})
pos 31: frozenset({515690, 904779})
pos 32: frozenset({515690, 904779})
pos 33: frozenset({515690, 904779})
pos 34: frozenset({515690, 904779})
pos 35: frozenset({515690, 904779})
pos 36: frozenset({515690, 904779})
pos 37: frozenset({515690, 904779})
pos 38: frozenset({515690, 904779})
pos 39: frozenset({515690, 904779})
pos 40: frozenset({515690, 904779})
pos 41: frozenset({515690, 904779})
pos 42: frozenset({515690, 904779})
pos 43: frozenset({515690, 904779})
pos 44: frozenset({515690, 904779})
pos 45: frozenset({515690, 904779})
pos 46: frozenset({515690, 904779})
pos 47: frozenset({515690, 904779})
pos 48: frozenset({515690, 904779})
pos 49: frozenset({515690, 904779})
pos 50: frozenset({515690, 904779})
pos 66: frozenset({515691, 904780})
pos 67: frozenset({515691, 904781})
pos 68: frozenset({515691, 904781})
pos 69: frozenset({515691, 904781})
pos 70: frozenset({515691, 904781})
pos 71: frozenset({515691, 904781})
pos 72: frozenset({515691, 904782})
pos 73: frozenset({515691, 904782})
pos 74: frozenset({515691, 904782})
pos 75: frozenset({515691, 904782})
pos 76: frozenset({515691, 904782})
pos 77: frozenset({515691, 904783})
pos 78: frozenset({515691, 904783})
pos 79: frozenset({515691, 904783})
pos 80: frozenset({515691, 904783})
pos 81: frozenset({515691, 904783})
pos 82: frozenset({515691, 904783})
pos 83: frozenset({515691, 904783})
pos 84: frozenset({515691, 904783})
pos 85: frozenset({515691, 904783})
pos 101: frozenset({904784, 515692})
pos 102: frozenset({904785, 515692})
pos 103: frozenset({904785, 515692})
pos 104: frozenset({904785, 515692})
pos 105: frozenset({904785, 515692})
pos 106: frozenset({904786, 515692})
pos 107: frozenset({904786, 515692})
pos 108: frozenset({904786, 515692})
pos 109: frozenset({904786, 515692})
pos 110: frozenset({904786, 515692})
pos 111: frozenset({904786, 515692})
pos 112: frozenset({904786, 515692})
pos 113: frozenset({904786, 515692})
pos 114: frozenset({904786, 515692})
pos 115: frozenset({904786, 515692})
pos 116: frozenset({904786, 515692})
pos 117: frozenset({904786, 515692})
pos 133: frozenset({904787, 515693})
pos 134: frozenset({904788, 515693})
pos 135: frozenset({904788, 515693})
pos 136: frozenset({904788, 515693})
pos 137: frozenset({904788, 515693})
pos 138: frozenset({904788, 515693})
pos 139: frozenset({904788, 515693})
pos 140: frozenset({904788, 515693})
pos 141: frozenset({904788, 515693})
pos 142: frozenset({904788, 515693})
pos 143: frozenset({904788, 515693})
pos 144: frozenset({515693, 904789})
pos 145: frozenset({515693, 904789})
pos 146: frozenset({515693, 904789})
pos 147: frozenset({515693, 904789})
pos 148: frozenset({515693, 904789})
pos 149: frozenset({515693, 904789})
pos 150: frozenset({515693, 904790})
pos 151: frozenset({515693, 904790})
pos 152: frozenset({515693, 904790})
pos 153: frozenset({515693, 904790})
pos 154: frozenset({515693, 904790})
pos 155: frozenset({515693, 904790})
pos 156: frozenset({515693, 904790})
pos 157: frozenset({515693, 904790})
pos 158: frozenset({515693, 904790})
pos 159: frozenset({515693, 904790})
pos 160: frozenset({515693, 904790})
pos 161: frozenset({515693, 904790})
pos 162: frozenset({515693, 904790})
pos 163: frozenset({515693, 904790})
pos 179: frozenset({515694, 904791})
pos 180: frozenset({904792, 515694})
pos 181: frozenset({904792, 515694})
pos 182: frozenset({904792, 515694})
pos 183: frozenset({904792, 515694})
pos 184: frozenset({904792, 515694})
pos 185: frozenset({904793, 515694})
pos 186: frozenset({904793, 515694})
pos 187: frozenset({904793, 515694})
pos 188: frozenset({904793, 515694})
pos 189: frozenset({904793, 515694})
pos 190: frozenset({904793, 515694})
pos 206: frozenset({904794, 515695})
pos 207: frozenset({904794, 515695})
pos 208: frozenset({904794, 515695})
pos 209: frozenset({904794, 515695})
pos 210: frozenset({904795, 515695})
pos 211: frozenset({904795, 515695})
pos 212: frozenset({904795, 515695})
pos 213: frozenset({904795, 515695})
pos 229: frozenset({515696, 904796})
pos 230: frozenset({515696, 904797})
pos 231: frozenset({515696, 904797})
pos 232: frozenset({515696, 904797})
pos 233: frozenset({515696, 904797})
pos 234: frozenset({515696, 904798})
pos 235: frozenset({515696, 904798})
pos 236: frozenset({515696, 904798})
pos 237: frozenset({515696, 904798})
pos 238: frozenset({515696, 904798})
pos 239: frozenset({515696, 904798})
pos 255: frozenset({515697, 904799})
pos 256: frozenset({904800, 515697})
pos 257: frozenset({904800, 515697})
pos 258: frozenset({904800, 515697})
pos 259: frozenset({904800, 515697})
pos 260: frozenset({515697, 904801})
pos 261: frozenset({515697, 904801})
pos 262: frozenset({515697, 904801})
pos 263: frozenset({515697, 904801})
pos 264: frozenset({515697, 904801})
pos 265: frozenset({515697, 904801})
pos 266: frozenset({515697, 904802})
pos 267: frozenset({515697, 904802})
pos 268: frozenset({515697, 904802})
pos 269: frozenset({515697, 904802})
pos 270: frozenset({515697, 904802})
pos 271: frozenset({515697, 904802})
pos 272: frozenset({515697, 904802})
pos 273: frozenset({515697, 904802})
pos 289: frozenset({515698, 904803})
pos 290: frozenset({515698, 904803})
pos 291: frozenset({515698, 904803})
pos 292: frozenset({515698, 904804})
pos 293: frozenset({515698, 904804})
pos 294: frozenset({515698, 904804})
pos 295: frozenset({515698, 904804})
pos 311: frozenset({515699, 904805})
pos 312: frozenset({515699, 904806})
pos 313: frozenset({515699, 904806})
pos 314: frozenset({515699, 904806})
pos 315: frozenset({515699, 904806})
pos 316: frozenset({515699, 904806})
pos 317: frozenset({515699, 904807})
pos 318: frozenset({515699, 904807})
pos 319: frozenset({515699, 904807})
pos 320: frozenset({515699, 904807})
pos 321: frozenset({515699, 904807})
pos 322: frozenset({515699, 904807})
pos 323: frozenset({904808, 515699})
pos 324: frozenset({904808, 515699})
pos 325: frozenset({904808, 515699})
pos 326: frozenset({904808, 515699})
pos 327: frozenset({904808, 515699})
pos 328: frozenset({904808, 515699})
pos 329: frozenset({904808, 515699})
pos 330: frozenset({904808, 515699})
pos 331: frozenset({904808, 515699})
pos 332: frozenset({904808, 515699})
pos 333: frozenset({904808, 515699})
pos 334: frozenset({904808, 515699})
pos 335: frozenset({904808, 515699})
pos 336: frozenset({904808, 515699})
pos 337: frozenset({904808, 515699})
pos 338: frozenset({904808, 515699})
pos 339: frozenset({904808, 515699})
pos 340: frozenset({904808, 515699})
pos 341: frozenset({904808, 515699})
pos 342: frozenset({904808, 515699})
pos 343: frozenset({904808, 515699})

We can write the recorded text and the postions to two files:

In [9]:
rec.write("data/gen1.txt")
In [10]:
!head -n 10 data/gen1.txt
Genesis 1:[email protected] BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
Genesis 1:[email protected] WH>RY HJTH THW WBHW 
Genesis 1:[email protected] WXCK <L&PNJ THWM 
Genesis 1:[email protected] WRWX >LHJM MRXPT <L&PNJ HMJM00 
Genesis 1:[email protected] WJ>MR >LHJM 
Genesis 1:[email protected] JHJ >WR 
Genesis 1:[email protected] WJHJ&>WR00 
Genesis 1:[email protected] WJR> >LHJM >T&H>WR 
Genesis 1:[email protected] KJ&VWB 
Genesis 1:[email protected] WJBDL >LHJM BJN H>WR WBJN HXCK00 
In [11]:
!head -n 30 data/gen1.txt.pos












904776	515690
904776	515690
904776	515690
904776	515690
904776	515690
904776	515690
904776	515690
904777	515690
904777	515690
904777	515690
904777	515690
515690	904778
515690	904778
515690	904778
515690	904778
515690	904778

Now we produce a (fake) annotation file, based on the text.

The file is tab delimited, the columns are:

  • start character position
  • end character position
  • feature1 value
  • feature2 value
  • etc

We annotate as follows:

  • every word that starts with a B gets bword=1
  • every word that ends with a T gets tword=1

Then we want every phrase with a b-word to get bword=1 and likewise every clause with a b-word to get bword=1, and the same for tword.

In [12]:
def annotate(fileName):
    annotations = {}

    with open(fileName) as fh:
        pos = 0
        for line in fh:
            words = line.split(" ")

            for word in words[0:2]:
                lWord = len(word)
                pos += lWord + 1
            for word in words[2:]:
                word = word.rstrip("\n")
                lWord = len(word)
                start = pos
                end = pos + lWord - 1
                pos += lWord + 1
                if lWord:
                    if word[0] == "B":
                        annotations.setdefault((start, end), {})["bword"] = 1
                    if word[-1] == "T":
                        annotations.setdefault((start, end), {})["tword"] = 1

    with open(f"{fileName}.ann", "w") as fh:
        fh.write("start\tend\tbword\ttword\n")
        for ((start, end), features) in annotations.items():
            row = "\t".join(
                str(a)
                for a in (
                    start,
                    end,
                    features.get("bword", ""),
                    features.get("tword", ""),
                )
            )
            fh.write(f"{row}\n")
In [13]:
annotate("data/gen1.txt")

Here is the annotation file.

In [14]:
!cat data/gen1.txt.ann
start	end	bword	tword
14	19	1	1
21	23	1	
31	32		1
40	42		1
144	148		1
323	325	1	

Now we want to feed back these annotations as TF features on phrase_atom and clause_atom nodes.

Our recorder knows how to do that.

In [15]:
features = rec.makeFeatures("data/gen1.txt.ann")

Let's see.

In [16]:
features["bword"]
Out[16]:
{904776: '1', 515690: '1', 904777: '1', 904808: '1', 515699: '1'}
In [17]:
features["tword"]
Out[17]:
{904776: '1', 515690: '1', 904779: '1', 904789: '1', 515693: '1'}

Let's check:

In [18]:
for feat in ("bword", "tword"):
    for n in features[feat]:
        print(f'{feat} {F.otype.v(n)} {n}: {T.text(n, fmt="text-trans-plain")}')
bword phrase_atom 904776: BR>CJT 
bword clause_atom 515690: BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
bword phrase_atom 904777: BR> 
bword phrase_atom 904808: BJN H>WR WBJN HXCK00 
bword clause_atom 515699: WJBDL >LHJM BJN H>WR WBJN HXCK00 
tword phrase_atom 904776: BR>CJT 
tword clause_atom 515690: BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
tword phrase_atom 904779: >T HCMJM W>T H>RY00 
tword phrase_atom 904789: MRXPT 
tword clause_atom 515693: WRWX >LHJM MRXPT <L&PNJ HMJM00 

What if we want to transform the annotations to word features instead to features on phrase and clause atoms?

Then we should record the text differently.

We only add slots to the mix.

In [19]:
rec = Recorder()
LIMIT = 10

for (i, cla) in enumerate(L.d(gen1, otype="clause_atom")):
    if i >= LIMIT:
        break
    label = "{} {}:{}".format(*T.sectionFromNode(cla))
    rec.add(f"{label}@{i} ")

    for w in L.d(cla, otype="word"):
        rec.start(w)
        rec.add(T.text(w, fmt="text-trans-plain"))
        rec.end(w)

    rec.add("\n")

It gives the same text:

In [20]:
print(rec.text())
Genesis 1:[email protected] BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
Genesis 1:[email protected] WH>RY HJTH THW WBHW 
Genesis 1:[email protected] WXCK <L&PNJ THWM 
Genesis 1:[email protected] WRWX >LHJM MRXPT <L&PNJ HMJM00 
Genesis 1:[email protected] WJ>MR >LHJM 
Genesis 1:[email protected] JHJ >WR 
Genesis 1:[email protected] WJHJ&>WR00 
Genesis 1:[email protected] WJR> >LHJM >T&H>WR 
Genesis 1:[email protected] KJ&VWB 
Genesis 1:[email protected] WJBDL >LHJM BJN H>WR WBJN HXCK00 

but the node positions are different:

In [21]:
print("\n".join(f"pos {i}: {p}" for (i, p) in enumerate(rec.positions()) if p))
pos 14: frozenset({1})
pos 15: frozenset({2})
pos 16: frozenset({2})
pos 17: frozenset({2})
pos 18: frozenset({2})
pos 19: frozenset({2})
pos 20: frozenset({2})
pos 21: frozenset({3})
pos 22: frozenset({3})
pos 23: frozenset({3})
pos 24: frozenset({3})
pos 25: frozenset({4})
pos 26: frozenset({4})
pos 27: frozenset({4})
pos 28: frozenset({4})
pos 29: frozenset({4})
pos 30: frozenset({4})
pos 31: frozenset({5})
pos 32: frozenset({5})
pos 33: frozenset({5})
pos 34: frozenset({6})
pos 35: frozenset({7})
pos 36: frozenset({7})
pos 37: frozenset({7})
pos 38: frozenset({7})
pos 39: frozenset({7})
pos 40: frozenset({8})
pos 41: frozenset({9})
pos 42: frozenset({9})
pos 43: frozenset({9})
pos 44: frozenset({10})
pos 45: frozenset({11})
pos 46: frozenset({11})
pos 47: frozenset({11})
pos 48: frozenset({11})
pos 49: frozenset({11})
pos 50: frozenset({11})
pos 66: frozenset({12})
pos 67: frozenset({13})
pos 68: frozenset({14})
pos 69: frozenset({14})
pos 70: frozenset({14})
pos 71: frozenset({14})
pos 72: frozenset({15})
pos 73: frozenset({15})
pos 74: frozenset({15})
pos 75: frozenset({15})
pos 76: frozenset({15})
pos 77: frozenset({16})
pos 78: frozenset({16})
pos 79: frozenset({16})
pos 80: frozenset({16})
pos 81: frozenset({17})
pos 82: frozenset({18})
pos 83: frozenset({18})
pos 84: frozenset({18})
pos 85: frozenset({18})
pos 101: frozenset({19})
pos 102: frozenset({20})
pos 103: frozenset({20})
pos 104: frozenset({20})
pos 105: frozenset({20})
pos 106: frozenset({21})
pos 107: frozenset({21})
pos 108: frozenset({21})
pos 109: frozenset({22})
pos 110: frozenset({22})
pos 111: frozenset({22})
pos 112: frozenset({22})
pos 113: frozenset({23})
pos 114: frozenset({23})
pos 115: frozenset({23})
pos 116: frozenset({23})
pos 117: frozenset({23})
pos 133: frozenset({24})
pos 134: frozenset({25})
pos 135: frozenset({25})
pos 136: frozenset({25})
pos 137: frozenset({25})
pos 138: frozenset({26})
pos 139: frozenset({26})
pos 140: frozenset({26})
pos 141: frozenset({26})
pos 142: frozenset({26})
pos 143: frozenset({26})
pos 144: frozenset({27})
pos 145: frozenset({27})
pos 146: frozenset({27})
pos 147: frozenset({27})
pos 148: frozenset({27})
pos 149: frozenset({27})
pos 150: frozenset({28})
pos 151: frozenset({28})
pos 152: frozenset({28})
pos 153: frozenset({29})
pos 154: frozenset({29})
pos 155: frozenset({29})
pos 156: frozenset({29})
pos 157: frozenset({30})
pos 158: frozenset({31})
pos 159: frozenset({31})
pos 160: frozenset({31})
pos 161: frozenset({31})
pos 162: frozenset({31})
pos 163: frozenset({31})
pos 179: frozenset({32})
pos 180: frozenset({33})
pos 181: frozenset({33})
pos 182: frozenset({33})
pos 183: frozenset({33})
pos 184: frozenset({33})
pos 185: frozenset({34})
pos 186: frozenset({34})
pos 187: frozenset({34})
pos 188: frozenset({34})
pos 189: frozenset({34})
pos 190: frozenset({34})
pos 206: frozenset({35})
pos 207: frozenset({35})
pos 208: frozenset({35})
pos 209: frozenset({35})
pos 210: frozenset({36})
pos 211: frozenset({36})
pos 212: frozenset({36})
pos 213: frozenset({36})
pos 229: frozenset({37})
pos 230: frozenset({38})
pos 231: frozenset({38})
pos 232: frozenset({38})
pos 233: frozenset({38})
pos 234: frozenset({39})
pos 235: frozenset({39})
pos 236: frozenset({39})
pos 237: frozenset({39})
pos 238: frozenset({39})
pos 239: frozenset({39})
pos 255: frozenset({40})
pos 256: frozenset({41})
pos 257: frozenset({41})
pos 258: frozenset({41})
pos 259: frozenset({41})
pos 260: frozenset({42})
pos 261: frozenset({42})
pos 262: frozenset({42})
pos 263: frozenset({42})
pos 264: frozenset({42})
pos 265: frozenset({42})
pos 266: frozenset({43})
pos 267: frozenset({43})
pos 268: frozenset({43})
pos 269: frozenset({44})
pos 270: frozenset({45})
pos 271: frozenset({45})
pos 272: frozenset({45})
pos 273: frozenset({45})
pos 289: frozenset({46})
pos 290: frozenset({46})
pos 291: frozenset({46})
pos 292: frozenset({47})
pos 293: frozenset({47})
pos 294: frozenset({47})
pos 295: frozenset({47})
pos 311: frozenset({48})
pos 312: frozenset({49})
pos 313: frozenset({49})
pos 314: frozenset({49})
pos 315: frozenset({49})
pos 316: frozenset({49})
pos 317: frozenset({50})
pos 318: frozenset({50})
pos 319: frozenset({50})
pos 320: frozenset({50})
pos 321: frozenset({50})
pos 322: frozenset({50})
pos 323: frozenset({51})
pos 324: frozenset({51})
pos 325: frozenset({51})
pos 326: frozenset({51})
pos 327: frozenset({52})
pos 328: frozenset({53})
pos 329: frozenset({53})
pos 330: frozenset({53})
pos 331: frozenset({53})
pos 332: frozenset({54})
pos 333: frozenset({55})
pos 334: frozenset({55})
pos 335: frozenset({55})
pos 336: frozenset({55})
pos 337: frozenset({56})
pos 338: frozenset({57})
pos 339: frozenset({57})
pos 340: frozenset({57})
pos 341: frozenset({57})
pos 342: frozenset({57})
pos 343: frozenset({57})

We have produced the same text, so we can use the earlier annotation file to create word features.

In [22]:
features = rec.makeFeatures("data/gen1.txt.ann")
In [23]:
features["bword"]
Out[23]:
{1: '1', 2: '1', 3: '1', 51: '1'}
In [24]:
features["tword"]
Out[24]:
{1: '1', 2: '1', 5: '1', 8: '1', 9: '1', 27: '1'}

Let's check:

In [25]:
for feat in ("bword", "tword"):
    for n in features[feat]:
        print(f'{feat} {F.otype.v(n)} {n}: {T.text(n, fmt="text-trans-plain")}')
bword word 1: B
bword word 2: R>CJT 
bword word 3: BR> 
bword word 51: BJN 
tword word 1: B
tword word 2: R>CJT 
tword word 5: >T 
tword word 8: W
tword word 9: >T 
tword word 27: MRXPT 

Explanation:

The annotator just looked at the string BR>CJT without knowing that it is two words.

In [26]:
!cat data/gen1.txt.ann
start	end	bword	tword
14	19	1	1
21	23	1	
31	32		1
40	42		1
144	148		1
323	325	1	

So it has annotated pos 14-19 as a bword and as a tword.

But TF knows that 14-19 are slots 1 and 2, so when the annotations are applied, slots 1 and 2 are both set to b-words and t-words.

We can remedy the situation by producing an other text to the annotator, one where slots are always separated by a space.

Lets do that by always adding a space, so real words are separated by two spaces.

In [27]:
rec = Recorder()
LIMIT = 10

for (i, cla) in enumerate(L.d(gen1, otype="clause_atom")):
    if i >= LIMIT:
        break
    label = "{} {}:{}".format(*T.sectionFromNode(cla))
    rec.add(f"{label}@{i} ")

    for w in L.d(cla, otype="word"):
        rec.start(w)
        rec.add(T.text(w, fmt="text-trans-plain") + " ")
        rec.end(w)

    rec.add("\n")

Here is the text

In [28]:
print(rec.text())
Genesis 1:[email protected] B R>CJT  BR>  >LHJM  >T  H CMJM  W >T  H >RY00  
Genesis 1:[email protected] W H >RY  HJTH  THW  W BHW  
Genesis 1:[email protected] W XCK  <L& PNJ  THWM  
Genesis 1:[email protected] W RWX  >LHJM  MRXPT  <L& PNJ  H MJM00  
Genesis 1:[email protected] W J>MR  >LHJM  
Genesis 1:[email protected] JHJ  >WR  
Genesis 1:[email protected] W JHJ& >WR00  
Genesis 1:[email protected] W JR>  >LHJM  >T& H >WR  
Genesis 1:[email protected] KJ& VWB  
Genesis 1:[email protected] W JBDL  >LHJM  BJN  H >WR  W BJN  H XCK00  

We write the text to file.

In [29]:
rec.write("data/gen1wx.txt")

We run our annotator again, because we have a different text:

In [30]:
annotate("data/gen1wx.txt")

Here is the new annotation file.

In [31]:
!cat data/gen1wx.txt.ann
start	end	bword	tword
14	14	1	
16	20		1
23	25	1	
35	36		1
49	50		1
99	101	1	
170	174		1
373	375	1	
387	389	1	

The features are no surprise:

In [32]:
features = rec.makeFeatures("data/gen1wx.txt.ann")
In [33]:
features["bword"]
Out[33]:
{1: '1', 3: '1', 18: '1', 51: '1', 55: '1'}
In [34]:
features["tword"]
Out[34]:
{2: '1', 5: '1', 9: '1', 27: '1'}

Let's check:

In [35]:
for feat in ("bword", "tword"):
    for n in features[feat]:
        print(f'{feat} {F.otype.v(n)} {n}: {T.text(n, fmt="text-trans-plain")}')
bword word 1: B
bword word 3: BR> 
bword word 18: BHW 
bword word 51: BJN 
bword word 55: BJN 
tword word 2: R>CJT 
tword word 5: >T 
tword word 9: >T 
tword word 27: MRXPT 

All steps

  • start your first step in mastering the bible computationally
  • display become an expert in creating pretty displays of your text structures
  • search turbo charge your hand-coding with search templates
  • exportExcel make tailor-made spreadsheets out of your results
  • share draw in other people's data and let them use yours
  • export export your dataset as an Emdros database
  • annotate annotate plain text by means of other tools and import the annotations as TF features
  • volumes work with selected books only
  • trees work with the BHSA data as syntax trees

CC-BY Dirk Roorda