Tutorial

This notebook gets you started with using Text-Fabric for coding in the Dead-Sea Scrolls.

Familiarity with the underlying data model is recommended.

Cookbook

This tutorial and its sister tutorials are meant to showcase most of things TF can do.

But we also have a cookbook with a set of focused recipes on tricky things.

Installing Text-Fabric

See here

Tip

If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your repository. If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.

Data

Text-Fabric will fetch the data set for you from github, and check for updates.

The data will be stored in the text-fabric-data in your home directory.

Features

The data of the corpus is organized in features. They are columns of data. Think of the corpus as a gigantic spreadsheet, where row 1 corresponds to the first sign, row 2 to the second sign, and so on, for all ~ 1.5 M signs, followed by ~ 500 K word nodes and yet another 200 K nodes of other types.

The information which reading each sign has, constitutes a column in that spreadsheet. The DSS corpus contains > 50 columns.

Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import os
import collections

Incantation

The simplest way to get going is by this incantation:

In [6]:
from tf.app import use
In [5]:
A = use("etcbc/dss", hoist=globals())
TF-app: ~/text-fabric-data/etcbc/dss/app
data: ~/text-fabric-data/etcbc/dss/tf/0.9
data: ~/text-fabric-data/etcbc/dss/parallels/tf/0.9
This is Text-Fabric 9.3.1
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

67 features found and 1 ignored
Text-Fabric: Text-Fabric API 9.3.1, etcbc/dss/app v3, Search Reference
Data: DSS, Character table, Feature docs
Features:
Parallel Passages
sim
int
similarity between lines, as a percentage of the common material wrt the combined material
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Dirk Roorda
createdDate:
2019-05-09
dateWritten:
2019-06-11T14:51:21Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
sourceCreatedBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
sourceCreatedDate:
2015
sourceDescription:
Dead Sea Scrolls: biblical and non-biblical scrolls
writtenBy:
Text-Fabric
Dead Sea Scrolls
str
space behind the word, if any
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:55Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
(space)
writtenBy:
Text-Fabric
alt
int
alternative reading
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:56Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1
writtenBy:
Text-Fabric
int
whether we are in biblical material or not
acronym:
dss
applies:
scroll fragment line cluster word
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:56Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
remark:
for lines it means that the material is taken from the bib source while there is also material for this line in the nonbib source. But the nonbib material is either identical or virtually absent, in which case the bib material is a reconstruction and marked as such.
source:
Martin Abegg's data files, personal communication
values:
1=biblical, 2=biblical but also with nonbiblical material
writtenBy:
Text-Fabric
str
acronym of the book in which the word occurs
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:56Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
label of the chapter in which the word occurs
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:56Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
cl
str
class (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:56Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
advb, art, artp, card, cmn, conj, gent, indp, intj, intr, mult, nega, objm, ord, prep, prp, rela, unknown
writtenBy:
Text-Fabric
cl2
str
class (for part 2) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:57Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
d, h, n, unknown
writtenBy:
Text-Fabric
cor
int
correction made by an ancient or modern editor
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:57Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1 = modern, 2 = ancient, 3 = ancient supralinear
writtenBy:
Text-Fabric
str
label of a fragment of a scroll
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:57Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
full transcription (Unicode) of a word including flags and brackets
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:57Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
full transcription (ETCBC transliteration) of a word including flags and brackets
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:57Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
full transcription (original source) of a word including flags and brackets
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:58Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:01:58Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
str
representation (Unicode) of a lexeme leaving out non-letters
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:58Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (ETCBC transliteration) of a lexeme leaving out non-letters
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:59Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (original source) of a lexeme leaving out non-letters
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:01:59Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (Unicode) of a word or sign
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:00Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (ETCBC transliteration) of a word or sign
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:02Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (original source) of a word or sign
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:04Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
gn
str
gender (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
b, c, f, m, unknown
writtenBy:
Text-Fabric
gn2
str
gender (for part 2) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
c, f, m, unknown
writtenBy:
Text-Fabric
gn3
str
gender (for part 3) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
c, f, m
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
str
label of the half-verse in which the word occurs
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
int
interlinear material, the value indicates the sequence number of the interlinear line
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
language of a word or sign, only if it is not Hebrew
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
g=greek, a=aramaic
writtenBy:
Text-Fabric
lex
str
representation (Unicode) of a lexeme
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:06Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:07Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
str
representation (ETCBC transliteration) of a lexeme
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:07Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
representation (original source) of a lexeme
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:08Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
label of a line of a fragment of a scroll
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:08Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
md
str
mood (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:08Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
coho, cons, juss, unknown
writtenBy:
Text-Fabric
str
errors in parsing the morphology tag
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:08Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
morphological tag (by Abegg)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:08Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
nr
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:09Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
nu
str
number (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:09Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
d, p, s, unknown
writtenBy:
Text-Fabric
nu2
str
number (for part 2) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:09Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
p, s, unknown
writtenBy:
Text-Fabric
nu3
str
number (for part 3) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:09Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
s
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:09Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: biblical and non-biblical scrolls
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:10Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
ps
str
person (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:10Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1, 2, 3, unknown
writtenBy:
Text-Fabric
ps2
str
person (for part 2) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:10Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1, 2, 3
writtenBy:
Text-Fabric
ps3
str
person (for part 3) (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:10Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1, 2, 3
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:10Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
str
trailing punctuation (Unicode) of a word
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:11Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
trailing punctuation (ETCBC transliteration) of a word
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:11Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
str
trailing punctuation (original source) of a word
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:11Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
rec
int
reconstructed by a modern editor
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:11Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1
writtenBy:
Text-Fabric
rem
int
removed by an ancient or modern editor
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1 = modern, 2 = ancient
writtenBy:
Text-Fabric
str
script in which the word or sign is written if it is not Hebrew
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
paleohebrew greekcapital
writtenBy:
Text-Fabric
str
acronym of a scroll
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
sp
str
part of speech (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
adjv, numr, pron, ptcl, subs, suff, unknown, verb
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
int
the line number of the word in the source data file
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:12Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
st
str
state (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:13Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
a, c, d, unknown
writtenBy:
Text-Fabric
str
type of sign or cluster
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:13Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
unc
int
uncertain material in various degrees: higher degree is less certain
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:15Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1 2 3 4
writtenBy:
Text-Fabric
vac
int
empty, unwritten space
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:15Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
1
writtenBy:
Text-Fabric
str
label of the verse in which the word occurs
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:15Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
vs
str
verbal stem (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:15Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
aphel, apoel, haphel, hifil, hishtafel, hishtaphel, hithaphel, hithpaal, hithpeel, hithpolel, hitopel, hitpael, hitpalpel, hitpoel, hofal, hophal, hotpaal, hpealal, ishtaphel, ithpaal, ithpeel, ithpoel, nifal, nitpael, pael, palel, passive, peal, peil, piel, pilpel, poal, poel, polal, polel, pual, pulal, qal, shaphel, tifil, unknown
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:15Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
vt
str
verbal tense/aspect (morphology tag)
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:16Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
values:
impf, impv, infa, infc, perf, ptca, ptcp, unknown, wayy
writtenBy:
Text-Fabric
str
Dead Sea Scrolls: additions based on BHSA and machine learning
acronym:
dss-additions
convertedBy:
Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Martijn Naaijer, ETCBC
createdDate:
2020
dateWritten:
2020-12-29T15:02:16Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martijn Naaijer's data files, personal communication
writtenBy:
Text-Fabric
occ
none
edge feature from a lexeme to its occurrences
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:17Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
none
Dead Sea Scrolls: biblical and non-biblical scrolls
acronym:
dss
convertedBy:
Jarod Jacobs, Martijn Naaijer and Dirk Roorda
createdBy:
Martin G. Abegg, Jr., James E. Bowley, and Edward M. Cook
createdDate:
2015
dateWritten:
2020-12-29T15:02:17Z
license:
Creative Commons Attribution-NonCommercial 4.0 International License
licenseUrl:
http://creativecommons.org/licenses/by-nc/4.0/
source:
Martin Abegg's data files, personal communication
writtenBy:
Text-Fabric
Text-Fabric API: names N F E L T S C TF directly usable

You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.

API

The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the text and data of the corpus.

At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).

The most essential thing for now is that we can use F to access the data in the features we've loaded. But there is more, such as N, which helps us to walk over the text, as we see in a minute.

The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.

Text-Fabric contains a flexible search engine, that does not only work for the data, of this corpus, but also for other corpora and data that you add to corpora.

Search is the quickest way to come up-to-speed with your data, without too much programming.

Jump to the dedicated search search tutorial first, to whet your appetite.

The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:

  • compose dynamic queries
  • process query results

Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.

Counting

In order to get acquainted with the data, we start with the simple task of counting.

Count all nodes

We use the N.walk() generator to walk through the nodes.

We compared the TF data to a gigantic spreadsheet, where the rows correspond to the signs. In Text-Fabric, we call the rows slots, because they are the textual positions that can be filled with signs.

We also mentioned that there are also other textual objects. They are the clusters, lines, faces and documents. They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows nodes, and the N() generator carries us through those nodes in the textual order.

Just one extra thing: the info statements generate timed messages. If you use them instead of print you'll get a sense of the amount of time that the various processing steps typically need.

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))
  0.00s Counting nodes ...
  0.26s 2108303 nodes

Here you see it: over 2M nodes.

What are those nodes?

Every node has a type, like sign, or line, face. But what exactly are they?

Text-Fabric has two special features, otype and oslots, that must occur in every Text-Fabric data set. otype tells you for each node its type, and you can ask for the number of slots in the text.

Here we go!

In [6]:
F.otype.slotType
Out[6]:
'sign'
In [7]:
F.otype.maxSlot
Out[7]:
1430241
In [8]:
F.otype.maxNode
Out[8]:
2108303
In [9]:
F.otype.all
Out[9]:
('scroll',
 'lex',
 'fragment',
 'line',
 'clause',
 'cluster',
 'phrase',
 'word',
 'sign')
In [10]:
C.levels.data
Out[10]:
(('scroll', 1428.8121878121879, 1605868, 1606868),
 ('lex', 129.1396172248804, 1542523, 1552972),
 ('fragment', 127.90565194061885, 1531341, 1542522),
 ('line', 27.03924756593251, 1552973, 1605867),
 ('clause', 12.848, 2107864, 2107988),
 ('cluster', 6.678582379647672, 1430242, 1531340),
 ('phrase', 5.098412698412698, 2107989, 2108303),
 ('word', 2.814359424744758, 1606869, 2107863),
 ('sign', 1, 1, 1430241))

This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.

Count individual object types

This is an intuitive way to count the number of nodes in each type. Note in passing, how we use the indent in conjunction with info to produce neat timed and indented progress messages.

In [11]:
A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0

    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")
  0.00s counting objects ...
   |     0.00s    1001 scrolls
   |     0.00s   10450 lexs
   |     0.00s   11182 fragments
   |     0.01s   52895 lines
   |     0.00s     125 clauses
   |     0.01s  101099 clusters
   |     0.00s     315 phrases
   |     0.05s  500995 words
   |     0.13s 1430241 signs
  0.20s Done

Viewing textual objects

You can use the A API (the extra power) to display cuneiform text.

See the display tutorial.

Feature statistics

F gives access to all features. Every feature has a method freqList() to generate a frequency list of its values, higher frequencies first. Here are the parts of speech:

In [12]:
F.sp.freqList()
Out[12]:
(('ptcl', 154464),
 ('subs', 108562),
 ('unknown', 80256),
 ('verb', 58873),
 ('suff', 45747),
 ('adjv', 10633),
 ('numr', 6526),
 ('pron', 5784))

Signs, words and clusters have types. We can count them separately:

In [13]:
F.type.freqList("cluster")
Out[13]:
(('rec', 93733),
 ('vac', 3522),
 ('cor3', 1582),
 ('unc2', 906),
 ('rem2', 706),
 ('alt', 333),
 ('cor2', 147),
 ('cor', 95),
 ('rem', 75))
In [14]:
F.type.freqList("word")
Out[14]:
(('glyph', 470605), ('punct', 29927), ('numr', 463))
In [15]:
F.type.freqList("sign")
Out[15]:
(('cons', 1156780),
 ('empty', 98407),
 ('missing', 53864),
 ('sep', 46453),
 ('punct', 29927),
 ('unc', 27168),
 ('term', 15532),
 ('numr', 2029),
 ('add', 65),
 ('foreign', 16))

Word matters

Top 20 frequent words

We represent words by their essential symbols, collected in the feature glyph (which also exists for signs).

In [16]:
for (w, amount) in F.glyph.freqList("word")[0:20]:
    print(f"{amount:>5} {w}")
45393 ו
20491 ה
19378 ל
18225 ב
 6389 את
 5863 מ
 4894 אשר
 4789 יהוה
 4355 א
 4236 כול
 4185 על
 4172 אל
 3262 כי
 3091 כ
 3005 לא
 2841 כל
 2424 לוא
 1938 ארץ
 1829 ישראל
 1653 יום

Word distribution

Let's do a bit more fancy word stuff.

Hapaxes

A hapax can be found by picking the words with frequency 1. We do have lexeme information in this corpus, let's use it for determining hapaxes.

We print 20 hapaxes.

In [17]:
hapaxes1 = sorted(lx for (lx, amount) in F.lex.freqList("word") if amount == 1)
len(hapaxes1)
Out[17]:
3813
In [18]:
for lx in hapaxes1[0:20]:
    print(lx)
 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת

An other way to find lexemes with only one occurrence is to use the occ edge feature from lexeme nodes to the word nodes of its occurrences.

In [19]:
hapaxes2 = sorted(F.lex.v(lx) for lx in F.otype.s("lex") if len(E.occ.f(lx)) == 1)
len(hapaxes2)
Out[19]:
3813
In [20]:
for lx in hapaxes2[0:20]:
    print(lx)
 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת

The feature lex contains lexemes that may have uncertain characters in it.

The function glex has all those characters stripped. Let's use glex instead.

In [21]:
hapaxes1g = sorted(lx for (lx, amount) in F.glex.freqList("word") if amount == 1)
len(hapaxes1)
Out[21]:
3813
In [22]:
for lx in hapaxes1g[0:20]:
    print(lx)
100
115
126
150
300
32
350
50
52
536
54
61
65
66
67
71
83
92
99
 ידה

If we are not interested in the numerals:

In [23]:
for lx in [x for x in hapaxes1g if not x.isdigit()][0:20]:
    print(lx)
 ידה
 לוט
 נַחַל
 שֵׂעָר
ֶ
אֱגֹוז
אֱלִידָד
אֱלִיעָם
אֱלִישֶׁבַע
אֲבִיאֵל
אֲבִיטַל
אֲבִיעֶזְרִי
אֲבִיעֶזֶר
אֲבִישׁוּעַ
אֲבַטִּיחַ
אֲגֹורָה
אֲדַמְדַּם
אֲדָר
אֲדֹנִי
אֲדֹנִיָּה

Small occurrence base

The occurrence base of a word are the scrolls in which occurs.

We compute the occurrence base of each word, based on lexemes according to the glex feature.

In [24]:
occurrenceBase1 = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for w in F.otype.s("word"):
    scroll = T.sectionFromNode(w)[0]
    occurrenceBase1[F.glex.v(w)].add(scroll)
A.info(f"{len(occurrenceBase1)} entries")
  0.00s compiling occurrence base ...
  6.19s 8265 entries

Wow, that took long!

We looked up the scroll for each word.

But there is another way:

Start with scrolls, and iterate through their words.

In [25]:
occurrenceBase2 = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("scroll"):
    scroll = F.scroll.v(s)
    for w in L.d(s, otype="word"):
        occurrenceBase2[F.glex.v(w)].add(scroll)
A.info("done")
A.info(f"{len(occurrenceBase2)} entries")
  0.00s compiling occurrence base ...
  0.42s done
  0.42s 8265 entries

Much better. Are the results equal?

In [26]:
occurrenceBase1 == occurrenceBase2
Out[26]:
True

Yes.

In [27]:
occurrenceBase = occurrenceBase2

An overview of how many words have how big occurrence bases:

In [28]:
occurrenceSize = collections.Counter()

for (w, scrolls) in occurrenceBase.items():
    occurrenceSize[len(scrolls)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"base size {size:>4} : {amount:>5} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"base size {size:>4} : {amount:>5} words")
base size    1 :  2789 words
base size    2 :  1109 words
base size    3 :   692 words
base size    4 :   462 words
base size    5 :   335 words
base size    6 :   256 words
base size    7 :   219 words
base size    8 :   182 words
base size    9 :   177 words
base size   10 :   122 words
...
base size  457 :     1 words
base size  459 :     1 words
base size  538 :     1 words
base size  600 :     1 words
base size  605 :     1 words
base size  629 :     1 words
base size  745 :     1 words
base size  761 :     1 words
base size  844 :     1 words
base size  997 :     1 words

Let's give the predicate private to those words whose occurrence base is a single scroll.

In [29]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)
Out[29]:
2789

Peculiarity of scrolls

As a final exercise with scrolls, lets make a list of all scrolls, and show their

  • total number of words
  • number of private words
  • the percentage of private words: a measure of the peculiarity of the scroll
In [30]:
scrollList = []

empty = set()
ordinary = set()

for d in F.otype.s("scroll"):
    scroll = T.scrollName(d)
    words = {F.glex.v(w) for w in L.d(d, otype="word")}
    a = len(words)
    if not a:
        empty.add(scroll)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(scroll)
        continue
    p = 100 * o / a
    scrollList.append((scroll, a, o, p))

scrollList = sorted(scrollList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty scrolls")
print(f"Found {len(ordinary):>4} ordinary scrolls (i.e. without private words)")
Found    0 empty scrolls
Found  507 ordinary scrolls (i.e. without private words)
In [31]:
print(
    "{:<20}{:>5}{:>5}{:>5}\n{}".format(
        "scroll",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in scrollList[0:20]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in scrollList[-20:]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
scroll               #all #own %own
-----------------------------------
4Q341                  32   21 65.6%
4Q340                  15    5 33.3%
11Q26                   6    2 33.3%
4Q313a                  3    1 33.3%
4Q358                   3    1 33.3%
4Q347                  10    3 30.0%
4Q124                  86   25 29.1%
4Q282d                  7    2 28.6%
1Q70bis                11    3 27.3%
1Q70                   24    6 25.0%
4Q346a                  4    1 25.0%
4Q357                   4    1 25.0%
1Q41                    9    2 22.2%
3Q15                  269   58 21.6%
4Q561                  73   15 20.5%
4Q559                 129   26 20.2%
4Q360a                 20    4 20.0%
1Q58                    5    1 20.0%
4Q250b                  5    1 20.0%
4Q468bb                 5    1 20.0%
...
4Q427                 343    2  0.6%
4Q2                   174    1  0.6%
4Q366                 185    1  0.5%
4Q98                  192    1  0.5%
4Q56                  963    5  0.5%
4Q394                 194    1  0.5%
4Q59                  404    2  0.5%
4Q88                  208    1  0.5%
11Q20                 429    2  0.5%
4Q57                  875    4  0.5%
11Q11                 222    1  0.5%
4Q58                  450    2  0.4%
4Q174                 241    1  0.4%
4Q13                  257    1  0.4%
4Q524                 280    1  0.4%
4Q271                 293    1  0.3%
4Q84                  350    1  0.3%
4Q33                  365    1  0.3%
4Q428                 385    1  0.3%
1QpHab                463    1  0.2%

Tip

See the lexeme recipe in the cookbook for how you get from a lexeme node to its word occurrence nodes.

Locality API

We travel upwards and downwards, forwards and backwards through the nodes. The Locality-API (L) provides functions: u() for going up, and d() for going down, n() for going to next nodes and p() for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the oslots feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. And one if next or previous to an other, if its slots follow or precede the slots of the other one.

L.u(node) Up is going to nodes that embed node.

L.d(node) Down is the opposite direction, to those that are contained in node.

L.n(node) Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node.

L.p(node) Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node.

All these functions yield nodes of all possible otypes. By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

Going up

We go from the first word to the scroll it contains. Note the [0] at the end. You expect one scroll, yet L returns a tuple. To get the only element of that tuple, you need to do that [0].

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [32]:
firstScroll = L.u(1, otype="scroll")[0]
print(firstScroll)
1605868

And let's see all the containing objects of sign 3:

In [33]:
s = 3
for otype in F.otype.all:
    if otype == F.otype.slotType:
        continue
    up = L.u(s, otype=otype)
    upNode = "x" if len(up) == 0 else up[0]
    print("sign {} is contained in {} {}".format(s, otype, upNode))
sign 3 is contained in scroll 1605868
sign 3 is contained in lex 1542524
sign 3 is contained in fragment 1531341
sign 3 is contained in line 1552973
sign 3 is contained in clause x
sign 3 is contained in cluster x
sign 3 is contained in phrase x
sign 3 is contained in word 1606870

Going next

Let's go to the next nodes of the first scroll.

In [34]:
afterFirstScroll = L.n(firstScroll)
for n in afterFirstScroll:
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
secondScroll = L.n(firstScroll, otype="scroll")[0]
  17149: sign          first slot=17149 , last slot=17149 
1612982: word          first slot=17149 , last slot=17149 
1553387: line          first slot=17149 , last slot=17176 
1531359: fragment      first slot=17149 , last slot=18207 
1605869: scroll        first slot=17149 , last slot=33885 

Going previous

And let's see what is right before the second scroll.

In [35]:
for n in L.p(secondScroll):
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
1605868: scroll        first slot=1     , last slot=17148 
1531358: fragment      first slot=15658 , last slot=17148 
1553386: line          first slot=17099 , last slot=17148 
1612981: word          first slot=17147 , last slot=17148 
  17148: sign          first slot=17148 , last slot=17148 

Going down

We go to the fragments of the first scroll, and just count them.

In [36]:
fragments = L.d(firstScroll, otype="fragment")
print(len(fragments))
18

The first line

We pick two nodes and explore what is above and below them: the first line and the first word.

In [37]:
for n in [
    F.otype.s("word")[0],
    F.otype.s("line")[0],
]:
    A.indent(level=0)
    A.info("Node {}".format(n), tm=False)
    A.indent(level=1)
    A.info("UP", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    A.indent(level=1)
    A.info("DOWN", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)
Node 1606869
   |   UP
   |      |   1542523         lex
   |      |   1552973         line
   |      |   1531341         fragment
   |      |   1605868         scroll
   |   DOWN
   |      |   2               sign
Node 1552973
   |   UP
   |      |   1531341         fragment
   |      |   1605868         scroll
   |   DOWN
   |      |   1430242         cluster
   |      |   1               sign
   |      |   1606869         word
   |      |   2               sign
   |      |   1606870         word
   |      |   3               sign
   |      |   4               sign
   |      |   5               sign
   |      |   1606871         word
   |      |   6               sign
   |      |   7               sign
   |      |   8               sign
   |      |   9               sign
   |      |   1606872         word
   |      |   10              sign
   |      |   11              sign
   |      |   1606873         word
   |      |   12              sign
   |      |   13              sign
   |      |   14              sign
   |      |   15              sign
   |      |   16              sign
   |      |   1606874         word
   |      |   17              sign
   |      |   18              sign
   |      |   19              sign
   |      |   1606875         word
   |      |   20              sign
   |      |   1606876         word
   |      |   21              sign
   |      |   22              sign
   |      |   23              sign
   |      |   24              sign
   |      |   1606877         word
   |      |   25              sign
   |      |   1606878         word
   |      |   26              sign
   |      |   27              sign
   |      |   28              sign
   |      |   29              sign
Done

Text API

So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.

In the same way as F gives access to feature data, T gives access to the text. That is also feature data, but you can tell Text-Fabric which features are specifically carrying the text, and in return Text-Fabric offers you a Text API: T.

Formats

DSS text can be represented in a number of ways:

  • orig: unicode
  • trans: ETCBC transcription
  • source: as in Abegg's data files

All three can be represented in two flavours:

  • full: all glyphs, but no bracketings and flags
  • extra: everything

If you wonder where the information about text formats is stored: not in the program text-fabric, but in the data set. It has a feature otext, which specifies the formats and which features must be used to produce them. otext is the third special feature in a TF data set, next to otype and oslots. It is an optional feature. If it is absent, there will be no T API.

Here is a list of all available formats in this data set.

In [38]:
T.formats
Out[38]:
{'lex-default': 'word',
 'lex-orig-full': 'word',
 'lex-source-full': 'word',
 'lex-trans-full': 'word',
 'morph-source-full': 'word',
 'text-orig-extra': 'word',
 'text-orig-full': 'sign',
 'text-source-extra': 'word',
 'text-source-full': 'sign',
 'text-trans-extra': 'word',
 'text-trans-full': 'sign',
 'layout-orig-full': 'sign',
 'layout-source-full': 'sign',
 'layout-trans-full': 'sign'}

Using the formats

The T.text() function is central to get text representations of nodes. Its most basic usage is

T.text(nodes, fmt=fmt)

where nodes is a list or iterable of nodes, usually word nodes, and fmt is the name of a format. If you leave out fmt, the default text-orig-full is chosen.

The result is the text in that format for all nodes specified:

You see for each format in the list above its intended level of operation: sign or word.

If TF formats a node according to a defined text-format, it will descend to constituent nodes and represent those constituent nodes.

In this case, the formats ending in -extra specify the word level as the descend type. Because, in this dataset, the features that contain the text-critical brackets are only defined at the word level. At the sign level, those brackets are no longer visible, but they have left their traces in other features.

If we do not specify a format, the default format is used (text-orig-full).

We examine a portion of biblical material at the start 1Q1.

In [39]:
fragmentNode = T.nodeFromSection(("1Q1", "f1"))
fragmentNode
Out[39]:
1540222
In [40]:
signs = L.d(fragmentNode, otype="sign")
words = L.d(fragmentNode, otype="word")
lines = L.d(fragmentNode, otype="line")
print(
    f"""
Fragment {T.sectionFromNode(fragmentNode)} with
  {len(signs):>3} signs
  {len(words):>3} words
  {len(lines):>3} lines
"""
)
Fragment ('1Q1', 'f1') with
  157 signs
   57 words
    3 lines

In [41]:
T.text(signs[0:100])
Out[41]:
'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ אלהים ישרוצו המים שרץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים '
In [42]:
T.text(words[0:20])
Out[42]:
'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר אלהים ישרוצו ה'
In [43]:
T.text(lines[0:2])
Out[43]:
'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ אלהים ישרוצו המים שרץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים ׃ ╱ '

The -extra formats

In order to use non-default formats, we have to specify them in the fmt parameter.

In [44]:
T.text(signs[0:100], fmt="text-orig-extra")
Out[44]:
''

We do not get much, let's ask why.

In [45]:
T.text(signs[0:2], fmt="text-orig-extra", explain=True)
EXPLANATION: T.text() called with parameters:
	nodes  : iterable of 2 nodes
	fmt    : text-orig-extra targeted at word
	descend: implicit
	func   : no custom format implementation

	NODE: sign 770999
		TARGET LEVEL: word  (descend=None) (format target type)
		EXPANSION: 0 words 
		FORMATTING: explicit text-orig-extra does <function Text._compileFormat.<locals>.g at 0x133d4a680>
		MATERIAL:
	NODE: sign 771000
		TARGET LEVEL: word  (descend=None) (format target type)
		EXPANSION: 0 words 
		FORMATTING: explicit text-orig-extra does <function Text._compileFormat.<locals>.g at 0x133d4a680>
		MATERIAL:
Out[45]:
''

The reason can be found in TARGET LEVEL: word and EXPANSION 0 words. We are applying the word targeted format text-orig-extra to a sign, which does not contain words.

In [46]:
T.text(words[0:20], fmt="text-orig-extra")
Out[46]:
'[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] [ אלהים יש ]רוצו ה'
In [47]:
T.text(lines[0:2], fmt="text-orig-extra")
Out[47]:
'[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] [ אלהים יש ]רוצו המים שר#[ ץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים ׃ '

Note that the direction of the brackets look wrong, because they have not been adapted to the right-to-left writing direction.

We can view them in ETCBC transcription as well:

In [48]:
T.text(words[0:20], fmt="text-trans-extra")
Out[48]:
'[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] [ >LHJm J# ]RWYW H'
In [49]:
T.text(lines[0:2], fmt="text-trans-extra")
Out[49]:
'[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] [ >LHJm J# ]RWYW HMJm #R#[ y NP# XJH W<Wp J<WPp <L H>Ry <L PNJ RQJ< H#MJm 00 '

Or in Abegg's source encoding:

In [50]:
T.text(words[0:20], fmt="text-source-extra")
Out[50]:
']wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ ]alhyM yC[rwxw h'
In [51]:
T.text(lines[0:2], fmt="text-source-extra")
Out[51]:
']wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ ]alhyM yC[rwxw hmyM Cr«]X npC jyh wowP yowpP ol harX ol pny rqyo hCmyM . '

The function T.text() works with nodes of many types.

We compose a set of example nodes and run T.text on them:

In [52]:
exampleNodes = [
    F.otype.s("sign")[1],
    F.otype.s("word")[1],
    F.otype.s("cluster")[0],
    F.otype.s("line")[0],
    F.otype.s("fragment")[0],
    F.otype.s("scroll")[0],
    F.otype.s("lex")[1],
]
exampleNodes
Out[52]:
[2, 1606870, 1430242, 1552973, 1531341, 1605868, 1542524]
In [53]:
for n in exampleNodes:
    print(f"This is {F.otype.v(n)} {n}:")
    text = T.text(n)
    if len(text) > 200:
        text = text[0:200] + f"\nand {len(text) - 200} characters more"
    print(text)
    print("")
This is sign 2:
ו

This is word 1606870:
עתה 

This is cluster 1430242:
  

This is line 1552973:
  ועתה שמעו כל יודעי צדק ובינו במעשי 

This is fragment 1531341:
  ועתה שמעו כל יודעי צדק ובינו במעשי אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית לישראל ולא נתנ
and 827 characters more

This is scroll 1605868:
  ועתה שמעו כל יודעי צדק ובינו במעשי אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית לישראל ולא נתנ
and 21145 characters more

This is lex 1542524:
h-עַתָּה-<AT.@H-oAt;Dh 

Look at the last case, the lexeme node: obviously, the text-format that has been invoked provides the language (h) of the lexeme, plus its representations in unicode, etcbc, and Abegg transcription.

But what format exactly has been invoked? Let's ask.

In [54]:
T.text(exampleNodes[-1], explain=True)
EXPLANATION: T.text() called with parameters:
	nodes  : single node
	fmt    : implicit
	descend: implicit
	func   : no custom format implementation

	NODE: lex 1542524
		TARGET LEVEL: lex (no expansion needed) (descend=None) (format target type)
		EXPANSION: 1 lex 1542524
		FORMATTING: implicit lex-default does <function Text._compileFormat.<locals>.g at 0x133d49a20>
		MATERIAL:
			lex 1542524 ADDS "h-עַתָּה-<AT.@H-oAt;Dh "
Out[54]:
'h-עַתָּה-<AT.@H-oAt;Dh '

The clue is in FORMATTING: implicit lex-default.

Remember that we saw the format lex-default in T.formats.

The Text-API has matched the type of the lexeme node we provided with this default format and applies it, thereby skipping the expansion of the lexeme node to its occurrences.

But we can force the expansion:

In [55]:
T.text(exampleNodes[-1], fmt="lex-default", descend=True)
Out[55]:
'h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh '

Using the formats

Now let's use those formats to print out the first biblical line in this corpus.

Note that the formats starting with layout- are not usable for this. Also the format lex-default is not useful, so we leave that out as well.

For the layout- formats, see display.

In [56]:
usefulFormats = [
    fmt
    for fmt in sorted(T.formats)
    if not fmt.startswith("layout-") and not fmt == "lex-default"
]
len(usefulFormats)
Out[56]:
10
In [57]:
firstLine = T.nodeFromSection(("1Q1", "f1", "1"))
for fmt in usefulFormats:
    if not fmt.startswith("layout-"):
        print(
            "{}:\n\t{}\n".format(
                fmt,
                T.text(firstLine, fmt=fmt),
            )
        )
lex-orig-full:
	h-וְh-ראה h-אֱלֹהִים h-כִּי h-טֹוב h-׃ h-וְh-היה h-עֶרֶב h-וְh-היה h-בֹּקֶר h-יֹום h-רְבִיעִי h-׃ h-וְh-אמר 

lex-source-full:
	h-w◊h-rah h-aTløhIyM h-k;Iy h-føwb h-. h-w◊h-hyh h-oRr®b h-w◊h-hyh h-b;Oq®r h-yøwM h-r√bIyoIy h-. h-w◊h-amr 

lex-trans-full:
	h-W:h-R>H h->:ELOHIJm h-K.IJ h-VOWB h-00 h-W:h-HJH h-<EREB h-W:h-HJH h-B.OQER h-JOWm h-R:BIJ<IJ h-00 h-W:h->MR 

morph-source-full:
	Pcvqw3msj ncmp Pc ams . Pcvqw3msj ncms Pcvqw3msj ncms ncms uomsa . Pcvqw3ms 

text-orig-extra:
	[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] 

text-orig-full:
	וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ 

text-source-extra:
	]wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ 

text-source-full:
	wyra alhyM ky fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr ╱ 

text-trans-extra:
	[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] 

text-trans-full:
	WJR> >LHJm KJ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ╱ 

Whole text in all formats in a few seconds

Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Dead Sea Scrolls is a piece of cake.

It takes just a dozen seconds or so to have that cake and eat it. In all useful formats.

In [58]:
A.indent(reset=True)
A.info("writing plain text of all scrolls in all text formats")

text = collections.defaultdict(list)

for ln in F.otype.s("line"):
    for fmt in usefulFormats:
        if fmt.startswith("text-"):
            text[fmt].append(T.text(ln, fmt=fmt, descend=True))

A.info("done {} formats".format(len(text)))

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))
  0.00s writing plain text of all scrolls in all text formats
  8.74s done 6 formats
text-orig-extra
ועתה שמעו כל יודעי צדק ובינו במעשי 
אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ 
כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו 
ו?יתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית 
לישראל ולא נתנ׳ם לכלה ׃ ובקץ חרון שנים שלוש מאות 

text-orig-full
  ועתה שמעו כל יודעי צדק ובינו במעשי 
אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ 
כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו 
ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית 
לישראל ולא נתנ׳ם לכלה ׃ ובקץ חרון שנים שלוש מאות 

text-source-extra
woth Cmow kl ywdoy xdq wbynw bmoCy 
al . ky ryb l/w oM kl bCr wmCpf yoCh bkl mnaxy/w . 
ky bmwol/M aCr ozbw/hw hstyr pny/w myCral wmmqdC/w 
wØytn/M ljrb . wbzkr/w bryt raCnyM hCayr Cayryt 
lyCral wla ntn/M lklh . wbqX jrwN CnyM ClwC mawt 

text-source-full
□ woth Cmow kl ywdoy xdq wbynw bmoCy 
al . ky ryb l/w oM kl bCr wmCpf yoCh bkl mnaxy/w . 
ky bmwol/M aCr ozbw/hw hstyr pny/w myCral wmmqdC/w 
wytn/M ljrb . wbzkr/w bryt raCnyM hCayr Cayryt 
lyCral wla ntn/M lklh . wbqX jrwN CnyM ClwC mawt 

text-trans-extra
W<TH #M<W KL JWD<J YDQ WBJNW BM<#J 
>L 00 KJ RJB L'W <m KL B#R WM#PV J<#H BKL MN>YJ'W 00 
KJ BMW<L'm >#R <ZBW'HW HSTJR PNJ'W MJ#R>L WMMQD#'W 
W?JTN'm LXRB 00 WBZKR'W BRJT R>#NJm H#>JR #>JRJT 
LJ#R>L WL> NTN'm LKLH 00 WBQy XRWn #NJm #LW# M>WT 

text-trans-full
  W<TH #M<W KL JWD<J YDQ WBJNW BM<#J 
>L 00 KJ RJB L'W <m KL B#R WM#PV J<#H BKL MN>YJ'W 00 
KJ BMW<L'm >#R <ZBW'HW HSTJR PNJ'W MJ#R>L WMMQD#'W 
WJTN'm LXRB 00 WBZKR'W BRJT R>#NJm H#>JR #>JRJT 
LJ#R>L WL> NTN'm LKLH 00 WBQy XRWn #NJm #LW# M>WT 

The full plain text

We write all formats to file, in your Downloads folder.

In [59]:
for fmt in T.formats:
    if fmt.startswith("text-"):
        with open(
            os.path.expanduser(f"~/Downloads/{fmt}.txt"),
            "w",
            # encoding='utf8',
        ) as f:
            f.write("\n".join(text[fmt]))

(if this errors, uncomment the line with encoding)

Sections

A section in the DSS is a scroll, a fragment or a line. Knowledge of sections is not baked into Text-Fabric. The config feature otext.tf may specify three section levels, and tell what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from line nodes to tuples of the form:

(scroll acronym, fragment label, line number)

You can get the section of a node as a tuple of relevant scroll, fragment, and line nodes. Or you can get it as a passage label, a string.

You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.

If you are dealing with scroll and fragment nodes, you can ask to fill out the line and fragment parts as well.

Here are examples of getting the section that corresponds to a node and vice versa.

NB: sectionFromNode always delivers a line specification, either from the first slot belonging to that node, or, if lastSlot, from the last slot belonging to that node.

In [60]:
someNodes = (
    F.otype.s("sign")[100000],
    F.otype.s("word")[10000],
    F.otype.s("cluster")[5000],
    F.otype.s("line")[15000],
    F.otype.s("fragment")[1000],
    F.otype.s("scroll")[500],
)
In [61]:
for n in someNodes:
    nType = F.otype.v(n)
    d = f"{n:>7} {nType}"
    first = A.sectionStrFromNode(n)
    last = A.sectionStrFromNode(n, lastSlot=True, fillup=True)
    tup = (
        T.sectionTuple(n),
        T.sectionTuple(n, lastSlot=True, fillup=True),
    )
    print(f"{d:<16} - {first:<18} {last:<18} {tup}")
 100001 sign     - 1QHa 25:31         1QHa 25:31         ((1605874, 1531445, 1555227), (1605874, 1531445, 1555227))
1616869 word     - 1QS 8:10           1QS 8:10           ((1605869, 1531366, 1553578), (1605869, 1531366, 1553578))
1435242 cluster  - 1Q29 f2:3          1Q29 f2:3          ((1605890, 1531685, 1556400), (1605890, 1531685, 1556400))
1567973 line     - 4Q368 f3:4         4Q368 f3:4         ((1606221, 1534207, 1567973), (1606221, 1534207, 1567973))
1532341 fragment - 4Q186 f2ii         4Q186 f2ii:3       ((1605991, 1532341), (1605991, 1532341, 1559220))
1606368 scroll   - 4Q471b             4Q471b f1a_d:10    ((1606368,), (1606368, 1536089, 1575660))

Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.

There are two ways to do that:

  • Locate the .tf directory of your dataset, and remove all .tfx files in it. This might be a bit awkward to do, because the .tf directory is hidden on Unix-like systems.
  • Call TF.clearCache(), which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.

In [62]:
# TF.clearCache()

Next steps

By now you have an impression how to compute around in the corpus. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

  • display become an expert in creating pretty displays of your text structures
  • search turbo charge your hand-coding with search templates
  • exportExcel make tailor-made spreadsheets out of your results
  • share draw in other people's data and let them use yours
  • similarLines spot the similarities between lines

See the cookbook for recipes for small, concrete tasks.

CC-BY Dirk Roorda