Getting started¶

It is assumed that you have read start and followed the installation instructions there.

Corpus¶

This is:

dss Dead Sea Scrolls

First acquaintance¶

We just want to grasp what the corpus is about and how we can find our way in the data.

Open a terminal or command prompt and say one of the following

text-fabric dss

Wait and see a lot happening before your browser starts up and shows you an interface on the corpus:

Text-Fabric needs an app to deal with the corpus-specific things. It downloads/finds/caches the latest version of the app:

Using TF-app in /Users/dirk/text-fabric-data/annotation/app-dss/code:
    rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)

It downloads/finds/caches the latest version of the data:

Using data in /Users/dirk/text-fabric-data/etcbc/dss/tf/0.6:
    rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)

The data is preprocessed in order to speed up typical Text-Fabric operations. The result is cached on your computer. Preprocessing costs time. Next time you use this corpus on this machine, the startup time is much quicker.

TF setup done.

Then the app goes on to act as a local webserver serving the corpus that has just been downloaded and it will open your browser for you and load the corpus page

 * Running on http://localhost:8107/ (Press CTRL+C to quit)
Opening dss in browser
Listening at port 18987

Help!¶

Indeed, that is what you need. Click the vertical Help tab.

From there, click around a little bit. Don't read closely, just note the kinds of information that is presented to you.

Later on, it will make more sense!

Browsing¶

First we browse our data. Click the browse button.

and then, in the table of documents (scrolls), click on a fragment of scroll 1QSb:

Now you're looking at a fragment of a scroll: the writing in Hebrew characters without vowel signs.

Now click the Options tab and select the layout-orig-unicode format to see the same fragment in a layout that indicates the status of the pieces of writing.

You can click a triangle to see how a line is broken down:

Searching¶

In this corpus there is a lot of attention for the uncertainty of signs and whether they have been corrected, either in antiquity or in more modern times.

Also, the corpus is marked up with part-of-speech for each word.

So we can, for example, search for verbs that have an uncertain or corrected or removed consonant in them.

word sp=verb
  sign type=cons
  /with/
  .. unc=1|2|3|4
  /or/
  .. cor=1|2|3
  /or/
  .. rem=1|2
  /or/
  .. rec=1
  /-/

In English:

search all words that contain a sign with feature type having value cons (consonant) where at least one of the following holds for that sign:

the feature unc has value 1 or 2 or 3 or 4
the feature cor has value 1 or 2 or 3
the feature rem has value 1 or 2
the feature rec has value 1

Words with multiple uncertain signs correspond with multiple results. We can condense the results in such a way that all results for the same word are shown as one result.

Click the options tab, check condense results, and check word as the container into you want to condense.

You can expand results by clicking the triangle.

You can see the result in context by clicking the browse icon.

You can go back to the result list by clicking the results icon.

Computing¶

This triggers other questions.

For example: how many verbs are there, if there are already 37344 with uncertain signs?

How is uncertainty distributed over the verbs? I.e. how many verbs have how many uncertain/corrected/removed signs?

This is a typical question where you want to leave the search mode and enter computing mode.

Let's find out.

Extra information:

the features unc, cor, rem have values 1, 2, 3, 4 that indicate the kind of uncertainty, correction, removal. We just use those values as the seriousness of the uncertainty. Essentially, we just sum up all values of these features for each sign.
the feature rec means, if that the sign is reconstructed. We consider it to be severely uncertain, and add the penalty 10 for such signs.

Open your terminal and say

jupyter notebook

Your browser starts up and presents you a local computing environment where you can run Python programs.

You see cells like the one below, where you can type programming statements and execute them by pressing Shift Enter.

In [ ]:

First we load the Text-Fabric module, as follows:

In [1]:

from tf.app import use

Now we load the TF-app for the corpus dss and that app loads the corpus data.

We give a name to the result of all that loading: A.

In [3]:

A = use('ETCBC/dss', hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/dss/app

data: ~/text-fabric-data/github/ETCBC/dss/tf/1.9

data: ~/text-fabric-data/github/ETCBC/dss/parallels/tf/1.9

TF: TF API 12.5.4, ETCBC/dss/app v3, Search Reference
Data: ETCBC - dss 1.9, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
scroll	1001	1428.81	100
lex	10450	129.14	94
fragment	11182	127.91	100
line	52895	27.04	100
clause	125	12.85	0
cluster	101099	6.68	47
phrase	315	5.10	0
word	500995	2.81	99
sign	1430241	1.00	100

Sets: no custom sets
Features:

ETCBC/dss/parallels/tf

sim

int

similarity between lines, as a percentage of the common material wrt the combined material

Dead Sea Scrolls

after

str

space behind the word, if any

alt

int

alternative reading

biblical

int

whether we are in biblical material or not

book

str

acronym of the book in which the word occurs

book_etcbc

str

Dead Sea Scrolls: additions based on BHSA

chapter

str

label of the chapter in which the word occurs

cl

str

class (morphology tag)

cl2

str

class (for part 2) (morphology tag)

cor

int

correction made by an ancient or modern editor

fragment

str

label of a fragment of a scroll

full

str

full transcription (Unicode) of a word including flags and brackets

fulle

str

full transcription (ETCBC transliteration) of a word including flags and brackets

fullo

str

full transcription (original source) of a word including flags and brackets

g_cons

str

Dead Sea Scrolls: additions based on BHSA and machine learning

g_nme_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

g_prs_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

glex

str

representation (Unicode) of a lexeme leaving out non-letters

glexe

str

representation (ETCBC transliteration) of a lexeme leaving out non-letters

glexo

str

representation (original source) of a lexeme leaving out non-letters

glyph

str

representation (Unicode) of a word or sign

glyphe

str

representation (ETCBC transliteration) of a word or sign

glypho

str

representation (original source) of a word or sign

gn

str

gender (morphology tag)

gn2

str

gender (for part 2) (morphology tag)

gn3

str

gender (for part 3) (morphology tag)

gn_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

halfverse

str

label of the half-verse in which the word occurs

intl

int

interlinear material, the value indicates the sequence number of the interlinear line

lang

str

language of a word or sign, only if it is not Hebrew

lang_etcbc

str

Dead Sea Scrolls: additions based on BHSA

lex

str

representation (Unicode) of a lexeme

lex_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

lexe

str

representation (ETCBC transliteration) of a lexeme

lexo

str

representation (original source) of a lexeme

line

str

label of a line of a fragment of a scroll

md

str

mood (morphology tag)

merr

str

errors in parsing the morphology tag

morpho

str

morphological tag (by Abegg)

nr

str

Dead Sea Scrolls: additions based on BHSA and machine learning

nu

str

number (morphology tag)

nu2

str

number (for part 2) (morphology tag)

nu3

str

number (for part 3) (morphology tag)

nu_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

otype

str

Dead Sea Scrolls: biblical and non-biblical scrolls

ps

str

person (morphology tag)

ps2

str

person (for part 2) (morphology tag)

ps3

str

person (for part 3) (morphology tag)

ps_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

punc

str

trailing punctuation (Unicode) of a word

punce

str

trailing punctuation (ETCBC transliteration) of a word

punco

str

trailing punctuation (original source) of a word

rec

int

reconstructed by a modern editor

rem

int

removed by an ancient or modern editor

script

str

script in which the word or sign is written if it is not Hebrew

scroll

str

acronym of a scroll

sp

str

part of speech (morphology tag)

sp_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

srcLn

int

the line number of the word in the source data file

st

str

state (morphology tag)

type

str

type of sign or cluster

unc

int

uncertain material in various degrees: higher degree is less certain

uvf_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

vac

int

empty, unwritten space

verse

str

label of the verse in which the word occurs

vs

str

verbal stem (morphology tag)

vs_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

vt

str

verbal tense/aspect (morphology tag)

vt_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

occ

none

edge feature from a lexeme to its occurrences

oslots

none

Dead Sea Scrolls: biblical and non-biblical scrolls

Settings:

specified

apiVersion: 3
appName: ETCBC/dss
appPath: /Users/me/text-fabric-data/github/ETCBC/dss/app
commit: gd796845ffd026d7896a29d71f730d471cba06631
css:
.full,.glyph,.punc { font-family: "Ezra SIL", "SBL Hebrew", sans-serif; } .scriptpaleohebrew { border: 1px dashed navy; } .scriptgreekcapital { border: 1px dashed brown; } .langa { text-decoration: underline; } .intl1 { vertical-align: -0.25em; } .intl2 { vertical-align: -0.5em; } .langg { font-family: serif; text-decoration: underline; } .vac1 { background-color: #aaaaaa; border 2pt solid #dd3333; border-radius: 4pt; } .rem1 { font-weight: bold; color: red; text-decoration: line-through; } .rem2 { font-weight: bold; color: maroon; text-decoration: line-through; } .rec1 { color: teal; font-size: 80%; } .cor1 { font-weight: bold; color: dodgerblue; text-decoration: overline; } .cor2 { font-weight: bold; color: navy; text-decoration: overline; } .cor3 { font-weight: bold; color: navy; text-decoration: overline; vertical-align: super; } .alt1 { text-decoration: overline; } /* UNSURE: italic*/ .unc1 { font-weight: bold; color: #888888; } .unc2 { font-weight: bold; color: #bbbbbb; } .unc3 { font-weight: bold; color: #bbbbbb; text-shadow: #cccccc 1px 1px; } .unc4 { font-weight: bold; color: #dddddd; text-shadow: #eeeeee 2px 2px; } .empty { color: #ff0000; }
dataDisplay:
- noneValues:
  - unknown
  - no value
- showVerseInTuple: True
- textFormats:
  - layout-orig-full: {method: layoutOrig}
  - layout-source-full: {method: layoutSource}
  - layout-trans-full: {method: layoutTrans}
docs:
- docPage: about
- featureBase: {docBase}/transcription.md
- featurePage: transcription
interfaceDefaults: {lineNumbers: 0}
isCompatible: True
local: local
localDir: /Users/me/text-fabric-data/github/ETCBC/dss/_temp
provenanceSpec:
- corpus: Dead Sea Scrolls
- doi: 10.5281/zenodo.2652849
- moduleSpecs:
  :
  
  backend: no value
  corpus: Parallel Passages
  docUrl:
  https://nbviewer.jupyter.org/github/etcbc/dss/blob/master/programs/parallels.ipynb
  doi: 10.5281/zenodo.2652849
  org: ETCBC
  relative: parallels/tf
  repo: dss
- org: ETCBC
- relative: /tf
- repo: dss
- version: 1.9
- webBase: https://www.deadseascrolls.org.il/explore-the-archive
- webHint: Show this scroll in the Leon Levy library
- webUrl: {webBase}/search#q='<1>'
release: v1.9
typeDisplay:
- clause: {label: {nr}}
- cluster:
  - label: {type}
  - stretch: 0
- fragment: {features: biblical}
- lex:
  - features: lex lexe
  - featuresBare: lexo
  - label: True
  - lexOcc: word
  - template: True
- line: {features: biblical}
- phrase: {label: {nr}}
- scroll: {features: biblical}
- sign: {exclude: {type: term}}
- word:
  - base: True
  - features: lang lex cl ps gn nu st vs vt md
  - featuresBare: sp
  - label: True
  - lineNumber: srcLn
  - wrap: 0
writing: hbo

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Some bits are familiar from above, when you ran the text-fabric command in the terminal.

Other bits are links to the documentation, they point to the same places as the links on the Text-Fabric browser.

You see a list of all the data features that have been loaded.

And a list of references to the API documentation, which tells you how you can use this data in your program statements.

Searching (revisited)¶

We do the same search again, but now inside our program.

That means that we can capture the results in a list for further processing.

In [4]:

template = '''
word sp=verb
  sign type=cons
  /with/
  .. unc=1|2|3|4
  /or/
  .. cor=1|2|3
  /or/
  .. rem=1|2
  /or/
  .. rec=1
  /-/
'''
results = A.search(template)

  1.70s 125111 results

In a few seconds, we have all the results!

Let's look at the first one:

In [5]:

results[0]

Out[5]:

(1607456, 1742)

Each result is a list of numbers: for a

word
sign

Here is the second one:

In [6]:

results[1]

Out[6]:

(1607456, 1743)

And here the last one:

In [7]:

results[-1]

Out[7]:

(2107836, 1430165)

Now we are only interested in the words that we have encountered. We collect them in a set:

In [8]:

verbs = sorted({result[0] for result in results})
len(verbs)

Out[8]:

This corresponds exactly to the number of condensed results!

Now we get the number of verbs:

In [9]:

len(F.sp.s('verb'))

Out[9]:

In English: take feature sp (part-of-speech), and collect all nodes that have value verb for this feature. Then take the length of this list.

Now we want to find out something for each result verb: what is the accumulated uncertainty of that verb? Some verbs have more consonants than others, so we divide by the number of consonants.

We define a function that collects the uncertainty of a single sign:

In [10]:

def getUncertainty(sign):
  return sum((
    F.unc.v(sign) or 0,
    F.cor.v(sign) or 0,
    F.rem.v(sign) or 0,
    10 if F.rec.v(sign) else 0
  ))

Let's see what this gives for the first sign in the 1000th result:

In [11]:

sign = results[999][1]
A.pretty(sign)

1QM 14:12

י

type=cons

In [12]:

unc  =  getUncertainty(sign)
print(unc)

An other one:

In [13]:

sign = results[12][1]
A.pretty(sign)
print(getUncertainty(sign))

CD 7:13

ו

type=cons

Now we define a function that gives us the uncertainty of a word. We collect the consonants of the word. We sum the uncertainty of them and divide it by the number of consonants in the word.

In [14]:

def uncertainty(word):
  signs = L.d(word, otype='sign')  # go a Level down to signs and collect them in a list
  return sum(getUncertainty(sign) for sign in signs) / len(signs)

We compute the uncertainty of some verbs.

In [15]:

verb = verbs[999]
A.pretty(verb)

1Q20 1:28

word ידעין

verbtype=glyph

Now the computation:

In [16]:

unc  =  uncertainty(verb)
print(unc)

4.0

An other one:

In [17]:

verb = verbs[12]
A.pretty(verb)
print(uncertainty(verb))

CD 8:13

word שוקל

verbtype=glyph

1.25

We compute all word uncertainties and store them in a dictionary:

In [18]:

verbUncertainty = {verb: uncertainty(verb) for verb in verbs}
len(verbUncertainty)

Out[18]:

What is the minimum and the maximum uncertainty?

In [19]:

max(verbUncertainty.values())

Out[19]:

14.0

In [20]:

min(verbUncertainty.values())

Out[20]:

0.14285714285714285

In order to visualize how many how uncertain verbs there are, we make a scatterplot, using the seaborn library.

(You might need to install the python package seaborn)

In [21]:

!pip install seaborn

Requirement already satisfied: seaborn in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.4)
Requirement already satisfied: python-dateutil>=2.7 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)

In [22]:

import seaborn as sns

In [23]:

sns.set(color_codes=True)
sns.distplot(list(verbUncertainty.values()), axlabel="uncertainty")

/var/folders/nb/_qgnfkns75bcn_p_msyjl81m0000gn/T/ipykernel_55730/2177275149.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(list(verbUncertainty.values()), axlabel="uncertainty")

Out[23]:

<Axes: xlabel='uncertainty', ylabel='Density'>

Let's single out the verbs with uncertainty greater than 9, but lower than 10, and inspect a few.

In [24]:

verbHighUnc = [verb for (verb, unc) in verbUncertainty.items() if 9 < unc < 10]
len(verbHighUnc)

Out[24]:

In [25]:

A.show([[verb] for verb in verbHighUnc], fmt='layout-orig-full', condenseType='word')

result 1

1QM 18:12

word התגברתה

verbtype=glyph

result 2

1QHa 4:33

word הכינותה

verbtype=glyph

result 3

1QHa 9:11

word הכינותה

verbtype=glyph

result 4

4Q261 f1a_b:2

word מתנדבים

verbtype=glyph

result 5

4Q270 f7i:2

word הלך

verbtype=glyph

result 6

4Q365 f2:3

word יעשו

verbtype=glyph

result 7

4Q398 f14_17i:7

word השיבותה

verbtype=glyph

result 8

11Q14 f1ii:15

word מתיצבים

verbtype=glyph

result 9

4Q70 f3+4i+5_7:6

word בנו

verbtype=glyph

result 10

Mur88 3:24

word השיבותי

verbtype=glyph

In [ ]: