Notebook

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

from tf.app import use

In [3]:

A = use("ETCBC/bhsa:clone", checkout="clone")

Locating corpus resources ...

app: ~/github/ETCBC/bhsa/app

data: ~/github/ETCBC/bhsa/tf/2021

data: ~/github/ETCBC/phono/tf/2021

data: ~/github/ETCBC/parallels/tf/2021

TF: TF API 12.1.2, ETCBC/bhsa/app v3, Search Reference
Data: ETCBC - bhsa 2021, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

str

✅ verbal stem (qal; piel; hif; apel; pael)

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

Settings:

specified

apiVersion: 3
appName: ETCBC/bhsa
appPath: /Users/me/github/ETCBC/bhsa/app
commit: no value
css: ''
dataDisplay:
- exampleSectionHtml:
  <code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
- excludedFeatures:
  - g_uvf_utf8
  - g_vbs
  - kq_hybrid
  - languageISO
  - g_nme
  - lex0
  - is_root
  - g_vbs_utf8
  - g_uvf
  - dist
  - root
  - suffix_person
  - g_vbe
  - dist_unit
  - suffix_number
  - distributional_parent
  - kq_hybrid_utf8
  - crossrefSET
  - instruction
  - g_prs
  - lexeme_count
  - rank_occ
  - g_pfm_utf8
  - freq_occ
  - crossrefLCS
  - functional_parent
  - g_pfm
  - g_nme_utf8
  - g_vbe_utf8
  - kind
  - g_prs_utf8
  - suffix_gender
  - mother_object_type
- noneValues:
  - absent
  - n/a
  - none
  - unknown
  - no value
  - NA
docs:
- docBase: {docRoot}/{repo}
- docExt: ''
- docPage: ''
- docRoot: https://{org}.github.io
- featurePage: 0_home
interfaceDefaults: {}
isCompatible: True
local: clone
localDir: /Users/me/github/ETCBC/bhsa/_temp
provenanceSpec:
- corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
- doi: 10.5281/zenodo.1007624
- extraData: ner
- moduleSpecs:
  - :
    backend: no value
    corpus: Phonetic Transcriptions
    docUrl:
    https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
    doi: 10.5281/zenodo.1007636
    org: ETCBC
    relative: /tf
    repo: phono
  - :
    backend: no value
    corpus: Parallel Passages
    docUrl:
    https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
    doi: 10.5281/zenodo.1007642
    org: ETCBC
    relative: /tf
    repo: parallels
- org: ETCBC
- relative: /tf
- repo: bhsa
- version: 2021
- webBase: https://shebanq.ancient-data.org/hebrew
- webHint: Show this on SHEBANQ
- webLang: la
- webLexId: True
- webUrl:
  {webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: no value
typeDisplay:
- clause:
  - label: {typ} {rela}
  - style: ''
- clause_atom:
  - hidden: True
  - label: {code}
  - level: 1
  - style: ''
- half_verse:
  - hidden: True
  - label: {label}
  - style: ''
  - verselike: True
- lex:
  - featuresBare: gloss
  - label: {voc_lex_utf8}
  - lexOcc: word
  - style: orig
  - template: {voc_lex_utf8}
- phrase:
  - label: {typ} {function}
  - style: ''
- phrase_atom:
  - hidden: True
  - label: {typ} {rela}
  - level: 1
  - style: ''
- sentence:
  - label: {number}
  - style: ''
- sentence_atom:
  - hidden: True
  - label: {number}
  - level: 1
  - style: ''
- subphrase:
  - hidden: True
  - label: {number}
  - style: ''
- word:
  - features: pdp vs vt
  - featuresBare: lex:gloss
writing: hbo

Before we start, we make a test instruction set.

We pick up all words whose lexeme starts with BJT_ and that have multiple occurrence forms. We collect the occurrence forms and use them to populate a spreadsheet with instructions.

See the file ner/sheets/places.xlsx

In [4]:

F = A.api.F

In [5]:

candidates = {}
candidates_utf8 = {}

for w in F.otype.s("word"):
    lex = F.lex.v(w)
    if not lex.startswith("BJT_"):
        continue
    lex_utf8 = F.lex_utf8.v(w)
    candidates.setdefault(lex, set()).add(F.g_cons.v(w))
    candidates_utf8.setdefault(lex_utf8, set()).add(F.g_cons_utf8.v(w))

multiples = {lex: shapes for (lex, shapes) in candidates.items() if len(shapes) > 1}
multiples_utf8 = {lex: shapes for (lex, shapes) in candidates_utf8.items() if len(shapes) > 1}

def show(d):
    for (k, vs) in sorted(d.items()):
        print(k)
        print("\t" + (" ; ".join(v.replace(" ", "_") for v in vs)))
        
show(multiples)
show(multiples_utf8)

BJT_C>N/
	BJT_C>N ; BJT_CN
BJT_DGWN/
	BJT_DGN ; BJT_DGWN
BJT_HJCMWT/
	BJT_HJCJMT ; BJT_HJCMT ; BJT_HJCMWT
BJT_XWRWN/
	BJT_XRWN ; BJT_XWRN ; BJT_XRN ; BJT_XWRWN
בית דגון
	בית_דגון ; בית_דגן
בית הישׁמות
	בית_הישׁמות ; בית_הישׁימת ; בית_הישׁמת
בית חורון
	בית_חורון ; בית_חורן ; בית_חרן ; בית_חרון
בית שׁאן
	בית_שׁאן ; בית_שׁן

Now we start with the entity assignment.

In [6]:

NE = A.makeNer()

In [8]:

NE.readInstructions("places", force=True)

4 entities with 11 occurrence specs
0 entities do not have occurrence specifiers
All occurrence specifiers are unambiguous

In [9]:

NE.makeInventory()
NE.showInventory()

בית.דגון                 LOC   בית_דגון                 1 x בית דגון
בית.דגון                 LOC   בית_דגן                  1 x בית דגון
בית.חורון                LOC   בית_חורון                5 x בית חורון
בית.חורון                LOC   בית_חורן                 5 x בית חורון
בית.חורון                LOC   בית_חרון                 3 x בית חורון
בית.חורון                LOC   בית_חרן                  1 x בית חורון
Total 16

In [10]:

NE.setSet("power")

Annotation set power has 16 annotations

In [11]:

NE.resetSet()

Annotation set power has 0 annotations

In [12]:

NE.markEntities()

Already present:     0 x
Added:              16 x

In [14]:

results = NE.filterContent(anyEnt=True, showStats=None)

15 verses

In [15]:

NE.showContent(results)

Joshua 10:10 ויהמם יהוה לפני ישׂראל ויכם מכה־גדולה בגבעון וירדפם דרך מעלה 1בית.חורון LOC 14בית חורן 1ויכם עד־עזקה ועד־מקדה׃

Joshua 10:11 ויהי בנסם׀ מפני ישׂראל הם במורד 1בית.חורון LOC 14בית חורן 1ויהוה השׁליך עליהם אבנים גדלות מן־השׁמים עד־עזקה וימתו רבים אשׁר־מתו באבני הברד מאשׁר הרגו בני ישׂראל בחרב׃ ס

Joshua 15:41 וגדרות 1בית.דגון LOC 2בית דגון 1ונעמה ומקדה ערים שׁשׁ־עשׂרה וחצריהן׃ ס

Joshua 16:3 וירד־ימה אל־גבול היפלטי עד גבול 1בית.חורון LOC 14בית חורן 1תחתון ועד־גזר והיו תצאתו ימה׃

Joshua 16:5 ויהי גבול בני־אפרים למשׁפחתם ויהי גבול נחלתם מזרחה עטרות אדר עד־1בית.חורון LOC 14בית חורן 1עליון׃

Joshua 18:13 ועבר משׁם הגבול לוזה אל־כתף לוזה נגבה היא בית אל וירד הגבול עטרות אדר על־ההר אשׁר מנגב ל1בית.חורון LOC 14בית חרון 1תחתון׃

Joshua 18:14 ותאר הגבול ונסב לפאת־ים נגבה מן־ההר אשׁר על־פני 1בית.חורון LOC 14בית חרון 1נגבה והיה תצאתיו אל־קרית בעל היא קרית יערים עיר בני יהודה זאת פאת־ים׃

Joshua 19:27 ושׁב מזרח השׁמשׁ 1בית.דגון LOC 2בית דגן 1ופגע בזבלון ובגי יפתח אל צפונה בית העמק ונעיאל ויצא אל־כבול משׂמאל׃

Joshua 21:22 ואת־קבצים ואת־מגרשׁה ואת־1בית.חורון LOC 14בית חורן 1ואת־מגרשׁה ערים ארבע׃ ס

1_Samuel 13:18 והראשׁ אחד יפנה דרך 1בית.חורון LOC 14בית חרון 1והראשׁ אחד יפנה דרך הגבול הנשׁקף על־גי הצבעים המדברה׃ ס

1_Kings 9:17 ויבן שׁלמה את־גזר ואת־1בית.חורון LOC 14בית חרן 1תחתון׃

1_Chronicles 6:53 ואת־יקמעם ואת־מגרשׁיה ואת־1בית.חורון LOC 14בית חורון 1ואת־מגרשׁיה׃

1_Chronicles 7:24 ובתו שׁארה ותבן את־1בית.חורון LOC 14בית חורון 1התחתון ואת־העליון ואת אזן שׁארה׃

2_Chronicles 8:5 ויבן את־1בית.חורון LOC 14בית חורון 1העליון ואת־1בית.חורון LOC 14בית חורון 1התחתון ערי מצור חומות דלתים ובריח׃

2_Chronicles 25:13 ובני הגדוד אשׁר השׁיב אמציהו מלכת עמו למלחמה ויפשׁטו בערי יהודה משׁמרון ועד־1בית.חורון LOC 14בית חורון 1ויכו מהם שׁלשׁת אלפים ויבזו בזה רבה׃ ס

In [ ]: