%load_ext autoreload
%autoreload 2
from tf.app import use
A = use("ETCBC/bhsa:clone", checkout="clone")
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/github/ETCBC/bhsa/app
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
clone
/Users/me/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
Before we start, we make a test instruction set.
We pick up all words whose lexeme starts with BJT_
and that have multiple occurrence forms.
We collect the occurrence forms and use them to populate a spreadsheet with instructions.
See the file ner/sheets/places.xlsx
F = A.api.F
candidates = {}
candidates_utf8 = {}
for w in F.otype.s("word"):
lex = F.lex.v(w)
if not lex.startswith("BJT_"):
continue
lex_utf8 = F.lex_utf8.v(w)
candidates.setdefault(lex, set()).add(F.g_cons.v(w))
candidates_utf8.setdefault(lex_utf8, set()).add(F.g_cons_utf8.v(w))
multiples = {lex: shapes for (lex, shapes) in candidates.items() if len(shapes) > 1}
multiples_utf8 = {lex: shapes for (lex, shapes) in candidates_utf8.items() if len(shapes) > 1}
def show(d):
for (k, vs) in sorted(d.items()):
print(k)
print("\t" + (" ; ".join(v.replace(" ", "_") for v in vs)))
show(multiples)
show(multiples_utf8)
BJT_C>N/ BJT_C>N ; BJT_CN BJT_DGWN/ BJT_DGN ; BJT_DGWN BJT_HJCMWT/ BJT_HJCJMT ; BJT_HJCMT ; BJT_HJCMWT BJT_XWRWN/ BJT_XRWN ; BJT_XWRN ; BJT_XRN ; BJT_XWRWN בית דגון בית_דגון ; בית_דגן בית הישׁמות בית_הישׁמות ; בית_הישׁימת ; בית_הישׁמת בית חורון בית_חורון ; בית_חורן ; בית_חרן ; בית_חרון בית שׁאן בית_שׁאן ; בית_שׁן
Now we start with the entity assignment.
NE = A.makeNer()
NE.readInstructions("places", force=True)
4 entities with 11 occurrence specs 0 entities do not have occurrence specifiers All occurrence specifiers are unambiguous
NE.makeInventory()
NE.showInventory()
בית.דגון LOC בית_דגון 1 x בית דגון בית.דגון LOC בית_דגן 1 x בית דגון בית.חורון LOC בית_חורון 5 x בית חורון בית.חורון LOC בית_חורן 5 x בית חורון בית.חורון LOC בית_חרון 3 x בית חורון בית.חורון LOC בית_חרן 1 x בית חורון Total 16
NE.setSet("power")
Annotation set power has 16 annotations
NE.resetSet()
Annotation set power has 0 annotations
NE.markEntities()
Already present: 0 x Added: 16 x
results = NE.filterContent(anyEnt=True, showStats=None)
15 verses
NE.showContent(results)