Some words in the Hebrew text are contiguous with preceding or following words: there is no white space to separate them, only empty space or a maqaf (hyphen). Such a sequence of adjacent words we call an accented unit.
How do you find, given a word occurrence, the complete accented unit it belongs to?
import sys, os
import laf
from laf.fabric import LafFabric
#from etcbc.preprocess import prepare
fabric = LafFabric()
0.00s This is LAF-Fabric 4.8.2 API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html
version = '4b'
API = fabric.load('etcbc{}'.format(version), 'lexicon,para', 'paragraphs', {
"xmlids": {"node": False, "edge": False},
"features": ('''
otype monads
g_word_utf8 trailer_utf8
''',
'''
'''),
# "prepare": prepare,
"primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE 0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56 0.00s DETAIL: COMPILING a: lexicon: UP TO DATE 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54 0.01s DETAIL: COMPILING a: para: UP TO DATE 0.01s USING annox: para DATA COMPILED AT: 2016-07-08T14-38-37 0.02s DETAIL: load main: G.node_anchor_min 0.10s DETAIL: load main: G.node_anchor_max 0.21s DETAIL: load main: G.node_sort 0.28s DETAIL: load main: G.node_sort_inv 0.68s DETAIL: load main: G.edges_from 0.74s DETAIL: load main: G.edges_to 0.80s DETAIL: load main: F.etcbc4_db_monads [node] 1.48s DETAIL: load main: F.etcbc4_db_otype [node] 2.71s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] 2.99s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] 3.11s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/paragraphs/__log__paragraphs.txt 3.11s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon, para FOR TASK paragraphs AT 2016-09-26T17-56-23
We use trailer_utf8
as criterion whether a word and the following word are contiguous. If it is the empty string or a maqaf, we conclude that the word in question and the next one are part of the same accented unit.
It is not straightforward in LAF-Fabric how to proceed from a word node the the node of the next of previous word.
A solution is then to walk through all the words in text order, and make a mapping from words to accented units.
If we store that mapping, we can easily find the au of any word node that we encounter.
inf('Compiling index of accented units ...')
word2au = {} # the mapping from word node to accent unit
aus = set() # only needed to count the total number of accented units
glue = {'', '־'} # the interword material that continues the current au
current_au = []
for w in F.otype.s('word'):
current_au.append(w)
word2au[w] = current_au
if F.trailer_utf8.v(w) not in glue: # move to a new au
aus.add(tuple(current_au))
current_au = []
if current_au: aus.add(tuple(current_au))
inf('Assembled {} words into {} accented units'.format(
len(word2au.keys()),
len(aus),
))
3.05s Compiling index of accented units ... 5.20s Assembled 426568 words into 262497 accented units
We display the text of the first 11 verses in the Bible and mark the accented units with a a tuple of the monad numbers of their words.
text = ''
verse = 0
for n in NN():
otype = F.otype.v(n)
if otype == 'verse':
verse += 1
if verse > 11: break
text += '\nGenesis 1:{}\n'.format(verse)
prev_au = None
elif otype == 'word':
this_au = word2au[n]
if prev_au != None and this_au is not prev_au:
text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))
prev_au = this_au
text += F.g_word_utf8.v(n)+F.trailer_utf8.v(n)
if prev_au:
text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))
print(text)
Genesis 1:1 בְּרֵאשִׁ֖ית (1,2) בָּרָ֣א (3) אֱלֹהִ֑ים (4) אֵ֥ת (5) הַשָּׁמַ֖יִם (6,7) וְאֵ֥ת (8,9) הָאָֽרֶץ׃ Genesis 1:2 וְהָאָ֗רֶץ (12,13,14) הָיְתָ֥ה (15) תֹ֨הוּ֙ (16) וָבֹ֔הוּ (17,18) וְחֹ֖שֶׁךְ (19,20) עַל־פְּנֵ֣י (21,22) תְהֹ֑ום (23) וְר֣וּחַ (24,25) אֱלֹהִ֔ים (26) מְרַחֶ֖פֶת (27) עַל־פְּנֵ֥י (28,29) הַמָּֽיִם׃ Genesis 1:3 וַיֹּ֥אמֶר (32,33) אֱלֹהִ֖ים (34) יְהִ֣י (35) אֹ֑ור (36) וַֽיְהִי־אֹֽור׃ Genesis 1:4 וַיַּ֧רְא (40,41) אֱלֹהִ֛ים (42) אֶת־הָאֹ֖ור (43,44,45) כִּי־טֹ֑וב (46,47) וַיַּבְדֵּ֣ל (48,49) אֱלֹהִ֔ים (50) בֵּ֥ין (51) הָאֹ֖ור (52,53) וּבֵ֥ין (54,55) הַחֹֽשֶׁךְ׃ Genesis 1:5 וַיִּקְרָ֨א (58,59) אֱלֹהִ֤ים ׀ (60) לָאֹור֙ (61,63) יֹ֔ום (62,64) וְלַחֹ֖שֶׁךְ (65,66,68) קָ֣רָא (67,69) לָ֑יְלָה (70) וַֽיְהִי־עֶ֥רֶב (71,72,73) וַֽיְהִי־בֹ֖קֶר (74,75,76) יֹ֥ום (77) אֶחָֽד׃ פ Genesis 1:6 וַיֹּ֣אמֶר (79,80) אֱלֹהִ֔ים (81) יְהִ֥י (82) רָקִ֖יעַ (83) בְּתֹ֣וךְ (84,85) הַמָּ֑יִם (86,87) וִיהִ֣י (88,89) מַבְדִּ֔יל (90) בֵּ֥ין (91) מַ֖יִם (92) לָמָֽיִם׃ Genesis 1:7 וַיַּ֣עַשׂ (95,96) אֱלֹהִים֮ (97) אֶת־הָרָקִיעַ֒ (98,99,100) וַיַּבְדֵּ֗ל (101,102) בֵּ֤ין (103) הַמַּ֨יִם֙ (104,105) אֲשֶׁר֙ (106) מִתַּ֣חַת (107,108) לָרָקִ֔יעַ (109,111) וּבֵ֣ין (110,112,113) הַמַּ֔יִם (114,115) אֲשֶׁ֖ר (116) מֵעַ֣ל (117,118) לָרָקִ֑יעַ (119,121) וַֽיְהִי־כֵֽן׃ Genesis 1:8 וַיִּקְרָ֧א (125,126) אֱלֹהִ֛ים (127) לָֽרָקִ֖יעַ (128,130) שָׁמָ֑יִם (129,131) וַֽיְהִי־עֶ֥רֶב (132,133,134) וַֽיְהִי־בֹ֖קֶר (135,136,137) יֹ֥ום (138) שֵׁנִֽי׃ פ Genesis 1:9 וַיֹּ֣אמֶר (140,141) אֱלֹהִ֗ים (142) יִקָּו֨וּ (143) הַמַּ֜יִם (144,145) מִתַּ֤חַת (146,147) הַשָּׁמַ֨יִם֙ (148,149) אֶל־מָקֹ֣ום (150,151) אֶחָ֔ד (152) וְתֵרָאֶ֖ה (153,154) הַיַּבָּשָׁ֑ה (155,156) וַֽיְהִי־כֵֽן׃ Genesis 1:10 וַיִּקְרָ֨א (160,161) אֱלֹהִ֤ים ׀ (162) לַיַּבָּשָׁה֙ (163,165) אֶ֔רֶץ (164,166) וּלְמִקְוֵ֥ה (167,168,169) הַמַּ֖יִם (170,171) קָרָ֣א (172) יַמִּ֑ים (173) וַיַּ֥רְא (174,175) אֱלֹהִ֖ים (176) כִּי־טֹֽוב׃ Genesis 1:11 וַיֹּ֣אמֶר (179,180) אֱלֹהִ֗ים (181) תַּֽדְשֵׁ֤א (182) הָאָ֨רֶץ֙ (183,184) דֶּ֔שֶׁא (185) עֵ֚שֶׂב (186) מַזְרִ֣יעַ (187) זֶ֔רַע (188) עֵ֣ץ (189) פְּרִ֞י (190) עֹ֤שֶׂה (191) פְּרִי֙ (192) לְמִינֹ֔ו (193,194) אֲשֶׁ֥ר (195) זַרְעֹו־בֹ֖ו (196,197) עַל־הָאָ֑רֶץ (198,199,200) וַֽיְהִי־כֵֽן׃ (201,202,203)