We want to make a list of all nouns with their adjectival modifiers. We produce a tab separated file of phrases which contain a noun and adjectival modifiers. The columns are
Hebrew text is represented in ETCBC consonantal transcription, for ease of importing it in Excel. It is not difficult to generate fully vocalized Hebrew, but then you need OpenOffice to open the csv file.
import sys, os
import collections
import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()
0.00s This is LAF-Fabric 4.8.3 API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html
version = '4b'
API = fabric.load('etcbc{}'.format(version), 'lexicon', 'adjectives', {
"xmlids": {"node": False, "edge": False},
"features": ('''
otype
function rela sp
gloss
g_word_utf8 trailer_utf8
book chapter verse number
''',
'''
mother
'''),
"prepare": prepare,
"primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.01s DETAIL: COMPILING m: etcbc4b: UP TO DATE 0.01s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56 0.01s DETAIL: COMPILING a: lexicon: UP TO DATE 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54 0.03s DETAIL: load main: G.node_anchor_min 0.12s DETAIL: load main: G.node_anchor_max 0.21s DETAIL: load main: G.node_sort 0.27s DETAIL: load main: G.node_sort_inv 0.65s DETAIL: load main: G.edges_from 0.71s DETAIL: load main: G.edges_to 0.77s DETAIL: load main: F.etcbc4_db_otype [node] 1.36s DETAIL: load main: F.etcbc4_ft_function [node] 1.47s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] 1.74s DETAIL: load main: F.etcbc4_ft_number [node] 2.61s DETAIL: load main: F.etcbc4_ft_rela [node] 3.15s DETAIL: load main: F.etcbc4_ft_sp [node] 3.49s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] 3.75s DETAIL: load main: F.etcbc4_sft_book [node] 3.78s DETAIL: load main: F.etcbc4_sft_chapter [node] 3.80s DETAIL: load main: F.etcbc4_sft_verse [node] 3.81s DETAIL: load main: F.etcbc4_ft_mother [e] 3.93s DETAIL: load main: C.etcbc4_ft_mother -> 4.28s DETAIL: load main: C.etcbc4_ft_mother <- 4.41s DETAIL: load annox lexicon: F.etcbc4_lex_gloss [node] 4.61s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/adjectives/__log__adjectives.txt 4.63s INFO: LOADING PREPARED data: please wait ... 4.63s prep prep: G.node_sort 4.69s prep prep: G.node_sort_inv 5.12s prep prep: L.node_up 7.93s prep prep: L.node_down 13s prep prep: V.verses 13s prep prep: V.books_la 13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html 15s INFO: LOADED PREPARED data 15s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK adjectives AT 2016-11-18T18-41-00
Let us first collect subphrases having rela = atr
.
attr_subphrases = set()
inf('Finding subphrases ...')
for s in F.otype.s('subphrase'):
if F.rela.v(s) != 'atr':
continue
attr_subphrases.add(s)
inf('{} attributive subphrases'.format(len(attr_subphrases)))
7.65s Finding subphrases ... 8.88s 3106 attributive subphrases
Now let us add the mothers to those subphrases. If there is no mother, we leave it out. A subphrase should not have multiple mothers, but we'll check that anyway.
attr_subphrase_mother = dict()
multiple_mothers = set()
no_mothers = set()
for s in attr_subphrases:
mothers = list(C.mother.v(s))
if len(mothers) == 0:
no_mothers.add(s)
continue
if len(mothers) > 1:
multiple_mothers.add(s)
continue
attr_subphrase_mother[s] = mothers[0]
if len(multiple_mothers):
msg('{} subphrases with multiple mothers'.format(len(multiple_mothers)))
else:
inf('No subphrases with multiple mothers')
if len(no_mothers):
msg('{} subphrases without mothers'.format(len(no_mothers)))
else:
inf('No subphrases without mothers')
inf('{} attributive subphrases with a single mother'.format(len(attr_subphrase_mother)))
15s No subphrases with multiple mothers
15s 12 subphrases without mothers
15s 3094 attributive subphrases with a single mother
Let us get some information about the mothers of those subphrases. What kind of objects are they?
mother_types = collections.Counter()
idents = 0
for (s, m) in attr_subphrase_mother.items():
mother_types[F.otype.v(m)] +=1
for t in sorted(mother_types):
print('{:>4} subphrases with a mother of type {}'.format(mother_types[t], t))
3094 subphrases with a mother of type subphrase
So the mother is always a subphrase. What about the length of that subphrase?
mother_length = collections.Counter()
for (s, m) in attr_subphrase_mother.items():
mother_length[len(L.d('word', m))] +=1
for t in sorted(mother_length):
print('{:>4} subphrases with a mother of length {:>2}'.format(mother_length[t], t))
2085 subphrases with a mother of length 1 919 subphrases with a mother of length 2 62 subphrases with a mother of length 3 14 subphrases with a mother of length 4 11 subphrases with a mother of length 5 1 subphrases with a mother of length 7 1 subphrases with a mother of length 8 1 subphrases with a mother of length 9
How many nouns has the mother?
mother_nouns = collections.Counter()
for (s, m) in attr_subphrase_mother.items():
mother_nouns[len([w for w in L.d('word', m) if F.sp.v(w) == 'subs'])] +=1
for t in sorted(mother_nouns):
print('{:>4} subphrases with a mother having {:>2} nouns'.format(mother_nouns[t], t))
63 subphrases with a mother having 0 nouns 2867 subphrases with a mother having 1 nouns 137 subphrases with a mother having 2 nouns 12 subphrases with a mother having 3 nouns 6 subphrases with a mother having 4 nouns 8 subphrases with a mother having 5 nouns 1 subphrases with a mother having 6 nouns
Let us now assemble all data into the final output. We produce also a row of column headers.
fields = '''
passage
phrase_text
phrase_gloss
head
attributive
#words_mother
#nouns_mother
'''.strip().split()
nfields = len(fields)
row_template = ('{}\t' * (nfields - 1))+'{}\n'
of_path_template = 'attributives_{}.csv'
for fmt in ['ec', 'ha']:
of = open(of_path_template.format(fmt), 'w')
of.write('{}\n'.format('\t'.join(fields)))
for s in sorted(attr_subphrase_mother, key=NK):
sw = list(L.d('word', s))
p = L.u('phrase', s)
pw = list(L.d('word', p))
m = attr_subphrase_mother[s]
mw = list(L.d('word', m))
of.write(row_template.format(
T.passage(s),
T.words(pw, fmt=fmt).replace('\n', ' '),
' '.join(F.gloss.v(w) for w in pw),
T.words(mw, fmt=fmt).replace('\n', ' '),
T.words(sw, fmt=fmt).replace('\n', ' '),
len(mw),
len([w for w in mw if F.sp.v(w) == 'subs']),
))
of.close()
inf('Written {} lines to {}'.format(len(attr_subphrase_mother) + 1, of_path_template.format(fmt)))
25s Written 3095 lines to attributives_ec.csv 25s Written 3095 lines to attributives_ha.csv
print(open(of_path_template.format('ec')).read()[0:1000])
passage phrase_text phrase_gloss head attributive #words_mother #nouns_mother Genesis 1:8 JWM #NJ00 day second JWM #NJ00 1 1 Genesis 1:13 JWM #LJ#J00 day third JWM #LJ#J00 1 1 Genesis 1:16 >T&#NJ HM>RT HGDLJM <object marker> two the lamp the great HM>RT HGDLJM 2 1 Genesis 1:16 >T&HM>WR HGDL LMM#LT HJWM <object marker> the lamp the great to dominion the day HM>WR HGDL 2 1 Genesis 1:16 >T&HM>WR HQVN LMM#LT HLJLH <object marker> the lamp the small to dominion the night HM>WR HQVN 2 1 Genesis 1:19 JWM RBJ<J00 day fourth JWM RBJ<J00 1 1 Genesis 1:20 #RY NP# XJH swarming creatures soul alive NP# XJH 1 1 Genesis 1:21 >T&HTNJNM HGDLJM W>T KL&NP# <object marker> the sea-monster the great and <object marker> whole soul HTNJNM HGDLJM 2 1 Genesis 1:23 JWM XMJ#J00 day fifth JWM XMJ#J00 1 1 Genesis 1:24 NP# XJH soul alive NP# XJH 1 1 Genesis 1:30 NP# XJH soul alive NP# XJH 1 1 Genesis 2:2 BJWM H#BJ<J in the day the seventh JWM H#BJ<J 2 1 Genesis 2:2 BJWM H#BJ<J i