This notebook composes syntax trees out of the BHSA data.
To this end, it imports the module etcbc.trees
, which contains the logic to
synthesize trees from the data as it lies encoded in an EMDROS database and has been translated to LAF.
etcbc.trees
can also work for other data, such as the CALAP database of parts of the Syriac Peshitta.
This notebook invokes the functions of etcbc.trees
to generate trees for every sentence in the Hebrew Bible.
After that it performs a dozen sanity checks on the trees.
There are 13 sentences that do not pass some of these tests (out of 71354).
They will be excluded from further processing, since they probably have been coded wrong in the first place.
The process of tree construction is not straightforward, since the BHSA data have not been coded as syntax trees. Rather they take the shape of a collection of features that describe observable characteristics of the words, phrases, clauses and sentences. Moreover, if a phrase, clause or sentence is discontinuous, it is divided in phrase_atoms, clause_atoms, or sentence_atoms, respectively, which are by definition continuous.
There are no explicit hierarchical relationships between these objects. But there is an implicit hierarchy: embedding. Every object carries with it the set of word occurrences it contains.
The module etcbc.trees
constructs a hierarchy of words, subphrases, phrases, clauses and sentences based on the embedding
relationship.
But this is not all. The BHSA data contains a mother relationship, which in some cases denotes a parent relationship between
objects. The module etcbc.trees
reconstructs the tree obtained from the embedding by
using the mother relationship as a set of instructions to move certain nodes below others.
In some cases extra nodes will be constructed as well.
The BHSA data is coded in such a way that every object is associated with a type and a monad set.
The type of an object, $T(O)$, determines which features an object has. ETCBC types are sentence
, sentence_atom
,
clause
, clause_atom
, phrase
, phrase_atom
, subphrase
, word
.
There is an implicit ordering of object types, given by the sequence above, where word
comes first and
sentence
comes last. We denote this ordering by $<$.
The monad set of an object, $m(O)$, is the set of word occurrences contained by that object. Every word occurrence in the source has a unique sequence number, so monad sets are sets of sequence numbers.
Note that when a sentence contains a clause which contains a phrase, the sentence, clause, and phrase contain words directly as monad sets. The fact that a sentence contains a clause is not marked directly, it is a consequence of the monad set embedding.
There is a natural order on monad sets, following the intuition that monad sets with smaller elements come before monad set with bigger elements, and embedding monad sets come before embedded monad sets. Hence, if you enumerate a set of objects that happens to constitute a tree hierarchy based on monad set embedding, and you enumerate those objects in the monad set order, you will walk the tree in pre-order.
This order is a modification of the one as described in (Doedens 1994, 3.6.3).
For a lot of processing, it is handy to have a the stack of embedding elements available when working with an element. That is the advantage of pre-order over post-order. It is very much like SAX parsing.
Here is as the formal definition of my order:
$m_1 < m_2$ if either of the following holds:
We will not base our trees on all object types, since in the BHSA data they do not constitute a single hierarchy.
We will restrict ourselves to the set $\cal O = \{$ sentence
, clause
, phrase
, word
$\}$.
Object type $T_1$ is directly below $T_2$ ( $T_1 <_1 T_2 $ ) in $\cal O$ if $T_1 < T_2$ and there is no $T$ in $\cal O$ with $T_1 < T < T_2$.
Now we can introduce the notion of (tree) parent with respect to a set of object types $\cal O$ (e.g. ):
Object $A$ is a parent of object $B$ if the following are true:
While using the embedding got us trees, using the mother relationship will give us more interesting trees.
In general, the mother in the ETCBC points to an object on which the object in question is, in some sense, dependent.
The nature of this dependency is coded in a specific feature on clauses, the clause_constituent_relation
.
For a list of values this feature can take and the associated meanings, see the notebook
clause_phrase_types in this directory.
Here is a description of what we do with the mother relationship.
If a clause has a mother, there are three cases for the clause_constituent_relation of this clause:
Adju
, Objc
, Subj
, PrAd
, PreC
, Cmpl
, Attr
, RgRc
, Spec
$\}$Coor
Resu
, ReVo
, none
$\}$In case 3 we do nothing.
In case 1 we remove the link of the clause to its parent and add the clause as a child to either the object that the mother points to, or to the parent of the mother. We do the latter only if the mother is a word. We will not add children to words.
In the diagrams, the red arrows represent the mother relationship, and the black arrows the embedding relationships, and the fat black arrows the new parent relationships. The gray arrows indicated severed parent links.
In case 2 we create a node between the mother and its parent.
This node takes the name of the mother, and the mother will be added as child, but with name Ccoor
, and the clause which points to the mother is added as a sister.
This is a rather complicated case, but the intuition is not that difficult. Consider the sentence:
John thinks that Mary said it and did it
We have a compound object sentence, with Mary said it
and did it
as coordinated components.
The way this has been marked up in the BHSA database is as follows:
Mary said it
, clause with clause_constituent_relation
=Objc
, mother
=John thinks
(clause)
and did it
, clause with clause_constituent_relation
=Coor
, mother
=Mary said it
(clause)
So the second coordinated clause is simply linked to the first coordinated clause. Restructuring means to create a parent for both coordinated clauses and treat both as sisters at the same hierarchical level. See the diagram.
When we add nodes to new parents, we let them occupy the sequential position among its new sisters that corresponds with the monad set ordering.
Sentences, clauses and phrases are not always continuous. Before restructuring it will not always be the case that if you walk the tree in pre-order, you will end up with the leaves (the words) in the same order as the original sentence. Restructuring generally improves that, because it often puts an object under a non-continuous parent object precisely at the location that corresponds with the a gap in the parent.
However, there is no guarantee that every discontinuity will be resolved in this graceful manner. When we create the trees, we also output the list of monad numbers that you get when you walk the tree in pre-order. Whenever this list is not monotonic, there is an issue with the ordering.
If a mother points to itself or a descendant of itself, we have a grave form of incest. In these cases, the restructuring algorithm will disconnect a parent link without introducing a new link to the tree above it: a whole fragment of the tree becomes disconnected and will get lost.
Sanity check 6 below reveals that this occurs in fact 4 times in the BHSA version 4 (it occurred 13 times in the BHSA 3 version). We will exclude these trees from further processing.
If a mother points outside the sentence of the clause on which it is specified we have a form of adultery. This should not happen. Mothers may point outside their sentences, but not in the cases that trigger restructuring. Yet, the sanity checks below reveal that this occurs twice. We will exclude these cases from further processing.
import sys
import collections
import random
%load_ext autoreload
%autoreload 2
import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription, monad_set
from etcbc.trees import Tree
fabric = LafFabric()
tr = Transcription()
0.00s This is LAF-Fabric 4.8.3 API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html
The engines are fired up now, all ETCBC data we need is accessible through the fabric
and tr
objects.
Next we select the information we want and load it into memory.
API = fabric.load('etcbc4', '--', 'trees', {
"xmlids": {"node": False, "edge": False},
"features": ('''
oid otype monads
g_cons_utf8
sp
rela typ
label
''','''
mother
'''),
"prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.00s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37 4.53s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4/trees/__log__trees.txt 4.53s INFO: LOADING PREPARED data: please wait ... 4.53s prep prep: G.node_sort 4.53s PREPARING prep: G.node_sort 4.53s LOADING API with EXTRAs: please wait ... 4.53s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37 6.11s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX FOR TASK trees AT 2016-11-28T10-46-41 6.11s SORTING nodes ... 53s WRITING prep: G.node_sort 53s prep prep: G.node_sort_inv 53s PREPARING prep: G.node_sort_inv 53s SORTING nodes (inv) ... 53s WRITING prep: G.node_sort_inv 54s prep prep: L.node_up 54s PREPARING prep: L.node_up 54s LOADING API with EXTRAs: please wait ... 54s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37 54s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX FOR TASK trees AT 2016-11-28T10-47-29 54s Objects contained in books 1m 03s Objects contained in chapters 1m 13s Objects contained in verses 1m 26s Objects contained in half_verses 1m 36s Objects contained in sentences 1m 47s Objects contained in sentence_atoms 1m 58s Objects contained in clauses 2m 05s Objects contained in clause_atoms 2m 12s Objects contained in phrases 2m 21s Objects contained in phrase_atoms 2m 28s Objects contained in subphrases 2m 31s Objects contained in words 2m 35s WRITING prep: L.node_up 2m 38s prep prep: L.node_down 2m 38s PREPARING prep: L.node_down 2m 38s WRITING prep: L.node_down 2m 42s prep prep: V.verses 2m 42s PREPARING prep: V.verses 2m 42s LOADING API with EXTRAs: please wait ... 2m 42s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37 2m 42s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX FOR TASK trees AT 2016-11-28T10-49-17 2m 42s Making verse index 2m 44s Done. 23213 verses 2m 44s WRITING prep: V.verses 2m 44s prep prep: V.books_la 2m 44s PREPARING prep: V.books_la 2m 44s Listing books 2m 45s Done. 39 books 2m 45s WRITING prep: V.books_la 2m 45s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html 2m 48s INFO: LOADED PREPARED data 2m 48s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX lexicon FOR TASK trees AT 2016-11-28T10-49-23
show_ccr = collections.defaultdict(lambda: 0)
for n in NN():
otype = F.otype.v(n)
if otype == 'clause':
show_ccr[F.rela.v(n)] += 1
for c in sorted(show_ccr):
print("{:<8}: {}x".format(c, show_ccr[c]))
Adju : 5872x Attr : 5930x Cmpl : 241x CoVo : 305x Coor : 2858x NA : 69400x Objc : 1347x PrAd : 1x PreC : 156x Resu : 1193x RgRc : 198x Spec : 41x Subj : 436x
Before we can apply tree construction, we have to specify the objects and features we work with, since the tree constructing algorithm can also work on slightly different data, with other object types and different feature names.
We also construct an index that tells to which verse each sentence belongs.
type_info = (
("word", ''),
("subphrase", 'U'),
("phrase", 'P'),
("clause", 'C'),
("sentence", 'S'),
)
type_table = dict(t for t in type_info)
type_order = [t[0] for t in type_info]
pos_table = {
'adjv': 'aj',
'advb': 'av',
'art': 'dt',
'conj': 'cj',
'intj': 'ij',
'inrg': 'ir',
'nega': 'ng',
'subs': 'n',
'nmpr': 'n-pr',
'prep': 'pp',
'prps': 'pr-ps',
'prde': 'pr-dem',
'prin': 'pr-int',
'verb': 'vb',
}
ccr_info = {
'Adju': ('r', 'Cadju'),
'Attr': ('r', 'Cattr'),
'Cmpl': ('r', 'Ccmpl'),
'CoVo': ('n', 'Ccovo'),
'Coor': ('x', 'Ccoor'),
'Objc': ('r', 'Cobjc'),
'PrAd': ('r', 'Cprad'),
'PreC': ('r', 'Cprec'),
'Resu': ('n', 'Cresu'),
'RgRc': ('r', 'Crgrc'),
'Spec': ('r', 'Cspec'),
'Subj': ('r', 'Csubj'),
'NA': ('n', 'C'),
}
tree_types = ('sentence', 'clause', 'phrase', 'subphrase', 'word')
(root_type, leaf_type, clause_type) = (tree_types[0], tree_types[-1], 'clause')
ccr_table = dict((c[0],c[1][1]) for c in ccr_info.items())
ccr_class = dict((c[0],c[1][0]) for c in ccr_info.items())
root_verse = {}
for n in NN():
otype = F.otype.v(n)
if otype == 'verse': cur_verse = F.label.v(n)
elif otype == root_type: root_verse[n] = cur_verse
Now we can actually construct the tree by initializing a tree object.
After that we call its restructure_clauses()
method.
Then we have two tree structures for each sentence:
We have several tree relationships at our disposal:
Coor
(case 2) to its daughter clauseswhere eldest_sister and sisters only occur in the rtree.
This will take a while (25 seconds approx on a MacBook Air 2012).
tree = Tree(API, otypes=tree_types,
clause_type=clause_type,
ccr_feature='rela',
pt_feature='typ',
pos_feature='sp',
mother_feature = 'mother',
)
tree.restructure_clauses(ccr_class)
results = tree.relations()
parent = results['rparent']
sisters = results['sisters']
children = results['rchildren']
elder_sister = results['elder_sister']
msg("Ready for processing")
0.00s LOADING API with EXTRAs: please wait ... 0.00s INFO: USING DATA COMPILED AT: 2014-07-23T09-31-37 1.48s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK trees AT 2015-03-26T06-42-46 0.00s Start computing parent and children relations for objects of type sentence, clause, phrase, subphrase, word 1.50s 100000 nodes 3.03s 200000 nodes 4.47s 300000 nodes 6.01s 400000 nodes 7.45s 500000 nodes 8.98s 600000 nodes 11s 700000 nodes 12s 800000 nodes 13s 900000 nodes 14s 947471 nodes: 881423 have parents and 520916 have children 14s Restructuring clauses: deep copying tree relations 21s Pass 0: Storing mother relationship 22s 18580 clauses have a mother 22s All clauses have mothers of types in {'word', 'phrase', 'subphrase', 'sentence', 'clause'} 22s Pass 1: all clauses except those of type Coor 23s Pass 2: clauses of type Coor only 24s Mothers applied. Found 0 motherless clauses. 24s 2497 nodes have 1 sisters 24s 167 nodes have 2 sisters 24s 9 nodes have 3 sisters 24s There are 2858 sisters, 2673 nodes have sisters. 24s Ready for processing
Let us see whether the trees we have constructed satisfy some sanity constraints. After all, the algorithm is based on certain assumptions about the data, but are those assumptions valid? And restructuring is a tricky operation, do we have confidence that nothing went wrong?
clause_constituent_relation
has different values than the ones that should trigger restructuring.#1
msg("Counting {}s ... (expecting 66045 (in etcbc3: 71354))".format(root_type))
msg("There are {} {}s".format(len(set(NN(test=F.otype.v, value=root_type))), root_type))
1m 26s Counting sentences ... (expecting 66045 (in etcbc3: 71354)) 1m 27s There are 66045 sentences
#2
msg("Checking parents of {}s ... (expecting none)".format(root_type))
exceptions = set()
for node in NN(test=F.otype.v, value=root_type):
if node in parent: exceptions.add(node)
if len(exceptions) == 0:
msg("No {} has a parent".format(root_type))
else:
msg("{} {}s have a parent:".format(len(exceptions), root_type))
for n in sorted(exceptions):
p = parent[n]
msg("{} {} [{}] has {} parent {} [{}]".format(
root_type, n, F.monads.v(n),
F.otype.v(p), p, F.monads.v(p)
))
1m 30s Checking parents of sentences ... (expecting none) 1m 31s No sentence has a parent
#3 (again a check on #1)
msg("Checking the types of root nodes ... (should all be sentences)")
exceptions = collections.defaultdict(lambda: [])
sn = 0
for node in NN():
otype = F.otype.v(node)
if otype not in type_table: continue
if otype == root_type: sn += 1
if node not in parent and node not in elder_sister and otype != root_type:
exceptions[otype].append(node)
if len(exceptions) == 0:
msg("All top nodes are {}s".format(root_type))
else:
msg("Top nodes which are not {}s:".format(root_type))
for t in sorted(exceptions):
msg("{}: {}x".format(t, len(exceptions[t])), withtime=False)
msg("{} {}s seen".format(sn, root_type))
for c in exceptions[clause_type]:
(s, st) = tree.get_root(c, 'e')
v = root_verse[s]
msg("{}={}, {}={}={}, verse={}".format(clause_type, c, root_type, st, s, v), withtime=False)
1m 37s Checking the types of root nodes ... (should all be sentences) 1m 39s Top nodes which are not sentences: subphrase: 3x 1m 39s 66045 sentences seen
#4, 5
def get_top(kind, rel, rela, multi):
seen = set()
top_nodes = set()
start_nodes = set(NN(test=F.otype.v, value=kind))
next_nodes = start_nodes
msg("Starting from {} nodes ...".format(kind))
while len(next_nodes):
new_next_nodes = set()
for node in next_nodes:
if node in seen: continue
seen.add(node)
is_top = True
if node in rel:
is_top = False
if multi:
for c in rel[node]: new_next_nodes.add(c)
else:
new_next_nodes.add(rel[node])
if node in rela:
is_top = False
if multi:
for c in rela[node]: new_next_nodes.add(c)
else:
new_next_nodes.add(rela[node])
if is_top: top_nodes.add(node)
next_nodes = new_next_nodes
top_types = collections.defaultdict(lambda: 0)
for t in top_nodes:
top_types[F.otype.v(t)] += 1
for t in top_types:
msg("From {} {} nodes reached {} {} nodes".format(len(start_nodes), kind, top_types[t], t), withtime=False)
msg("Embedding trees")
get_top(leaf_type, tree.eparent, {}, False)
get_top(root_type, tree.echildren, {}, True)
msg("Restructd trees")
get_top(leaf_type, tree.rparent, tree.elder_sister, False)
get_top(root_type, tree.rchildren, tree.sisters, True)
msg("Done")
1m 47s Embedding trees 1m 48s Starting from word nodes ... From 426555 word nodes reached 66045 sentence nodes From 426555 word nodes reached 3 subphrase nodes 1m 50s Starting from sentence nodes ... From 66045 sentence nodes reached 426552 word nodes 1m 51s Restructd trees 1m 52s Starting from word nodes ... From 426555 word nodes reached 66045 sentence nodes From 426555 word nodes reached 3 subphrase nodes 1m 54s Starting from sentence nodes ... From 66045 sentence nodes reached 425602 word nodes 1m 56s Done
#6
msg("Verifying whether all monads are preserved under restructuring")
errors = []
#i = 10
for snode in NN(test=F.otype.v, value=root_type):
declared_monads = monad_set(F.monads.v(snode))
results = {}
thisgood = {}
for kind in ('e', 'r'):
results[kind] = set(int(F.monads.v(l)) for l in tree.get_leaves(snode, kind) if F.otype.v(l) == leaf_type)
thisgood[kind] = declared_monads == results[kind]
#if not thisgood[kind]:
#print("{} D={}\nL={}".format(kind, declared_monads, results[kind]))
#i -= 1
#if i == 0: break
if False in thisgood.values(): errors.append((snode, thisgood['e'], thisgood['r']))
msg("{} mismatches:".format(len(errors)))
mine = min(20, len(errors))
skip = {e[0] for e in errors}
for (s, e, r) in errors[0:mine]:
msg("{} embedding: {}; restructd: {}".format(s, 'OK' if e else 'XX', 'OK' if r else 'XX'), withtime=False)
1m 59s Verifying whether all monads are preserved under restructuring 2m 07s 3 mismatches: 1176165 embedding: XX; restructd: XX 1176166 embedding: XX; restructd: XX 1188063 embedding: XX; restructd: XX
#7
msg("Which types embed which types and how often? ...")
for kind in ('e', 'r'):
plinked_types = collections.defaultdict(lambda: 0)
parent = tree.eparent if kind == 'e' else tree.rparent
kindrep = 'embedding' if kind == 'e' else 'restructd'
for (c, p) in parent.items():
plinked_types[(F.otype.v(c), F.otype.v(p))] += 1
msg("Found {} parent ({}) links between types".format(len(parent), kindrep))
for lt in sorted(plinked_types):
msg("{}: {}x".format(lt, plinked_types[lt]), withtime=False)
2m 13s Which types embed which types and how often? ... 2m 15s Found 881423 parent (embedding) links between types ('clause', 'sentence'): 87978x ('phrase', 'clause'): 254662x ('phrase', 'subphrase'): 2x ('subphrase', 'phrase'): 83122x ('subphrase', 'subphrase'): 29104x ('word', 'phrase'): 303715x ('word', 'subphrase'): 122840x 2m 17s Found 878565 parent (restructd) links between types ('clause', 'clause'): 8117x ('clause', 'phrase'): 5395x ('clause', 'sentence'): 70898x ('clause', 'subphrase'): 710x ('phrase', 'clause'): 254662x ('phrase', 'subphrase'): 2x ('subphrase', 'phrase'): 83122x ('subphrase', 'subphrase'): 29104x ('word', 'phrase'): 303715x ('word', 'subphrase'): 122840x
#8
msg("How many mothers can nodes have? ...")
mother_len = {}
for c in NN():
lms = list(C.mother.v(c))
nms = len(lms)
if nms: mother_len[c] = nms
count = collections.defaultdict(lambda: 0)
for c in tree.mother: count[mother_len[c]] += 1
msg("There are {} tree nodes with a mother".format(len(tree.mother)))
for cnt in sorted(count):
msg("{} nodes have {} mother{}".format(count[cnt], cnt, 's' if cnt != 1 else ''), withtime=False)
2m 21s How many mothers can nodes have? ... 2m 27s There are 18580 tree nodes with a mother 18580 nodes have 1 mother
#9
msg("Which types have mother links to which types and how often? ...")
mlinked_types = collections.defaultdict(lambda: set())
for (c, m) in tree.mother.items():
ctype = F.otype.v(c)
mlinked_types[(ctype, F.rela.v(c), F.otype.v(m))].add(c)
msg("Found {} mother links between types".format(len(parent)))
for lt in sorted(mlinked_types):
msg("{}: {}x".format(lt, len(mlinked_types[lt])), withtime=False)
2m 33s Which types have mother links to which types and how often? ... 2m 33s Found 878565 mother links between types ('clause', 'Adju', 'clause'): 5872x ('clause', 'Attr', 'clause'): 45x ('clause', 'Attr', 'phrase'): 5128x ('clause', 'Attr', 'word'): 757x ('clause', 'Cmpl', 'clause'): 241x ('clause', 'CoVo', 'clause'): 305x ('clause', 'Coor', 'clause'): 2845x ('clause', 'Coor', 'phrase'): 13x ('clause', 'NA', 'clause'): 2x ('clause', 'Objc', 'clause'): 1347x ('clause', 'PrAd', 'clause'): 1x ('clause', 'PreC', 'clause'): 156x ('clause', 'Resu', 'clause'): 1193x ('clause', 'RgRc', 'clause'): 4x ('clause', 'RgRc', 'phrase'): 1x ('clause', 'RgRc', 'word'): 193x ('clause', 'Spec', 'clause'): 15x ('clause', 'Spec', 'phrase'): 25x ('clause', 'Spec', 'word'): 1x ('clause', 'Subj', 'clause'): 436x
#10
msg("Counting {}s with mothers in another {}".format(clause_type, root_type))
exceptions = set()
for node in tree.mother:
if F.otype.v(node) not in type_table: continue
mnode = tree.mother[node]
snode = tree.get_root(node, 'e')
smnode = tree.get_root(mnode, 'e')
if snode != smnode:
exceptions.add((node, snode, smnode))
msg("{} nodes have a mother in another {}".format(len(exceptions), root_type))
for (n, sn, smn) in exceptions:
msg("[{} {}]({}) occurs in {} but has mother in {}".format(F.otype.v(n), F.monads.v(n), n, sn, smn), withtime=False)
2m 42s Counting clauses with mothers in another sentence 2m 42s 0 nodes have a mother in another sentence
#11
msg("Computing lengths and depths")
ntrees = 0
rntrees = 0
total_depth = {'e': 0, 'r': 0}
rtotal_depth = {'e': 0, 'r': 0}
max_depth = {'e': 0, 'r':0}
rmax_depth = {'e': 0, 'r': 0}
total_length = 0
for node in NN(test=F.otype.v, value=root_type):
ntrees += 1
total_length += tree.length(node)
this_depth = {}
for kind in ('e', 'r'):
this_depth[kind] = tree.depth(node, kind)
different = this_depth['e'] != this_depth['r']
if different: rntrees += 1
for kind in ('e', 'r'):
if this_depth[kind] > max_depth[kind]: max_depth[kind] = this_depth[kind]
total_depth[kind] += this_depth[kind]
if different:
if this_depth[kind] > rmax_depth[kind]: rmax_depth[kind] = this_depth[kind]
rtotal_depth[kind] += this_depth[kind]
msg("{} trees seen, of which in {} cases restructuring makes a difference in depth".format(ntrees, rntrees))
if ntrees > 0:
msg("Embedding trees: max depth = {:>2}, average depth = {:.2g}".format(max_depth['e'], total_depth['e'] / ntrees))
msg("Restructd trees: max depth = {:>2}, average depth = {:.2g}".format(max_depth['r'], total_depth['r'] / ntrees))
if rntrees > 0:
msg("Statistics for cases where restructuring makes a difference:")
msg("Embedding trees: max depth = {:>2}, average depth = {:.2g}".format(rmax_depth['e'], rtotal_depth['e'] / rntrees))
msg("Restructd trees: max depth = {:>2}, average depth = {:.2g}".format(rmax_depth['r'], rtotal_depth['r'] / rntrees))
msg("Total number of leaves in the trees: {}, average number of leaves = {:.2g}".format(total_length, total_length / ntrees))
2m 47s Computing lengths and depths 2m 52s 66045 trees seen, of which in 9655 cases restructuring makes a difference in depth 2m 52s Embedding trees: max depth = 13, average depth = 3.6 2m 52s Restructd trees: max depth = 19, average depth = 3.8 2m 52s Statistics for cases where restructuring makes a difference: 2m 52s Embedding trees: max depth = 13, average depth = 3.7 2m 52s Restructd trees: max depth = 19, average depth = 5.3 2m 52s Total number of leaves in the trees: 426555, average number of leaves = 6.5
After all these checks we can proceed to print out the tree structures as plain, bracketed text strings.
Per tree we also print a string of the monad numbers that you get when you walk the tree in pre-order. And we produce object ids from the EMDROS database and node ids from the LAF version.
First we apply our algorithms to a limited set of interesting trees and a random sample.
For those cases we also apply a debug_write()
method that outputs considerably more information.
This output has been visually checked by Constantijn Sikkel and Dirk Roorda.
def get_tag(node):
otype = F.otype.v(node)
tag = type_table[otype]
if tag == 'P': tag = F.typ.v(node)
elif tag == 'C': tag = ccr_table[F.rela.v(node)]
is_word = tag == ''
pos = pos_table[F.sp.v(node)] if is_word else None
monad = int(F.monads.v(node)) if is_word else None
text = '"{}"'.format(F.g_cons_utf8.v(node)) if is_word else None
return (tag, pos, monad, text, is_word)
def passage_roots(verse_label):
sought = []
grab = -1
for n in NN():
if grab == 1: continue
otype = F.otype.v(n)
if otype == 'verse':
check = F.label.v(n) == verse_label
if check: grab = 0
elif grab == 0: grab = 1
if grab == 0 and otype == root_type: sought.append(n)
return sought
def showcases(cases, ofile):
out = outfile(ofile)
for snode in cases:
out.write("\n====================\n{}\n{}\n{} bhs_id={} laf_node={}:\n".format(
root_verse[snode], cases[snode], root_type, F.oid.v(snode), snode,
))
for kind in ('e', 'r'):
out.write("\nTree based on monad embedding {}\n\n".format(
"only" if kind == 'e' else " and mother+clause_constituent relation"
))
(tree_rep, words_rep, bmonad) = tree.write_tree(snode, kind, get_tag, rev=False, leafnumbers=False)
out.write("{}\n\n{}\n".format(words_rep, tree_rep))
out.write("\nDepth={}\n".format(tree.depth(snode, kind)))
out.write(tree.debug_write_tree(snode, kind, legenda=kind=='r'))
out.close()
# below holds for etcbc3, in etcbc4 we have less problem cases
problem_desc = collections.OrderedDict((
(1131739, "debug reorder"),
(1131712, "interesting"),
(1131701, "interesting"),
(1140469, "subject clause order"),
(passage_roots(' GEN 01,16')[1], "interesting"),
(1164864, "interesting"),
(1143081, "cyclic mothers"),
(1153973, "cyclic mothers"),
(1158971, "cyclic mothers"),
(1158971, "cyclic mothers"),
(1160416, "cyclic mothers"),
(1160464, "cyclic mothers"),
(1161141, "nested cyclic mothers: C.coor => C.attr => P below first C.coor"),
(1163666, "cyclic mothers"),
(1164830, "cyclic mothers"),
(1167680, "cyclic mothers"),
(1170057, "cyclic mothers"),
(1193065, "cyclic mothers"),
(1199681, "cyclic mothers"),
(1199682, "mother points outside sentence"),
))
fixed_sample = (
1167680,
1167152,
1145250,
1154339,
1136677,
1166385,
1198984,
1152969,
1153930,
1150648,
1168396,
1151917,
1164750,
1156719,
1148048,
1138673,
1134184,
1156789,
1156600,
1140469,
)
sample_size = 20
sample = {}
fsample = collections.OrderedDict()
mother_keys = list(sorted(tree.mother))
for s in range(20):
r = random.randint(0, len(mother_keys) - 1)
snode = tree.get_root(tree.mother[mother_keys[r]], 'e')[0]
sample[snode] = 'random sample in {}s with {}s with mothers'.format(root_type, clause_type)
for snode in fixed_sample:
fsample[snode] = 'random sample in {}s with {}s with mothers'.format(root_type, clause_type)
#showcases(problem_desc, "tree_notabene.txt")
showcases(sample, 'trees_random_{}.txt'.format(sample_size))
#showcases(fsample, 'trees_fixed_{}.txt'.format(len(fsample)))
Finally, here is the production of the whole set of trees.
msg("Writing {} trees".format(root_type))
trees = outfile("trees.txt")
verse_label = ''
s = 0
chunk = 10000
sc = 0
for node in NN():
if node in skip: continue
otype = F.otype.v(node)
oid = F.oid.v(node)
if otype == 'verse':
verse_label = F.label.v(node)
continue
if otype != root_type: continue
(tree_rep, words_rep, bmonad) = tree.write_tree(node, 'r', get_tag, rev=False, leafnumbers=False)
trees.write("\n#{}\tnode={}\toid={}\tbmonad={}\t{}\n{}\n".format(
verse_label, node, oid, bmonad, words_rep, tree_rep,
))
s += 1
sc += 1
if sc == chunk:
msg("{} trees written".format(s))
sc = 0
trees.close()
msg("{} trees written".format(s))
2m 59s Writing sentence trees 3m 07s 10000 trees written 3m 15s 20000 trees written 3m 23s 30000 trees written 3m 31s 40000 trees written 3m 38s 50000 trees written 3m 44s 60000 trees written 3m 49s 66042 trees written
close()
28m 19s Results directory: /Users/dirk/laf-fabric-output/etcbc4/trees __log__trees.txt 4903 Mon Sep 29 17:18:53 2014 anomalies.txt 0 Tue Jul 15 14:40:09 2014 coor.txt 157892 Tue Jul 15 17:45:28 2014 depths.txt 661357 Tue Jul 15 17:45:26 2014 tgrep_result.txt 4619456 Tue Jul 15 17:45:14 2014 tree_notabene.txt 52483 Tue Jul 15 17:28:29 2014 trees.t2c 12099621 Tue Jul 15 17:31:12 2014 trees.txt 11737304 Mon Sep 29 16:54:23 2014 trees_fixed_20.txt 14133 Tue Jul 15 17:28:29 2014 trees_random_20.txt 105541 Mon Sep 29 16:53:31 2014
Here are the first lines of the output.
!head -n 25 {my_file('trees.txt')}
# GEN 01,01 node=1127306 oid=11 bmonad=1 0 1 2 3 4 5 6 7 8 9 10 (S(C(PP(pp "ב")(n "ראשׁית"))(VP(vb "ברא"))(NP(n "אלהים"))(PP(U(pp "את")(dt "ה")(n "שׁמים"))(cj "ו")(U(pp "את")(dt "ה")(n "ארץ"))))) # GEN 01,02 node=1127307 oid=39 bmonad=12 0 1 2 3 4 5 6 (S(C(CP(cj "ו"))(NP(dt "ה")(n "ארץ"))(VP(vb "היתה"))(NP(U(n "תהו"))(cj "ו")(U(n "בהו"))))) # GEN 01,02 node=1127308 oid=60 bmonad=19 0 1 2 3 4 (S(C(CP(cj "ו"))(NP(n "חשׁך"))(PP(pp "על")(U(n "פני"))(U(n "תהום"))))) # GEN 01,02 node=1127309 oid=78 bmonad=24 0 1 2 3 4 5 6 7 (S(C(CP(cj "ו"))(NP(U(n "רוח"))(U(n "אלהים")))(VP(vb "מרחפת"))(PP(pp "על")(U(n "פני"))(U(dt "ה")(n "מים"))))) # GEN 01,03 node=1127310 oid=104 bmonad=32 0 1 2 (S(C(CP(cj "ו"))(VP(vb "יאמר"))(NP(n "אלהים")))) # GEN 01,03 node=1127311 oid=117 bmonad=35 0 1 (S(C(VP(vb "יהי"))(NP(n "אור")))) # GEN 01,03 node=1127312 oid=128 bmonad=37 0 1 2 (S(C(CP(cj "ו"))(VP(vb "יהי"))(NP(n "אור")))) # GEN 01,04 node=1127313 oid=143 bmonad=40 0 1 2 3 4 5 6 7 (S(C(CP(cj "ו"))(VP(vb "ירא"))(NP(n "אלהים"))(PP(pp "את")(dt "ה")(n "אור"))(Cobjc(CP(cj "כי"))(VP(vb "טוב")))))