Notebook

Trees - for BHSA data (Hebrew)¶

This notebook composes syntax trees out of the BHSA data. To this end, it imports the module etcbc.trees, which contains the logic to synthesize trees from the data as it lies encoded in an EMDROS database and has been translated to LAF.

etcbc.trees can also work for other data, such as the CALAP database of parts of the Syriac Peshitta.

This notebook invokes the functions of etcbc.trees to generate trees for every sentence in the Hebrew Bible. After that it performs a dozen sanity checks on the trees. There are 13 sentences that do not pass some of these tests (out of 71354).

They will be excluded from further processing, since they probably have been coded wrong in the first place.

BHSA data and syntax trees¶

The process of tree construction is not straightforward, since the BHSA data have not been coded as syntax trees. Rather they take the shape of a collection of features that describe observable characteristics of the words, phrases, clauses and sentences. Moreover, if a phrase, clause or sentence is discontinuous, it is divided in phrase_atoms, clause_atoms, or sentence_atoms, respectively, which are by definition continuous.

There are no explicit hierarchical relationships between these objects. But there is an implicit hierarchy: embedding. Every object carries with it the set of word occurrences it contains.

The module etcbc.trees constructs a hierarchy of words, subphrases, phrases, clauses and sentences based on the embedding relationship.

But this is not all. The BHSA data contains a mother relationship, which in some cases denotes a parent relationship between objects. The module etcbc.trees reconstructs the tree obtained from the embedding by using the mother relationship as a set of instructions to move certain nodes below others. In some cases extra nodes will be constructed as well.

The embedding relationship¶

Objects:¶

The BHSA data is coded in such a way that every object is associated with a type and a monad set.

The type of an object, $T(O)$, determines which features an object has. ETCBC types are sentence, sentence_atom, clause, clause_atom, phrase, phrase_atom, subphrase, word.

There is an implicit ordering of object types, given by the sequence above, where word comes first and sentence comes last. We denote this ordering by $<$.

The monad set of an object, $m(O)$, is the set of word occurrences contained by that object. Every word occurrence in the source has a unique sequence number, so monad sets are sets of sequence numbers.

Note that when a sentence contains a clause which contains a phrase, the sentence, clause, and phrase contain words directly as monad sets. The fact that a sentence contains a clause is not marked directly, it is a consequence of the monad set embedding.

Definition (monad set order):¶

There is a natural order on monad sets, following the intuition that monad sets with smaller elements come before monad set with bigger elements, and embedding monad sets come before embedded monad sets. Hence, if you enumerate a set of objects that happens to constitute a tree hierarchy based on monad set embedding, and you enumerate those objects in the monad set order, you will walk the tree in pre-order.

This order is a modification of the one as described in (Doedens 1994, 3.6.3).

> Doedens, Crist-Jan (1994), *Text Databases. One Database Model and Several Retrieval Languages*, number 14 in Language and Computers, Editions Rodopi, Amsterdam, Netherlands and Atlanta, USA. ISBN: 90-5183-729-1, http://books.google.nl/books?id=9ggOBRz1dO4C. The order as defined by Doedens corresponds to walking trees in post-order.

For a lot of processing, it is handy to have a the stack of embedding elements available when working with an element. That is the advantage of pre-order over post-order. It is very much like SAX parsing.

Here is as the formal definition of my order:

$m_1 < m_2$ if either of the following holds:

$m_2 \subset m_1$ (note that the embedder comes first!)
$m_1 \not\subset m_2 \wedge m_2 \not\subset m_1 \wedge \min(m_1 \setminus m_2) < \min(m_2 \setminus m_1)$

We will not base our trees on all object types, since in the BHSA data they do not constitute a single hierarchy. We will restrict ourselves to the set $\cal O = \{$ sentence, clause, phrase, word $\}$.

Definition (directly below):¶

Object type $T_1$ is directly below $T_2$ ( $T_1 <_1 T_2 $ ) in $\cal O$ if $T_1 < T_2$ and there is no $T$ in $\cal O$ with $T_1 < T < T_2$.

Now we can introduce the notion of (tree) parent with respect to a set of object types $\cal O$ (e.g. ):

Definition (parent)¶

Object $A$ is a parent of object $B$ if the following are true:

$m(A) \subseteq\ m(B)$
$T(A) <_1 T(B)$ in $\cal O$.

The mother relationship¶

While using the embedding got us trees, using the mother relationship will give us more interesting trees. In general, the mother in the ETCBC points to an object on which the object in question is, in some sense, dependent. The nature of this dependency is coded in a specific feature on clauses, the clause_constituent_relation. For a list of values this feature can take and the associated meanings, see the notebook clause_phrase_types in this directory.

Here is a description of what we do with the mother relationship.

If a clause has a mother, there are three cases for the clause_constituent_relation of this clause:

its value is in $\{$ Adju, Objc, Subj, PrAd, PreC, Cmpl, Attr, RgRc, Spec $\}$
its value is Coor
its value is in $\{$ Resu, ReVo, none $\}$

In case 3 we do nothing.

In case 1 we remove the link of the clause to its parent and add the clause as a child to either the object that the mother points to, or to the parent of the mother. We do the latter only if the mother is a word. We will not add children to words.

In the diagrams, the red arrows represent the mother relationship, and the black arrows the embedding relationships, and the fat black arrows the new parent relationships. The gray arrows indicated severed parent links.

In case 2 we create a node between the mother and its parent. This node takes the name of the mother, and the mother will be added as child, but with name Ccoor, and the clause which points to the mother is added as a sister.

This is a rather complicated case, but the intuition is not that difficult. Consider the sentence:

John thinks that Mary said it and did it

We have a compound object sentence, with Mary said it and did it as coordinated components. The way this has been marked up in the BHSA database is as follows:

Mary said it, clause with clause_constituent_relation=Objc, mother=John thinks(clause)

and did it, clause with clause_constituent_relation=Coor, mother=Mary said it(clause)

So the second coordinated clause is simply linked to the first coordinated clause. Restructuring means to create a parent for both coordinated clauses and treat both as sisters at the same hierarchical level. See the diagram.

Note on order¶

When we add nodes to new parents, we let them occupy the sequential position among its new sisters that corresponds with the monad set ordering.

Note on discontinuity¶

Sentences, clauses and phrases are not always continuous. Before restructuring it will not always be the case that if you walk the tree in pre-order, you will end up with the leaves (the words) in the same order as the original sentence. Restructuring generally improves that, because it often puts an object under a non-continuous parent object precisely at the location that corresponds with the a gap in the parent.

However, there is no guarantee that every discontinuity will be resolved in this graceful manner. When we create the trees, we also output the list of monad numbers that you get when you walk the tree in pre-order. Whenever this list is not monotonic, there is an issue with the ordering.

Note on incest¶

If a mother points to itself or a descendant of itself, we have a grave form of incest. In these cases, the restructuring algorithm will disconnect a parent link without introducing a new link to the tree above it: a whole fragment of the tree becomes disconnected and will get lost.

Sanity check 6 below reveals that this occurs in fact 4 times in the BHSA version 4 (it occurred 13 times in the BHSA 3 version). We will exclude these trees from further processing.

Note on adultery¶

If a mother points outside the sentence of the clause on which it is specified we have a form of adultery. This should not happen. Mothers may point outside their sentences, but not in the cases that trigger restructuring. Yet, the sanity checks below reveal that this occurs twice. We will exclude these cases from further processing.

In [1]:

import sys
import collections
import random
%load_ext autoreload
%autoreload 2

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription, monad_set
from etcbc.trees import Tree

fabric = LafFabric()
tr = Transcription()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html

The engines are fired up now, all ETCBC data we need is accessible through the fabric and tr objects.

Next we select the information we want and load it into memory.

In [2]:

API = fabric.load('etcbc4', '--', 'trees', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype monads
        g_cons_utf8
        sp
        rela typ
        label
    ''','''
        mother
    '''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37
  4.53s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4/trees/__log__trees.txt
  4.53s INFO: LOADING PREPARED data: please wait ... 
  4.53s prep prep: G.node_sort
  4.53s PREPARING prep: G.node_sort
  4.53s LOADING API with EXTRAs: please wait ... 
  4.53s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37
  6.11s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX  FOR TASK trees AT 2016-11-28T10-46-41
  6.11s SORTING nodes ...
    53s WRITING prep: G.node_sort
    53s prep prep: G.node_sort_inv
    53s PREPARING prep: G.node_sort_inv
    53s SORTING nodes (inv) ...
    53s WRITING prep: G.node_sort_inv
    54s prep prep: L.node_up
    54s PREPARING prep: L.node_up
    54s LOADING API with EXTRAs: please wait ... 
    54s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37
    54s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX  FOR TASK trees AT 2016-11-28T10-47-29
    54s Objects contained in books
 1m 03s Objects contained in chapters
 1m 13s Objects contained in verses
 1m 26s Objects contained in half_verses
 1m 36s Objects contained in sentences
 1m 47s Objects contained in sentence_atoms
 1m 58s Objects contained in clauses
 2m 05s Objects contained in clause_atoms
 2m 12s Objects contained in phrases
 2m 21s Objects contained in phrase_atoms
 2m 28s Objects contained in subphrases
 2m 31s Objects contained in words
 2m 35s WRITING prep: L.node_up
 2m 38s prep prep: L.node_down
 2m 38s PREPARING prep: L.node_down
 2m 38s WRITING prep: L.node_down
 2m 42s prep prep: V.verses
 2m 42s PREPARING prep: V.verses
 2m 42s LOADING API with EXTRAs: please wait ... 
 2m 42s USING main: etcbc4 DATA COMPILED AT: 2014-07-23T09-31-37
 2m 42s NORMAL: DATA LOADED FROM SOURCE etcbc4 AND ANNOX  FOR TASK trees AT 2016-11-28T10-49-17
 2m 42s Making verse index
 2m 44s Done. 23213 verses
 2m 44s WRITING prep: V.verses
 2m 44s prep prep: V.books_la
 2m 44s PREPARING prep: V.books_la
 2m 44s Listing books
 2m 45s Done. 39 books
 2m 45s WRITING prep: V.books_la
 2m 45s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
 2m 48s INFO: LOADED PREPARED data
 2m 48s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX lexicon FOR TASK trees AT 2016-11-28T10-49-23

In [3]:

show_ccr = collections.defaultdict(lambda: 0)
for n in NN():
    otype = F.otype.v(n)
    if otype == 'clause':
        show_ccr[F.rela.v(n)] += 1
for c in sorted(show_ccr):
    print("{:<8}: {}x".format(c, show_ccr[c]))

Adju    : 5872x
Attr    : 5930x
Cmpl    : 241x
CoVo    : 305x
Coor    : 2858x
NA      : 69400x
Objc    : 1347x
PrAd    : 1x
PreC    : 156x
Resu    : 1193x
RgRc    : 198x
Spec    : 41x
Subj    : 436x

Before we can apply tree construction, we have to specify the objects and features we work with, since the tree constructing algorithm can also work on slightly different data, with other object types and different feature names.

We also construct an index that tells to which verse each sentence belongs.

In [4]:

type_info = (
    ("word", ''),
    ("subphrase", 'U'),
    ("phrase", 'P'),
    ("clause", 'C'),
    ("sentence", 'S'),
)
type_table = dict(t for t in type_info)
type_order = [t[0] for t in type_info]

pos_table = {
 'adjv': 'aj',
 'advb': 'av',
 'art': 'dt',
 'conj': 'cj',
 'intj': 'ij',
 'inrg': 'ir',
 'nega': 'ng',
 'subs': 'n',
 'nmpr': 'n-pr',
 'prep': 'pp',
 'prps': 'pr-ps',
 'prde': 'pr-dem',
 'prin': 'pr-int',
 'verb': 'vb',
}

ccr_info = {
    'Adju': ('r', 'Cadju'),
    'Attr': ('r', 'Cattr'),
    'Cmpl': ('r', 'Ccmpl'),
    'CoVo': ('n', 'Ccovo'),
    'Coor': ('x', 'Ccoor'),
    'Objc': ('r', 'Cobjc'),
    'PrAd': ('r', 'Cprad'),
    'PreC': ('r', 'Cprec'),
    'Resu': ('n', 'Cresu'),
    'RgRc': ('r', 'Crgrc'),
    'Spec': ('r', 'Cspec'),
    'Subj': ('r', 'Csubj'),
    'NA':   ('n', 'C'),

}

tree_types = ('sentence', 'clause', 'phrase', 'subphrase', 'word')
(root_type, leaf_type, clause_type) = (tree_types[0], tree_types[-1], 'clause')
ccr_table = dict((c[0],c[1][1]) for c in ccr_info.items())
ccr_class = dict((c[0],c[1][0]) for c in ccr_info.items())

root_verse = {}

for n in NN():
    otype = F.otype.v(n)
    if otype == 'verse': cur_verse = F.label.v(n)
    elif otype == root_type: root_verse[n] = cur_verse

Now we can actually construct the tree by initializing a tree object. After that we call its restructure_clauses() method.

Then we have two tree structures for each sentence:

the etree, i.e. the tree obtained by working out the embedding relationships and nothing else
the rtree, i.e. the tree obtained by restructuring the etree

We have several tree relationships at our disposal:

eparent and its inverse echildren
rparent and its inverse rchildren
eldest_sister going from a mother clause of kind Coor (case 2) to its daughter clauses
sisters being the inverse of eldest_sister

where eldest_sister and sisters only occur in the rtree.

This will take a while (25 seconds approx on a MacBook Air 2012).

In [5]:

tree = Tree(API, otypes=tree_types, 
    clause_type=clause_type,
    ccr_feature='rela',
    pt_feature='typ',
    pos_feature='sp',
    mother_feature = 'mother',
)
tree.restructure_clauses(ccr_class)
results = tree.relations()
parent = results['rparent']
sisters = results['sisters']
children = results['rchildren']
elder_sister = results['elder_sister']
msg("Ready for processing")

  0.00s LOADING API with EXTRAs: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2014-07-23T09-31-37
  1.48s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK trees AT 2015-03-26T06-42-46
  0.00s Start computing parent and children relations for objects of type sentence, clause, phrase, subphrase, word
  1.50s 100000 nodes
  3.03s 200000 nodes
  4.47s 300000 nodes
  6.01s 400000 nodes
  7.45s 500000 nodes
  8.98s 600000 nodes
    11s 700000 nodes
    12s 800000 nodes
    13s 900000 nodes
    14s 947471 nodes: 881423 have parents and 520916 have children
    14s Restructuring clauses: deep copying tree relations
    21s Pass 0: Storing mother relationship
    22s 18580 clauses have a mother
    22s All clauses have mothers of types in {'word', 'phrase', 'subphrase', 'sentence', 'clause'}
    22s Pass 1: all clauses except those of type Coor
    23s Pass 2: clauses of type Coor only
    24s Mothers applied. Found 0 motherless clauses.
    24s 2497 nodes have 1 sisters
    24s 167 nodes have 2 sisters
    24s 9 nodes have 3 sisters
    24s There are 2858 sisters, 2673 nodes have sisters.
    24s Ready for processing

Checking for sanity¶

Let us see whether the trees we have constructed satisfy some sanity constraints. After all, the algorithm is based on certain assumptions about the data, but are those assumptions valid? And restructuring is a tricky operation, do we have confidence that nothing went wrong?

How many sentence nodes? From earlier queries we know what to expect.
Does any sentence have a parent? If so, there is something wrong with our assumptions or algorithm.
Is every top node a sentence? If not, we have material outside a sentence, which contradicts the assumptions.
Do you reach all sentences if you go up from words? If not, some sentences do not contain words.
Do you reach all words if you go down from sentences? If not, some words have become disconnected from their sentences.
Do you reach the same words in reconstructed trees as in embedded trees? If not, some sentence material has got lost during the restructuring process.
From what object types to what object types does the parent relationship link? Here we check that parents do not link object types that are too distant in the object type ranking.
How many nodes have mothers and how many mothers can a node have? We expect at most one.
From what object types to what object types does the mother relationship link?
Is the mother of a clause always in the same sentence? If not, foreign sentences will be drawn in, leading to (very) big chunks. This may occur when we use mother relationships in cases where clause_constituent_relation has different values than the ones that should trigger restructuring.
Has the max/average tree depth increased after restructuring? By how much? This is meant as an indication by how much our tree structures improve in significant hierarchy when we take the mother relationship into account.

In [6]:

#1
msg("Counting {}s ... (expecting 66045 (in etcbc3: 71354))".format(root_type))
msg("There are {} {}s".format(len(set(NN(test=F.otype.v, value=root_type))), root_type))

 1m 26s Counting sentences ... (expecting 66045 (in etcbc3: 71354))
 1m 27s There are 66045 sentences

In [7]:

#2
msg("Checking parents of {}s ... (expecting none)".format(root_type))
exceptions = set()
for node in NN(test=F.otype.v, value=root_type):
    if node in parent: exceptions.add(node)
if len(exceptions) == 0:
    msg("No {} has a parent".format(root_type))
else:
    msg("{} {}s have a parent:".format(len(exceptions), root_type))
    for n in sorted(exceptions):
        p = parent[n]
        msg("{} {} [{}] has {} parent {} [{}]".format(
            root_type, n, F.monads.v(n), 
            F.otype.v(p), p, F.monads.v(p)
        ))

 1m 30s Checking parents of sentences ... (expecting none)
 1m 31s No sentence has a parent

In [8]:

#3 (again a check on #1)
msg("Checking the types of root nodes ... (should all be sentences)")
exceptions = collections.defaultdict(lambda: [])
sn = 0
for node in NN():
    otype = F.otype.v(node)
    if otype not in type_table: continue
    if otype == root_type: sn += 1
    if node not in parent and node not in elder_sister and otype != root_type: 
        exceptions[otype].append(node)
if len(exceptions) == 0:
    msg("All top nodes are {}s".format(root_type))
else:
    msg("Top nodes which are not {}s:".format(root_type))
    for t in sorted(exceptions):
        msg("{}: {}x".format(t, len(exceptions[t])), withtime=False)
msg("{} {}s seen".format(sn, root_type))

for c in exceptions[clause_type]:
    (s, st) = tree.get_root(c, 'e')
    v = root_verse[s]
    msg("{}={}, {}={}={}, verse={}".format(clause_type, c, root_type, st, s, v), withtime=False)

 1m 37s Checking the types of root nodes ... (should all be sentences)
 1m 39s Top nodes which are not sentences:
subphrase: 3x
 1m 39s 66045 sentences seen

In [9]:

#4, 5
def get_top(kind, rel, rela, multi):
    seen = set()
    top_nodes = set()
    start_nodes = set(NN(test=F.otype.v, value=kind))
    next_nodes = start_nodes
    msg("Starting from {} nodes ...".format(kind))
    while len(next_nodes):
        new_next_nodes = set()
        for node in next_nodes:
            if node in seen: continue
            seen.add(node)
            is_top = True
            if node in rel: 
                is_top = False
                if multi:
                    for c in rel[node]: new_next_nodes.add(c)
                else:
                    new_next_nodes.add(rel[node])
            if node in rela: 
                is_top = False
                if multi:
                    for c in rela[node]: new_next_nodes.add(c)
                else:
                    new_next_nodes.add(rela[node])
            if is_top: top_nodes.add(node)
        next_nodes = new_next_nodes
    top_types = collections.defaultdict(lambda: 0)
    for t in top_nodes:
        top_types[F.otype.v(t)] += 1
    for t in top_types:
        msg("From {} {} nodes reached {} {} nodes".format(len(start_nodes), kind, top_types[t], t), withtime=False)

msg("Embedding trees")
get_top(leaf_type, tree.eparent, {}, False)
get_top(root_type, tree.echildren, {}, True)
msg("Restructd trees")
get_top(leaf_type, tree.rparent, tree.elder_sister, False)
get_top(root_type, tree.rchildren, tree.sisters, True)
msg("Done")

 1m 47s Embedding trees
 1m 48s Starting from word nodes ...
From 426555 word nodes reached 66045 sentence nodes
From 426555 word nodes reached 3 subphrase nodes
 1m 50s Starting from sentence nodes ...
From 66045 sentence nodes reached 426552 word nodes
 1m 51s Restructd trees
 1m 52s Starting from word nodes ...
From 426555 word nodes reached 66045 sentence nodes
From 426555 word nodes reached 3 subphrase nodes
 1m 54s Starting from sentence nodes ...
From 66045 sentence nodes reached 425602 word nodes
 1m 56s Done

In [10]:

#6
msg("Verifying whether all monads are preserved under restructuring")
errors = []
#i = 10
for snode in NN(test=F.otype.v, value=root_type):
    declared_monads = monad_set(F.monads.v(snode))
    results = {}
    thisgood = {}
    for kind in ('e', 'r'):
        results[kind] = set(int(F.monads.v(l)) for l in tree.get_leaves(snode, kind) if F.otype.v(l) == leaf_type)
        thisgood[kind] = declared_monads == results[kind]
        #if not thisgood[kind]:
            #print("{} D={}\nL={}".format(kind, declared_monads, results[kind]))
            #i -= 1
    #if i == 0: break
    if False in thisgood.values(): errors.append((snode, thisgood['e'], thisgood['r']))
msg("{} mismatches:".format(len(errors)))
mine = min(20, len(errors))
skip = {e[0] for e in errors}
for (s, e, r) in errors[0:mine]:
    msg("{} embedding: {}; restructd: {}".format(s, 'OK' if e else 'XX', 'OK' if r else 'XX'), withtime=False)

 1m 59s Verifying whether all monads are preserved under restructuring
 2m 07s 3 mismatches:
1176165 embedding: XX; restructd: XX
1176166 embedding: XX; restructd: XX
1188063 embedding: XX; restructd: XX

In [11]:

#7
msg("Which types embed which types and how often? ...")
for kind in ('e', 'r'):
    plinked_types = collections.defaultdict(lambda: 0)
    parent = tree.eparent if kind == 'e' else tree.rparent
    kindrep = 'embedding' if kind == 'e' else 'restructd'
    for (c, p) in parent.items():
        plinked_types[(F.otype.v(c), F.otype.v(p))] += 1
    msg("Found {} parent ({}) links between types".format(len(parent), kindrep))
    for lt in sorted(plinked_types):
        msg("{}: {}x".format(lt, plinked_types[lt]), withtime=False)

 2m 13s Which types embed which types and how often? ...
 2m 15s Found 881423 parent (embedding) links between types
('clause', 'sentence'): 87978x
('phrase', 'clause'): 254662x
('phrase', 'subphrase'): 2x
('subphrase', 'phrase'): 83122x
('subphrase', 'subphrase'): 29104x
('word', 'phrase'): 303715x
('word', 'subphrase'): 122840x
 2m 17s Found 878565 parent (restructd) links between types
('clause', 'clause'): 8117x
('clause', 'phrase'): 5395x
('clause', 'sentence'): 70898x
('clause', 'subphrase'): 710x
('phrase', 'clause'): 254662x
('phrase', 'subphrase'): 2x
('subphrase', 'phrase'): 83122x
('subphrase', 'subphrase'): 29104x
('word', 'phrase'): 303715x
('word', 'subphrase'): 122840x

In [12]:

#8
msg("How many mothers can nodes have? ...")
mother_len = {}
for c in NN():
    lms = list(C.mother.v(c))
    nms = len(lms)
    if nms: mother_len[c] = nms
count = collections.defaultdict(lambda: 0)
for c in tree.mother: count[mother_len[c]] += 1
msg("There are {} tree nodes with a mother".format(len(tree.mother)))
for cnt in sorted(count):
    msg("{} nodes have {} mother{}".format(count[cnt], cnt, 's' if cnt != 1 else ''), withtime=False)      

 2m 21s How many mothers can nodes have? ...
 2m 27s There are 18580 tree nodes with a mother
18580 nodes have 1 mother

In [13]:

#9
msg("Which types have mother links to which types and how often? ...")
mlinked_types = collections.defaultdict(lambda: set())
for (c, m) in tree.mother.items():
    ctype = F.otype.v(c)
    mlinked_types[(ctype, F.rela.v(c), F.otype.v(m))].add(c)
msg("Found {} mother links between types".format(len(parent)))
for lt in sorted(mlinked_types):
    msg("{}: {}x".format(lt, len(mlinked_types[lt])), withtime=False)

 2m 33s Which types have mother links to which types and how often? ...
 2m 33s Found 878565 mother links between types
('clause', 'Adju', 'clause'): 5872x
('clause', 'Attr', 'clause'): 45x
('clause', 'Attr', 'phrase'): 5128x
('clause', 'Attr', 'word'): 757x
('clause', 'Cmpl', 'clause'): 241x
('clause', 'CoVo', 'clause'): 305x
('clause', 'Coor', 'clause'): 2845x
('clause', 'Coor', 'phrase'): 13x
('clause', 'NA', 'clause'): 2x
('clause', 'Objc', 'clause'): 1347x
('clause', 'PrAd', 'clause'): 1x
('clause', 'PreC', 'clause'): 156x
('clause', 'Resu', 'clause'): 1193x
('clause', 'RgRc', 'clause'): 4x
('clause', 'RgRc', 'phrase'): 1x
('clause', 'RgRc', 'word'): 193x
('clause', 'Spec', 'clause'): 15x
('clause', 'Spec', 'phrase'): 25x
('clause', 'Spec', 'word'): 1x
('clause', 'Subj', 'clause'): 436x

In [14]:

#10
msg("Counting {}s with mothers in another {}".format(clause_type, root_type))
exceptions = set()
for node in tree.mother:
    if F.otype.v(node) not in type_table: continue
    mnode = tree.mother[node]
    snode = tree.get_root(node, 'e')
    smnode = tree.get_root(mnode, 'e')
    if snode != smnode:
        exceptions.add((node, snode, smnode))
msg("{} nodes have a mother in another {}".format(len(exceptions), root_type))
for (n, sn, smn) in exceptions:
    msg("[{} {}]({}) occurs in {} but has mother in {}".format(F.otype.v(n), F.monads.v(n), n, sn, smn), withtime=False)

 2m 42s Counting clauses with mothers in another sentence
 2m 42s 0 nodes have a mother in another sentence

In [15]:

#11
msg("Computing lengths and depths")
ntrees = 0
rntrees = 0
total_depth = {'e': 0, 'r': 0}
rtotal_depth = {'e': 0, 'r': 0}
max_depth = {'e': 0, 'r':0}
rmax_depth = {'e': 0, 'r': 0}
total_length = 0

for node in NN(test=F.otype.v, value=root_type):
    ntrees += 1
    total_length += tree.length(node)
    this_depth = {}
    for kind in ('e', 'r'):
        this_depth[kind] = tree.depth(node, kind)
    different = this_depth['e'] != this_depth['r']
    if different: rntrees += 1
    for kind in ('e', 'r'):
        if this_depth[kind] > max_depth[kind]: max_depth[kind] = this_depth[kind]
        total_depth[kind] += this_depth[kind]
        if different:
            if this_depth[kind] > rmax_depth[kind]: rmax_depth[kind] = this_depth[kind]
            rtotal_depth[kind] += this_depth[kind]
                
msg("{} trees seen, of which in {} cases restructuring makes a difference in depth".format(ntrees, rntrees))
if ntrees > 0:
    msg("Embedding trees: max depth = {:>2}, average depth = {:.2g}".format(max_depth['e'], total_depth['e'] / ntrees))
    msg("Restructd trees: max depth = {:>2}, average depth = {:.2g}".format(max_depth['r'], total_depth['r'] / ntrees))
if rntrees > 0:
    msg("Statistics for cases where restructuring makes a difference:")
    msg("Embedding trees: max depth = {:>2}, average depth = {:.2g}".format(rmax_depth['e'], rtotal_depth['e'] / rntrees))
    msg("Restructd trees: max depth = {:>2}, average depth = {:.2g}".format(rmax_depth['r'], rtotal_depth['r'] / rntrees))
msg("Total number of leaves in the trees: {}, average number of leaves = {:.2g}".format(total_length, total_length / ntrees))

 2m 47s Computing lengths and depths
 2m 52s 66045 trees seen, of which in 9655 cases restructuring makes a difference in depth
 2m 52s Embedding trees: max depth = 13, average depth = 3.6
 2m 52s Restructd trees: max depth = 19, average depth = 3.8
 2m 52s Statistics for cases where restructuring makes a difference:
 2m 52s Embedding trees: max depth = 13, average depth = 3.7
 2m 52s Restructd trees: max depth = 19, average depth = 5.3
 2m 52s Total number of leaves in the trees: 426555, average number of leaves = 6.5

Writing Trees¶

After all these checks we can proceed to print out the tree structures as plain, bracketed text strings.

Per tree we also print a string of the monad numbers that you get when you walk the tree in pre-order. And we produce object ids from the EMDROS database and node ids from the LAF version.

First we apply our algorithms to a limited set of interesting trees and a random sample. For those cases we also apply a debug_write() method that outputs considerably more information.

This output has been visually checked by Constantijn Sikkel and Dirk Roorda.

In [16]:

def get_tag(node):
    otype = F.otype.v(node)
    tag = type_table[otype]
    if tag == 'P': tag = F.typ.v(node)
    elif tag == 'C': tag = ccr_table[F.rela.v(node)]
    is_word = tag == ''
    pos = pos_table[F.sp.v(node)] if is_word else None
    monad = int(F.monads.v(node)) if is_word else None
    text = '"{}"'.format(F.g_cons_utf8.v(node)) if is_word else None
    return (tag, pos, monad, text, is_word)

def passage_roots(verse_label):
    sought = []
    grab = -1
    for n in NN():
        if grab == 1: continue
        otype = F.otype.v(n)
        if otype == 'verse': 
            check = F.label.v(n) == verse_label
            if check: grab = 0
            elif grab == 0: grab = 1
        if grab == 0 and otype == root_type: sought.append(n)
    return sought

def showcases(cases, ofile):
    out = outfile(ofile)
    for snode in cases:
        out.write("\n====================\n{}\n{}\n{} bhs_id={} laf_node={}:\n".format(
            root_verse[snode], cases[snode], root_type, F.oid.v(snode), snode,
        ))
        for kind in ('e', 'r'):
            out.write("\nTree based on monad embedding {}\n\n".format(
                "only" if kind == 'e' else " and mother+clause_constituent relation"
            ))
            (tree_rep, words_rep, bmonad) = tree.write_tree(snode, kind, get_tag, rev=False, leafnumbers=False)
            out.write("{}\n\n{}\n".format(words_rep, tree_rep))
            out.write("\nDepth={}\n".format(tree.depth(snode, kind)))
            out.write(tree.debug_write_tree(snode, kind, legenda=kind=='r'))
    out.close()
    

# below holds for etcbc3, in etcbc4 we have less problem cases

problem_desc = collections.OrderedDict((
    (1131739, "debug reorder"),
    (1131712, "interesting"), 
    (1131701, "interesting"),
    (1140469, "subject clause order"),
    (passage_roots(' GEN 01,16')[1], "interesting"), 
    (1164864, "interesting"),
    (1143081, "cyclic mothers"),
    (1153973, "cyclic mothers"),
    (1158971, "cyclic mothers"),
    (1158971, "cyclic mothers"),
    (1160416, "cyclic mothers"),
    (1160464, "cyclic mothers"),
    (1161141, "nested cyclic mothers: C.coor => C.attr => P below first C.coor"), 
    (1163666, "cyclic mothers"), 
    (1164830, "cyclic mothers"), 
    (1167680, "cyclic mothers"), 
    (1170057, "cyclic mothers"), 
    (1193065, "cyclic mothers"), 
    (1199681, "cyclic mothers"), 
    (1199682, "mother points outside sentence"),
))
fixed_sample = (
    1167680,
    1167152,
    1145250,
    1154339,
    1136677,
    1166385,
    1198984,
    1152969,
    1153930,
    1150648,
    1168396,
    1151917,
    1164750,
    1156719,
    1148048,
    1138673,
    1134184,
    1156789,
    1156600,
    1140469,
)
sample_size = 20
sample = {}
fsample = collections.OrderedDict()
mother_keys = list(sorted(tree.mother))
for s in range(20):
    r = random.randint(0, len(mother_keys) - 1)
    snode = tree.get_root(tree.mother[mother_keys[r]], 'e')[0]
    sample[snode] = 'random sample in {}s with {}s with mothers'.format(root_type, clause_type)
for snode in fixed_sample:
    fsample[snode] = 'random sample in {}s with {}s with mothers'.format(root_type, clause_type)

#showcases(problem_desc, "tree_notabene.txt")
showcases(sample, 'trees_random_{}.txt'.format(sample_size))
#showcases(fsample, 'trees_fixed_{}.txt'.format(len(fsample)))

Finally, here is the production of the whole set of trees.

In [17]:

msg("Writing {} trees".format(root_type))
trees = outfile("trees.txt")
verse_label = ''
s = 0
chunk = 10000
sc = 0
for node in NN():
    if node in skip: continue
    otype = F.otype.v(node)
    oid = F.oid.v(node)
    if  otype == 'verse':
        verse_label = F.label.v(node)
        continue
    if otype != root_type: continue
    (tree_rep, words_rep, bmonad) = tree.write_tree(node, 'r', get_tag, rev=False, leafnumbers=False)
    trees.write("\n#{}\tnode={}\toid={}\tbmonad={}\t{}\n{}\n".format(
        verse_label, node, oid, bmonad, words_rep, tree_rep,
    ))
    s += 1
    sc += 1
    if sc == chunk:
        msg("{} trees written".format(s))
        sc = 0
trees.close()    
msg("{} trees written".format(s))

 2m 59s Writing sentence trees
 3m 07s 10000 trees written
 3m 15s 20000 trees written
 3m 23s 30000 trees written
 3m 31s 40000 trees written
 3m 38s 50000 trees written
 3m 44s 60000 trees written
 3m 49s 66042 trees written

In [18]:

close()

28m 19s Results directory:
/Users/dirk/laf-fabric-output/etcbc4/trees

__log__trees.txt                       4903 Mon Sep 29 17:18:53 2014
anomalies.txt                             0 Tue Jul 15 14:40:09 2014
coor.txt                             157892 Tue Jul 15 17:45:28 2014
depths.txt                           661357 Tue Jul 15 17:45:26 2014
tgrep_result.txt                    4619456 Tue Jul 15 17:45:14 2014
tree_notabene.txt                     52483 Tue Jul 15 17:28:29 2014
trees.t2c                          12099621 Tue Jul 15 17:31:12 2014
trees.txt                          11737304 Mon Sep 29 16:54:23 2014
trees_fixed_20.txt                    14133 Tue Jul 15 17:28:29 2014
trees_random_20.txt                  105541 Mon Sep 29 16:53:31 2014

Preview¶

Here are the first lines of the output.

In [19]:

!head -n 25 {my_file('trees.txt')}

# GEN 01,01	node=1127306	oid=11	bmonad=1	0 1 2 3 4 5 6 7 8 9 10
(S(C(PP(pp "ב")(n "ראשׁית"))(VP(vb "ברא"))(NP(n "אלהים"))(PP(U(pp "את")(dt "ה")(n "שׁמים"))(cj "ו")(U(pp "את")(dt "ה")(n "ארץ")))))

# GEN 01,02	node=1127307	oid=39	bmonad=12	0 1 2 3 4 5 6
(S(C(CP(cj "ו"))(NP(dt "ה")(n "ארץ"))(VP(vb "היתה"))(NP(U(n "תהו"))(cj "ו")(U(n "בהו")))))

# GEN 01,02	node=1127308	oid=60	bmonad=19	0 1 2 3 4
(S(C(CP(cj "ו"))(NP(n "חשׁך"))(PP(pp "על")(U(n "פני"))(U(n "תהום")))))

# GEN 01,02	node=1127309	oid=78	bmonad=24	0 1 2 3 4 5 6 7
(S(C(CP(cj "ו"))(NP(U(n "רוח"))(U(n "אלהים")))(VP(vb "מרחפת"))(PP(pp "על")(U(n "פני"))(U(dt "ה")(n "מים")))))

# GEN 01,03	node=1127310	oid=104	bmonad=32	0 1 2
(S(C(CP(cj "ו"))(VP(vb "יאמר"))(NP(n "אלהים"))))

# GEN 01,03	node=1127311	oid=117	bmonad=35	0 1
(S(C(VP(vb "יהי"))(NP(n "אור"))))

# GEN 01,03	node=1127312	oid=128	bmonad=37	0 1 2
(S(C(CP(cj "ו"))(VP(vb "יהי"))(NP(n "אור"))))

# GEN 01,04	node=1127313	oid=143	bmonad=40	0 1 2 3 4 5 6 7
(S(C(CP(cj "ו"))(VP(vb "ירא"))(NP(n "אלהים"))(PP(pp "את")(dt "ה")(n "אור"))(Cobjc(CP(cj "כי"))(VP(vb "טוב")))))

In [15]: