Notebook

Weight calculation PCFG model (GBI treebank/ N1904GBI)¶

Table of content ¶

1 - Introduction
2 - Create sum of transitions
3 - Avarage probabilities for the complete set
4 - Normalizing probabilities per source status

1 - Introduction ¶

Back to TOC ¶

PCFG= Probabilistic Context-Free Grammar. It is a type of context-free grammar that associates a probability with each production rule. Each production rule in a PCFG is assigned a probability, indicating the likelihood of using that rule in a derivation.

The formula for calculation probability of transtition $\alpha → \beta$:

$q_{ML}(\alpha → \beta) =\frac{count (\alpha → \beta)}{count (\alpha)}$

And consequently:

∑$_{i=1}^{n} q_{ML}(\alpha → \beta) = 1 $

Testing dataset: N1904 treebank (GBI)

2 - Create sum of transitions ¶

Back to TOC ¶

In [2]:

import pandas as pd
import sys
import os
import time
import pickle

import re  # used for regular expressions
from os import listdir
from os.path import isfile, join
import xml.etree.ElementTree as ET

In [3]:

BaseDir = 'C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\'
InputDir = BaseDir+'inputfiles\\'
bo='26-jude'
InputFile = os.path.join(InputDir, f'{bo}.xml')
tree = ET.parse(InputFile)
root = tree.getroot()

# Dictionary to store transition frequencies
transition_frequencies = {}

Multiple sets of books are defined here allowing for comparing the calculated probability-values.

In [4]:

booklist = ['01-matthew', '02-mark', '03-luke', '04-john', '05-acts', '06-romans',
           '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
           '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
           '15-1timothy', '16-2timothy', '17-titus', '18-philemon', '19-hebrews', 
           '20-james', '21-1peter', '22-2peter', '23-1john', '24-2john', '25-3john',
           '26-jude', '27-revelation']
paullist= ['06-romans', '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
           '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
           '15-1timothy', '16-2timothy', '17-titus', '18-philemon']
peterlist= ['21-1peter', '22-2peter']
lukelist= ['03-luke','05-acts']
johnlist = ['23-1john', '24-2john', '25-3john']

3 - Avarage probabilities for the complete set ¶

Back to TOC ¶

i.e. all rules sum op to p=1.

In [5]:

import xml.etree.ElementTree as ET

def addParentInfo(parent, element):
    for child in element:
        child.attrib['parent'] = parent
        addParentInfo(child, child)

def getParent(element):
    if 'parent' in element.attrib:
        return element.attrib['parent']
    else:
        return None

# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0    
# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}

for bo in paullist:
    InputFile = os.path.join(InputDir, f'{bo}.xml')
    print (f'Reading file {InputFile}')
    
    # Load the XML file
    tree = ET.parse(InputFile)
    root = tree.getroot()
    
    # Add 'parent' attribute to each child element
    addParentInfo(None, root)
    
    # Iterate over 'Tree' elements
    for tree in root.findall('.//Tree'):
        # Iterate over child nodes of the current 'Tree' element
        for node in tree.findall('.//Node'):
            # Check if the node has child nodes
            has_children = bool(list(node))

            # Determine the current rule
            node_cat = node.get('Cat') if has_children else 'Term'

            # Get the parent node using the 'getParent' function
            parent_node = getParent(node)

            # Check if there is a parent node
            if parent_node is not None:
                parent_cat = parent_node.get('Cat')
                if parent_cat == None and node_cat != None:
                    parent_cat = "Start"
                    continue

            # Combine parent and current rule to form the transition
            transition = (parent_cat, node_cat)

            # Update the frequency count in the dictionary
            total_transitions += 1
            transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1

print (f'number of transitions: {total_transitions}')
            
# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
    grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))

# Print separate tables for each group
for from_value, transitions in grouped_transitions.items():
    print(f"Transition table for starting condition: {from_value}")
    print("From\tTo\tTransitions\tAverage Occurrence")
    
    for from_val, to_val, frequency in transitions:
        weight = frequency / total_transitions
        print(f'{from_val}\t{to_val}\t{frequency}\t{weight:.4}')
    
    print('\n')

Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\06-romans.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\07-1corinthians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\08-2corinthians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\09-galatians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\10-ephesians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\11-philippians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\12-colossians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\13-1thessalonians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\14-2thessalonians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\15-1timothy.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\16-2timothy.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\17-titus.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\18-philemon.xml
number of transitions: 95065
Transition table for starting condition: S
From	To	Transitions	Average Occurrence
S	CL	1929	0.02029
S	np	2285	0.02404
S	adjp	4	4.208e-05


Transition table for starting condition: CL
From	To	Transitions	Average Occurrence
CL	S	2299	0.02418
CL	V	4816	0.05066
CL	ADV	4784	0.05032
CL	O	2271	0.02389
CL	VC	637	0.006701
CL	P	1115	0.01173
CL	CL	7937	0.08349
CL	IO	406	0.004271
CL	Term	3410	0.03587
CL	conj	56	0.0005891
CL	np	148	0.001557
CL	intj	14	0.0001473
CL	advp	136	0.001431
CL	O2	57	0.0005996
CL	ptcl	37	0.0003892


Transition table for starting condition: np
From	To	Transitions	Average Occurrence
np	np	11942	0.1256
np	Term	15789	0.1661
np	adjp	1927	0.02027
np	CL	955	0.01005
np	advp	301	0.003166
np	pp	285	0.002998
np	conj	5	5.26e-05
np	nump	16	0.0001683
np	intj	2	2.104e-05


Transition table for starting condition: adjp
From	To	Transitions	Average Occurrence
adjp	Term	2378	0.02501
adjp	CL	113	0.001189
adjp	adj	44	0.0004628
adjp	adjp	197	0.002072
adjp	pp	9	9.467e-05
adjp	advp	20	0.0002104
adjp	np	9	9.467e-05


Transition table for starting condition: V
From	To	Transitions	Average Occurrence
V	vp	4816	0.05066


Transition table for starting condition: vp
From	To	Transitions	Average Occurrence
vp	Term	5618	0.0591
vp	vp	165	0.001736
vp	CL	23	0.0002419
vp	advp	7	7.363e-05


Transition table for starting condition: ADV
From	To	Transitions	Average Occurrence
ADV	pp	2365	0.02488
ADV	adjp	69	0.0007258
ADV	advp	1260	0.01325
ADV	CL	479	0.005039
ADV	np	622	0.006543
ADV	ADV	20	0.0002104
ADV	Term	7	7.363e-05


Transition table for starting condition: pp
From	To	Transitions	Average Occurrence
pp	Term	3102	0.03263
pp	np	3039	0.03197
pp	advp	76	0.0007995
pp	pp	322	0.003387
pp	prep	42	0.0004418


Transition table for starting condition: O
From	To	Transitions	Average Occurrence
O	np	2011	0.02115
O	CL	259	0.002724
O	adjp	1	1.052e-05


Transition table for starting condition: VC
From	To	Transitions	Average Occurrence
VC	vp	637	0.006701


Transition table for starting condition: P
From	To	Transitions	Average Occurrence
P	pp	221	0.002325
P	np	492	0.005175
P	CL	19	0.0001999
P	adjp	352	0.003703
P	advp	31	0.0003261


Transition table for starting condition: advp
From	To	Transitions	Average Occurrence
advp	Term	1792	0.01885
advp	advp	72	0.0007574
advp	adjp	20	0.0002104
advp	np	27	0.000284
advp	adv	39	0.0004102


Transition table for starting condition: IO
From	To	Transitions	Average Occurrence
IO	np	406	0.004271


Transition table for starting condition: conj
From	To	Transitions	Average Occurrence
conj	Term	61	0.0006417


Transition table for starting condition: adj
From	To	Transitions	Average Occurrence
adj	Term	44	0.0004628


Transition table for starting condition: prep
From	To	Transitions	Average Occurrence
prep	Term	42	0.0004418


Transition table for starting condition: intj
From	To	Transitions	Average Occurrence
intj	Term	16	0.0001683


Transition table for starting condition: O2
From	To	Transitions	Average Occurrence
O2	adjp	14	0.0001473
O2	np	39	0.0004102
O2	CL	4	4.208e-05


Transition table for starting condition: adv
From	To	Transitions	Average Occurrence
adv	Term	39	0.0004102


Transition table for starting condition: ptcl
From	To	Transitions	Average Occurrence
ptcl	Term	37	0.0003892


Transition table for starting condition: nump
From	To	Transitions	Average Occurrence
nump	Term	19	0.0001999
nump	nump	3	3.156e-05
nump	adjp	3	3.156e-05

4 - Normalizing probabilities per source status ¶

Back to TOC ¶

In [98]:

# avarages for each seperate transition (i.e. all rules sum op to p=1 per starting condition)

import xml.etree.ElementTree as ET

def addParentInfo(parent, element):
    for child in element:
        child.attrib['parent'] = parent
        addParentInfo(child, child)

def getParent(element):
    if 'parent' in element.attrib:
        return element.attrib['parent']
    else:
        return None

# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0

# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}
print('loading books ',end='')

for bo in johnlist:
    InputFile = os.path.join(InputDir, f'{bo}.xml')
    #print (f'Reading file {InputFile}')
    print ('.',end='')
    
    # Load the XML file
    tree = ET.parse(InputFile)
    root = tree.getroot()
    
    # Add 'parent' attribute to each child element
    addParentInfo(None, root)

    # Iterate over 'Tree' elements
    for tree in root.findall('.//Tree'):
        # Iterate over child nodes of the current 'Tree' element
        for node in tree.findall('.//Node'):
            # Check if the node has child nodes
            has_children = bool(list(node))

            # Determine the current rule
            node_cat = node.get('Cat') if has_children else 'Term'

            # Get the parent node using the 'getParent' function
            parent_node = getParent(node)

            # Check if there is a parent node
            if parent_node is not None:
                parent_cat = parent_node.get('Cat')
                if parent_cat is None and node_cat is not None:
                    parent_cat = "Start"
                    continue

                # Combine parent and current rule to form the transition
                transition = (parent_cat, node_cat)

                # Update the frequency count in the dictionary
                total_transitions += 1
                transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1

print (f'\nFinished\tNumber of transitions: {total_transitions}\n')

# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
    grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))

# Print separate tables for each group with sorted transitions
for from_value, transitions in grouped_transitions.items():
    print(f"Transition table for starting condition: {from_value}")
    print("From\tTo\tOcc.\tWeigth")
    
    # Sort transitions based on frequency in descending order
    sorted_transitions = sorted(transitions, key=lambda x: x[2], reverse=True)

    # Calculate total occurrences for the current table
    total_occurrences = sum(occurrence for _, _, occurrence in sorted_transitions)

    for from_val, to_val, frequency in sorted_transitions:
        # Calculate the average occurrence for each transition
        average_occurrence = frequency / total_occurrences
        print(f'{from_val}\t{to_val}\t{frequency}\t{average_occurrence:.4}')

    print('\n')

loading books ...
Finished	Number of transitions: 7678

Transition table for starting condition: S
From	To	Occ.	Weigth
S	np	223	0.5533
S	CL	180	0.4467


Transition table for starting condition: CL
From	To	Occ.	Weigth
CL	CL	743	0.2964
CL	V	425	0.1695
CL	Term	295	0.1177
CL	ADV	271	0.1081
CL	O	246	0.09813
CL	S	223	0.08895
CL	P	111	0.04428
CL	VC	104	0.04148
CL	IO	36	0.01436
CL	np	28	0.01117
CL	conj	12	0.004787
CL	advp	9	0.00359
CL	O2	4	0.001596


Transition table for starting condition: np
From	To	Occ.	Weigth
np	Term	1267	0.5599
np	np	757	0.3345
np	adjp	113	0.04993
np	CL	95	0.04198
np	advp	16	0.00707
np	pp	15	0.006628


Transition table for starting condition: VC
From	To	Occ.	Weigth
VC	vp	104	1.0


Transition table for starting condition: vp
From	To	Occ.	Weigth
vp	Term	540	0.98
vp	vp	11	0.01996


Transition table for starting condition: P
From	To	Occ.	Weigth
P	np	47	0.4234
P	pp	46	0.4144
P	adjp	18	0.1622


Transition table for starting condition: pp
From	To	Occ.	Weigth
pp	Term	228	0.479
pp	np	221	0.4643
pp	pp	21	0.04412
pp	advp	6	0.01261


Transition table for starting condition: O
From	To	Occ.	Weigth
O	np	218	0.8862
O	CL	28	0.1138


Transition table for starting condition: V
From	To	Occ.	Weigth
V	vp	425	1.0


Transition table for starting condition: ADV
From	To	Occ.	Weigth
ADV	pp	152	0.5507
ADV	advp	92	0.3333
ADV	np	20	0.07246
ADV	CL	8	0.02899
ADV	ADV	2	0.007246
ADV	Term	1	0.003623
ADV	adjp	1	0.003623


Transition table for starting condition: IO
From	To	Occ.	Weigth
IO	np	36	1.0


Transition table for starting condition: adjp
From	To	Occ.	Weigth
adjp	Term	135	0.9783
adjp	adjp	2	0.01449
adjp	CL	1	0.007246


Transition table for starting condition: advp
From	To	Occ.	Weigth
advp	Term	122	0.9683
advp	adjp	2	0.01587
advp	advp	2	0.01587


Transition table for starting condition: conj
From	To	Occ.	Weigth
conj	Term	12	1.0


Transition table for starting condition: O2
From	To	Occ.	Weigth
O2	np	4	1.0

In [ ]:

Weight calculation PCFG model (GBI treebank/ N1904GBI)¶

Table of content ¶

1 - Introduction ¶

Back to TOC¶

2 - Create sum of transitions ¶

Back to TOC¶

3 - Avarage probabilities for the complete set ¶

Back to TOC¶

4 - Normalizing probabilities per source status¶

Back to TOC¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

4 - Normalizing probabilities per source status ¶

Back to TOC ¶