PCFG= Probabilistic Context-Free Grammar. It is a type of context-free grammar that associates a probability with each production rule. Each production rule in a PCFG is assigned a probability, indicating the likelihood of using that rule in a derivation.
The formula for calculation probability of transtition $\alpha → \beta$:
$q_{ML}(\alpha → \beta) =\frac{count (\alpha → \beta)}{count (\alpha)}$
And consequently:
∑$_{i=1}^{n} q_{ML}(\alpha → \beta) = 1 $
Testing dataset: N1904 treebank (GBI)
import pandas as pd
import sys
import os
import time
import pickle
import re # used for regular expressions
from os import listdir
from os.path import isfile, join
import xml.etree.ElementTree as ET
BaseDir = 'C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\'
InputDir = BaseDir+'inputfiles\\'
bo='26-jude'
InputFile = os.path.join(InputDir, f'{bo}.xml')
tree = ET.parse(InputFile)
root = tree.getroot()
# Dictionary to store transition frequencies
transition_frequencies = {}
Multiple sets of books are defined here allowing for comparing the calculated probability-values.
booklist = ['01-matthew', '02-mark', '03-luke', '04-john', '05-acts', '06-romans',
'07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
'11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
'15-1timothy', '16-2timothy', '17-titus', '18-philemon', '19-hebrews',
'20-james', '21-1peter', '22-2peter', '23-1john', '24-2john', '25-3john',
'26-jude', '27-revelation']
paullist= ['06-romans', '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
'11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
'15-1timothy', '16-2timothy', '17-titus', '18-philemon']
peterlist= ['21-1peter', '22-2peter']
lukelist= ['03-luke','05-acts']
johnlist = ['23-1john', '24-2john', '25-3john']
import xml.etree.ElementTree as ET
def addParentInfo(parent, element):
for child in element:
child.attrib['parent'] = parent
addParentInfo(child, child)
def getParent(element):
if 'parent' in element.attrib:
return element.attrib['parent']
else:
return None
# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0
# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}
for bo in paullist:
InputFile = os.path.join(InputDir, f'{bo}.xml')
print (f'Reading file {InputFile}')
# Load the XML file
tree = ET.parse(InputFile)
root = tree.getroot()
# Add 'parent' attribute to each child element
addParentInfo(None, root)
# Iterate over 'Tree' elements
for tree in root.findall('.//Tree'):
# Iterate over child nodes of the current 'Tree' element
for node in tree.findall('.//Node'):
# Check if the node has child nodes
has_children = bool(list(node))
# Determine the current rule
node_cat = node.get('Cat') if has_children else 'Term'
# Get the parent node using the 'getParent' function
parent_node = getParent(node)
# Check if there is a parent node
if parent_node is not None:
parent_cat = parent_node.get('Cat')
if parent_cat == None and node_cat != None:
parent_cat = "Start"
continue
# Combine parent and current rule to form the transition
transition = (parent_cat, node_cat)
# Update the frequency count in the dictionary
total_transitions += 1
transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1
print (f'number of transitions: {total_transitions}')
# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))
# Print separate tables for each group
for from_value, transitions in grouped_transitions.items():
print(f"Transition table for starting condition: {from_value}")
print("From\tTo\tTransitions\tAverage Occurrence")
for from_val, to_val, frequency in transitions:
weight = frequency / total_transitions
print(f'{from_val}\t{to_val}\t{frequency}\t{weight:.4}')
print('\n')
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\06-romans.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\07-1corinthians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\08-2corinthians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\09-galatians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\10-ephesians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\11-philippians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\12-colossians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\13-1thessalonians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\14-2thessalonians.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\15-1timothy.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\16-2timothy.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\17-titus.xml Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\18-philemon.xml number of transitions: 95065 Transition table for starting condition: S From To Transitions Average Occurrence S CL 1929 0.02029 S np 2285 0.02404 S adjp 4 4.208e-05 Transition table for starting condition: CL From To Transitions Average Occurrence CL S 2299 0.02418 CL V 4816 0.05066 CL ADV 4784 0.05032 CL O 2271 0.02389 CL VC 637 0.006701 CL P 1115 0.01173 CL CL 7937 0.08349 CL IO 406 0.004271 CL Term 3410 0.03587 CL conj 56 0.0005891 CL np 148 0.001557 CL intj 14 0.0001473 CL advp 136 0.001431 CL O2 57 0.0005996 CL ptcl 37 0.0003892 Transition table for starting condition: np From To Transitions Average Occurrence np np 11942 0.1256 np Term 15789 0.1661 np adjp 1927 0.02027 np CL 955 0.01005 np advp 301 0.003166 np pp 285 0.002998 np conj 5 5.26e-05 np nump 16 0.0001683 np intj 2 2.104e-05 Transition table for starting condition: adjp From To Transitions Average Occurrence adjp Term 2378 0.02501 adjp CL 113 0.001189 adjp adj 44 0.0004628 adjp adjp 197 0.002072 adjp pp 9 9.467e-05 adjp advp 20 0.0002104 adjp np 9 9.467e-05 Transition table for starting condition: V From To Transitions Average Occurrence V vp 4816 0.05066 Transition table for starting condition: vp From To Transitions Average Occurrence vp Term 5618 0.0591 vp vp 165 0.001736 vp CL 23 0.0002419 vp advp 7 7.363e-05 Transition table for starting condition: ADV From To Transitions Average Occurrence ADV pp 2365 0.02488 ADV adjp 69 0.0007258 ADV advp 1260 0.01325 ADV CL 479 0.005039 ADV np 622 0.006543 ADV ADV 20 0.0002104 ADV Term 7 7.363e-05 Transition table for starting condition: pp From To Transitions Average Occurrence pp Term 3102 0.03263 pp np 3039 0.03197 pp advp 76 0.0007995 pp pp 322 0.003387 pp prep 42 0.0004418 Transition table for starting condition: O From To Transitions Average Occurrence O np 2011 0.02115 O CL 259 0.002724 O adjp 1 1.052e-05 Transition table for starting condition: VC From To Transitions Average Occurrence VC vp 637 0.006701 Transition table for starting condition: P From To Transitions Average Occurrence P pp 221 0.002325 P np 492 0.005175 P CL 19 0.0001999 P adjp 352 0.003703 P advp 31 0.0003261 Transition table for starting condition: advp From To Transitions Average Occurrence advp Term 1792 0.01885 advp advp 72 0.0007574 advp adjp 20 0.0002104 advp np 27 0.000284 advp adv 39 0.0004102 Transition table for starting condition: IO From To Transitions Average Occurrence IO np 406 0.004271 Transition table for starting condition: conj From To Transitions Average Occurrence conj Term 61 0.0006417 Transition table for starting condition: adj From To Transitions Average Occurrence adj Term 44 0.0004628 Transition table for starting condition: prep From To Transitions Average Occurrence prep Term 42 0.0004418 Transition table for starting condition: intj From To Transitions Average Occurrence intj Term 16 0.0001683 Transition table for starting condition: O2 From To Transitions Average Occurrence O2 adjp 14 0.0001473 O2 np 39 0.0004102 O2 CL 4 4.208e-05 Transition table for starting condition: adv From To Transitions Average Occurrence adv Term 39 0.0004102 Transition table for starting condition: ptcl From To Transitions Average Occurrence ptcl Term 37 0.0003892 Transition table for starting condition: nump From To Transitions Average Occurrence nump Term 19 0.0001999 nump nump 3 3.156e-05 nump adjp 3 3.156e-05
# avarages for each seperate transition (i.e. all rules sum op to p=1 per starting condition)
import xml.etree.ElementTree as ET
def addParentInfo(parent, element):
for child in element:
child.attrib['parent'] = parent
addParentInfo(child, child)
def getParent(element):
if 'parent' in element.attrib:
return element.attrib['parent']
else:
return None
# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0
# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}
print('loading books ',end='')
for bo in johnlist:
InputFile = os.path.join(InputDir, f'{bo}.xml')
#print (f'Reading file {InputFile}')
print ('.',end='')
# Load the XML file
tree = ET.parse(InputFile)
root = tree.getroot()
# Add 'parent' attribute to each child element
addParentInfo(None, root)
# Iterate over 'Tree' elements
for tree in root.findall('.//Tree'):
# Iterate over child nodes of the current 'Tree' element
for node in tree.findall('.//Node'):
# Check if the node has child nodes
has_children = bool(list(node))
# Determine the current rule
node_cat = node.get('Cat') if has_children else 'Term'
# Get the parent node using the 'getParent' function
parent_node = getParent(node)
# Check if there is a parent node
if parent_node is not None:
parent_cat = parent_node.get('Cat')
if parent_cat is None and node_cat is not None:
parent_cat = "Start"
continue
# Combine parent and current rule to form the transition
transition = (parent_cat, node_cat)
# Update the frequency count in the dictionary
total_transitions += 1
transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1
print (f'\nFinished\tNumber of transitions: {total_transitions}\n')
# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))
# Print separate tables for each group with sorted transitions
for from_value, transitions in grouped_transitions.items():
print(f"Transition table for starting condition: {from_value}")
print("From\tTo\tOcc.\tWeigth")
# Sort transitions based on frequency in descending order
sorted_transitions = sorted(transitions, key=lambda x: x[2], reverse=True)
# Calculate total occurrences for the current table
total_occurrences = sum(occurrence for _, _, occurrence in sorted_transitions)
for from_val, to_val, frequency in sorted_transitions:
# Calculate the average occurrence for each transition
average_occurrence = frequency / total_occurrences
print(f'{from_val}\t{to_val}\t{frequency}\t{average_occurrence:.4}')
print('\n')
loading books ... Finished Number of transitions: 7678 Transition table for starting condition: S From To Occ. Weigth S np 223 0.5533 S CL 180 0.4467 Transition table for starting condition: CL From To Occ. Weigth CL CL 743 0.2964 CL V 425 0.1695 CL Term 295 0.1177 CL ADV 271 0.1081 CL O 246 0.09813 CL S 223 0.08895 CL P 111 0.04428 CL VC 104 0.04148 CL IO 36 0.01436 CL np 28 0.01117 CL conj 12 0.004787 CL advp 9 0.00359 CL O2 4 0.001596 Transition table for starting condition: np From To Occ. Weigth np Term 1267 0.5599 np np 757 0.3345 np adjp 113 0.04993 np CL 95 0.04198 np advp 16 0.00707 np pp 15 0.006628 Transition table for starting condition: VC From To Occ. Weigth VC vp 104 1.0 Transition table for starting condition: vp From To Occ. Weigth vp Term 540 0.98 vp vp 11 0.01996 Transition table for starting condition: P From To Occ. Weigth P np 47 0.4234 P pp 46 0.4144 P adjp 18 0.1622 Transition table for starting condition: pp From To Occ. Weigth pp Term 228 0.479 pp np 221 0.4643 pp pp 21 0.04412 pp advp 6 0.01261 Transition table for starting condition: O From To Occ. Weigth O np 218 0.8862 O CL 28 0.1138 Transition table for starting condition: V From To Occ. Weigth V vp 425 1.0 Transition table for starting condition: ADV From To Occ. Weigth ADV pp 152 0.5507 ADV advp 92 0.3333 ADV np 20 0.07246 ADV CL 8 0.02899 ADV ADV 2 0.007246 ADV Term 1 0.003623 ADV adjp 1 0.003623 Transition table for starting condition: IO From To Occ. Weigth IO np 36 1.0 Transition table for starting condition: adjp From To Occ. Weigth adjp Term 135 0.9783 adjp adjp 2 0.01449 adjp CL 1 0.007246 Transition table for starting condition: advp From To Occ. Weigth advp Term 122 0.9683 advp adjp 2 0.01587 advp advp 2 0.01587 Transition table for starting condition: conj From To Occ. Weigth conj Term 12 1.0 Transition table for starting condition: O2 From To Occ. Weigth O2 np 4 1.0