This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use
# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())
Locating corpus resources ...
| 0.21s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 2.46s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.61s T unicode from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.48s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.57s T wordtranslit from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.60s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.59s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.58s T wordunacc from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | | 0.06s C __levels__ from otype, oslots, otext | | 1.83s C __order__ from otype, oslots, __levels__ | | 0.07s C __rank__ from otype, __order__ | | 3.93s C __levUp__ from otype, oslots, __rank__ | | 2.16s C __levDown__ from otype, __levUp__, __rank__ | | 0.21s C __characters__ from otext | | 0.94s C __boundary__ from otype, oslots, __rank__ | | 0.04s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse | | 0.23s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse | 0.36s T appos from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.43s T booknumber from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.49s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.49s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.34s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.55s T containedclause from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.41s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.56s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.46s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.35s T junction from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.56s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.51s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.53s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.41s T markafter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.41s T markbefore from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.41s T markorder from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.45s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.43s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.53s T morph from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.53s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.49s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.46s T orig_order from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.45s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.44s T punctuation from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.67s T ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.52s T roleclausedistance from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.45s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.52s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T sp_full from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.54s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.43s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.43s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.45s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.43s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.40s T wgclass from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.36s T wglevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.38s T wgnum from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.37s T wgrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.37s T wgrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.41s T wgrule from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.35s T wgtype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.51s T wordlevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.49s T wordrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5 | 0.50s T wordrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7943 | 17.35 | 100 |
sentence | 8011 | 17.20 | 100 |
wg | 113447 | 7.58 | 624 |
word | 137779 | 1.00 | 100 |
3
tonyjurg/Nestle1904LFT
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
f2eb5e2b0f8805ad720d91a5cb9e2aa2fdc6c99a
''
reference
]none
unknown
NA
''
layout-orig-full
}True
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
Nestle 1904 (Low Fat Tree)
tonyjurg
/tf
Nestle1904LFT
Nestle1904LFT
0.5
v03
0
{book}
''
{chapter}
''
0
{sentence}
''
{verse}
''
0
{rule} {clausetype} {wgrolelong} {junction}
''
True
lemma
strongs
gloss
]App config error(s) in wg: label: feature rule not loaded
This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words.
# Library to format table
from tabulate import tabulate
# The actual query (see section 3.2 about the used RegExp in this query)
SearchPunctuations = '''
word word~([\.·—,;])$
'''
PunctuationList = N1904.search(SearchPunctuations)
ResultDict = {}
for tuple in PunctuationList:
node=tuple[0]
Punctuation=F.word.v(node)[-1]
# Check if this Punctuation already exists in ResultDict
if Punctuation in ResultDict:
# If it exists, add the count to the existing value
ResultDict[Punctuation]+=1
else:
# If it doesn't exist, initialize the count as the value
ResultDict[Punctuation]=1
# Convert the dictionary into a list of key-value pairs
TableData = [[key, value] for key, value in ResultDict.items()]
# Produce the table
headers = ["Punctuation","Frequency"]
print(tabulate(TableData, headers=headers, tablefmt='fancy_grid'))
0.12s 18507 results ╒═══════════════╤═════════════╕ │ Punctuation │ Frequency │ ╞═══════════════╪═════════════╡ │ . │ 5712 │ ├───────────────┼─────────────┤ │ , │ 9441 │ ├───────────────┼─────────────┤ │ · │ 2355 │ ├───────────────┼─────────────┤ │ ; │ 969 │ ├───────────────┼─────────────┤ │ — │ 30 │ ╘═══════════════╧═════════════╛
The regular expression [\.·—,;]$
matches any one character from the set containing .
, ·
, —
, ,
, or ;
. The $
anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the $
anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character —
.