This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'.
%load_ext autoreload
%autoreload 2
# Loading the New Testament TextFabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use
# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())
Locating corpus resources ...
The requested data is not available offline ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 not found
| 0.30s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 3.07s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.01s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.58s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.70s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | | 0.08s C __levels__ from otype, oslots, otext | | 1.79s C __order__ from otype, oslots, __levels__ | | 0.08s C __rank__ from otype, __order__ | | 4.63s C __levUp__ from otype, oslots, __rank__ | | 2.70s C __levDown__ from otype, __levUp__, __rank__ | | 0.06s C __characters__ from otext | | 1.19s C __boundary__ from otype, oslots, __rank__ | | 0.05s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse | | 0.26s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse | 0.54s T appos from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.58s T book_long from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.50s T booknumber from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.55s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.47s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.66s T containedclause from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.50s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.65s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.54s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.72s T id from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.48s T junction from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.63s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.58s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.60s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.52s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.59s T morph from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.61s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.68s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.56s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.56s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.50s T orig_order from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.75s T ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T roleclausedistance from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.55s T rule from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.58s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T sp_full from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.61s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.53s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.71s T unicode from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.52s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.55s T wgclass from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.48s T wglevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.51s T wgnum from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.50s T wgrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.50s T wgrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.53s T wgtype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.07s T wordgroup from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.57s T wordlevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.56s T wordrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 | 0.58s T wordrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7943 | 17.35 | 100 |
sentence | 12160 | 11.33 | 100 |
wg | 132460 | 6.59 | 633 |
word | 137779 | 1.00 | 100 |
The following shows the pressence of a few 'odd' cases for feature 'after':
result = F.after.freqList()
print ('frequency: {0}'.format(result))
frequency: ((' ', 119271), (',', 9443), ('.', 5717), ('·', 2355), (';', 970), ('—', 7), ('ε', 3), ('ς', 3), ('ὶ', 2), ('ί', 1), ('α', 1), ('ι', 1), ('χ', 1), ('ἱ', 1), ('ὁ', 1), ('ὰ', 1), ('ὸ', 1))
# Library to format table
from tabulate import tabulate
# The actual query
SearchOddAfters = '''
word after~^(?!([\s\.·—,;]))
'''
OddAfterList = N1904.search(SearchOddAfters)
# Postprocess the query results
Results=[]
for tuple in OddAfterList:
node=tuple[0]
location="{} {}:{}".format(F.book.v(node),F.chapter.v(node),F.verse.v(node))
result=(location,F.word.v(node),F.after.v(node))
Results.append(result)
# Produce the table
headers = ["location","word","after"]
print(tabulate(Results, headers=headers, tablefmt='fancy_grid'))
0.11s 16 results ╒═════════════════════╤══════════════╤═════════╕ │ location │ word │ after │ ╞═════════════════════╪══════════════╪═════════╡ │ Luke 23:51 │ —οὗτο │ ς │ ├─────────────────────┼──────────────┼─────────┤ │ Luke 2:35 │ —κα │ ὶ │ ├─────────────────────┼──────────────┼─────────┤ │ John 4:2 │ —καίτοιγ │ ε │ ├─────────────────────┼──────────────┼─────────┤ │ John 7:22 │ —οὐ │ χ │ ├─────────────────────┼──────────────┼─────────┤ │ Acts 22:2 │ —ἀκούσαντε │ ς │ ├─────────────────────┼──────────────┼─────────┤ │ Romans 15:25 │ —νυν │ ὶ │ ├─────────────────────┼──────────────┼─────────┤ │ I_Corinthians 9:15 │ —τ │ ὸ │ ├─────────────────────┼──────────────┼─────────┤ │ II_Corinthians 12:2 │ —ἁρπαγέντ │ α │ ├─────────────────────┼──────────────┼─────────┤ │ II_Corinthians 12:2 │ —εἴτ │ ε │ ├─────────────────────┼──────────────┼─────────┤ │ II_Corinthians 12:3 │ —εἴτ │ ε │ ├─────────────────────┼──────────────┼─────────┤ │ II_Corinthians 6:2 │ —λέγε │ ι │ ├─────────────────────┼──────────────┼─────────┤ │ Galatians 2:6 │ —ὁποῖο │ ί │ ├─────────────────────┼──────────────┼─────────┤ │ Ephesians 5:10 │ —δοκιμάζοντε │ ς │ ├─────────────────────┼──────────────┼─────────┤ │ Ephesians 5:9 │ — │ ὁ │ ├─────────────────────┼──────────────┼─────────┤ │ Hebrews 7:20 │ —ο │ ἱ │ ├─────────────────────┼──────────────┼─────────┤ │ Hebrews 7:22 │ —κατ │ ὰ │ ╘═════════════════════╧══════════════╧═════════╛
The regular expression broken down in its components:
^
: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string.
(?!...)
: This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position.
[…]
: This denotes a character class, which matches any single character that is within the brackets.
[\s\.·,—,;]
: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it:
\s
: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines.\.
: This matches a literal period (dot).·
: This matches a specific Unicode character, which is a middle dot.—
: This matches an em dash character.,
: This matches a comma.;
: This matches a semicolon.In summary, the character class [\s\.·,—,;]
matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon.
The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon.
The following site can be used to build and verify a regular expression: regex101.com (choose the 'Pyton flavor')
The observed behaviour was due to a bug. Issue tracker #76 was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data.