Identifying 'odd' characters for feature 'after' (N1904LFT)¶

Table of content ¶

1 - Introduction]
2 - Load Text-Fabric app and data
3 - Performing the queries

1 - Introduction ¶

Back to TOC ¶

This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'.

2 - Load Text-Fabric app and data ¶

Back to TOC ¶

In [2]:

%load_ext autoreload
%autoreload 2

In [3]:

# Loading the New Testament TextFabric code
# Note: it is assumed Text-Fabric is installed in your environment.

from tf.fabric import Fabric
from tf.app import use

In [4]:

# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())

Locating corpus resources ...

Status: latest release online v03 versus None locally

downloading app, main data and requested additions ...

app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app

The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 not found

Status: latest release online v03 versus None locally

downloading app, main data and requested additions ...

data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3

   |     0.30s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     3.07s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.01s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.70s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |      |     0.08s C __levels__           from otype, oslots, otext
   |      |     1.79s C __order__            from otype, oslots, __levels__
   |      |     0.08s C __rank__             from otype, __order__
   |      |     4.63s C __levUp__            from otype, oslots, __rank__
   |      |     2.70s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.06s C __characters__       from otext
   |      |     1.19s C __boundary__         from otype, oslots, __rank__
   |      |     0.05s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |      |     0.26s C __structure__        from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse
   |     0.54s T appos                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T book_long            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.50s T booknumber           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T bookshort            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.55s T case                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.47s T clausetype           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.66s T containedclause      from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.50s T degree               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.65s T gloss                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.54s T gn                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.72s T id                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.48s T junction             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.63s T lemma                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T lex_dom              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.60s T ln                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.52s T monad                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T mood                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.59s T morph                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.61s T nodeID               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.68s T normalized           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.56s T nu                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.56s T number               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.50s T orig_order           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T person               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.75s T ref                  from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T roleclausedistance   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.55s T rule                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T sentence             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T sp                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T sp_full              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.61s T strongs              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T subj_ref             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T tense                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.53s T type                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.71s T unicode              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.52s T voice                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.55s T wgclass              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.48s T wglevel              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.51s T wgnum                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.50s T wgrole               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.50s T wgrolelong           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.53s T wgtype               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.07s T wordgroup            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.57s T wordlevel            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.56s T wordrole             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3
   |     0.58s T wordrolelong         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3

Text-Fabric: Text-Fabric API 11.4.10, tonyjurg/Nestle1904LFT/app v3, Search Reference
Data: tonyjurg - Nestle1904LFT 0.3, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7943	17.35	100
sentence	12160	11.33	100
wg	132460	6.59	633
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 (LowFat Tree)

after

str

Characters (eg. punctuations) following the word

appos

str

Apposition details

book

str

Book

book_long

str

Book name (fully spelled out)

booknumber

int

NT book number (Matthew=1, Mark=2, ..., Revelation=27)

bookshort

str

Book name (abbreviated)

case

str

Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)

chapter

int

Chapter number inside book

clausetype

str

Clause type details

containedclause

str

Contained clause (WG number)

degree

str

Degree (e.g. Comparitative, Superlative)

gloss

str

English gloss

gn

str

Gramatical gender (Masculine, Feminine, Neuter)

id

str

id of the word

junction

str

Junction data related to a wordgroup

lemma

str

Lexeme (lemma)

lex_dom

str

Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)

ln

str

Lauw-Nida lexical classification (not present everywhere?)

monad

int

Monad (currently: order of words in XML tree file!)

mood

str

Gramatical mood of the verb (passive, etc)

morph

str

Morphological tag (Sandborg-Petersen morphology)

nodeID

str

Node ID (as in the XML source data, not yet post-processes)

normalized

str

Surface word stripped of punctations

nu

str

Gramatical number (Singular, Plural)

number

str

Gramatical number of the verb

orig_order

int

Word order within corpus (per book)

otype

str

person

str

Gramatical person of the verb (first, second, third)

ref

str

ref Id

roleclausedistance

str

distance to wordgroup defining the role of this word

rule

str

Wordgroup rule information

sentence

int

Sentence number (counted per chapter)

sp

str

Part of Speech (abbreviated)

sp_full

str

Part of Speech (long description)

strongs

str

Strongs number

subj_ref

str

Subject reference (to nodeID in XML source data, not yet post-processes)

tense

str

Gramatical tense of the verb (e.g. Present, Aorist)

type

str

Gramatical type of noun or pronoun (e.g. Common, Personal)

unicode

str

Word as it arears in the text in Unicode (incl. punctuations)

verse

int

Verse number inside chapter

voice

str

Gramatical voice of the verb

wgclass

str

Class of the wordgroup ()

wglevel

int

number of parent wordgroups for a wordgroup

wgnum

int

Wordgroup number (counted per book)

wgrole

str

Role of the wordgroup (abbreviated)

wgrolelong

str

Role of the wordgroup (full)

wgtype

str

Wordgroup type details

word

str

Word as it appears in the text (excl. punctuations)

wordgroup

int

Wordgroup number (counted per book)

wordlevel

str

number of parent wordgroups for a word

wordrole

str

Role of the word (abbreviated)

wordrolelong

str

Role of the word (full)

oslots

none

Text-Fabric API: names N F E L T S C TF directly usable

3 - Performing the queries ¶

3.1 - Showing the issue ¶

Back to TOC ¶

The following shows the pressence of a few 'odd' cases for feature 'after':

In [31]:

result = F.after.freqList()
print ('frequency: {0}'.format(result))

frequency: ((' ', 119271), (',', 9443), ('.', 5717), ('·', 2355), (';', 970), ('—', 7), ('ε', 3), ('ς', 3), ('ὶ', 2), ('ί', 1), ('α', 1), ('ι', 1), ('χ', 1), ('ἱ', 1), ('ὁ', 1), ('ὰ', 1), ('ὸ', 1))

3.2 - Setting up a query to find them ¶

Back to TOC ¶

In [51]:

# Library to format table
from tabulate import tabulate

# The actual query
SearchOddAfters = '''
word after~^(?!([\s\.·—,;]))
    '''
OddAfterList = N1904.search(SearchOddAfters)

# Postprocess the query results
Results=[]
for tuple in OddAfterList:
    node=tuple[0]
    location="{} {}:{}".format(F.book.v(node),F.chapter.v(node),F.verse.v(node))
    result=(location,F.word.v(node),F.after.v(node))
    Results.append(result)
      
# Produce the table
headers = ["location","word","after"]
print(tabulate(Results, headers=headers, tablefmt='fancy_grid'))

  0.11s 16 results
╒═════════════════════╤══════════════╤═════════╕
│ location            │ word         │ after   │
╞═════════════════════╪══════════════╪═════════╡
│ Luke 23:51          │ —οὗτο        │ ς       │
├─────────────────────┼──────────────┼─────────┤
│ Luke 2:35           │ —κα          │ ὶ       │
├─────────────────────┼──────────────┼─────────┤
│ John 4:2            │ —καίτοιγ     │ ε       │
├─────────────────────┼──────────────┼─────────┤
│ John 7:22           │ —οὐ          │ χ       │
├─────────────────────┼──────────────┼─────────┤
│ Acts 22:2           │ —ἀκούσαντε   │ ς       │
├─────────────────────┼──────────────┼─────────┤
│ Romans 15:25        │ —νυν         │ ὶ       │
├─────────────────────┼──────────────┼─────────┤
│ I_Corinthians 9:15  │ —τ           │ ὸ       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 12:2 │ —ἁρπαγέντ    │ α       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 12:2 │ —εἴτ         │ ε       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 12:3 │ —εἴτ         │ ε       │
├─────────────────────┼──────────────┼─────────┤
│ II_Corinthians 6:2  │ —λέγε        │ ι       │
├─────────────────────┼──────────────┼─────────┤
│ Galatians 2:6       │ —ὁποῖο       │ ί       │
├─────────────────────┼──────────────┼─────────┤
│ Ephesians 5:10      │ —δοκιμάζοντε │ ς       │
├─────────────────────┼──────────────┼─────────┤
│ Ephesians 5:9       │ —            │ ὁ       │
├─────────────────────┼──────────────┼─────────┤
│ Hebrews 7:20        │ —ο           │ ἱ       │
├─────────────────────┼──────────────┼─────────┤
│ Hebrews 7:22        │ —κατ         │ ὰ       │
╘═════════════════════╧══════════════╧═════════╛

3.3 - Explanation of the regular expression ¶

Back to TOC ¶

The regular expression broken down in its components:

^: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string.

(?!...): This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position.

[…]: This denotes a character class, which matches any single character that is within the brackets.

[\s\.·,—,;]: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it:

\s: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines.
\.: This matches a literal period (dot).
·: This matches a specific Unicode character, which is a middle dot.
—: This matches an em dash character.
,: This matches a comma.
;: This matches a semicolon.

In summary, the character class [\s\.·,—,;] matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon.

The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon.

The following site can be used to build and verify a regular expression: regex101.com (choose the 'Pyton flavor')

3.4 - Bug ¶

The observed behaviour was due to a bug. Issue tracker #76 was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data.

In [ ]:

Identifying 'odd' characters for feature 'after' (N1904LFT)¶

Table of content ¶

1 - Introduction ¶

Back to TOC¶

2 - Load Text-Fabric app and data ¶

Back to TOC¶

3 - Performing the queries ¶

Back to TOC¶

3.1 - Showing the issue ¶

Back to TOC¶

3.2 - Setting up a query to find them ¶

Back to TOC¶

3.3 - Explanation of the regular expression ¶

Back to TOC¶

3.4 - Bug ¶

Back to TOC¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶