#!/usr/bin/env python
# coding: utf-8
# # Identifying 'odd' characters for feature 'after' (N1904LFT)
# ## Table of content
# * 1 - Introduction]
# * 2 - Load Text-Fabric app and data
# * 3 - Performing the queries
# * 3.1 - Showing the issue
# * 3.2 - Setting up a query to find them
# * 3.3 - Explanation of the regular expression
# * 3.4 - Bug
# # 1 - Introduction
# ##### [Back to TOC](#TOC)
#
# This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'.
# # 2 - Load Text-Fabric app and data
# ##### [Back to TOC](#TOC)
# In[2]:
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
# In[3]:
# Loading the New Testament TextFabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use
# In[4]:
# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())
# # 3 - Performing the queries
# ##### [Back to TOC](#TOC)
# ## 3.1 - Showing the issue
# ##### [Back to TOC](#TOC)
# The following shows the pressence of a few 'odd' cases for feature 'after':
# In[31]:
result = F.after.freqList()
print ('frequency: {0}'.format(result))
# ## 3.2 - Setting up a query to find them
# ##### [Back to TOC](#TOC)
# In[51]:
# Library to format table
from tabulate import tabulate
# The actual query
SearchOddAfters = '''
word after~^(?!([\s\.·—,;]))
'''
OddAfterList = N1904.search(SearchOddAfters)
# Postprocess the query results
Results=[]
for tuple in OddAfterList:
node=tuple[0]
location="{} {}:{}".format(F.book.v(node),F.chapter.v(node),F.verse.v(node))
result=(location,F.word.v(node),F.after.v(node))
Results.append(result)
# Produce the table
headers = ["location","word","after"]
print(tabulate(Results, headers=headers, tablefmt='fancy_grid'))
# ## 3.3 - Explanation of the regular expression
# ##### [Back to TOC](#TOC)
#
# The regular expression broken down in its components:
#
# `^`: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string.
#
# `(?!...)`: This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position.
#
# `[…]`: This denotes a character class, which matches any single character that is within the brackets.
#
# `[\s\.·,—,;]`: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it:
#
# * `\s`: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines.
# * `\.`: This matches a literal period (dot).
# * `·`: This matches a specific Unicode character, which is a middle dot.
# * `—`: This matches an em dash character.
# * `,`: This matches a comma.
# * `;`: This matches a semicolon.
# In summary, the character class `[\s\.·,—,;]` matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon.
#
# The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon.
# The following site can be used to build and verify a regular expression: [regex101.com](https://regex101.com/) (choose the 'Pyton flavor')
# ## 3.4 - Bug
# ##### [Back to TOC](#TOC)
#
# The observed behaviour was due to a bug. [Issue tracker #76](https://github.com/Clear-Bible/macula-greek/issues/76) was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data.
# In[ ]: