#!/usr/bin/env python # coding: utf-8 # # Identifying 'odd' characters for feature 'after' (N1904LFT) # ## Table of content # * 1 - Introduction] # * 2 - Load Text-Fabric app and data # * 3 - Performing the queries # * 3.1 - Showing the issue # * 3.2 - Setting up a query to find them # * 3.3 - Explanation of the regular expression # * 3.4 - Bug # # 1 - Introduction # ##### [Back to TOC](#TOC) # # This Jupyter Notebook investigates the pressense of 'odd' values for feature 'after'. # # 2 - Load Text-Fabric app and data # ##### [Back to TOC](#TOC) # In[2]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[3]: # Loading the New Testament TextFabric code # Note: it is assumed Text-Fabric is installed in your environment. from tf.fabric import Fabric from tf.app import use # In[4]: # load the app and data N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals()) # # 3 - Performing the queries # ##### [Back to TOC](#TOC) # ## 3.1 - Showing the issue # ##### [Back to TOC](#TOC) # The following shows the pressence of a few 'odd' cases for feature 'after': # In[31]: result = F.after.freqList() print ('frequency: {0}'.format(result)) # ## 3.2 - Setting up a query to find them # ##### [Back to TOC](#TOC) # In[51]: # Library to format table from tabulate import tabulate # The actual query SearchOddAfters = ''' word after~^(?!([\s\.·—,;])) ''' OddAfterList = N1904.search(SearchOddAfters) # Postprocess the query results Results=[] for tuple in OddAfterList: node=tuple[0] location="{} {}:{}".format(F.book.v(node),F.chapter.v(node),F.verse.v(node)) result=(location,F.word.v(node),F.after.v(node)) Results.append(result) # Produce the table headers = ["location","word","after"] print(tabulate(Results, headers=headers, tablefmt='fancy_grid')) # ## 3.3 - Explanation of the regular expression # ##### [Back to TOC](#TOC) # # The regular expression broken down in its components: # # `^`: This symbol is called a caret and represents the start of a string. It ensures that the following pattern is applied at the beginning of the string. # # `(?!...)`: This is a negative lookahead assertion. It checks if the pattern inside the parentheses does not match at the current position. # # `[…]`: This denotes a character class, which matches any single character that is within the brackets. # # `[\s\.·,—,;]`: This character class contains multiple characters enclosed in the brackets. Let's break down the characters within it: # # * `\s`: This is a shorthand character class that matches any whitespace character, including spaces, tabs, and newlines. # * `\.`: This matches a literal period (dot). # * `·`: This matches a specific Unicode character, which is a middle dot. # * `—`: This matches an em dash character. # * `,`: This matches a comma. # * `;`: This matches a semicolon. # In summary, the character class `[\s\.·,—,;]` matches any single character that is either a whitespace character, a period, a middle dot, an em dash, a comma, or a semicolon. # # The regular expression selects any string which does not starts with a whitespace character, period, middle dot, em dash, comma, or semicolon. # The following site can be used to build and verify a regular expression: [regex101.com](https://regex101.com/) (choose the 'Pyton flavor') # ## 3.4 - Bug # ##### [Back to TOC](#TOC) # # The observed behaviour was due to a bug. [Issue tracker #76](https://github.com/Clear-Bible/macula-greek/issues/76) was opened. When the text of a node starts with punctuation, the @after attribute contains the last character of the word. This is a bug in the transformation to XML LowFat Tree data. # In[ ]: