#!/usr/bin/env python # coding: utf-8 # # Identify punctuations (Nestle1904GBI) # ## Table of content # * 1 - Introduction # * 2 - Load Text-Fabric app and data # * 3 - Performing the queries # * 3.1 - Frequency of punctuations in corpus # * 3.2 - Explanation of the Regular Expression # * 3.3 - Notes # # 1 - Introduction # # This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus. # # 2 - Load Text-Fabric app and data # ##### [Back to TOC](#TOC) # In[2]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[3]: # Loading the Text-Fabric code # Note: it is assumed Text-Fabric is installed in your environment. from tf.fabric import Fabric from tf.app import use # In[4]: # load the app and data N1904 = use ("tonyjurg/Nestle1904GBI:latest", hoist=globals()) # # 3 - Performing the queries # ## 3.1 - Frequency of punctuations in corpus # ##### [Back to TOC](#TOC) # # This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words. # In[5]: # Library to format table from tabulate import tabulate # The actual query (see section 3.2 about the used RegExp in this query) SearchPunctuations = ''' word word~([\.·—,;])$ ''' PunctuationList = N1904.search(SearchPunctuations) ResultDict = {} for tuple in PunctuationList: node=tuple[0] Punctuation=F.word.v(node)[-1] # Check if this Punctuation already exists in ResultDict if Punctuation in ResultDict: # If it exists, add the count to the existing value ResultDict[Punctuation]+=1 else: # If it doesn't exist, initialize the count as the value ResultDict[Punctuation]=1 # Convert the dictionary into a list of key-value pairs TableData = [[key, value] for key, value in ResultDict.items()] # Produce the table headers = ["Punctuation","Frequency"] print(tabulate(TableData, headers=headers, tablefmt='fancy_grid')) # ## 3.2 Explanation of the Regular Expression # ##### [Back to TOC](#TOC) # The regular expression `[\.·—,;]$` matches any one character from the set containing `.`, `·`, `—`, `,`, or `;`. The `$` anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the `$` anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character `—`. # ## 3.3 Note # ##### [Back to TOC](#TOC) # Starting from version 0.3, thi Text-Fabric dataset will include a new feature called 'after'. This feature aims to enhance the presentation of the data by providing information about the punctuations that come after a particular word.