Notebook

Identify punctuations (Nestle1904LFT)¶

Table of content ¶

1 - Introduction
2 - Load Text-Fabric app and data
3 - Performing the queries
- 3.1 - Frequency of punctuations in corpus
- 3.2 - Explanation of the Regular Expression

1 - Introduction ¶

This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus.

2 - Load Text-Fabric app and data ¶

Back to TOC ¶

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use

In [3]:

# load the app and data
N1904 = use ("tonyjurg/Nestle1904LFT:latest", hoist=globals())

Locating corpus resources ...

Status: latest release online v0.5 versus v03 locally

downloading app, main data and requested additions ...

app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app

data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5

   |     0.21s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     2.46s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.61s T unicode              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.48s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.57s T wordtranslit         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.60s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.59s T normalized           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.58s T wordunacc            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |      |     0.06s C __levels__           from otype, oslots, otext
   |      |     1.83s C __order__            from otype, oslots, __levels__
   |      |     0.07s C __rank__             from otype, __order__
   |      |     3.93s C __levUp__            from otype, oslots, __rank__
   |      |     2.16s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.21s C __characters__       from otext
   |      |     0.94s C __boundary__         from otype, oslots, __rank__
   |      |     0.04s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |      |     0.23s C __structure__        from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse
   |     0.36s T appos                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.43s T booknumber           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.49s T bookshort            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.49s T case                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.34s T clausetype           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.55s T containedclause      from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.41s T degree               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.56s T gloss                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.46s T gn                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.35s T junction             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.56s T lemma                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.51s T lex_dom              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.53s T ln                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.41s T markafter            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.41s T markbefore           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.41s T markorder            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.45s T monad                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.43s T mood                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.53s T morph                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.53s T nodeID               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.49s T nu                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T number               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.46s T orig_order           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.45s T person               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.44s T punctuation          from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.67s T ref                  from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.52s T roleclausedistance   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.45s T sentence             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.52s T sp                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T sp_full              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.54s T strongs              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.43s T subj_ref             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.43s T tense                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.45s T type                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.43s T voice                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.40s T wgclass              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.36s T wglevel              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.38s T wgnum                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.37s T wgrole               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.37s T wgrolelong           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.41s T wgrule               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.35s T wgtype               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.51s T wordlevel            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.49s T wordrole             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5
   |     0.50s T wordrolelong         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5

TF: TF API 12.1.5, tonyjurg/Nestle1904LFT/app v3, Search Reference
Data: tonyjurg - Nestle1904LFT 0.5, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7943	17.35	100
sentence	8011	17.20	100
wg	113447	7.58	624
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 (Low Fat Tree)

after

str

Characters (eg. punctuations) following the word

appos

str

Apposition details

book

str

Book name

booknumber

int

NT book number (Matthew=1, Mark=2, ..., Revelation=27)

bookshort

str

Book name (abbreviated)

case

str

Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)

chapter

int

Chapter number inside book

clausetype

str

Clause type details

containedclause

str

Contained clause (WG number)

degree

str

Degree (e.g. Comparitative, Superlative)

gloss

str

English gloss

str

Gramatical gender (Masculine, Feminine, Neuter)

junction

str

Junction data related to a wordgroup

lemma

str

Lexeme (lemma)

lex_dom

str

Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)

str

Lauw-Nida lexical classification (not present everywhere?)

markafter

str

Text critical marker after word

markbefore

str

Text critical marker before word

markorder

str

Order of punctuation and text critical marker

monad

int

Monad (word order in the corpus)

mood

str

Gramatical mood of the verb (passive, etc)

morph

str

Morphological tag (Sandborg-Petersen morphology)

nodeID

str

Node ID (as in the XML source data, not yet post-processes)

normalized

str

Surface word with accents normalized and trailing punctuations removed

str

Gramatical number (Singular, Plural)

number

str

Gramatical number of the verb

orig_order

int

Word order (in source XML file)

otype

str

person

str

Gramatical person of the verb (first, second, third)

punctuation

str

Punctuation after word

ref

str

ref ID

roleclausedistance

str

Distance to wordgroup defining the role of this word

sentence

int

Sentence number (counted per chapter)

str

Part of Speech (abbreviated)

sp_full

str

Part of Speech (long description)

strongs

str

Strongs number

subj_ref

str

Subject reference (to nodeID in XML source data, not yet post-processes)

tense

str

Gramatical tense of the verb (e.g. Present, Aorist)

type

str

Gramatical type of noun or pronoun (e.g. Common, Personal)

unicode

str

Word as it arears in the text in Unicode (incl. punctuations)

verse

int

Verse number inside chapter

voice

str

Gramatical voice of the verb

wgclass

str

Class of the wordgroup ()

wglevel

int

Number of parent wordgroups for a wordgroup

wgnum

int

Wordgroup number (counted per book)

wgrole

str

Role of the wordgroup (abbreviated)

wgrolelong

str

Role of the wordgroup (full)

wgrule

str

Wordgroup rule information

wgtype

str

Wordgroup type details

word

str

Word as it appears in the text (excl. punctuations)

wordlevel

str

Number of parent wordgroups for a word

wordrole

str

Role of the word (abbreviated)

wordrolelong

str

Role of the word (full)

wordtranslit

str

Transliteration of the text (in latin letters, excl. punctuations)

wordunacc

str

Word without accents (excl. punctuations)

oslots

none

Settings:

specified

apiVersion: 3
appName: tonyjurg/Nestle1904LFT
appPath:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
commit: f2eb5e2b0f8805ad720d91a5cb9e2aa2fdc6c99a
css: ''
dataDisplay:
- excludedFeatures: [reference]
- noneValues:
  - none
  - unknown
  - no value
  - NA
  - ''
interfaceDefaults: {fmt: layout-orig-full}
isCompatible: True
local: no value
localDir:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
provenanceSpec:
- corpus: Nestle 1904 (Low Fat Tree)
- org: tonyjurg
- relative: /tf
- repo: Nestle1904LFT
- repro: Nestle1904LFT
- version: 0.5
release: v03
showVerseInTuple: 0
typeDisplay:
- book:
  - label: {book}
  - style: ''
- chapter:
  - label: {chapter}
  - style: ''
- sentence:
  - hidden: 0
  - label: {sentence}
  - style: ''
- verse:
  - label: {verse}
  - style: ''
- wg:
  - hidden: 0
  - label: {rule} {clausetype} {wgrolelong} {junction}
  - style: ''
- word:
  - base: True
  - features:
    lemma
    strongs
  - featuresBare: [gloss]

App config error(s) in wg:
	label: feature rule not loaded

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

3 - Performing the queries ¶

3.1 - Frequency of punctuations in corpus ¶

This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words.

In [5]:

# Library to format table
from tabulate import tabulate

# The actual query (see section 3.2 about the used RegExp in this query)
SearchPunctuations = '''
word word~([\.·—,;])$
'''
PunctuationList = N1904.search(SearchPunctuations)

ResultDict = {}
for tuple in PunctuationList:
    node=tuple[0]
    Punctuation=F.word.v(node)[-1]  
    # Check if this Punctuation already exists in ResultDict
    if Punctuation in ResultDict:
        # If it exists, add the count to the existing value
        ResultDict[Punctuation]+=1
    else:
        # If it doesn't exist, initialize the count as the value
        ResultDict[Punctuation]=1

# Convert the dictionary into a list of key-value pairs
TableData = [[key, value] for key, value in ResultDict.items()]

# Produce the table
headers = ["Punctuation","Frequency"]
print(tabulate(TableData, headers=headers, tablefmt='fancy_grid'))

  0.12s 18507 results
╒═══════════════╤═════════════╕
│ Punctuation   │   Frequency │
╞═══════════════╪═════════════╡
│ .             │        5712 │
├───────────────┼─────────────┤
│ ,             │        9441 │
├───────────────┼─────────────┤
│ ·             │        2355 │
├───────────────┼─────────────┤
│ ;             │         969 │
├───────────────┼─────────────┤
│ —             │          30 │
╘═══════════════╧═════════════╛

3.2 Explanation of the Regular Expression ¶

Back to TOC ¶

The regular expression [\.·—,;]$ matches any one character from the set containing ., ·, —, ,, or ;. The $ anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the $ anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character —.

Identify punctuations (Nestle1904LFT)¶

Table of content ¶

1 - Introduction ¶

2 - Load Text-Fabric app and data ¶

Back to TOC¶

3 - Performing the queries ¶

3.1 - Frequency of punctuations in corpus ¶

Back to TOC¶

3.2 Explanation of the Regular Expression ¶

Back to TOC¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶