Various text formats (N1904-TF)¶

Table of content (TOC) ¶

1 - Introduction
- 1.1 - Naming schema for text formating
2 - Load Text-Fabric app and data
3 - Examining the text formats
4 - Notebook version

1 - Introduction ¶

Back to TOC ¶

This Jupyter Notebook is designed to demonstrate the predefined text formats available in this Text-Fabric dataset, specifically focusing on displaying the Greek surface text of the New Testament.

Text-Fabric's data design allows for flexible representation of the corpus text but requires at least one text format to be specified as its default (in this dataset: text-orig-full). During the creation of the dataset, additional formats relevant to this corpus were defined, which are basically based on a subset of the following surface text-related features:

after: All material found after a word (including text-critical signs).
before: All material found before a word.
criticalsign: Text-critical signs.
normalized: Normalized Greek text.
punctuation: Punctuations found after a word.
text: Word without punctuations and text-critical signs.
trailer: All material found after a word (excluding text-critical signs).
translit: Transliteration of the word surface texts.
unaccent: Word without accents and diacritical markers.
unicode: Unicode presentation including all material before and after word.

The relation between these features in relation to the surface text is shown in the following image.

1.1 - Naming schema for text formating ¶

The text formats in this Text-Fabric database are identified by unique names that reflect their actual formats. These names follow a structured naming schema, consisting of a string of keywords separated by hyphens (-).

 `what`-`how`-`fullness`

In our database the following keywords are used:

Keyword	Value	Meaning
what	text	words as they belong to the text
what	lex	lexemes of the words
how	orig	the original Greek script (all Unicode)
how	unaccent	the original Greek script without accents
how	translit	transliteration into Latin alphabet
fullness	full	complete text with text-critical markers
fullness	plain	complete text without text-critical markers

Not all possible combinations are defined or relevant. The following text-formatting options are defined:

Format	Usage	Template
lex-orig-plain	Lexemes of the Greek surface text	{lemma}{trailer}
lex-translit-plain	Transliteration of the lexemes of the Greek surface text	{lemmatranslit} {trailer}
text-orig-full (default)	The Greek surface text in unicode including text-critical markers	{before} {text} {after}
text-orig-plain	The Greek surface text in unicode	{text} {trailer}
text-translit-plain	Transliteration of the Greek surface text	{translit} {trailer}
text-unaccent-plain	The Greek surface text in unicode without accents	{unaccent} {trailer}

2 - Load Text-Fabric app and data ¶

Back to TOC ¶

In [1]:

%load_ext autoreload
%autoreload 2

In [3]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [5]:

# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/CenterBLC/N1904/app

data: ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0

TF: TF API 12.5.3, CenterBLC/N1904/app v3, Search Reference
Data: CenterBLC - N1904 1.0.0, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7944	17.34	100
sentence	8011	17.20	100
group	8945	7.01	46
clause	42506	8.36	258
wg	106868	6.88	533
phrase	69007	1.90	95
subphrase	116178	1.60	135
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 Greek New Testament

after

str

material after the end of the word

appositioncontainer

int

1 if it is an apposition container

articular

int

1 if the sentence, group, clause, phrase or wg has an article

before

str

this is XML attribute before

book

str

book name (full name)

bookshort

str

book name (abbreviated) from ref attribute in xml

case

str

grammatical case

chapter

int

chapter number, from ref attribute in xml

clausetype

str

clause type

cls

str

this is XML attribute cls

cltype

str

clause type

criticalsign

str

this is XML attribute criticalsign

crule

str

clause rule (from xml attribute Rule)

degree

str

grammatical degree

discontinuous

int

1 if the word is out of sequence in the xml

domain

str

domain

framespec

str

this is XML attribute framespec

function

str

this is XML attribute function

gender

str

grammatical gender

gloss

str

English gloss (BGVB)

id

str

xml id

junction

str

type of junction

lang

str

language the text is in

lemma

str

lexical lemma

lemmatranslit

str

transliteration of the word lemma

ln

str

ln

mood

str

verbal mood

morph

str

morphological code

nodeid

str

node id (as in the XML source data)

normalized

str

lemma normalized

note

str

annotation of linguistic nature

num

int

generated number (not in xml): book: (Matthew=1, Mark=2, ..., Revelation=27); sentence: numbered per chapter; word: numbered per verse.

number

str

grammatical number

otype

str

person

str

grammatical person

punctuation

str

punctuation found after a word

ref

str

biblical reference with word counting

referent

str

number of referent

rela

str

this is XML attribute rela

role

str

role

rule

str

syntactical rule

sp

str

part-of-speach

strong

int

strong number

subjrefspec

str

this is XML attribute subjrefspec

tense

str

verbal tense

text

str

the text of a word

trailer

str

material after the end of the word (excluding critical signs)

trans

str

translation of the word surface text according to the Berean Interlinear Bible

translit

str

transliteration of the word surface text

typ

str

syntactical type (on sentence, group, clause or phrase)

typems

str

morphological type (on word), syntactical type (on sentence, group, clause, phrase or wg)

unaccent

str

word in unicode characters without accents and diacritical markers

unicode

str

word in unicode characters plus material after it

variant

str

this is XML attribute variant

verse

int

verse number, from ref attribute in xml

voice

str

verbal voice

frame

str

frame

oslots

none

parent

none

parent relationship between words

sibling

int

this is XML attribute sibling

subjref

none

number of subject referent

Settings:

specified

apiVersion: 3
appName: CenterBLC/N1904
appPath: C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/app
commit: gdb630837ae89b9468c9e50d13bda05cfd3de4f18
css: ''
dataDisplay:
- excludedFeatures: []
- noneValues:
  - none
  - unknown
  - no value
  - NA
- sectionSep1:
- sectionSep2: :
- textFormat: text-orig-full
docs:
- docBase: https://github.com/CenterBLC/N1904/tree/main/docs
- docPage: about
- docRoot: https://github.com/CenterBLC/N1904
- featureBase:
  https://github.com/CenterBLC/N1904/blob/main/docs/features/<feature>.md
- featurePage: README
interfaceDefaults: {fmt: text-orig-full}
isCompatible: True
local: local
localDir:
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/_temp
provenanceSpec:
- branch: main
- corpus: Nestle 1904 Greek New Testament
- doi: 10.5281/zenodo.13117910
- moduleSpecs: []
- org: CenterBLC
- relative: /tf
- repo: N1904
- repro: N1904
- version: 1.0.0
- webBase: https://learner.bible/text/show_text/nestle1904/
- webHint: Show this on the website
- webLang: en
- webUrl:
  https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: 1.0.0
typeDisplay:
- clause:
  - condense: True
  - label: {typ} {function} {rela} \\ {cls} {role} {junction}
  - style: ''
- group:
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- phrase:
  - condense: True
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- sentence:
  - label: {typ} {function} {rela} \\ {role} {rule}
  - style: ''
- subphrase:
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- verse:
  - condense: True
  - label: {book} {chapter}:{verse}
  - style: ''
- wg:
  - condense: True
  - label: {typems} {role} {rule} {junction}
  - style: ''
- word:
  - features:
    lemma
    sp
  - featuresBare: [gloss]
writing: grc

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Display is setup for viewtype syntax-view

See here for more information on viewtypes

In [7]:

# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

3 - Examining the text format ¶

Back to TOC ¶

3.1 - Display the text formatting options available for this corpus ¶

The output of the following command provides details on available formats to present the text of the corpus.

In [9]:

N1904.showFormats()

format	level	template
`lex-orig-plain`	word	`{lemma}{trailer}`
`lex-translit-plain`	word	`{lemmatranslit}{trailer}`
`text-orig-full`	word	`{before}{text}{after}`
`text-orig-plain`	word	`{text}{trailer}`
`text-translit-plain`	word	`{translit}{trailer}`
`text-unaccent-plain`	word	`{unaccent}{trailer}`

Note 1: This data originates from the file otext.tf:

@config
...
@fmt:text-orig-full={before}{text}{after}
...

Note 2: The names of the available formats can also be obtaind by using the following call. However, this will not display the features that are included into the format. The function will return a list of ordered tuples that can easily be postprocessed:

In [11]:

T.formats

Out[11]:

{'lex-orig-plain': 'word',
 'lex-translit-plain': 'word',
 'text-orig-full': 'word',
 'text-orig-plain': 'word',
 'text-translit-plain': 'word',
 'text-unaccent-plain': 'word'}

3.2 - Showcasing the various formats ¶

This section will demonstrate the differences in how various text formats are displayed, using the verse Mark 1:1 as an example. To locate the corresponding verse node for Mark 1:1 in this dataset, the following command can be executed.

In [13]:

T.nodeFromSection(['Mark', 1, 1])

Out[13]:

The returned integer represents the numeric value of the verse node for Mark 1:1. This value can now be used in the following Python snippet to iterate through the defined text formats.

In [15]:

for formats in T.formats:
    print(f'fmt={formats}\t: {T.text(383782,formats)}')

fmt=lex-orig-plain	: ἀρχή ὁ εὐαγγέλιον Ἰησοῦς Χριστός υἱός θεός. 
fmt=lex-translit-plain	: arkhe o euaggelion Iesous Khristos uios theos. 
fmt=text-orig-full	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). 
fmt=text-orig-plain	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ Υἱοῦ Θεοῦ. 
fmt=text-translit-plain	: Arkhe tou euaggeliou Iesou Khristou Uiou Theou. 
fmt=text-unaccent-plain	: Αρχη του ευαγγελιου Ιησου Χριστου Υιου Θεου.

3.3 - Transliterated text ¶

Using transliterated text can be convenient for crafting queries, as it allows you to use your regular keyboard without needing to input Greek characters. The following example query efficiently retrieves all occurrences of the Greek conjunction 'δὲ'

In [17]:

LatinQuery = '''
word translit=de
'''
Result = N1904.search(LatinQuery) 

from collections import Counter
# Initialize a counter to store word frequencies
word_counts = Counter()
# Loop through the results and count the occurrences of each word
for tuple in Result:
    word = F.text.v(tuple[0])
    word_counts[word] += 1
# Convert the counter into a list of tuples (word, frequency)
word_frequencies = word_counts.most_common()
# Print the word frequency table
print(f"{'Word':<20}{'Frequency'}")
print("-" * 30)
for word, freq in word_frequencies:
    print(f"{word:<20}{freq}")

  0.09s 2769 results
Word                Frequency
------------------------------
δὲ                  2620
δέ                  144
δὴ                  4
δή                  1

This example highlights the importance of careful use of transliteration. While the vast majority of the results match the expected word, an additional 5 results (approximately 0.18% of the total) correspond to a different - but sound-alike - word, the emphatic particle δὴ.

3.4 - Text with text critical markers ¶

The base text of this Text-Fabric dataset is based upon the Nestle version or 1913, as explained on sites.google.com/site/nestle1904/faq:

What are your sources? For the text, I used the scanned books available at the Internet Archive (The first edition of 1904, and a reprinting from 1913 – the latter one has a better quality).

This version does have a limited amount of textual critical markers embedded in the base text. We have preserved this in text format 'text-orig-full', which can be printed using the following command.

In [19]:

T.text(383782,fmt='text-orig-full')

Out[19]:

'Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). '

3.5 - Nestle version 1904 and version 1913 (Mark 1:1)¶

The previous result can be verified by examining the scans of the following printed versions:

Nestle version 1904: @ archive.org
Nestle version 1913: @ archive.org

Or, in an image, placed side by side:

4 - Notebook version ¶

Back to TOC ¶

Author	Tony Jurg
Version	1.0
Date	9 October 2024

Various text formats (N1904-TF)¶

Table of content (TOC) ¶

1 - Introduction ¶

Back to TOC¶

1.1 - Naming schema for text formating¶

2 - Load Text-Fabric app and data ¶

Back to TOC¶

3 - Examining the text format¶

Back to TOC¶

3.1 - Display the text formatting options available for this corpus¶

3.2 - Showcasing the various formats¶

3.3 - Transliterated text¶

3.4 - Text with text critical markers¶

3.5 - Nestle version 1904 and version 1913 (Mark 1:1)¶

4 - Notebook version¶

Back to TOC¶

Back to TOC ¶

1.1 - Naming schema for text formating ¶

Back to TOC ¶

3 - Examining the text format ¶

Back to TOC ¶

3.1 - Display the text formatting options available for this corpus ¶

3.2 - Showcasing the various formats ¶

3.3 - Transliterated text ¶

3.4 - Text with text critical markers ¶

4 - Notebook version ¶

Back to TOC ¶