This Jupyter Notebook is designed to demonstrate the predefined text formats available in this Text-Fabric dataset, specifically focusing on displaying the Greek surface text of the New Testament.
Text-Fabric's data design allows for flexible representation of the corpus text but requires at least one text format to be specified as its default (in this dataset: text-orig-full). During the creation of the dataset, additional formats relevant to this corpus were defined, which are basically based on a subset of the following surface text-related features:
The relation between these features in relation to the surface text is shown in the following image.
The text formats in this Text-Fabric database are identified by unique names that reflect their actual formats. These names follow a structured naming schema, consisting of a string of keywords separated by hyphens (-).
`what`-`how`-`fullness`
In our database the following keywords are used:
Keyword | Value | Meaning |
---|---|---|
what | text | words as they belong to the text |
what | lex | lexemes of the words |
how | orig | the original Greek script (all Unicode) |
how | unaccent | the original Greek script without accents |
how | translit | transliteration into Latin alphabet |
fullness | full | complete text with text-critical markers |
fullness | plain | complete text without text-critical markers |
Not all possible combinations are defined or relevant. The following text-formatting options are defined:
Format | Usage | Template |
---|---|---|
lex-orig-plain | Lexemes of the Greek surface text | {lemma}{trailer} |
lex-translit-plain | Transliteration of the lexemes of the Greek surface text | {lemmatranslit} {trailer} |
text-orig-full (default) | The Greek surface text in unicode including text-critical markers | {before} {text} {after} |
text-orig-plain | The Greek surface text in unicode | {text} {trailer} |
text-translit-plain | Transliteration of the Greek surface text | {translit} {trailer} |
text-unaccent-plain | The Greek surface text in unicode without accents | {unaccent} {trailer} |
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7944 | 17.34 | 100 |
sentence | 8011 | 17.20 | 100 |
group | 8945 | 7.01 | 46 |
clause | 42506 | 8.36 | 258 |
wg | 106868 | 6.88 | 533 |
phrase | 69007 | 1.90 | 95 |
subphrase | 116178 | 1.60 | 135 |
word | 137779 | 1.00 | 100 |
3
CenterBLC/N1904
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/app
gdb630837ae89b9468c9e50d13bda05cfd3de4f18
''
[]
none
unknown
NA
:
text-orig-full
https://github.com/CenterBLC/N1904/tree/main/docs
about
https://github.com/CenterBLC/N1904
https://github.com/CenterBLC/N1904/blob/main/docs/features/<feature>.md
README
text-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/_temp
main
Nestle 1904 Greek New Testament
10.5281/zenodo.13117910
[]
CenterBLC
/tf
N1904
N1904
1.0.0
https://learner.bible/text/show_text/nestle1904/
Show this on the website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
1.0.0
True
{typ} {function} {rela} \\ {cls} {role} {junction}
''
{typ} {function} {rela} \\ {typems} {role} {rule}
''
True
{typ} {function} {rela} \\ {typems} {role} {rule}
''
{typ} {function} {rela} \\ {role} {rule}
''
{typ} {function} {rela} \\ {typems} {role} {rule}
''
True
{book} {chapter}:{verse}
''
True
{typems} {role} {rule} {junction}
''
lemma
sp
gloss
]grc
Display is setup for viewtype syntax-view
See here for more information on viewtypes
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())
The output of the following command provides details on available formats to present the text of the corpus.
N1904.showFormats()
format | level | template |
---|---|---|
lex-orig-plain |
word | {lemma}{trailer} |
lex-translit-plain |
word | {lemmatranslit}{trailer} |
text-orig-full |
word | {before}{text}{after} |
text-orig-plain |
word | {text}{trailer} |
text-translit-plain |
word | {translit}{trailer} |
text-unaccent-plain |
word | {unaccent}{trailer} |
Note 1: This data originates from the file otext.tf
:
@config
...
@fmt:text-orig-full={before}{text}{after}
...
Note 2: The names of the available formats can also be obtaind by using the following call. However, this will not display the features that are included into the format. The function will return a list of ordered tuples that can easily be postprocessed:
T.formats
{'lex-orig-plain': 'word', 'lex-translit-plain': 'word', 'text-orig-full': 'word', 'text-orig-plain': 'word', 'text-translit-plain': 'word', 'text-unaccent-plain': 'word'}
This section will demonstrate the differences in how various text formats are displayed, using the verse Mark 1:1 as an example. To locate the corresponding verse node for Mark 1:1 in this dataset, the following command can be executed.
T.nodeFromSection(['Mark', 1, 1])
383782
The returned integer represents the numeric value of the verse node for Mark 1:1. This value can now be used in the following Python snippet to iterate through the defined text formats.
for formats in T.formats:
print(f'fmt={formats}\t: {T.text(383782,formats)}')
fmt=lex-orig-plain : ἀρχή ὁ εὐαγγέλιον Ἰησοῦς Χριστός υἱός θεός. fmt=lex-translit-plain : arkhe o euaggelion Iesous Khristos uios theos. fmt=text-orig-full : Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). fmt=text-orig-plain : Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ Υἱοῦ Θεοῦ. fmt=text-translit-plain : Arkhe tou euaggeliou Iesou Khristou Uiou Theou. fmt=text-unaccent-plain : Αρχη του ευαγγελιου Ιησου Χριστου Υιου Θεου.
Using transliterated text can be convenient for crafting queries, as it allows you to use your regular keyboard without needing to input Greek characters. The following example query efficiently retrieves all occurrences of the Greek conjunction 'δὲ'
LatinQuery = '''
word translit=de
'''
Result = N1904.search(LatinQuery)
from collections import Counter
# Initialize a counter to store word frequencies
word_counts = Counter()
# Loop through the results and count the occurrences of each word
for tuple in Result:
word = F.text.v(tuple[0])
word_counts[word] += 1
# Convert the counter into a list of tuples (word, frequency)
word_frequencies = word_counts.most_common()
# Print the word frequency table
print(f"{'Word':<20}{'Frequency'}")
print("-" * 30)
for word, freq in word_frequencies:
print(f"{word:<20}{freq}")
0.09s 2769 results Word Frequency ------------------------------ δὲ 2620 δέ 144 δὴ 4 δή 1
This example highlights the importance of careful use of transliteration. While the vast majority of the results match the expected word, an additional 5 results (approximately 0.18% of the total) correspond to a different - but sound-alike - word, the emphatic particle δὴ.
The base text of this Text-Fabric dataset is based upon the Nestle version or 1913, as explained on sites.google.com/site/nestle1904/faq:
What are your sources? For the text, I used the scanned books available at the Internet Archive (The first edition of 1904, and a reprinting from 1913 – the latter one has a better quality).
This version does have a limited amount of textual critical markers embedded in the base text. We have preserved this in text format 'text-orig-full', which can be printed using the following command.
T.text(383782,fmt='text-orig-full')
'Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ). '
The previous result can be verified by examining the scans of the following printed versions:
Or, in an image, placed side by side: