Work in progress!
This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use
# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904LFT", version="0.6", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7943 | 17.35 | 100 |
sentence | 8011 | 17.20 | 100 |
wg | 105430 | 6.85 | 524 |
word | 137779 | 1.00 | 100 |
3
tonyjurg/Nestle1904LFT
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
e68bd68c7c4c862c1464d995d51e27db7691254f
''
orig_order
verse
book
chapter
none
unknown
NA
''
0
text-orig-full
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
about
https://github.com/tonyjurg/Nestle1904LFT
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
layout-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
Nestle 1904 (Low Fat Tree)
10.5281/zenodo.10182594
tonyjurg
/tf
Nestle1904LFT
Nestle1904LFT
0.6
https://learner.bible/text/show_text/nestle1904/
Show this on the Bible Online Learner website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
v0.6
True
True
{book}
''
True
True
{chapter}
''
0
#{sentence} (start: {book} {chapter}:{headverse})
''
True
chapter verse
{book} {chapter}:{verse}
''
0
#{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
''
True
lemma
gloss
chapter verse
grc
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())
# Set default view in a way to limit noise as much as possible.
N1904.displaySetup(condensed=True, multiFeatures=False, queryFeatures=False)
The method freqList
returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
print(f"{amount}\t{w}")
Amount word 8545 καὶ 2769 ὁ 2684 ἐν 2620 δὲ 2497 τοῦ 1755 εἰς 1658 τὸ 1556 τὸν 1518 τὴν 1411 αὐτοῦ 1300 τῆς 1281 ὅτι 1221 τῷ 1201 τῶν 1069 οἱ 941 ἡ 921 γὰρ 902 μὴ 859 τῇ 849 αὐτῷ 817 τὰ 767 οὐκ 722 τοὺς 689 Θεοῦ 670 πρὸς
This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table.
Note the first line of the output is 'Format: text-orig-full'. This
# Library to format table
from tabulate import tabulate
# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data
# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
print('Format: ',Key)
# 'key' refers to the pre-defined formats the text will be displayed
FrequencyList=FrequencyDictionary[Key]
SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
# In this example the table will be truncated to the first 15 entries
max_rows = 15 # Set your desired number of rows here
TruncatedTable = SortedFrequencyList[:max_rows]
headers = ["character", "frequency"]
print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
# Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
N1904.dm("**Warning: table truncated!**")
Format: text-critical ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ ν │ 56230 │ ├─────────────┼─────────────┤ │ α │ 51892 │ ├─────────────┼─────────────┤ │ τ │ 50599 │ ├─────────────┼─────────────┤ │ ο │ 45151 │ ├─────────────┼─────────────┤ │ ε │ 38597 │ ├─────────────┼─────────────┤ │ ς │ 27090 │ ├─────────────┼─────────────┤ │ ι │ 26131 │ ├─────────────┼─────────────┤ │ σ │ 24095 │ ├─────────────┼─────────────┤ │ ρ │ 22871 │ ├─────────────┼─────────────┤ │ κ │ 22630 │ ├─────────────┼─────────────┤ │ π │ 20308 │ ├─────────────┼─────────────┤ │ μ │ 19218 │ ├─────────────┼─────────────┤ │ λ │ 18228 │ ├─────────────┼─────────────┤ │ δ │ 12476 │ ├─────────────┼─────────────┤ │ ἐ │ 12116 │ ╘═════════════╧═════════════╛
Warning: table truncated!
Format: text-normalized ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ │ 137779 │ ├─────────────┼─────────────┤ │ ν │ 56230 │ ├─────────────┼─────────────┤ │ α │ 52127 │ ├─────────────┼─────────────┤ │ τ │ 50599 │ ├─────────────┼─────────────┤ │ ο │ 45516 │ ├─────────────┼─────────────┤ │ ε │ 38807 │ ├─────────────┼─────────────┤ │ ς │ 27090 │ ├─────────────┼─────────────┤ │ ι │ 26404 │ ├─────────────┼─────────────┤ │ σ │ 24095 │ ├─────────────┼─────────────┤ │ ρ │ 22871 │ ├─────────────┼─────────────┤ │ κ │ 22630 │ ├─────────────┼─────────────┤ │ ί │ 21518 │ ├─────────────┼─────────────┤ │ π │ 20308 │ ├─────────────┼─────────────┤ │ μ │ 19218 │ ├─────────────┼─────────────┤ │ λ │ 18228 │ ╘═════════════╧═════════════╛
Warning: table truncated!
Format: text-orig-full ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ │ 137779 │ ├─────────────┼─────────────┤ │ ν │ 56230 │ ├─────────────┼─────────────┤ │ α │ 51892 │ ├─────────────┼─────────────┤ │ τ │ 50599 │ ├─────────────┼─────────────┤ │ ο │ 45151 │ ├─────────────┼─────────────┤ │ ε │ 38597 │ ├─────────────┼─────────────┤ │ ς │ 27090 │ ├─────────────┼─────────────┤ │ ι │ 26131 │ ├─────────────┼─────────────┤ │ σ │ 24095 │ ├─────────────┼─────────────┤ │ ρ │ 22871 │ ├─────────────┼─────────────┤ │ κ │ 22630 │ ├─────────────┼─────────────┤ │ π │ 20308 │ ├─────────────┼─────────────┤ │ μ │ 19218 │ ├─────────────┼─────────────┤ │ λ │ 18228 │ ├─────────────┼─────────────┤ │ δ │ 12476 │ ╘═════════════╧═════════════╛
Warning: table truncated!
Format: text-transliterated ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ │ 137779 │ ├─────────────┼─────────────┤ │ e │ 93371 │ ├─────────────┼─────────────┤ │ o │ 87008 │ ├─────────────┼─────────────┤ │ a │ 75119 │ ├─────────────┼─────────────┤ │ i │ 62778 │ ├─────────────┼─────────────┤ │ t │ 60011 │ ├─────────────┼─────────────┤ │ n │ 56230 │ ├─────────────┼─────────────┤ │ s │ 52132 │ ├─────────────┼─────────────┤ │ u │ 39287 │ ├─────────────┼─────────────┤ │ k │ 27300 │ ├─────────────┼─────────────┤ │ p │ 25081 │ ├─────────────┼─────────────┤ │ r │ 22871 │ ├─────────────┼─────────────┤ │ h │ 20033 │ ├─────────────┼─────────────┤ │ m │ 19218 │ ├─────────────┼─────────────┤ │ l │ 18228 │ ╘═════════════╧═════════════╛
Warning: table truncated!
Format: text-unaccented ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ │ 137779 │ ├─────────────┼─────────────┤ │ α │ 75119 │ ├─────────────┼─────────────┤ │ ε │ 66656 │ ├─────────────┼─────────────┤ │ ο │ 65731 │ ├─────────────┼─────────────┤ │ ι │ 62834 │ ├─────────────┼─────────────┤ │ ν │ 56230 │ ├─────────────┼─────────────┤ │ τ │ 50599 │ ├─────────────┼─────────────┤ │ υ │ 39287 │ ├─────────────┼─────────────┤ │ ς │ 27090 │ ├─────────────┼─────────────┤ │ η │ 26715 │ ├─────────────┼─────────────┤ │ σ │ 24095 │ ├─────────────┼─────────────┤ │ ρ │ 23046 │ ├─────────────┼─────────────┤ │ κ │ 22630 │ ├─────────────┼─────────────┤ │ ω │ 21277 │ ├─────────────┼─────────────┤ │ π │ 20308 │ ╘═════════════╧═════════════╛
Warning: table truncated!
C.levels.data
(('book', 5102.925925925926, 137780, 137806), ('chapter', 529.9192307692308, 137807, 138066), ('verse', 17.345965000629484, 146078, 154020), ('sentence', 17.198726750717764, 138067, 146077), ('wg', 7.583849727185382, 154021, 267467), ('word', 1, 1, 137779))
Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also module tf.advanced.options Display Settings.
N1904.showFormats()
format | level | template |
---|---|---|
text-critical |
word | {unicode} |
text-normalized |
word | {normalized}{after} |
text-orig-full |
word | {word}{after} |
text-transliterated |
word | {wordtranslit}{after} |
text-unaccented |
word | {wordunacc}{after} |
The same result (although formatted different) can be obtained by the following call:
T.formats
{'text-critical': 'word', 'text-normalized': 'word', 'text-orig-full': 'word', 'text-transliterated': 'word', 'text-unaccented': 'word'}
Note that this data originates from file otext.tf
:
@config
...
@fmt:text-orig-full={word}{after}
...
This code generates a lot of output! For that reason we will cut it off after 5 lines per feature.
FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList:
if Feature!='otype':
print ('Feature:',Feature,'\n\n\t value\t frequency')
FeatureFrequenceLists=Fs(Feature).freqList()
PrintedLine=0
for item, freq in FeatureFrequenceLists:
PrintedLine+=1
print ('\t',item,'\t',freq)
if PrintedLine==LinesToPrint: break
print ('\n')
Feature: after value frequency 119270 , 9462 . 5717 · 2359 ; 971 Feature: book value frequency Luke 21785 Matthew 20529 Acts 20307 John 17582 Mark 12695 Feature: booknumber value frequency 3 19457 5 18394 1 18300 4 15644 2 11278 Feature: bookshort value frequency Luke 19457 Acts 18394 Matt 18300 John 15644 Mark 11278 Feature: case value frequency 58261 nominative 24197 accusative 23031 genitive 19515 dative 12126 Feature: chapter value frequency 1 12922 2 10923 3 9652 4 9631 5 8788 Feature: clausetype value frequency 102662 VerbElided 1009 Verbless 929 Minor 830 Feature: containedclause value frequency 8372 2 148 172 69 97 69 389 68 Feature: degree value frequency 137266 comparative 313 superlative 200 Feature: gloss value frequency the 9857 and 6212 - 5496 in 2320 And 2218 Feature: gn value frequency 63804 masculine 41486 feminine 18736 neuter 13753 Feature: headverse value frequency 1 298 7 270 12 267 9 264 13 260 Feature: junction value frequency 103128 apposition 2302 Feature: lemma value frequency ὁ 19783 καί 8978 αὐτός 5561 σύ 2892 δέ 2787 Feature: lex_dom value frequency 092004 26322 10487 089017 4370 093001 3672 033006 3225 Feature: ln value frequency 92.24 19781 10488 92.11 4718 89.92 2903 89.87 2756 Feature: markafter value frequency 137728 — 31 ) 11 ]] 7 ( 1 Feature: markbefore value frequency 137745 — 16 ( 10 [[ 7 [ 1 Feature: markorder value frequency 137694 0 34 3 32 2 10 1 9 Feature: monad value frequency 1 1 2 1 3 1 4 1 5 1 Feature: mood value frequency 109422 indicative 15617 participle 6653 infinitive 2285 imperative 1877 Feature: morph value frequency CONJ 16316 PREP 10568 ADV 3808 N-NSM 3475 N-GSM 2935 Feature: nodeID value frequency 52046 common 14186 personal 6040 proper 2192 relative 885 Feature: normalized value frequency καί 8576 ὁ 2769 δέ 2764 ἐν 2684 τοῦ 2497 Feature: nu value frequency singular 69846 38842 plural 29091 Feature: number value frequency singular 69846 38842 plural 29091 Feature: person value frequency 118360 third 12747 second 3729 first 2943 Feature: punctuation value frequency 119270 , 9462 . 5717 · 2359 ; 971 Feature: ref value frequency 1CO 10:1!1 1 1CO 10:1!10 1 1CO 10:1!11 1 1CO 10:1!12 1 1CO 10:1!13 1 Feature: reference value frequency 1CO 10:1!1 1 1CO 10:1!10 1 1CO 10:1!11 1 1CO 10:1!12 1 1CO 10:1!13 1 Feature: roleclausedistance value frequency 0 56129 1 37597 2 22297 3 12084 4 5277 Feature: sentence value frequency 3 1130 4 987 1 810 5 774 6 707 Feature: sp value frequency noun 28455 verb 28357 det 19786 conj 18227 pron 16177 Feature: sp_full value frequency Noun 28455 Verb 28357 Determiner 19786 Conjunction 18227 Pronoun 16177 Feature: strongs value frequency 3588 19783 2532 8978 846 5561 4771 2892 1161 2787 Feature: subj_ref value frequency 121204 n46003022002 172 n66001009002 131 n45001001001 104 n47010001004 104 Feature: tense value frequency 109422 aorist 11803 present 11579 imperfect 1689 future 1626 Feature: type value frequency 93321 common 23644 personal 11521 proper 4639 demonstrative 1722 Feature: unicode value frequency καὶ 8541 ὁ 2768 ἐν 2683 δὲ 2619 τοῦ 2497 Feature: verse value frequency 10 4928 12 4910 4 4800 9 4800 1 4793 Feature: voice value frequency 109422 active 20742 passive 3493 middle 2408 middlepassive 1714 Feature: wgclass value frequency np 33710 cl 30857 cl* 16378 12760 pp 11169 Feature: wglevel value frequency 5 16862 4 16527 6 15520 7 12162 3 10442 Feature: wgnum value frequency 2 27 3 27 4 27 5 27 6 27 Feature: wgrole value frequency 69235 adv 16710 o 9329 s 6710 p 1770 Feature: wgrolelong value frequency 69263 Adverbial 16710 Object 9329 Subject 6710 Predicate 1770 Feature: wgrule value frequency DetNP 15696 14701 PrepNp 11044 NPofNP 6819 Conj-CL 5571 Feature: wgtype value frequency 92932 group 9699 apposition 2799 Feature: word value frequency καὶ 8545 ὁ 2769 ἐν 2684 δὲ 2620 τοῦ 2497 Feature: wordlevel value frequency 6 21857 7 20984 5 20538 8 16755 9 12772 Feature: wordrole value frequency adv 41598 v 25817 s 22908 o 21929 9347 Feature: wordrolelong value frequency Adverbial 41598 Verbal 25817 Subject 22908 Object 21929 9347 Feature: wordtranslit value frequency kai 8576 en 3152 o 3149 to 2885 de 2769 Feature: wordunacc value frequency και 8576 ο 3019 δε 2764 εν 2752 του 2497
Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
# important: string does contain two characters in case of punctuations
frequency=str(freq) #convert it to a string
unicode_value = str(ord(string[0])) #convert it to a string
N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))
String | Unicode | Frequency |
---|---|---|
| 32 | 119272
,
| 44 | 9441
.
| 46 | 5712
·
| 183 | 2355
;
| 59 | 969
—
| 8212 | 30
The node number ranges are readily available by calling F.otype.all
which returns a list of all node types.
for NodeType in F.otype.all:
print (NodeType, F.otype.sInterval(NodeType))
book (137780, 137806) chapter (137807, 138066) verse (146078, 154020) sentence (138067, 146077) wg (154021, 268899) word (1, 137779)
Using the same API call, we can produce also another list where we are counting the number of nodes for each type.
for otype in F.otype.all:
i = 0
for n in F.otype.s(otype):
i += 1
print ("{:>7} {}s".format(i, otype))
27 books 260 chapters 7943 verses 8011 sentences 114879 wgs 137779 words
N1904.showProvenance(...)
This can be usefull if you want to process all feature in a script.
# Just print the structured tuple returned by the function call
FeatureName='word'
MetaData=Fs(FeatureName).meta
print (MetaData)
{'Availability': 'Creative Commons Attribution 4.0 International (CC BY 4.0)', 'Converter_author': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_execution': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_version': '0.3', 'Convertor_source': 'https://github.com/tonyjurg/Nestle1904LFT/tree/main/tools', 'Data source': 'MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat', 'Editors': 'Eberhard Nestle', 'Name': 'Greek New Testament (Nestle 1904 based on Low Fat Tree)', 'TextFabric version': '11.4.10', 'description': 'Word as it appears in the text (excl. punctuations)', 'valueType': 'str', 'writtenBy': 'Text-Fabric', 'dateWritten': '2023-06-19T15:13:46Z'}
Now do some very basic calculation with the data:
print ('feature ',FeatureName, end='')
if MetaData['valueType']=='str':
print (' is of type str.')
else:
print (' is not of type str.')
feature word is of type str.
origText=T.text(node,fmt='text-orig-full')
critText=T.text(node,fmt='text-critical-signs')
'fmt:text-orig-full': '{word}{after}',
'fmt:text-normalized': '{normalized}{after}',
'fmt:text-unaccented': '{wordunacc}{after}',
'fmt:text-transliterated':'{wordtranslit}{after}',
'fmt:text-critical':