This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use
# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904GBI", version="0.4", hoist=globals())
Locating corpus resources ...
The requested app is not available offline ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app not found
The requested data is not available offline ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 not found
| 0.19s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 1.85s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.68s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.53s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.51s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.64s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.50s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | | 0.05s C __levels__ from otype, oslots, otext | | 1.62s C __order__ from otype, oslots, __levels__ | | 0.07s C __rank__ from otype, __order__ | | 2.23s C __levUp__ from otype, oslots, __rank__ | | 1.46s C __levDown__ from otype, __levUp__, __rank__ | | 0.06s C __characters__ from otext | | 0.88s C __boundary__ from otype, oslots, __rank__ | | 0.04s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse | | 0.21s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse | 0.50s T booknum from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.58s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.48s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.50s T clause from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.07s T clauserule from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.02s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.43s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.53s T formaltag from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.54s T functionaltag from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.57s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.47s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.55s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.51s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.53s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.44s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.43s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.64s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.59s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.49s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.50s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.43s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.70s T phrase from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.26s T phrasefunction from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.28s T phrasefunctionlong from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.27s T phrasetype from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.46s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.51s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.51s T splong from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.54s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.45s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.45s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.45s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 | 0.45s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
sentence | 5720 | 24.09 | 100 |
verse | 7943 | 17.35 | 100 |
clause | 16124 | 8.54 | 100 |
phrase | 72674 | 1.90 | 100 |
word | 137779 | 1.00 | 100 |
3
tonyjurg/Nestle1904GBI
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/app
reference
]none
unknown
NA
''
text-orig-full
layout-orig-full
}True
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/_temp
Nestle 1904 (GBI nodes)
tonyjurg
/tf
Nestle1904GBI
Nestle1904GBI
0.4
https://bibleol.3bmoodle.dk/text/show_text/nestle1904/<1>/<2>/<3>
{book}
''
#{clause}
''
#{phrase}
''
lemma
strongs
gloss
]grc
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())
The method freqList
returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
print(f"{amount}\t{w}")
Amount word 8541 καὶ 2768 ὁ 2683 ἐν 2620 δὲ 2497 τοῦ 1755 εἰς 1657 τὸ 1556 τὸν 1518 τὴν 1410 αὐτοῦ 1300 τῆς 1281 ὅτι 1221 τῷ 1201 τῶν 1068 οἱ 941 ἡ 921 γὰρ 902 μὴ 859 τῇ 849 αὐτῷ 817 τὰ 767 οὐκ 722 τοὺς 688 Θεοῦ 670 πρὸς
This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table.
Note the first line of the output is 'Format: text-orig-full'. This
# Library to format table
from tabulate import tabulate
# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data
# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
print('Format: ',Key)
# 'key' refers to the pre-defined formats the text will be displayed
FrequencyList=FrequencyDictionary[Key]
SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
# In this example the table will be truncated to the first 15 entries
max_rows = 15 # Set your desired number of rows here
TruncatedTable = SortedFrequencyList[:max_rows]
headers = ["character", "frequency"]
print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
# Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
N1904.dm("**Warning: table truncated!**")
Format: text-orig-full ╒═════════════╤═════════════╕ │ character │ frequency │ ╞═════════════╪═════════════╡ │ │ 137779 │ ├─────────────┼─────────────┤ │ ν │ 56230 │ ├─────────────┼─────────────┤ │ α │ 51892 │ ├─────────────┼─────────────┤ │ τ │ 50599 │ ├─────────────┼─────────────┤ │ ο │ 45151 │ ├─────────────┼─────────────┤ │ ε │ 38597 │ ├─────────────┼─────────────┤ │ ς │ 27090 │ ├─────────────┼─────────────┤ │ ι │ 26131 │ ├─────────────┼─────────────┤ │ σ │ 24095 │ ├─────────────┼─────────────┤ │ ρ │ 22871 │ ├─────────────┼─────────────┤ │ κ │ 22630 │ ├─────────────┼─────────────┤ │ π │ 20308 │ ├─────────────┼─────────────┤ │ μ │ 19218 │ ├─────────────┼─────────────┤ │ λ │ 18228 │ ├─────────────┼─────────────┤ │ δ │ 12476 │ ╘═════════════╧═════════════╛
Warning: table truncated!
C.levels.data
(('book', 5102.925925925926, 137780, 137806), ('chapter', 529.9192307692308, 137807, 138066), ('sentence', 24.087237762237763, 226865, 232584), ('verse', 17.345965000629484, 232585, 240527), ('clause', 8.54496402877698, 138067, 154190), ('phrase', 1.8958499600957701, 154191, 226864), ('word', 1, 1, 137779))
Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also module tf.advanced.options Display Settings.
N1904.showFormats()
format | level | template |
---|---|---|
text-orig-full |
word | {word}{after} |
The same result (although formatted different, since an ordered tuple is returned) can be obtained by the following call:
T.formats
{'text-orig-full': 'word'}
Note that this data originates from file otext.tf
:
@config
...
@fmt:text-orig-full={word}{after}
...
FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList:
if Feature=='otype': break # this feature needs to be skipped.
print ('Feature:',Feature,'\n\n\t value\t frequency')
FeatureFrequenceLists=Fs(Feature).freqList()
PrintedLine=0
for item, freq in FeatureFrequenceLists:
PrintedLine+=1
print ('\t',item,'\t',freq)
if PrintedLine==LinesToPrint: break
print ('\n')
Feature: after value frequency 119272 , 9441 . 5712 · 2355 ; 969 Feature: book value frequency Luke 22801 Matthew 21334 Acts 21290 John 18389 Mark 13247 Feature: booknum value frequency 3 22801 1 21334 5 21290 4 18389 2 13247 Feature: bookshort value frequency Luke 22801 Matt 21334 Acts 21290 John 18389 Mark 13247 Feature: case value frequency 58261 Nominative 24197 Accusative 23031 Genitive 19515 Dative 12126 Feature: chapter value frequency 1 13795 2 11590 3 10239 4 10187 5 9270 Feature: clause value frequency 1 481 6 347 44 314 35 310 4 301 Feature: clauserule value frequency CLaCL 1841 Conj-CL 1740 sub-CL 1525 V-O 690 V2CL 653 Feature: clausetype value frequency VerbElided 1355 Verbless 1330 Minor 1161 Feature: degree value frequency 137266 Comparative 313 Superlative 200 Feature: formaltag value frequency CONJ 16316 PREP 10568 ADV 3808 N-NSM 3475 N-GSM 2935 Feature: functionaltag value frequency CONJ 16316 PREP 10568 ADV 3808 N-NSM 3475 N-GSM 2935 Feature: gloss value frequency the 9857 and 6212 - 5496 in 2320 And 2218 Feature: gn value frequency 63804 Masculine 41486 Feminine 18736 Neuter 13753 Feature: lemma value frequency ὁ 19783 καί 8978 αὐτός 5561 σύ 2892 δέ 2787 Feature: lex_dom value frequency 092004 26322 10487 089017 4370 093001 3672 033006 3225 Feature: ln value frequency 92.24 19781 10488 92.11 4718 89.92 2903 89.87 2756 Feature: monad value frequency 1 1 2 1 3 1 4 1 5 1 Feature: mood value frequency 109422 Indicative 15617 Participle 6653 Infinitive 2285 Imperative 1877 Feature: nodeID value frequency n40001001001 1 n40001001002 1 n40001001003 1 n40001001004 1 n40001001005 1 Feature: normalized value frequency καί 8576 ὁ 2769 δέ 2764 ἐν 2684 τοῦ 2497 Feature: nu value frequency Singular 69846 38842 Plural 29091 Feature: number value frequency Singular 69846 38842 Plural 29091
Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
# important: string does contain two characters in case of punctuations
frequency=str(freq) #convert it to a string
unicode_value = str(ord(string[0])) #convert it to a string
N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))
String | Unicode | Frequency |
---|---|---|
| 32 | 119272
,
| 44 | 9441
.
| 46 | 5712
·
| 183 | 2355
;
| 59 | 969
—
| 8212 | 30
The node number ranges are readily available by calling F.otype.all
which returns a list of all node types.
for NodeType in F.otype.all:
print (NodeType, F.otype.sInterval(NodeType))
book (137780, 137806) chapter (137807, 138066) sentence (226865, 232584) verse (232585, 240527) clause (138067, 154190) phrase (154191, 226864) word (1, 137779)
Using the same API call, we can produce also another list where we are counting the number of nodes for each type.
for otype in F.otype.all:
i = 0
for n in F.otype.s(otype):
i += 1
print ("{:>7} {}s".format(i, otype))
27 books 260 chapters 5720 sentences 7943 verses 16124 clauses 72674 phrases 137779 words
N1904.showProvenance(...)
The scripts in this notebook require (beside text-fabric
) the following Python libraries to be installed in the environment:
tabulate
You can install any missing library from within Jupyter Notebook using eitherpip
or pip3
.