Notebook

Some corpus statistics (Nestle1904LFT)¶

Work in progress!

Table of content ¶

1 - Introduction
2 - Load Text-Fabric app and data
3 - Performing the queries

1 - Introduction ¶

Back to TOC ¶

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed.

2 - Load Text-Fabric app and data ¶

Back to TOC ¶

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [3]:

# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904LFT", version="0.6", hoist=globals())

Locating corpus resources ...

The requested app is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app not found

Status: latest release online v0.6 versus None locally

downloading app, main data and requested additions ...

app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app

The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6 not found

Status: latest release online v0.6 versus None locally

downloading app, main data and requested additions ...

data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6

   |     0.21s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     2.31s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.56s T wordtranslit         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.48s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.59s T normalized           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.49s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.61s T unicode              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.56s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.46s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.59s T wordunacc            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.59s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |      |     0.06s C __levels__           from otype, oslots, otext
   |      |     1.79s C __order__            from otype, oslots, __levels__
   |      |     0.07s C __rank__             from otype, __order__
   |      |     3.35s C __levUp__            from otype, oslots, __rank__
   |      |     1.94s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.21s C __characters__       from otext
   |      |     0.92s C __boundary__         from otype, oslots, __rank__
   |      |     0.04s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |      |     0.23s C __structure__        from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse
   |     0.43s T booknumber           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.51s T bookshort            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.48s T case                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.33s T clausetype           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.57s T containedclause      from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.42s T degree               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.57s T gloss                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.46s T gn                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.03s T headverse            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.32s T junction             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.57s T lemma                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.52s T lex_dom              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.53s T ln                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.41s T markafter            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.41s T markbefore           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.41s T markorder            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.45s T monad                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.44s T mood                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.52s T morph                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.53s T nodeID               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.48s T nu                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.49s T number               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.43s T person               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.44s T punctuation          from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.65s T ref                  from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.67s T reference            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.49s T roleclausedistance   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.47s T sentence             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.50s T sp                   from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.51s T sp_full              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.53s T strongs              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.45s T subj_ref             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.44s T tense                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.46s T type                 from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.45s T voice                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.40s T wgclass              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.35s T wglevel              from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.43s T wgnum                from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.36s T wgrole               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.35s T wgrolelong           from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.41s T wgrule               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.33s T wgtype               from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.52s T wordlevel            from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.49s T wordrole             from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6
   |     0.51s T wordrolelong         from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6

TF: TF API 12.1.5, tonyjurg/Nestle1904LFT/app v3, Search Reference
Data: tonyjurg - Nestle1904LFT 0.6, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7943	17.35	100
sentence	8011	17.20	100
wg	105430	6.85	524
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 (Low Fat Tree)

after

str

✅ Characters (eg. punctuations) following the word

book

str

✅ Book name (in English language)

booknumber

int

✅ NT book number (Matthew=1, Mark=2, ..., Revelation=27)

bookshort

str

✅ Book name (abbreviated)

case

str

✅ Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)

chapter

int

✅ Chapter number inside book

clausetype

str

✅ Clause type details (e.g. Verbless, Minor)

containedclause

str

🆗 Contained clause (WG number)

degree

str

✅ Degree (e.g. Comparitative, Superlative)

gloss

str

✅ English gloss

str

✅ Gramatical gender (Masculine, Feminine, Neuter)

headverse

str

✅ Start verse number of a sentence

junction

str

✅ Junction data related to a wordgroup

lemma

str

✅ Lexeme (lemma)

lex_dom

str

✅ Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)

str

✅ Lauw-Nida lexical classification (not present everywhere?)

markafter

str

🆗 Text critical marker after word

markbefore

str

🆗 Text critical marker before word

markorder

str

Order of punctuation and text critical marker

monad

int

✅ Monad (smallest token matching word order in the corpus)

mood

str

✅ Gramatical mood of the verb (passive, etc)

morph

str

✅ Morphological tag (Sandborg-Petersen morphology)

nodeID

str

✅ Node ID (as in the XML source data)

normalized

str

✅ Surface word with accents normalized and trailing punctuations removed

str

✅ Gramatical number (Singular, Plural)

number

str

✅ Gramatical number of the verb (e.g. singular, plural)

otype

str

person

str

✅ Gramatical person of the verb (first, second, third)

punctuation

str

✅ Punctuation after word

ref

str

✅ Value of the ref ID (taken from XML sourcedata)

reference

str

✅ Reference (to nodeID in XML source data, not yet post-processes)

roleclausedistance

str

⚠️ Distance to the wordgroup defining the syntactical role of this word

sentence

int

✅ Sentence number (counted per chapter)

str

✅ Part of Speech (abbreviated)

sp_full

str

✅ Part of Speech (long description)

strongs

str

✅ Strongs number

subj_ref

str

🆗 Subject reference (to nodeID in XML source data, not yet post-processes)

tense

str

✅ Gramatical tense of the verb (e.g. Present, Aorist)

type

str

✅ Gramatical type of noun or pronoun (e.g. Common, Personal)

unicode

str

✅ Word as it apears in the text in Unicode (incl. punctuations)

verse

int

✅ Verse number inside chapter

voice

str

✅ Gramatical voice of the verb (e.g. active,passive)

wgclass

str

✅ Class of the wordgroup (e.g. cl, np, vp)

wglevel

int

🆗 Number of the parent wordgroups for a wordgroup

wgnum

int

✅ Wordgroup number (counted per book)

wgrole

str

✅ Syntactical role of the wordgroup (abbreviated)

wgrolelong

str

✅ Syntactical role of the wordgroup (full)

wgrule

str

✅ Wordgroup rule information (e.g. Np-Appos, ClCl2, PrepNp)

wgtype

str

✅ Wordgroup type details (e.g. group, apposition)

word

str

✅ Word as it appears in the text (excl. punctuations)

wordlevel

str

🆗 Number of the parent wordgroups for a word

wordrole

str

✅ Syntactical role of the word (abbreviated)

wordrolelong

str

✅ Syntactical role of the word (full)

wordtranslit

str

🆗 Transliteration of the text (in latin letters, excl. punctuations)

wordunacc

str

✅ Word without accents (excl. punctuations)

oslots

none

Settings:

specified

apiVersion: 3
appName: tonyjurg/Nestle1904LFT
appPath:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
commit: no value
css: ''
dataDisplay:
- excludedFeatures:
  - orig_order
  - verse
  - book
  - chapter
- noneValues:
  - none
  - unknown
  - no value
  - NA
  - ''
- showVerseInTuple: 0
- textFormat: text-orig-full
docs:
- docBase: https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
- docPage: about
- docRoot: https://github.com/tonyjurg/Nestle1904LFT
- featureBase:
  https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
interfaceDefaults: {fmt: layout-orig-full}
isCompatible: True
local: no value
localDir:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
provenanceSpec:
- corpus: Nestle 1904 (Low Fat Tree)
- doi: notyet
- org: tonyjurg
- relative: /tf
- repo: Nestle1904LFT
- repro: Nestle1904LFT
- version: 0.6
- webBase: https://learner.bible/text/show_text/nestle1904/
- webHint: Show this on the Bible Online Learner website
- webLang: en
- webUrl:
  https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: no value
typeDisplay:
- book:
  - condense: True
  - hidden: True
  - label: {book}
  - style: ''
- chapter:
  - condense: True
  - hidden: True
  - label: {chapter}
  - style: ''
- sentence:
  - hidden: 0
  - label: #{sentence} (start: {book} {chapter}:{headverse})
  - style: ''
- verse:
  - condense: True
  - excludedFeatures: chapter verse
  - label: {book} {chapter}:{verse}
  - style: ''
- wg:
  - hidden: 0
  - label:
    #{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
  - style: ''
- word:
  - base: True
  - features: lemma
  - featuresBare: gloss
  - surpress: chapter verse
writing: grc

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

In [6]:

# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

In [12]:

# Set default view in a way to limit noise as much as possible.
N1904.displaySetup(condensed=True, multiFeatures=False, queryFeatures=False)

3 - Performing the queries ¶

Back to TOC ¶

3.1 - The 25 most frequent words in the corpus ¶

Back to TOC ¶

The method freqList returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [4]:

print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8545	καὶ
2769	ὁ
2684	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1658	τὸ
1556	τὸν
1518	τὴν
1411	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1069	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
689	Θεοῦ
670	πρὸς

3.2 - Frequency of characters in corpus ¶

Back to TOC ¶

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table.

Note the first line of the output is 'Format: text-orig-full'. This

In [5]:

# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    # 'key' refers to the pre-defined formats the text will be displayed
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    
    # In this example the table will be truncated to the first 15 entries
    max_rows = 15  # Set your desired number of rows here
    TruncatedTable = SortedFrequencyList[:max_rows]
    
    headers = ["character", "frequency"]
    print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
    
    # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
    N1904.dm("**Warning: table truncated!**")

Format:  text-critical
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
├─────────────┼─────────────┤
│ ἐ           │       12116 │
╘═════════════╧═════════════╛

Warning: table truncated!

Format:  text-normalized
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       52127 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45516 │
├─────────────┼─────────────┤
│ ε           │       38807 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26404 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ ί           │       21518 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
╘═════════════╧═════════════╛

Warning: table truncated!

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
╘═════════════╧═════════════╛

Warning: table truncated!

Format:  text-transliterated
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ e           │       93371 │
├─────────────┼─────────────┤
│ o           │       87008 │
├─────────────┼─────────────┤
│ a           │       75119 │
├─────────────┼─────────────┤
│ i           │       62778 │
├─────────────┼─────────────┤
│ t           │       60011 │
├─────────────┼─────────────┤
│ n           │       56230 │
├─────────────┼─────────────┤
│ s           │       52132 │
├─────────────┼─────────────┤
│ u           │       39287 │
├─────────────┼─────────────┤
│ k           │       27300 │
├─────────────┼─────────────┤
│ p           │       25081 │
├─────────────┼─────────────┤
│ r           │       22871 │
├─────────────┼─────────────┤
│ h           │       20033 │
├─────────────┼─────────────┤
│ m           │       19218 │
├─────────────┼─────────────┤
│ l           │       18228 │
╘═════════════╧═════════════╛

Warning: table truncated!

Format:  text-unaccented
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ α           │       75119 │
├─────────────┼─────────────┤
│ ε           │       66656 │
├─────────────┼─────────────┤
│ ο           │       65731 │
├─────────────┼─────────────┤
│ ι           │       62834 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ υ           │       39287 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ η           │       26715 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       23046 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ ω           │       21277 │
├─────────────┼─────────────┤
│ π           │       20308 │
╘═════════════╧═════════════╛

Warning: table truncated!

3.3 - Some stats on node types ¶

Back to TOC ¶

In [8]:

C.levels.data

Out[8]:

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('verse', 17.345965000629484, 146078, 154020),
 ('sentence', 17.198726750717764, 138067, 146077),
 ('wg', 7.583849727185382, 154021, 267467),
 ('word', 1, 1, 137779))

3.4 - The available text formats ¶

Back to TOC ¶

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also module tf.advanced.options Display Settings.

In [8]:

N1904.showFormats()

format	level	template
`text-critical`	word	`{unicode}`
`text-normalized`	word	`{normalized}{after}`
`text-orig-full`	word	`{word}{after}`
`text-transliterated`	word	`{wordtranslit}{after}`
`text-unaccented`	word	`{wordunacc}{after}`

The same result (although formatted different) can be obtained by the following call:

In [9]:

T.formats

Out[9]:

{'text-critical': 'word',
 'text-normalized': 'word',
 'text-orig-full': 'word',
 'text-transliterated': 'word',
 'text-unaccented': 'word'}

Note that this data originates from file otext.tf:

@config
...
@fmt:text-orig-full={word}{after}
...

3.5 - List of feature frequencies ¶

Back to TOC ¶

This code generates a lot of output! For that reason we will cut it off after 5 lines per feature.

In [10]:

FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList: 
    if Feature!='otype':
        print ('Feature:',Feature,'\n\n\t value\t frequency')
        FeatureFrequenceLists=Fs(Feature).freqList()
        PrintedLine=0
        for item, freq in FeatureFrequenceLists:
            PrintedLine+=1
            print ('\t',item,'\t',freq)
            if PrintedLine==LinesToPrint: break
        print ('\n')

Feature: after 

	 value	 frequency
	   	 119270
	 ,  	 9462
	 .  	 5717
	 ·  	 2359
	 ;  	 971


Feature: appos 

	 value	 frequency
	  	 100949
	 group 	 9699
	 apposition 	 2799


Feature: book 

	 value	 frequency
	 Luke 	 19457
	 Acts 	 18394
	 Matthew 	 18300
	 John 	 15644
	 Mark 	 11278


Feature: booknumber 

	 value	 frequency
	 3 	 19457
	 5 	 18394
	 1 	 18300
	 4 	 15644
	 2 	 11278


Feature: bookshort 

	 value	 frequency
	 Luke 	 19457
	 Acts 	 18394
	 Matt 	 18300
	 John 	 15644
	 Mark 	 11278


Feature: case 

	 value	 frequency
	  	 58261
	 nominative 	 24197
	 accusative 	 23031
	 genitive 	 19515
	 dative 	 12126


Feature: chapter 

	 value	 frequency
	 1 	 12868
	 2 	 10923
	 3 	 9652
	 4 	 9631
	 5 	 8788


Feature: clausetype 

	 value	 frequency
	  	 110679
	 VerbElided 	 1009
	 Verbless 	 929
	 Minor 	 830


Feature: containedclause 

	 value	 frequency
	 2 	 338
	 2036 	 167
	 97 	 82
	 172 	 81
	 1083 	 79


Feature: degree 

	 value	 frequency
	  	 137266
	 comparative 	 313
	 superlative 	 200


Feature: gloss 

	 value	 frequency
	 the 	 9857
	 and 	 6212
	 - 	 5496
	 in 	 2320
	 And 	 2218


Feature: gn 

	 value	 frequency
	  	 63804
	 masculine 	 41486
	 feminine 	 18736
	 neuter 	 13753


Feature: junction 

	 value	 frequency
	  	 93392
	 coordinate 	 9178
	 subordinate 	 8491
	 apposition 	 2386


Feature: lemma 

	 value	 frequency
	 ὁ 	 19783
	 καί 	 8978
	 αὐτός 	 5561
	 σύ 	 2892
	 δέ 	 2787


Feature: lex_dom 

	 value	 frequency
	 092004 	 26322
	  	 10487
	 089017 	 4370
	 093001 	 3672
	 033006 	 3225


Feature: ln 

	 value	 frequency
	 92.24 	 19781
	  	 10488
	 92.11 	 4718
	 89.92 	 2903
	 89.87 	 2756


Feature: markafter 

	 value	 frequency
	  	 137728
	 — 	 31
	 ) 	 11
	 ]] 	 7
	 ( 	 1


Feature: markbefore 

	 value	 frequency
	  	 137745
	 — 	 16
	 ( 	 10
	 [[ 	 7
	 [ 	 1


Feature: markorder 

	 value	 frequency
	  	 137694
	 0 	 34
	 3 	 32
	 2 	 10
	 1 	 9


Feature: monad 

	 value	 frequency
	 1 	 1
	 2 	 1
	 3 	 1
	 4 	 1
	 5 	 1


Feature: mood 

	 value	 frequency
	  	 109422
	 indicative 	 15617
	 participle 	 6653
	 infinitive 	 2285
	 imperative 	 1877


Feature: morph 

	 value	 frequency
	 CONJ 	 16316
	 PREP 	 10568
	 ADV 	 3808
	 N-NSM 	 3475
	 N-GSM 	 2935


Feature: nodeID 

	 value	 frequency
	  	 52046
	 common 	 14186
	 personal 	 6040
	 proper 	 2192
	 relative 	 885


Feature: normalized 

	 value	 frequency
	 καί 	 8576
	 ὁ 	 2769
	 δέ 	 2764
	 ἐν 	 2684
	 τοῦ 	 2497


Feature: nu 

	 value	 frequency
	 singular 	 69846
	  	 38842
	 plural 	 29091


Feature: number 

	 value	 frequency
	 singular 	 69846
	  	 38842
	 plural 	 29091


Feature: orig_order 

	 value	 frequency
	 1 	 1
	 2 	 1
	 3 	 1
	 4 	 1
	 5 	 1


Feature: person 

	 value	 frequency
	  	 118360
	 third 	 12747
	 second 	 3729
	 first 	 2943


Feature: punctuation 

	 value	 frequency
	  	 119270
	 , 	 9462
	 . 	 5717
	 · 	 2359
	 ; 	 971


Feature: ref 

	 value	 frequency
	 1CO 10:1!1 	 1
	 1CO 10:1!10 	 1
	 1CO 10:1!11 	 1
	 1CO 10:1!12 	 1
	 1CO 10:1!13 	 1


Feature: roleclausedistance 

	 value	 frequency
	 0 	 56129
	 1 	 37597
	 2 	 22297
	 3 	 12084
	 4 	 5277


Feature: sentence 

	 value	 frequency
	 3 	 1103
	 4 	 960
	 1 	 810
	 5 	 747
	 6 	 680


Feature: sp 

	 value	 frequency
	 noun 	 28455
	 verb 	 28357
	 det 	 19786
	 conj 	 18227
	 pron 	 16177


Feature: sp_full 

	 value	 frequency
	 Noun 	 28455
	 Verb 	 28357
	 Determiner 	 19786
	 Conjunction 	 18227
	 Pronoun 	 16177


Feature: strongs 

	 value	 frequency
	 3588 	 19783
	 2532 	 8978
	 846 	 5561
	 4771 	 2892
	 1161 	 2787


Feature: subj_ref 

	 value	 frequency
	  	 121204
	 n46003022002 	 172
	 n66001009002 	 131
	 n45001001001 	 104
	 n47010001004 	 104


Feature: tense 

	 value	 frequency
	  	 109422
	 aorist 	 11803
	 present 	 11579
	 imperfect 	 1689
	 future 	 1626


Feature: type 

	 value	 frequency
	  	 93321
	 common 	 23644
	 personal 	 11521
	 proper 	 4639
	 demonstrative 	 1722


Feature: unicode 

	 value	 frequency
	 καὶ 	 8541
	 ὁ 	 2768
	 ἐν 	 2683
	 δὲ 	 2619
	 τοῦ 	 2497


Feature: verse 

	 value	 frequency
	 10 	 5180
	 12 	 5177
	 1 	 5064
	 9 	 5064
	 4 	 5024


Feature: voice 

	 value	 frequency
	  	 109422
	 active 	 20742
	 passive 	 3493
	 middle 	 2408
	 middlepassive 	 1714


Feature: wgclass 

	 value	 frequency
	 np 	 33710
	 cl 	 30857
	 cl* 	 16378
	  	 12760
	 pp 	 11169


Feature: wglevel 

	 value	 frequency
	 5 	 16862
	 4 	 16527
	 6 	 15520
	 7 	 12163
	 3 	 10447


Feature: wgnum 

	 value	 frequency
	 1 	 27
	 2 	 27
	 3 	 27
	 4 	 27
	 5 	 27


Feature: wgrole 

	 value	 frequency
	  	 77251
	 adv 	 16710
	 o 	 9329
	 s 	 6710
	 p 	 1770


Feature: wgrolelong 

	 value	 frequency
	  	 77280
	 Adverbial 	 16710
	 Object 	 9329
	 Subject 	 6710
	 Predicate 	 1770


Feature: wgrule 

	 value	 frequency
	  	 22718
	 DetNP 	 15696
	 PrepNp 	 11044
	 NPofNP 	 6819
	 Conj-CL 	 5571


Feature: wgtype 

	 value	 frequency
	  	 100949
	 group 	 9699
	 apposition 	 2799


Feature: word 

	 value	 frequency
	 καὶ 	 8545
	 ὁ 	 2769
	 ἐν 	 2684
	 δὲ 	 2620
	 τοῦ 	 2497


Feature: wordlevel 

	 value	 frequency
	 6 	 21857
	 7 	 20984
	 5 	 20538
	 8 	 16755
	 9 	 12772


Feature: wordrole 

	 value	 frequency
	 adv 	 41598
	 v 	 25817
	 s 	 22908
	 o 	 21929
	  	 9347


Feature: wordrolelong 

	 value	 frequency
	 Adverbial 	 41598
	 Verbal 	 25817
	 Subject 	 22908
	 Object 	 21929
	  	 9347


Feature: wordtranslit 

	 value	 frequency
	 kai 	 8576
	 en 	 3152
	 o 	 3149
	 to 	 2885
	 de 	 2769


Feature: wordunacc 

	 value	 frequency
	 και 	 8576
	 ο 	 3019
	 δε 	 2764
	 εν 	 2752
	 του 	 2497

3.6 - Frequency list of punctuations ¶

Back to TOC ¶

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [11]:

result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

String	Unicode	Frequency

| 32 | 119272

, | 44 | 9441

. | 46 | 5712

· | 183 | 2355

; | 59 | 969

— | 8212 | 30

3.7 - Node number ranges ¶

Back to TOC ¶

The node number ranges are readily available by calling F.otype.all which returns a list of all node types.

In [26]:

for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
verse (146078, 154020)
sentence (138067, 146077)
wg (154021, 268899)
word (1, 137779)

3.8 - Count the objects per type ¶

Back to TOC ¶

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [27]:

for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   7943 verses
   8011 sentences
 114879 wgs
 137779 words

In [7]:

N1904.showProvenance(...)

Job:

Ellipsis

Author:

program author

Created:

2023-07-28T23:07:21+02:00

Data:

Nestle 1904

version

0.5

release

none

download

tonyjurg/Nestle1904LFT/tf v:0.5(unknown release or commit)

DOI

no DOI

Tool:

Text-Fabric 11.4.10 10.5281/zenodo.592193

TF App:

tonyjurg/Nestle1904LFT on GitHub

commit

f2eb5e2b0f8805ad720d91a5cb9e2aa2fdc6c99a

3.9 - Obtain meta data for a feature ¶

Back to TOC ¶

In [ ]:

This can be usefull if you want to process all feature in a script.

In [12]:

# Just print the structured tuple returned by the function call
FeatureName='word'
MetaData=Fs(FeatureName).meta
print (MetaData)

{'Availability': 'Creative Commons Attribution 4.0 International (CC BY 4.0)', 'Converter_author': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_execution': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_version': '0.3', 'Convertor_source': 'https://github.com/tonyjurg/Nestle1904LFT/tree/main/tools', 'Data source': 'MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat', 'Editors': 'Eberhard Nestle', 'Name': 'Greek New Testament (Nestle 1904 based on Low Fat Tree)', 'TextFabric version': '11.4.10', 'description': 'Word as it appears in the text (excl. punctuations)', 'valueType': 'str', 'writtenBy': 'Text-Fabric', 'dateWritten': '2023-06-19T15:13:46Z'}

Now do some very basic calculation with the data:

In [13]:

print ('feature ',FeatureName, end='')
if MetaData['valueType']=='str':
    print (' is of type str.')
else:
    print (' is not of type str.')

feature  word is of type str.

trying the various formats¶

In [ ]:

origText=T.text(node,fmt='text-orig-full')
critText=T.text(node,fmt='text-critical-signs')

        'fmt:text-orig-full':     '{word}{after}',
        'fmt:text-normalized':    '{normalized}{after}',
        'fmt:text-unaccented':    '{wordunacc}{after}',
        'fmt:text-transliterated':'{wordtranslit}{after}', 
        'fmt:text-critical':  

Some corpus statistics (Nestle1904LFT)¶

Table of content ¶

1 - Introduction ¶

2 - Load Text-Fabric app and data ¶

3 - Performing the queries ¶

3.1 - The 25 most frequent words in the corpus¶

3.2 - Frequency of characters in corpus ¶

3.3 - Some stats on node types ¶

3.4 - The available text formats ¶

3.5 - List of feature frequencies ¶

3.6 - Frequency list of punctuations ¶

3.7 - Node number ranges ¶

3.8 - Count the objects per type ¶

3.9 - Obtain meta data for a feature ¶

trying the various formats¶

3.1 - The 25 most frequent words in the corpus ¶