Some corpus statistics (Nestle1904GBI)¶

Table of content ¶

1 - Introduction
2 - Load Text-Fabric app and data
3 - Performing the queries
4 - Required libraries

1 - Introduction ¶

Back to TOC ¶

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed.

2 - Load Text-Fabric app and data ¶

Back to TOC ¶

In [1]:

%load_ext autoreload
%autoreload 2

In [1]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [2]:

# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904GBI", version="0.4", hoist=globals())

Locating corpus resources ...

The requested app is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app not found

Status: latest release online 0.4 versus None locally

downloading app, main data and requested additions ...

app: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app

The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4 not found

Status: latest release online 0.4 versus None locally

downloading app, main data and requested additions ...

data: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4

   |     0.19s T otype                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     1.85s T oslots               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.68s T book                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.53s T chapter              from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.51s T verse                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.64s T word                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.50s T after                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |      |     0.05s C __levels__           from otype, oslots, otext
   |      |     1.62s C __order__            from otype, oslots, __levels__
   |      |     0.07s C __rank__             from otype, __order__
   |      |     2.23s C __levUp__            from otype, oslots, __rank__
   |      |     1.46s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.06s C __characters__       from otext
   |      |     0.88s C __boundary__         from otype, oslots, __rank__
   |      |     0.04s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |      |     0.21s C __structure__        from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse
   |     0.50s T booknum              from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.58s T bookshort            from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.48s T case                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.50s T clause               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.07s T clauserule           from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.02s T clausetype           from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.43s T degree               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.53s T formaltag            from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.54s T functionaltag        from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.57s T gloss                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.47s T gn                   from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.55s T lemma                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.51s T lex_dom              from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.53s T ln                   from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.44s T monad                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.43s T mood                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.64s T nodeID               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.59s T normalized           from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.49s T nu                   from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.50s T number               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.43s T person               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.70s T phrase               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.26s T phrasefunction       from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.28s T phrasefunctionlong   from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.27s T phrasetype           from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.46s T sentence             from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.51s T sp                   from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.51s T splong               from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.54s T strongs              from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.45s T subj_ref             from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.45s T tense                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.45s T type                 from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4
   |     0.45s T voice                from ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.4

TF: TF API 12.1.5, tonyjurg/Nestle1904GBI/app v3, Search Reference
Data: tonyjurg - Nestle1904GBI 0.4, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
sentence	5720	24.09	100
verse	7943	17.35	100
clause	16124	8.54	100
phrase	72674	1.90	100
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 (GBI nodes)

after

str

Character after the word (space or punctuation)

book

str

Book name (fully spelled out)

booknum

int

NT book number (Matthew=1, Mark=2, ..., Revelation=27)

bookshort

str

Book name (abbreviated)

case

str

Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)

chapter

int

Chapter number inside book

clause

int

Clause number (counted per chapter)

clauserule

str

Clause rule

clausetype

str

Clause type

degree

str

Degree (e.g. Comparitative, Superlative)

formaltag

str

Formal tag (Sandborg-Petersen morphology)

functionaltag

str

gloss

str

English gloss

gn

str

Gramatical gender (Masculine, Feminine, Neuter)

lemma

str

Lexeme (lemma)

lex_dom

str

Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG

ln

str

Lauw-Nida lexical classification

monad

int

Sequence number of the smallest meaningful unit of text (single word)

mood

str

Gramatical mood of the verb (passive, etc)

nodeID

str

Node ID (as in the XML source data)

normalized

str

Surface word stripped of punctations

nu

str

Gramatical number (Singular, Plural)

number

str

Gramatical number of the verb

otype

str

person

str

Gramatical person of the verb (first, second, third)

phrase

int

Phrase number (counted per chapter)

phrasefunction

str

Phrase function (abbreviated)

phrasefunctionlong

str

Phrase function (long description)

phrasetype

str

Phrase type information

sentence

int

Sentence number (counted per chapter)

sp

str

Speech Part (abbreviated)

splong

str

Speech Part (long description)

strongs

str

Strongs number

subj_ref

str

Subject reference (to nodeID in XML source data)

tense

str

Gramatical tense of the verb (e.g. Present, Aorist)

type

str

Gramatical type of noun or pronoun (e.g. Common, Personal)

verse

int

Verse number inside chapter

voice

str

Gramatical voice of the verb

word

str

Word as it appears in the text

oslots

none

Settings:

specified

apiVersion: 3
appName: tonyjurg/Nestle1904GBI
appPath:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/app
commit: no value
css:
dataDisplay:
- excludedFeatures: [reference]
- noneValues:
  - none
  - unknown
  - no value
  - NA
  - ''
- textFormat: text-orig-full
interfaceDefaults: {fmt: layout-orig-full}
isCompatible: True
local: no value
localDir:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904GBI/_temp
provenanceSpec:
- corpus: Nestle 1904 (GBI nodes)
- org: tonyjurg
- relative: /tf
- repo: Nestle1904GBI
- repro: Nestle1904GBI
- version: 0.4
- webUrl:
  https://bibleol.3bmoodle.dk/text/show_text/nestle1904/<1>/<2>/<3>
release: no value
typeDisplay:
- book:
  - label: {book}
  - style: ''
- clause:
  - label: #{clause}
  - style: ''
- phrase:
  - label: #{phrase}
  - style: ''
- word:
  - features:
    lemma
    strongs
  - featuresBare: [gloss]
writing: grc

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

In [3]:

# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

3 - Performing the queries ¶

Back to TOC ¶

3.1 - The 25 most frequent words in the corpus ¶

Back to TOC ¶

The method freqList returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [4]:

print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8541	καὶ
2768	ὁ
2683	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1657	τὸ
1556	τὸν
1518	τὴν
1410	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1068	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
688	Θεοῦ
670	πρὸς

3.2 - Frequency of characters in corpus ¶

Back to TOC ¶

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table.

Note the first line of the output is 'Format: text-orig-full'. This

In [6]:

# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    # 'key' refers to the pre-defined formats the text will be displayed
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    
    # In this example the table will be truncated to the first 15 entries
    max_rows = 15  # Set your desired number of rows here
    TruncatedTable = SortedFrequencyList[:max_rows]
    
    headers = ["character", "frequency"]
    print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
    
    # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
    N1904.dm("**Warning: table truncated!**")

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
╘═════════════╧═════════════╛

Warning: table truncated!

3.3 - Some stats on node types ¶

Back to TOC ¶

In [44]:

C.levels.data

Out[44]:

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('sentence', 24.087237762237763, 226865, 232584),
 ('verse', 17.345965000629484, 232585, 240527),
 ('clause', 8.54496402877698, 138067, 154190),
 ('phrase', 1.8958499600957701, 154191, 226864),
 ('word', 1, 1, 137779))

3.4 - The available text formats ¶

Back to TOC ¶

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also module tf.advanced.options Display Settings.

In [19]:

N1904.showFormats()

format	level	template
`text-orig-full`	word	`{word}{after}`

The same result (although formatted different, since an ordered tuple is returned) can be obtained by the following call:

In [8]:

T.formats

Out[8]:

{'text-orig-full': 'word'}

Note that this data originates from file otext.tf:

@config
...
@fmt:text-orig-full={word}{after}
...

3.5 - List of feature frequencies ¶

Back to TOC ¶

This code generates a lot of output!

In [7]:

FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList:
    if Feature=='otype': break # this feature needs to be skipped.
    print ('Feature:',Feature,'\n\n\t value\t frequency')
    FeatureFrequenceLists=Fs(Feature).freqList()
    PrintedLine=0
    for item, freq in FeatureFrequenceLists:
        PrintedLine+=1
        print ('\t',item,'\t',freq)
        if PrintedLine==LinesToPrint: break
    print ('\n')

Feature: after 

	 value	 frequency
	   	 119272
	 ,  	 9441
	 .  	 5712
	 ·  	 2355
	 ;  	 969


Feature: book 

	 value	 frequency
	 Luke 	 22801
	 Matthew 	 21334
	 Acts 	 21290
	 John 	 18389
	 Mark 	 13247


Feature: booknum 

	 value	 frequency
	 3 	 22801
	 1 	 21334
	 5 	 21290
	 4 	 18389
	 2 	 13247


Feature: bookshort 

	 value	 frequency
	 Luke 	 22801
	 Matt 	 21334
	 Acts 	 21290
	 John 	 18389
	 Mark 	 13247


Feature: case 

	 value	 frequency
	  	 58261
	 Nominative 	 24197
	 Accusative 	 23031
	 Genitive 	 19515
	 Dative 	 12126


Feature: chapter 

	 value	 frequency
	 1 	 13795
	 2 	 11590
	 3 	 10239
	 4 	 10187
	 5 	 9270


Feature: clause 

	 value	 frequency
	 1 	 481
	 6 	 347
	 44 	 314
	 35 	 310
	 4 	 301


Feature: clauserule 

	 value	 frequency
	 CLaCL 	 1841
	 Conj-CL 	 1740
	 sub-CL 	 1525
	 V-O 	 690
	 V2CL 	 653


Feature: clausetype 

	 value	 frequency
	 VerbElided 	 1355
	 Verbless 	 1330
	 Minor 	 1161


Feature: degree 

	 value	 frequency
	  	 137266
	 Comparative 	 313
	 Superlative 	 200


Feature: formaltag 

	 value	 frequency
	 CONJ 	 16316
	 PREP 	 10568
	 ADV 	 3808
	 N-NSM 	 3475
	 N-GSM 	 2935


Feature: functionaltag 

	 value	 frequency
	 CONJ 	 16316
	 PREP 	 10568
	 ADV 	 3808
	 N-NSM 	 3475
	 N-GSM 	 2935


Feature: gloss 

	 value	 frequency
	 the 	 9857
	 and 	 6212
	 - 	 5496
	 in 	 2320
	 And 	 2218


Feature: gn 

	 value	 frequency
	  	 63804
	 Masculine 	 41486
	 Feminine 	 18736
	 Neuter 	 13753


Feature: lemma 

	 value	 frequency
	 ὁ 	 19783
	 καί 	 8978
	 αὐτός 	 5561
	 σύ 	 2892
	 δέ 	 2787


Feature: lex_dom 

	 value	 frequency
	 092004 	 26322
	  	 10487
	 089017 	 4370
	 093001 	 3672
	 033006 	 3225


Feature: ln 

	 value	 frequency
	 92.24 	 19781
	  	 10488
	 92.11 	 4718
	 89.92 	 2903
	 89.87 	 2756


Feature: monad 

	 value	 frequency
	 1 	 1
	 2 	 1
	 3 	 1
	 4 	 1
	 5 	 1


Feature: mood 

	 value	 frequency
	  	 109422
	 Indicative 	 15617
	 Participle 	 6653
	 Infinitive 	 2285
	 Imperative 	 1877


Feature: nodeID 

	 value	 frequency
	 n40001001001 	 1
	 n40001001002 	 1
	 n40001001003 	 1
	 n40001001004 	 1
	 n40001001005 	 1


Feature: normalized 

	 value	 frequency
	 καί 	 8576
	 ὁ 	 2769
	 δέ 	 2764
	 ἐν 	 2684
	 τοῦ 	 2497


Feature: nu 

	 value	 frequency
	 Singular 	 69846
	  	 38842
	 Plural 	 29091


Feature: number 

	 value	 frequency
	 Singular 	 69846
	  	 38842
	 Plural 	 29091

3.6 - Frequency list of punctuations ¶

Back to TOC ¶

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [7]:

result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

String	Unicode	Frequency

| 32 | 119272

, | 44 | 9441

. | 46 | 5712

· | 183 | 2355

; | 59 | 969

— | 8212 | 30

3.7 - Node number ranges ¶

Back to TOC ¶

The node number ranges are readily available by calling F.otype.all which returns a list of all node types.

In [8]:

for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
sentence (226865, 232584)
verse (232585, 240527)
clause (138067, 154190)
phrase (154191, 226864)
word (1, 137779)

3.8 - Count the objects per type ¶

Back to TOC ¶

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [9]:

for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   5720 sentences
   7943 verses
  16124 clauses
  72674 phrases
 137779 words

In [8]:

N1904.showProvenance(...)

Job:

Ellipsis

Author:

program author

Created:

2023-07-24T17:00:59+02:00

Data:

Nestle 1904 (GBI nodes)

version

0.4

release

0.4

download

tonyjurg/Nestle1904GBI/tf v:0.4(r0.4=#671c7903caf2cdaa57866014bc3915a2840d4646 offline under C:/Users/tonyj/text-fabric-data/github)

DOI

no DOI

Tool:

Text-Fabric 11.4.10 10.5281/zenodo.592193

TF App:

tonyjurg/Nestle1904GBI on GitHub

commit

671c7903caf2cdaa57866014bc3915a2840d4646

4 - Required libraries ¶

Back to TOC ¶

The scripts in this notebook require (beside text-fabric) the following Python libraries to be installed in the environment:

tabulate

You can install any missing library from within Jupyter Notebook using eitherpip or pip3.

In [ ]:

Some corpus statistics (Nestle1904GBI)¶

Table of content ¶

1 - Introduction ¶

2 - Load Text-Fabric app and data ¶

3 - Performing the queries ¶

3.1 - The 25 most frequent words in the corpus¶

3.2 - Frequency of characters in corpus ¶

3.3 - Some stats on node types ¶

3.4 - The available text formats ¶

3.5 - List of feature frequencies ¶

3.6 - Frequency list of punctuations ¶

3.7 - Node number ranges ¶

3.8 - Count the objects per type ¶

4 - Required libraries ¶

3.1 - The 25 most frequent words in the corpus ¶