Notebook

Doc4TF ¶

automatic creation of feature documentation for existing Text-Fabric datasets¶

Table of content ¶

1 - Introduction
2 - Setting up the environment
3 - Load Text-Fabric data
4 - Creation of the dataset
- 4.1 - Setting up some production values
- 4.2 - Store data in dictionaries
5 - Create the pages
- 5.1 - Create set of feature pages
- 5.2 - Create overview page
6 - Licence

1 - Introduction ¶

Back to TOC ¶

Ideally, a comprehensive documentation set should be created as part of developing a Text-Fabric dataset. However, in practice, this is not always completed during the initial phase or after changes to features. This Jupyter Notebook contains Python code to automatically generate (and thus ensure consistency) a documentation set for any Text-Fabric dataset. It serves as a robust starting point for the development of a brand new documentation set or as validation for an existing one. One major advantage is that the resulting documentation set is fully hyperlinked, a task that can be laborious if done manually.

The main steps in producing the documentation set are:

Load a Text-Fabric database
Execute the code pressent in the subsequent cells. The code will:
- construct a few python dictionaries with relevant data from the TF datase
- create separate files for each feature
- create an overview page of all featers per node type

2. Setting up the environment ¶

Back to TOC ¶

Your environment should (obviously) include the Python package Text-Fabric. In the current implementation of the script, the Python package markdown2 is also required. If not installed yet, it can be installed using pip. (note: possibly in a future version this dependancy might be removed).

In [68]:

!pip install markdown2

Collecting markdown2
  Downloading markdown2-2.4.11-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading markdown2-2.4.11-py2.py3-none-any.whl (41 kB)
   ---------------------------------------- 0.0/41.1 kB ? eta -:--:--
   --------- ------------------------------ 10.2/41.1 kB ? eta -:--:--
   ------------------- -------------------- 20.5/41.1 kB 217.9 kB/s eta 0:00:01
   ---------------------------------------- 41.1/41.1 kB 328.0 kB/s eta 0:00:00
Installing collected packages: markdown2
Successfully installed markdown2-2.4.11

3 - Load Text-Fabric data ¶

Back to TOC ¶

At this stage the Text-Fabric dataset is loaded which will be used to create a documentation set. See documentation for function use for various options regaring storage locations.

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [3]:

# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904LFT", version="0.6", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app

data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.6

TF: TF API 12.2.2, tonyjurg/Nestle1904LFT/app v3, Search Reference
Data: tonyjurg - Nestle1904LFT 0.6, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7943	17.35	100
sentence	8011	17.20	100
wg	105430	6.85	524
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 (Low Fat Tree)

after

str

✅ Characters (eg. punctuations) following the word

book

str

✅ Book name (in English language)

booknumber

int

✅ NT book number (Matthew=1, Mark=2, ..., Revelation=27)

bookshort

str

✅ Book name (abbreviated)

case

str

✅ Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)

chapter

int

✅ Chapter number inside book

clausetype

str

✅ Clause type details (e.g. Verbless, Minor)

containedclause

str

🆗 Contained clause (WG number)

degree

str

✅ Degree (e.g. Comparitative, Superlative)

gloss

str

✅ English gloss

str

✅ Gramatical gender (Masculine, Feminine, Neuter)

headverse

str

✅ Start verse number of a sentence

junction

str

✅ Junction data related to a wordgroup

lemma

str

✅ Lexeme (lemma)

lex_dom

str

✅ Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)

str

✅ Lauw-Nida lexical classification (not present everywhere?)

markafter

str

🆗 Text critical marker after word

markbefore

str

🆗 Text critical marker before word

markorder

str

Order of punctuation and text critical marker

monad

int

✅ Monad (smallest token matching word order in the corpus)

mood

str

✅ Gramatical mood of the verb (passive, etc)

morph

str

✅ Morphological tag (Sandborg-Petersen morphology)

nodeID

str

✅ Node ID (as in the XML source data)

normalized

str

✅ Surface word with accents normalized and trailing punctuations removed

str

✅ Gramatical number (Singular, Plural)

number

str

✅ Gramatical number of the verb (e.g. singular, plural)

otype

str

person

str

✅ Gramatical person of the verb (first, second, third)

punctuation

str

✅ Punctuation after word

ref

str

✅ Value of the ref ID (taken from XML sourcedata)

reference

str

✅ Reference (to nodeID in XML source data, not yet post-processes)

roleclausedistance

str

⚠️ Distance to the wordgroup defining the syntactical role of this word

sentence

int

✅ Sentence number (counted per chapter)

str

✅ Part of Speech (abbreviated)

sp_full

str

✅ Part of Speech (long description)

strongs

str

✅ Strongs number

subj_ref

str

🆗 Subject reference (to nodeID in XML source data, not yet post-processes)

tense

str

✅ Gramatical tense of the verb (e.g. Present, Aorist)

type

str

✅ Gramatical type of noun or pronoun (e.g. Common, Personal)

unicode

str

✅ Word as it apears in the text in Unicode (incl. punctuations)

verse

int

✅ Verse number inside chapter

voice

str

✅ Gramatical voice of the verb (e.g. active,passive)

wgclass

str

✅ Class of the wordgroup (e.g. cl, np, vp)

wglevel

int

🆗 Number of the parent wordgroups for a wordgroup

wgnum

int

✅ Wordgroup number (counted per book)

wgrole

str

✅ Syntactical role of the wordgroup (abbreviated)

wgrolelong

str

✅ Syntactical role of the wordgroup (full)

wgrule

str

✅ Wordgroup rule information (e.g. Np-Appos, ClCl2, PrepNp)

wgtype

str

✅ Wordgroup type details (e.g. group, apposition)

word

str

✅ Word as it appears in the text (excl. punctuations)

wordlevel

str

🆗 Number of the parent wordgroups for a word

wordrole

str

✅ Syntactical role of the word (abbreviated)

wordrolelong

str

✅ Syntactical role of the word (full)

wordtranslit

str

🆗 Transliteration of the text (in latin letters, excl. punctuations)

wordunacc

str

✅ Word without accents (excl. punctuations)

oslots

none

Settings:

specified

apiVersion: 3
appName: tonyjurg/Nestle1904LFT
appPath:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
commit: e68bd68c7c4c862c1464d995d51e27db7691254f
css: ''
dataDisplay:
- excludedFeatures:
  - orig_order
  - verse
  - book
  - chapter
- noneValues:
  - none
  - unknown
  - no value
  - NA
  - ''
- showVerseInTuple: 0
- textFormat: text-orig-full
docs:
- docBase: https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
- docPage: about
- docRoot: https://github.com/tonyjurg/Nestle1904LFT
- featureBase:
  https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
interfaceDefaults: {fmt: layout-orig-full}
isCompatible: True
local: local
localDir:
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
provenanceSpec:
- corpus: Nestle 1904 (Low Fat Tree)
- doi: 10.5281/zenodo.10182594
- org: tonyjurg
- relative: /tf
- repo: Nestle1904LFT
- repro: Nestle1904LFT
- version: 0.6
- webBase: https://learner.bible/text/show_text/nestle1904/
- webHint: Show this on the Bible Online Learner website
- webLang: en
- webUrl:
  https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: v0.6
typeDisplay:
- book:
  - condense: True
  - hidden: True
  - label: {book}
  - style: ''
- chapter:
  - condense: True
  - hidden: True
  - label: {chapter}
  - style: ''
- sentence:
  - hidden: 0
  - label: #{sentence} (start: {book} {chapter}:{headverse})
  - style: ''
- verse:
  - condense: True
  - excludedFeatures: chapter verse
  - label: {book} {chapter}:{verse}
  - style: ''
- wg:
  - hidden: 0
  - label:
    #{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
  - style: ''
- word:
  - base: True
  - features: lemma
  - featuresBare: gloss
  - surpress: chapter verse
writing: grc

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

4 - Creation of the dataset ¶

4.1 - Setting up some production values ¶

Back to TOC ¶

In [4]:

# set the title for all pages (indicating the dataset the documentation is describing)
pageTitle="N1904 Greek New Testament Text-Fabric dataset [tonyjurg/Nestle1904LFT - 0.6](https://github.com/tonyjurg/Nestle1904LFT)"

# location to store the resulting files.For now the same location as where the notebook resides (no ending slash)
resultLocation = ""

# Set verbose to True if you want dictionaries printed. Setting to False does mute the output
verbose=True

4.2 - Store data in dictionaries ¶

4.2.1 - Get node types and their node ranges ¶

Back to TOC ¶

The following will create a dictionary containing the mapping from node type to node number.

In [5]:

# Initialize an empty dictionary
nodeDict = {}

# Iterate over C.levels.data
for item in C.levels.data:
    node,_,start,end = item
    # Create empty node list
    nodeDict[node] = []
    # Append the tuple (start, end) to the node's list
    nodeDict[node].append((start, end))
    
# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(nodeDict)
print('finished')

{'book': [(137780, 137806)], 'chapter': [(137807, 138066)], 'verse': [(146078, 154020)], 'sentence': [(138067, 146077)], 'wg': [(154021, 259450)], 'word': [(1, 137779)]}
finished

Or alternative (with identical result)

In [6]:

# Initialize an empty dictionary
nodeDict = {}
# Iterate over node types
for NodeType in F.otype.all:
    nodeDict[NodeType] = []
    start, end = F.otype.sInterval(NodeType)
    # Append the tuple (start, end) to the node's list
    nodeDict[NodeType].append((start, end))
    
# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(nodeDict)
print('finished')

{'book': [(137780, 137806)], 'chapter': [(137807, 138066)], 'verse': [(146078, 154020)], 'sentence': [(138067, 146077)], 'wg': [(154021, 259450)], 'word': [(1, 137779)]}
finished

4.2.2 - Determine which node types have specific features ¶

Back to TOC ¶

The following will create a feature list with information about the node types that contain values for that specific feature.

In [7]:

# Initialize an empty dictionary
featureDict = {}
# Iterate over Fall(), all features
for item in Fall():
    # Use a set to store unique values for each feature
    featureDict[item] = set()  
    for node, content in Fs(item).items():
        featureDict[item].add(F.otype.v(node))
        
# Print the resulting dictionary depending on setting 'verbose' 
if verbose: print(featureDict)
print('finished')

{'after': {'word'}, 'book': {'chapter', 'sentence', 'word', 'book', 'verse'}, 'booknumber': {'word', 'book'}, 'bookshort': {'word', 'book'}, 'case': {'word'}, 'chapter': {'word', 'sentence', 'verse', 'chapter'}, 'clausetype': {'wg'}, 'containedclause': {'word'}, 'degree': {'word'}, 'gloss': {'word'}, 'gn': {'word'}, 'headverse': {'sentence'}, 'junction': {'wg'}, 'lemma': {'word'}, 'lex_dom': {'word'}, 'ln': {'word'}, 'markafter': {'word'}, 'markbefore': {'word'}, 'markorder': {'word'}, 'monad': {'word'}, 'mood': {'word'}, 'morph': {'word'}, 'nodeID': {'word'}, 'normalized': {'word'}, 'nu': {'word'}, 'number': {'word'}, 'otype': {'wg', 'chapter', 'sentence', 'word', 'book', 'verse'}, 'person': {'word'}, 'punctuation': {'word'}, 'ref': {'word'}, 'reference': {'word'}, 'roleclausedistance': {'word'}, 'sentence': {'word', 'sentence'}, 'sp': {'word'}, 'sp_full': {'word'}, 'strongs': {'word'}, 'subj_ref': {'word'}, 'tense': {'word'}, 'type': {'word'}, 'unicode': {'word'}, 'verse': {'word', 'verse'}, 'voice': {'word'}, 'wgclass': {'wg'}, 'wglevel': {'wg'}, 'wgnum': {'wg'}, 'wgrole': {'wg'}, 'wgrolelong': {'wg'}, 'wgrule': {'wg'}, 'wgtype': {'wg'}, 'word': {'word'}, 'wordlevel': {'word'}, 'wordrole': {'word'}, 'wordrolelong': {'word'}, 'wordtranslit': {'word'}, 'wordunacc': {'word'}}
finished

4.2.3 - Create dictionairy with description and valuefrequency per feature ¶

Back to TOC ¶

The following will create a dictionairy with the description per feature (taken from the meta data)

In [8]:

# Initialize an empty dictionary
featureMetaDict = {}
# Iterate over Fall(), all features
for item in Fall():
    featureMetaDict[item] = []
    featureMetaData=Fs(item).meta
    # Check if 'description' key exists in the meta dictionary
    if 'description' in featureMetaData:
        featureDescription = featureMetaData['description']
    else:
        featureDescription = "No feature description"
        
    # Check if 'valueType' key exists in the meta dictionary    
    if 'valueType' in featureMetaData:
        featureType = "unknown"
        if featureMetaData["valueType"] == 'str': featureType = "string" 
        if featureMetaData["valueType"] == 'int': featureType = "integer" 
    else:
        featureType = "not found"
    
    if item!='otype':
        FeatureFrequenceLists=Fs(item).freqList()
        FoundItems=0
        FeatureValueSetList = []  # Initialize an empty list to store feature value sets
        for value, freq in FeatureFrequenceLists:
            FoundItems+=1
            FeatureValueSet = value
            FeatureFrequencySet = freq
            FeatureValueSetList.append((FeatureValueSet,FeatureFrequencySet))
            if FoundItems==10: break


    featureMetaDict[item].append((featureDescription, featureType, FeatureValueSetList))

# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(featureMetaDict)
print('finished')

{'after': [('✅ Characters (eg. punctuations) following the word', 'string', [(' ', 119270), (', ', 9462), ('. ', 5717), ('· ', 2359), ('; ', 971)])], 'book': [('✅ Book name (in English language)', 'string', [('Luke', 21785), ('Matthew', 20529), ('Acts', 20307), ('John', 17582), ('Mark', 12695), ('Revelation', 10726), ('Romans', 8014), ('I_Corinthians', 7798), ('Hebrews', 5513), ('II_Corinthians', 4992)])], 'booknumber': [('✅ NT book number (Matthew=1, Mark=2, ..., Revelation=27)', 'integer', [(3, 19457), (5, 18394), (1, 18300), (4, 15644), (2, 11278), (27, 9833), (6, 7101), (7, 6821), (19, 4956), (8, 4470)])], 'bookshort': [('✅ Book name (abbreviated)', 'string', [('Luke', 19457), ('Acts', 18394), ('Matt', 18300), ('John', 15644), ('Mark', 11278), ('Rev', 9833), ('Rom', 7101), ('1Cor', 6821), ('Heb', 4956), ('2Cor', 4470)])], 'case': [('✅ Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)', 'string', [('', 58261), ('nominative', 24197), ('accusative', 23031), ('genitive', 19515), ('dative', 12126), ('vocative', 649)])], 'chapter': [('✅ Chapter number inside book', 'integer', [(1, 12922), (2, 10923), (3, 9652), (4, 9631), (5, 8788), (6, 7561), (9, 7157), (7, 7069), (11, 7007), (12, 6970)])], 'clausetype': [('✅ Clause type details (e.g. Verbless, Minor)', 'string', [('', 102662), ('VerbElided', 1009), ('Verbless', 929), ('Minor', 830)])], 'containedclause': [('🆗 Contained clause (WG number)', 'string', [('', 8372), ('2', 148), ('172', 69), ('97', 69), ('389', 68), ('346', 63), ('822', 62), ('1455', 60), ('129', 59), ('1083', 58)])], 'degree': [('✅ Degree (e.g. Comparitative, Superlative)', 'string', [('', 137266), ('comparative', 313), ('superlative', 200)])], 'gloss': [('✅ English gloss', 'string', [('the', 9857), ('and', 6212), ('-', 5496), ('in', 2320), ('And', 2218), ('not', 2042), ('of the', 1551), ('for', 1501), ('that', 1498), ('you', 1226)])], 'gn': [('✅ Gramatical gender (Masculine, Feminine, Neuter)', 'string', [('', 63804), ('masculine', 41486), ('feminine', 18736), ('neuter', 13753)])], 'headverse': [('✅ Start verse number of a sentence', 'string', [('1', 298), ('7', 270), ('12', 267), ('9', 264), ('13', 260), ('11', 256), ('5', 255), ('8', 254), ('10', 252), ('6', 252)])], 'junction': [('✅ Junction data related to a wordgroup', 'string', [('', 103128), ('apposition', 2302)])], 'lemma': [('✅ Lexeme (lemma)', 'string', [('ὁ', 19783), ('καί', 8978), ('αὐτός', 5561), ('σύ', 2892), ('δέ', 2787), ('ἐν', 2743), ('ἐγώ', 2567), ('εἰμί', 2457), ('λέγω', 2255), ('εἰς', 1766)])], 'lex_dom': [('✅ Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)', 'string', [('092004', 26322), ('', 10487), ('089017', 4370), ('093001', 3672), ('033006', 3225), ('012001', 3000), ('089015', 2810), ('092003', 2444), ('069002', 1857), ('092001', 1846)])], 'ln': [('✅ Lauw-Nida lexical classification (not present everywhere?)', 'string', [('92.24', 19781), ('', 10488), ('92.11', 4718), ('89.92', 2903), ('89.87', 2756), ('33.69', 2336), ('69.3', 1736), ('92.1', 1732), ('92.7', 1494), ('12.1', 1247)])], 'markafter': [('🆗 Text critical marker after word', 'string', [('', 137728), ('—', 31), (')', 11), (']]', 7), ('(', 1), (']', 1)])], 'markbefore': [('🆗 Text critical marker before word', 'string', [('', 137745), ('—', 16), ('(', 10), ('[[', 7), ('[', 1)])], 'markorder': [(' Order of punctuation and text critical marker', 'string', [('', 137694), ('0', 34), ('3', 32), ('2', 10), ('1', 9)])], 'monad': [('✅ Monad (smallest token matching word order in the corpus)', 'integer', [(1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)])], 'mood': [('✅ Gramatical mood of the verb (passive, etc)', 'string', [('', 109422), ('indicative', 15617), ('participle', 6653), ('infinitive', 2285), ('imperative', 1877), ('subjunctive', 1856), ('optative', 69)])], 'morph': [('✅ Morphological tag (Sandborg-Petersen morphology)', 'string', [('CONJ', 16316), ('PREP', 10568), ('ADV', 3808), ('N-NSM', 3475), ('N-GSM', 2935), ('T-NSM', 2905), ('N-ASF', 2870), ('PRT-N', 2701), ('N-ASM', 2456), ('V-PAI-3S', 2271)])], 'nodeID': [('✅ Node ID (as in the XML source data)', 'string', [('', 52046), ('common', 14186), ('personal', 6040), ('proper', 2192), ('relative', 885), ('demonstrative', 755), ('verb', 436), ('noun', 391), ('det', 359), ('interrogative', 341)])], 'normalized': [('✅ Surface word with accents normalized and trailing punctuations removed', 'string', [('καί', 8576), ('ὁ', 2769), ('δέ', 2764), ('ἐν', 2684), ('τοῦ', 2497), ('εἰς', 1755), ('τό', 1664), ('τόν', 1562), ('τήν', 1523), ('αὐτοῦ', 1411)])], 'nu': [('✅ Gramatical number (Singular, Plural)', 'string', [('singular', 69846), ('', 38842), ('plural', 29091)])], 'number': [('✅ Gramatical number of the verb (e.g. singular, plural)', 'string', [('singular', 69846), ('', 38842), ('plural', 29091)])], 'otype': [('No feature description', 'string', [('singular', 69846), ('', 38842), ('plural', 29091)])], 'person': [('✅ Gramatical person of the verb (first, second, third)', 'string', [('', 118360), ('third', 12747), ('second', 3729), ('first', 2943)])], 'punctuation': [('✅ Punctuation after word', 'string', [('', 119270), (',', 9462), ('.', 5717), ('·', 2359), (';', 971)])], 'ref': [('✅ Value of the ref ID (taken from XML sourcedata)', 'string', [('1CO 10:1!1', 1), ('1CO 10:1!10', 1), ('1CO 10:1!11', 1), ('1CO 10:1!12', 1), ('1CO 10:1!13', 1), ('1CO 10:1!14', 1), ('1CO 10:1!15', 1), ('1CO 10:1!16', 1), ('1CO 10:1!17', 1), ('1CO 10:1!18', 1)])], 'reference': [('✅ Reference (to nodeID in XML source data, not yet post-processes)', 'string', [('1CO 10:1!1', 1), ('1CO 10:1!10', 1), ('1CO 10:1!11', 1), ('1CO 10:1!12', 1), ('1CO 10:1!13', 1), ('1CO 10:1!14', 1), ('1CO 10:1!15', 1), ('1CO 10:1!16', 1), ('1CO 10:1!17', 1), ('1CO 10:1!18', 1)])], 'roleclausedistance': [('⚠️ Distance to the wordgroup defining the syntactical role of this word', 'string', [('0', 56129), ('1', 37597), ('2', 22297), ('3', 12084), ('4', 5277), ('5', 2527), ('6', 1041), ('7', 412), ('8', 141), ('9', 64)])], 'sentence': [('✅ Sentence number (counted per chapter)', 'integer', [(3, 1130), (4, 987), (1, 810), (5, 774), (6, 707), (10, 703), (7, 694), (11, 688), (15, 636), (9, 632)])], 'sp': [('✅ Part of Speech (abbreviated)', 'string', [('noun', 28455), ('verb', 28357), ('det', 19786), ('conj', 18227), ('pron', 16177), ('prep', 10914), ('adj', 8452), ('adv', 6147), ('ptcl', 773), ('num', 476)])], 'sp_full': [('✅ Part of Speech (long description)', 'string', [('Noun', 28455), ('Verb', 28357), ('Determiner', 19786), ('Conjunction', 18227), ('Pronoun', 16177), ('Preposition', 10914), ('Adjective', 8452), ('', 6147), ('Particle', 773), ('Numeral', 476)])], 'strongs': [('✅ Strongs number', 'string', [('3588', 19783), ('2532', 8978), ('846', 5561), ('4771', 2892), ('1161', 2787), ('1722', 2743), ('1473', 2567), ('1510', 2457), ('3004', 2255), ('1519', 1766)])], 'subj_ref': [('🆗 Subject reference (to nodeID in XML source data, not yet post-processes)', 'string', [('', 121204), ('n46003022002', 172), ('n66001009002', 131), ('n45001001001', 104), ('n47010001004', 104), ('n41006030007', 95), ('n50001001001', 92), ('n40005001015', 78), ('n49003001013', 73), ('n51001002007', 71)])], 'tense': [('✅ Gramatical tense of the verb (e.g. Present, Aorist)', 'string', [('', 109422), ('aorist', 11803), ('present', 11579), ('imperfect', 1689), ('future', 1626), ('perfect', 1572), ('pluperfect', 88)])], 'type': [('✅ Gramatical type  of noun or pronoun (e.g. Common, Personal)', 'string', [('', 93321), ('common', 23644), ('personal', 11521), ('proper', 4639), ('demonstrative', 1722), ('relative', 1674), ('interrogative', 633), ('indefinite', 552), ('possessive', 70), ('adverbial', 3)])], 'unicode': [('✅ Word as it apears in the text in Unicode (incl. punctuations)', 'string', [('καὶ', 8541), ('ὁ', 2768), ('ἐν', 2683), ('δὲ', 2619), ('τοῦ', 2497), ('εἰς', 1755), ('τὸ', 1657), ('τὸν', 1556), ('τὴν', 1518), ('τῆς', 1300)])], 'verse': [('✅ Verse number inside chapter', 'integer', [(10, 4928), (12, 4910), (4, 4800), (9, 4800), (1, 4793), (3, 4768), (5, 4757), (11, 4734), (8, 4727), (2, 4723)])], 'voice': [('✅ Gramatical voice of the verb (e.g. active,passive)', 'string', [('', 109422), ('active', 20742), ('passive', 3493), ('middle', 2408), ('middlepassive', 1714)])], 'wgclass': [('✅ Class of the wordgroup (e.g. cl, np, vp)', 'string', [('np', 33710), ('cl', 30857), ('cl*', 16378), ('', 12760), ('pp', 11169), ('vp', 207), ('adjp', 168), ('advp', 166), ('adv', 7), ('nump', 7)])], 'wglevel': [('🆗 Number of the parent wordgroups for a wordgroup', 'integer', [(5, 16862), (4, 16527), (6, 15520), (7, 12162), (3, 10442), (8, 9027), (2, 8011), (9, 6135), (10, 4005), (11, 2483)])], 'wgnum': [('✅ Wordgroup number (counted per book)', 'integer', [(2, 27), (3, 27), (4, 27), (5, 27), (6, 27), (7, 27), (8, 27), (11, 27), (12, 27), (13, 27)])], 'wgrole': [('✅ Syntactical role of the wordgroup (abbreviated)', 'string', [('', 69235), ('adv', 16710), ('o', 9329), ('s', 6710), ('p', 1770), ('io', 702), ('v', 405), ('aux', 360), ('o2', 171), ('topic', 25)])], 'wgrolelong': [('✅ Syntactical role of the wordgroup (full)', 'string', [('', 69263), ('Adverbial', 16710), ('Object', 9329), ('Subject', 6710), ('Predicate', 1770), ('Indirect Object', 702), ('Verbal', 405), ('Auxiliar', 360), ('Second Object', 171), ('Verbal Copula', 10)])], 'wgrule': [('✅ Wordgroup rule information (e.g. Np-Appos, ClCl2, PrepNp)', 'string', [('DetNP', 15696), ('', 14701), ('PrepNp', 11044), ('NPofNP', 6819), ('Conj-CL', 5571), ('CLaCL', 3668), ('sub-CL', 3114), ('V2CL', 2753), ('V-O', 2660), ('DetCL', 2011)])], 'wgtype': [('✅ Wordgroup type details (e.g. group, apposition)', 'string', [('', 92932), ('group', 9699), ('apposition', 2799)])], 'word': [('✅ Word as it appears in the text (excl. punctuations)', 'string', [('καὶ', 8545), ('ὁ', 2769), ('ἐν', 2684), ('δὲ', 2620), ('τοῦ', 2497), ('εἰς', 1755), ('τὸ', 1658), ('τὸν', 1556), ('τὴν', 1518), ('αὐτοῦ', 1411)])], 'wordlevel': [('🆗 Number of the parent wordgroups for a word', 'string', [('6', 21857), ('7', 20984), ('5', 20538), ('8', 16755), ('9', 12772), ('4', 12110), ('10', 8756), ('3', 8537), ('11', 5797), ('12', 3571)])], 'wordrole': [('✅ Syntactical role of the word (abbreviated)', 'string', [('adv', 41598), ('v', 25817), ('s', 22908), ('o', 21929), ('', 9347), ('p', 7474), ('io', 3948), ('vc', 2603), ('aux', 1654), ('o2', 501)])], 'wordrolelong': [('✅ Syntactical role of the word (full)', 'string', [('Adverbial', 41598), ('Verbal', 25817), ('Subject', 22908), ('Object', 21929), ('', 9347), ('Predicate', 7474), ('Indirect Object', 3948), ('Verbal Copula', 2603), ('Auxiliar', 1654), ('Second Object', 501)])], 'wordtranslit': [('🆗 Transliteration of the text (in latin letters, excl. punctuations)', 'string', [('kai', 8576), ('en', 3152), ('o', 3149), ('to', 2885), ('de', 2769), ('ton', 2763), ('tou', 2497), ('eis', 1851), ('ten', 1523), ('auton', 1514)])], 'wordunacc': [('✅ Word without accents (excl. punctuations)', 'string', [('και', 8576), ('ο', 3019), ('δε', 2764), ('εν', 2752), ('του', 2497), ('εις', 1851), ('το', 1664), ('τον', 1562), ('την', 1523), ('αυτου', 1411)])]}
finished

5 - Create the pages ¶

5.1 - Create set of feature pages ¶

Back to TOC ¶

In [9]:

import markdown2
import os

filesCreated=0
for feature in featureDict:
    # prepare the data
    featureName = feature
    nodeList = ''
    featureValues=''
    for node in featureDict[feature]:
        nodeList += f' <A HREF=\"featurebynodetype.md#{node}\">`{node}`</A>'
    featureDescription, featureType, valueFreq = featureMetaDict[feature][0]

    featureValues="Value|Frequency|\n---|---|\n"
    for value, freq in valueFreq:
        if value=='':
           featureValues+=f"empty |{freq}|\n"
        else:
           featureValues+=f"`{value}` | {freq} |\n"
    featureValues+="Note: only the first 10 items are shown"
    
    # define the template for the feature description pages
    FeaturePageTemplate = f"{pageTitle}\n#Feature: {featureName}\nData type|Available for node types|\n---|---|\n`{featureType}` |{nodeList}|\n## Description\n{featureDescription}\n## Values\n{featureValues}\n"

    # create the feature file
    FeaturePageContent = FeaturePageTemplate.format(featureName=feature, featureType=featureType, nodeList=nodeList)

    # Convert the plain text to Markdown
    markdown_content = markdown2.markdown(FeaturePageContent, extras=['tables'])

    # set up path to location to store the resulting file
    fileName = os.path.join(resultLocation, f"{feature}.md")

    try:
        with open(fileName, "w", encoding="utf-8") as file:
            file.write(markdown_content)
            filesCreated+=1
            # Write the Markdown content to a file
            if verbose: print(f"Markdown content written to {fileName}")
    except Exception as e:
        print(f"Error writing to file {fileName} (please create directory \'{resultLocation}\' first)")
        break
if filesCreated!=0: print(f'finished (writing {filesCreated} files)') 

Markdown content written to after.md
Markdown content written to book.md
Markdown content written to booknumber.md
Markdown content written to bookshort.md
Markdown content written to case.md
Markdown content written to chapter.md
Markdown content written to clausetype.md
Markdown content written to containedclause.md
Markdown content written to degree.md
Markdown content written to gloss.md
Markdown content written to gn.md
Markdown content written to headverse.md
Markdown content written to junction.md
Markdown content written to lemma.md
Markdown content written to lex_dom.md
Markdown content written to ln.md
Markdown content written to markafter.md
Markdown content written to markbefore.md
Markdown content written to markorder.md
Markdown content written to monad.md
Markdown content written to mood.md
Markdown content written to morph.md
Markdown content written to nodeID.md
Markdown content written to normalized.md
Markdown content written to nu.md
Markdown content written to number.md
Markdown content written to otype.md
Markdown content written to person.md
Markdown content written to punctuation.md
Markdown content written to ref.md
Markdown content written to reference.md
Markdown content written to roleclausedistance.md
Markdown content written to sentence.md
Markdown content written to sp.md
Markdown content written to sp_full.md
Markdown content written to strongs.md
Markdown content written to subj_ref.md
Markdown content written to tense.md
Markdown content written to type.md
Markdown content written to unicode.md
Markdown content written to verse.md
Markdown content written to voice.md
Markdown content written to wgclass.md
Markdown content written to wglevel.md
Markdown content written to wgnum.md
Markdown content written to wgrole.md
Markdown content written to wgrolelong.md
Markdown content written to wgrule.md
Markdown content written to wgtype.md
Markdown content written to word.md
Markdown content written to wordlevel.md
Markdown content written to wordrole.md
Markdown content written to wordrolelong.md
Markdown content written to wordtranslit.md
Markdown content written to wordunacc.md
finished (writing 55 files)

5.2 - Create overview page ¶

Back to TOC ¶

In [10]:

overviewPage = f"{pageTitle}\n#Features per node type\n"

# Iterate over node types
for NodeType in F.otype.all:
    # Initialize an empty list to store keys
    FeaturesWithNodeType = []
    # Check each set in featureDict for the presence of this nodetype
    for feature, value_set in featureDict.items():
        if NodeType in value_set:
             FeaturesWithNodeType.append(feature)
    NodeItemText=f"##{NodeType}\nFeature|Datatype|Description|Examples\n|---|---|---|---|\n"
    for item in FeaturesWithNodeType:
        featureDescription =featureMetaDict[item][0][0]
        DataType="`"+featureMetaDict[item][0][1]+"` "
        #Get some example values
        FoundItems=0
        valueExamples=''
        for value, freq in featureMetaDict[item][0][2]:
           FoundItems+=1
           valueExamples+='`'+str(value)+'` '
           if FoundItems==2: break
        NodeItemText+=f'<A HREF=\"{item}.md#readme\">{item}</A>| {DataType} | {featureDescription} | {valueExamples} \n'
    overviewPage+=NodeItemText
    

# create the feature overview file
# Convert the plain text to Markdown
markdown_content = markdown2.markdown(overviewPage, extras=['tables'])
    
# set up path to location to store the resulting file
fileName = os.path.join(resultLocation, "featurebynodetype.md")
try:
    with open(fileName, "w", encoding="utf-8") as file:
        file.write(markdown_content)
        filesCreated+=1
        # Write the Markdown content to a file
        if verbose: print(f"Markdown content written to {fileName}")
        print('Overview page created successfully')
except Exception as e:
    print(f"Error writing to file {fileName} (please create directory \'{resultLocation}\' first)")

Markdown content written to featurebynodetype.md
Overview page created successfully

6 - License ¶

Back to TOC ¶

Licenced under Creative Commons Attribution 4.0 International (CC BY 4.0)

automatic creation of feature documentation for existing Text-Fabric datasets¶

Table of content ¶

1 - Introduction ¶

2. Setting up the environment¶

3 - Load Text-Fabric data ¶

4 - Creation of the dataset¶

4.1 - Setting up some production values¶

4.2 - Store data in dictionaries¶

4.2.1 - Get node types and their node ranges¶

4.2.2 - Determine which node types have specific features¶

4.2.3 - Create dictionairy with description and valuefrequency per feature¶

5 - Create the pages¶

5.1 - Create set of feature pages¶

5.2 - Create overview page¶

6 - License¶

2. Setting up the environment ¶

4 - Creation of the dataset ¶

4.1 - Setting up some production values ¶

4.2 - Store data in dictionaries ¶

4.2.1 - Get node types and their node ranges ¶

4.2.2 - Determine which node types have specific features ¶

4.2.3 - Create dictionairy with description and valuefrequency per feature ¶

5 - Create the pages ¶

5.1 - Create set of feature pages ¶

5.2 - Create overview page ¶

6 - License ¶