Notebook

1 Discussion

Booknames (multilingual)¶

This notebook adds multilingual book names to a BHSA dataset in text-Fabric format.

Discussion¶

We add the features book@iso where iso is a two letter ISO-639 language code of a modern language. We use a source file blang.py that contains the names of the books of the bible in modern languages (around 20, most big languages are covered). This data has been gleaned mostly from Wikipedia.

We assume that the dataset has the book feature present, holding Latin book names.

This program works for all datasets and versions that have this feature with the intended meaning.

In [1]:

import os
import sys
import utils
from tf.fabric import Fabric
from blang import bookLangs, bookNames

Pipeline¶

See operation for how to run this script in the pipeline.

In [2]:

if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    VERSION = "2021"


def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

Setting up the context: source file and target directories¶

The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.

In [3]:

repoBase = os.path.expanduser("~/github/etcbc")
thisRepo = "{}/{}".format(repoBase, CORE_NAME)

thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)

thisTf = "{}/tf/{}".format(thisRepo, VERSION)

Collect¶

We collect the book names.

In [4]:

utils.caption(4, "Book names")

metaData = {
    "": dict(
        dataset="BHSA",
        version=VERSION,
        datasetName="Biblia Hebraica Stuttgartensia Amstelodamensis",
        author="Eep Talstra Centre for Bible and Computer",
        provenance="book names from wikipedia and other sources",
        encoders="Dirk Roorda (TF)",
        website="https://shebanq.ancient-data.org",
        email="shebanq@ancient-data.org",
    ),
}

for (langCode, (langEnglish, langName)) in bookLangs.items():
    metaData["book@{}".format(langCode)] = {
        "valueType": "str",
        "language": langName,
        "languageCode": langCode,
        "languageEnglish": langEnglish,
    }

newFeatures = sorted(m for m in metaData if m != "")
newFeaturesStr = " ".join(newFeatures)

utils.caption(0, "{} languages ...".format(len(newFeatures)))

..............................................................................................
.       0.00s Book names                                                                     .
..............................................................................................
|       0.00s 26 languages ...

Test¶

Check whether this conversion is needed in the first place. Only when run as a script.

In [5]:

if SCRIPT:
    (good, work) = utils.mustRun(
        None, "{}/.tf/{}.tfx".format(thisTf, newFeatures[0]), force=FORCE
    )
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

Load existing data¶

In [6]:

utils.caption(4, "Loading relevant features")

TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("book")
api.makeAvailableIn(globals())

nodeFeatures = {}
nodeFeatures["book@la"] = {}

bookNodes = []
for b in F.otype.s("book"):
    bookNodes.append(b)
    nodeFeatures["book@la"][b] = F.book.v(b)

for (langCode, langBookNames) in bookNames.items():
    nodeFeatures["book@{}".format(langCode)] = dict(zip(bookNodes, langBookNames))
utils.caption(0, "{} book name features created".format(len(nodeFeatures)))

..............................................................................................
.       4.78s Loading relevant features                                                      .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

88 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.58s All features loaded/computed - for details use loadLog()
|       8.37s 26 book name features created

Write new features¶

In [7]:

utils.caption(4, "Write book name features as TF")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

..............................................................................................
.         18s Write book name features as TF                                                 .
..............................................................................................

Out[7]:

True

Diffs¶

Check differences with previous versions.

In [8]:

utils.checkDiffs(thisTempTf, thisTf, only=set(newFeatures))

..............................................................................................
.         21s Check differences with previous version                                        .
..............................................................................................
|         21s 	26 features to add
|         21s 		book@am
|         21s 		book@ar
|         21s 		book@bn
|         21s 		book@da
|         21s 		book@de
|         21s 		book@el
|         21s 		book@en
|         21s 		book@es
|         21s 		book@fa
|         21s 		book@fr
|         21s 		book@he
|         21s 		book@hi
|         21s 		book@id
|         21s 		book@ja
|         21s 		book@ko
|         21s 		book@la
|         21s 		book@nl
|         21s 		book@pa
|         21s 		book@pt
|         21s 		book@ru
|         21s 		book@sw
|         21s 		book@syc
|         21s 		book@tr
|         21s 		book@ur
|         21s 		book@yo
|         21s 		book@zh
|         21s 	no features to delete
|         21s 	0 features in common
|         21s Done

Deliver¶

Copy the new Text-Fabric features from the temporary location where they have been created to their final destination.

In [9]:

utils.deliverFeatures(thisTempTf, thisTf, newFeatures)

..............................................................................................
.         23s Deliver features to /Users/dirk/github/etcbc/bhsa/tf/2021                      .
..............................................................................................
|         23s 	book@am
|         23s 	book@ar
|         23s 	book@bn
|         23s 	book@da
|         23s 	book@de
|         23s 	book@el
|         23s 	book@en
|         23s 	book@es
|         23s 	book@fa
|         23s 	book@fr
|         23s 	book@he
|         23s 	book@hi
|         23s 	book@id
|         23s 	book@ja
|         23s 	book@ko
|         23s 	book@la
|         23s 	book@nl
|         23s 	book@pa
|         23s 	book@pt
|         23s 	book@ru
|         23s 	book@sw
|         23s 	book@syc
|         23s 	book@tr
|         23s 	book@ur
|         23s 	book@yo
|         23s 	book@zh

Compile TF¶

In [10]:

utils.caption(4, "Load and compile the new TF features")

TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("")
api.makeAvailableIn(globals())

..............................................................................................
.         27s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

114 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T book@ko              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@fr              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@zh              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@en              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@ja              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@syc             from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@he              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@es              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@id              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@bn              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@yo              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@la              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@da              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@ru              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@pt              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@tr              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@ur              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@hi              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@de              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@ar              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@el              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@pa              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@sw              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@fa              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@nl              from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T book@am              from ~/github/etcbc/bhsa/tf/2021
  3.62s All features loaded/computed - for details use loadLog()

Out[10]:

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

Examples¶

In [11]:

utils.caption(4, "Genesis in all languages")
genesisNode = F.otype.s("book")[0]

for (lang, langInfo) in sorted(T.languages.items()):
    language = langInfo["language"]
    langEng = langInfo["languageEnglish"]
    book = T.sectionFromNode(genesisNode, lang=lang)[0]
    utils.caption(
        0,
        "{:<2} = {:<20} Genesis is {:<20} in {:<20}".format(
            lang, langEng, book, language
        ),
    )

utils.caption(0, "Done")

..............................................................................................
.         33s Genesis in all languages                                                       .
..............................................................................................
|         33s    = default              Genesis is Genesis              in default             
|         33s am = amharic              Genesis is ኦሪት_ዘፍጥረት            in ኣማርኛ                
|         33s ar = arabic               Genesis is تكوين                in العَرَبِية          
|         33s bn = bengali              Genesis is আদিপুস্তক            in বাংলা               
|         33s da = danish               Genesis is 1.Mosebog            in Dansk               
|         33s de = german               Genesis is Genesis              in Deutsch             
|         33s el = greek                Genesis is Γένεση               in Ελληνικά            
|         33s en = english              Genesis is Genesis              in English             
|         33s es = spanish              Genesis is Génesis              in Español             
|         33s fa = farsi                Genesis is پيدايش               in فارسی               
|         33s fr = french               Genesis is Genèse               in Français            
|         33s he = hebrew               Genesis is בראשית               in עברית               
|         33s hi = hindi                Genesis is उत्पाति              in हिन्दी              
|         33s id = indonesian           Genesis is Kejadian             in Bahasa Indonesia    
|         33s ja = japanese             Genesis is 創世記                  in 日本語                 
|         33s ko = korean               Genesis is 창세기                  in 한국어                 
|         33s la = latin                Genesis is Genesis              in Latina              
|         33s nl = dutch                Genesis is Genesis              in Nederlands          
|         33s pa = punjabi              Genesis is ਉਤਪਤ                 in ਪੰਜਾਬੀ              
|         33s pt = portuguese           Genesis is Gênesis              in Português           
|         33s ru = russian              Genesis is Бытия                in Русский             
|         33s sw = swahili              Genesis is Mwanzo               in Kiswahili           
|         33s syc = syriac               Genesis is ܒܪܝܬܐ                in ܠܫܢܐ ܣܘܪܝܝܐ         
|         33s tr = turkish              Genesis is Yaratılış            in Türkçe              
|         33s ur = urdu                 Genesis is پیدائش               in اُردُو              
|         33s yo = yoruba               Genesis is Genesisi             in èdè Yorùbá          
|         33s zh = chinese              Genesis is 创世记                  in 中文                  
|         33s Done

Table of Contents