This notebook adds statistical features to a BHSA dataset in text-Fabric format.
We add the features
freq_occ freq_lex rank_occ rank_lex
.
We assume that the dataset has these features present:
languageISO
) for determining if the word is Hebrew or Aramaicg_cons
) to get the word string in consonantal transcriptionlex
) to get the lexical identifier in consonantal transcriptionThis program works for all datasets and versions that have these features with the
intended meanings. The exact names of these features can be passed as parameters.
Note that the old version 3
uses very different names for many features.
We will not identify lexemes and word occurrences across language. So if two occurrences or lexemes exhibit the same string, but they are categorized as belonging to different languages, they will not be identified.
We group occurrences by their consonantal transcriptions. So if two occurrences differ only in pointing, we count them as two occurrences of the same value.
Lexemes are identified by the lex
feature within a biblical language.
We will not identify lexemes across language.
import os
import sys
import collections
import utils
from tf.fabric import Fabric
if "SCRIPT" not in locals():
SCRIPT = False
FORCE = True
CORE_NAME = "bhsa"
VERSION = "2021"
LANG_FEATURE = "languageISO"
OCC_FEATURE = "g_cons"
LEX_FEATURE = "lex"
def stop(good=False):
if SCRIPT:
sys.exit(0 if good else 1)
The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.
repoBase = os.path.expanduser("~/github/etcbc")
thisRepo = "{}/{}".format(repoBase, CORE_NAME)
thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)
thisTf = "{}/tf/{}".format(thisRepo, VERSION)
newFeaturesStr = """
freq_occ
freq_lex
rank_occ
rank_lex
"""
newFeatures = newFeaturesStr.strip().split()
Check whether this conversion is needed in the first place. Only when run as a script.
if SCRIPT:
(good, work) = utils.mustRun(
None, "{}/.tf/{}.tfx".format(thisTf, newFeatures[0]), force=FORCE
)
if not good:
stop(good=False)
if not work:
stop(good=True)
We collect the statistics.
utils.caption(4, "Loading relevant features")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("{} {} {}".format(LANG_FEATURE, LEX_FEATURE, OCC_FEATURE))
api.makeAvailableIn(globals())
hasLex = "lex" in set(F.otype.all)
.............................................................................................. . 0.00s Loading relevant features . .............................................................................................. This is Text-Fabric 9.0.2 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 114 features found and 0 ignored 0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API 4.03s All features loaded/computed - for details use TF.isLoaded()
utils.caption(0, "Counting occurrences")
wstats = {
"freqs": {
"lex": collections.defaultdict(lambda: collections.Counter()),
"occ": collections.defaultdict(lambda: collections.Counter()),
},
"ranks": {
"lex": collections.defaultdict(lambda: {}),
"occ": collections.defaultdict(lambda: {}),
},
}
langs = set()
for w in F.otype.s("word"):
occ = Fs(OCC_FEATURE).v(w)
lex = Fs(LEX_FEATURE).v(w)
lan = Fs(LANG_FEATURE).v(w)
wstats["freqs"]["lex"][lan][lex] += 1
wstats["freqs"]["occ"][lan][occ] += 1
langs.add(lan)
for lan in langs:
for tp in ["lex", "occ"]:
rank = -1
prev_n = -1
amount = 1
for (x, n) in sorted(
wstats["freqs"][tp][lan].items(), key=lambda y: (-y[1], y[0])
):
if n == prev_n:
amount += 1
else:
rank += amount
amount = 1
prev_n = n
wstats["ranks"][tp][lan][x] = rank
| 5.66s Counting occurrences
utils.caption(0, "Making statistical features")
metaData = {
"": dict(
dataset="BHSA",
version=VERSION,
datasetName="Biblia Hebraica Stuttgartensia Amstelodamensis",
author="Eep Talstra Centre for Bible and Computer",
provenance="computed addition to core set of features",
encoders="Dirk Roorda (TF)",
website="https://shebanq.ancient-data.org",
email="shebanq@ancient-data.org",
),
}
nodeFeatures = {}
edgeFeatures = {}
for ft in newFeatures:
nodeFeatures[ft] = {}
metaData.setdefault(ft, {})["valueType"] = "int"
for w in F.otype.s("word"):
lan = Fs(LANG_FEATURE).v(w)
occ = Fs(OCC_FEATURE).v(w)
lex = Fs(LEX_FEATURE).v(w)
nodeFeatures["freq_occ"][w] = str(wstats["freqs"]["occ"][lan][occ])
nodeFeatures["rank_occ"][w] = str(wstats["ranks"]["occ"][lan][occ])
nodeFeatures["freq_lex"][w] = str(wstats["freqs"]["lex"][lan][lex])
nodeFeatures["rank_lex"][w] = str(wstats["ranks"]["lex"][lan][lex])
if hasLex:
for lx in F.otype.s("lex"):
firstOcc = L.d(lx, otype="word")[0]
nodeFeatures["freq_lex"][lx] = nodeFeatures["freq_lex"][firstOcc]
nodeFeatures["rank_lex"][lx] = nodeFeatures["rank_lex"][firstOcc]
| 20s Making statistical features
utils.caption(4, "Write statistical features as TF")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)
.............................................................................................. . 22s Write statistical features as TF . ..............................................................................................
True
Check differences with previous versions.
utils.checkDiffs(thisTempTf, thisTf, only=set(newFeatures))
.............................................................................................. . 26s Check differences with previous version . .............................................................................................. | 26s no features to add | 26s no features to delete | 26s 4 features in common | 26s freq_lex ... no changes | 27s freq_occ ... no changes | 27s rank_lex ... no changes | 27s rank_occ ... no changes | 27s Done
Copy the new TF features from the temporary location where they have been created to their final destination.
utils.deliverFeatures(thisTempTf, thisTf, newFeatures)
.............................................................................................. . 29s Deliver features to /Users/dirk/github/etcbc/bhsa/tf/2021 . .............................................................................................. | 29s freq_occ | 29s freq_lex | 29s rank_occ | 29s rank_lex
utils.caption(4, "Load and compile the new TF features")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("{} {}".format(LEX_FEATURE, newFeaturesStr))
api.makeAvailableIn(globals())
.............................................................................................. . 11m 04s Load and compile the new TF features . .............................................................................................. This is Text-Fabric 9.0.2 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 114 features found and 0 ignored lex freq_occ freq_lex rank_occ rank_lex 0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API | 0.62s T freq_lex from ~/github/etcbc/bhsa/tf/2021 | 0.59s T freq_occ from ~/github/etcbc/bhsa/tf/2021 | 0.60s T rank_lex from ~/github/etcbc/bhsa/tf/2021 | 0.61s T rank_occ from ~/github/etcbc/bhsa/tf/2021 6.22s All features loaded/computed - for details use TF.isLoaded()
[('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')), ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')), ('Fabric', 'loading', ('TF',)), ('Locality', 'locality', ('L Locality',)), ('Nodes', 'navigating-nodes', ('N Nodes',)), ('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')), ('Search', 'search', ('S Search',)), ('Text', 'text', ('T Text',))]
utils.caption(4, "Basic test")
mostFrequent = set()
topX = 10
lexIndex = {}
utils.caption(0, "Top {} freqent lexemes (computed on otype=word)".format(topX))
for w in sorted(F.otype.s("word"), key=lambda w: -F.freq_lex.v(w)):
lex = Fs(LEX_FEATURE).v(w)
mostFrequent.add(lex)
lexIndex[lex] = w
if len(mostFrequent) == topX:
break
mostFrequentWord = sorted((-F.freq_lex.v(lexIndex[lex]), lex) for lex in mostFrequent)
for (freq, lex) in mostFrequentWord:
utils.caption(0, "{:<10} {:>6}x".format(lex, -freq))
if hasLex:
utils.caption(4, "Top {} freqent lexemes (computed on otype=lex)".format(topX))
mostFrequentLex = sorted(
(-F.freq_lex.v(lx), F.lex.v(lx)) for lx in F.otype.s("lex")
)[0:10]
for (freq, lex) in mostFrequentLex:
utils.caption(0, "{:<10} {:>6}x".format(lex, -freq))
if mostFrequentWord != mostFrequentLex:
utils.caption(
0, "\tWARNING: Mismatch in lexeme frequencies computed by lex vs by word"
)
else:
utils.caption(0, "\tINFO: Same lexeme frequencies computed by lex vs by word")
utils.caption(0, "Done")
.............................................................................................. . 42s Basic test . .............................................................................................. | 42s Top 10 freqent lexemes (computed on otype=word) | 42s W 50272x | 42s H 30386x | 42s L 20069x | 42s B 15542x | 42s >T 10987x | 42s MN 7562x | 42s JHWH/ 6828x | 42s <L 5766x | 42s >L 5517x | 42s >CR 5500x .............................................................................................. . 42s Top 10 freqent lexemes (computed on otype=lex) . .............................................................................................. | 42s W 50272x | 42s H 30386x | 42s L 20069x | 42s B 15542x | 42s >T 10987x | 42s MN 7562x | 42s JHWH/ 6828x | 42s <L 5766x | 42s >L 5517x | 42s >CR 5500x | 42s INFO: Same lexeme frequencies computed by lex vs by word | 42s Done
if SCRIPT:
stop(good=True)