This notebook adds statistical features to a BHSA dataset in text-Fabric format.
We add the features
freq_occ freq_lex rank_occ rank_lex
.
We assume that the dataset has these features present:
languageISO
) for determining if the word is Hebrew or Aramaicg_cons
) to get the word string in consonantal transcriptionlex
) to get the lexical identifier in consonantal transcriptionThis program works for all datasets and versions that have these features with the
intended meanings. The exact names of these features can be passed as parameters.
Note that the old version 3
uses very different names for many features.
We will not identify lexemes and word occurrences across language. So if two occurrences or lexemes exhibit the same string, but they are categorized as belonging to different languages, they will not be identified.
We group occurrences by their consonantal transcriptions. So if two occurrences differ only in pointing, we count them as two occurrences of the same value.
Lexemes are identified by the lex
feature within a biblical language.
We will not identify lexemes across language.
import os,sys,re,collections
import utils
from tf.fabric import Fabric
if 'SCRIPT' not in locals():
SCRIPT = False
FORCE = True
CORE_NAME = 'bhsa'
VERSION= 'c'
LANG_FEATURE = 'languageISO'
OCC_FEATURE = 'g_cons'
LEX_FEATURE = 'lex'
def stop(good=False):
if SCRIPT: sys.exit(0 if good else 1)
The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.
repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, CORE_NAME)
thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisTempTf = '{}/tf'.format(thisTemp)
thisTf = '{}/tf/{}'.format(thisRepo, VERSION)
newFeaturesStr = '''
freq_occ
freq_lex
rank_occ
rank_lex
'''
newFeatures = newFeaturesStr.strip().split()
Check whether this conversion is needed in the first place. Only when run as a script.
if SCRIPT:
(good, work) = utils.mustRun(None, '{}/.tf/{}.tfx'.format(thisTf, newFeatures[0]), force=FORCE)
if not good: stop(good=False)
if not work: stop(good=True)
We collect the statistics.
utils.caption(4, 'Loading relevant features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('{} {} {}'.format(LANG_FEATURE, LEX_FEATURE, OCC_FEATURE))
api.makeAvailableIn(globals())
hasLex = 'lex' in set(F.otype.all)
.............................................................................................. . 0.00s Loading relevant features . .............................................................................................. This is Text-Fabric 3.0.2 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 103 features found and 0 ignored 0.00s loading features ... | 0.15s B g_cons from /Users/dirk/github/etcbc/bhsa/tf/c | 0.15s B language from /Users/dirk/github/etcbc/bhsa/tf/c | 0.16s B lex from /Users/dirk/github/etcbc/bhsa/tf/c | 0.00s Feature overview: 98 for nodes; 4 for edges; 1 configs; 7 computed 6.06s All features loaded/computed - for details use loadLog()
utils.caption(0, 'Counting occurrences')
wstats = {
'freqs': {
'lex': collections.defaultdict(lambda: collections.Counter()),
'occ': collections.defaultdict(lambda: collections.Counter()),
},
'ranks': {
'lex': collections.defaultdict(lambda: {}),
'occ': collections.defaultdict(lambda: {}),
},
}
langs = set()
for w in F.otype.s('word'):
occ = Fs(OCC_FEATURE).v(w)
lex = Fs(LEX_FEATURE).v(w)
lan = Fs(LANG_FEATURE).v(w)
wstats['freqs']['lex'][lan][lex] += 1
wstats['freqs']['occ'][lan][occ] += 1
langs.add(lan)
for lan in langs:
for tp in ['lex', 'occ']:
rank = -1
prev_n = -1
amount = 1
for (x, n) in sorted(wstats['freqs'][tp][lan].items(), key=lambda y: (-y[1], y[0])):
if n == prev_n:
amount += 1
else:
rank += amount
amount = 1
prev_n = n
wstats['ranks'][tp][lan][x] = rank
| 6.14s Counting occurrences
utils.caption(0, 'Making statistical features')
metaData={
'': dict(
dataset='BHSA',
version=VERSION,
datasetName='Biblia Hebraica Stuttgartensia Amstelodamensis',
author='Eep Talstra Centre for Bible and Computer',
provenance='computed addition to core set of features',
encoders='Dirk Roorda (TF)',
website='https://shebanq.ancient-data.org',
email='shebanq@ancient-data.org',
),
}
nodeFeatures = {}
edgeFeatures = {}
for ft in (newFeatures):
nodeFeatures[ft] = {}
metaData.setdefault(ft, {})['valueType'] = 'int'
for w in F.otype.s('word'):
lan = Fs(LANG_FEATURE).v(w)
occ = Fs(OCC_FEATURE).v(w)
lex = Fs(LEX_FEATURE).v(w)
nodeFeatures['freq_occ'][w] = str(wstats['freqs']['occ'][lan][occ])
nodeFeatures['rank_occ'][w] = str(wstats['ranks']['occ'][lan][occ])
nodeFeatures['freq_lex'][w] = str(wstats['freqs']['lex'][lan][lex])
nodeFeatures['rank_lex'][w] = str(wstats['ranks']['lex'][lan][lex])
if hasLex:
for lx in F.otype.s('lex'):
firstOcc = L.d(lx, otype='word')[0]
nodeFeatures['freq_lex'][lx] = nodeFeatures['freq_lex'][firstOcc]
nodeFeatures['rank_lex'][lx] = nodeFeatures['rank_lex'][firstOcc]
| 7.54s Making statistical features
utils.caption(4, 'Write statistical features as TF')
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)
.............................................................................................. . 10s Write statistical features as TF . .............................................................................................. | 0.76s T freq_lex to /Users/dirk/github/etcbc/bhsa/_temp/c/tf | 0.72s T freq_occ to /Users/dirk/github/etcbc/bhsa/_temp/c/tf | 0.89s T rank_lex to /Users/dirk/github/etcbc/bhsa/_temp/c/tf | 0.88s T rank_occ to /Users/dirk/github/etcbc/bhsa/_temp/c/tf
Check differences with previous versions.
utils.checkDiffs(thisTempTf, thisTf, only=set(newFeatures))
.............................................................................................. . 13s Check differences with previous version . .............................................................................................. | 13s 4 features to add | 13s freq_lex | 13s freq_occ | 13s rank_lex | 13s rank_occ | 13s no features to delete | 13s 0 features in common | 13s Done
Copy the new TF features from the temporary location where they have been created to their final destination.
utils.deliverFeatures(thisTempTf, thisTf, newFeatures)
.............................................................................................. . 13s Deliver features to /Users/dirk/github/etcbc/bhsa/tf/c . .............................................................................................. | 13s freq_occ | 13s freq_lex | 13s rank_occ | 13s rank_lex
utils.caption(4, 'Load and compile the new TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('{} {}'.format(LEX_FEATURE, newFeaturesStr))
api.makeAvailableIn(globals())
.............................................................................................. . 13s Load and compile the new TF features . .............................................................................................. This is Text-Fabric 3.0.2 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 107 features found and 0 ignored 0.00s loading features ... | 0.28s B lex from /Users/dirk/github/etcbc/bhsa/tf/c | 1.02s T freq_occ from /Users/dirk/github/etcbc/bhsa/tf/c | 1.00s T freq_lex from /Users/dirk/github/etcbc/bhsa/tf/c | 0.95s T rank_occ from /Users/dirk/github/etcbc/bhsa/tf/c | 1.00s T rank_lex from /Users/dirk/github/etcbc/bhsa/tf/c | 0.00s Feature overview: 102 for nodes; 4 for edges; 1 configs; 7 computed 11s All features loaded/computed - for details use loadLog()
utils.caption(4, 'Basic test')
mostFrequent = set()
topX = 10
lexIndex = {}
utils.caption(0, 'Top {} freqent lexemes (computed on otype=word)'.format(topX))
for w in sorted(F.otype.s('word'), key=lambda w: -F.freq_lex.v(w)):
lex = Fs(LEX_FEATURE).v(w)
mostFrequent.add(lex)
lexIndex[lex] = w
if len(mostFrequent) == topX: break
mostFrequentWord = sorted((-F.freq_lex.v(lexIndex[lex]), lex) for lex in mostFrequent)
for (freq, lex) in mostFrequentWord:
utils.caption(0, '{:<10} {:>6}x'.format(lex, -freq))
if hasLex:
utils.caption(4, 'Top {} freqent lexemes (computed on otype=lex)'.format(topX))
mostFrequentLex = sorted((-F.freq_lex.v(lx), F.lex.v(lx)) for lx in F.otype.s('lex'))[0:10]
for (freq, lex) in mostFrequentLex:
utils.caption(0, '{:<10} {:>6}x'.format(lex, -freq))
if mostFrequentWord != mostFrequentLex:
utils.caption(0, '\tWARNING: Mismatch in lexeme frequencies computed by lex vs by word')
else:
utils.caption(0, '\tINFO: Same lexeme frequencies computed by lex vs by word')
utils.caption(0, 'Done')
.............................................................................................. . 24s Basic test . .............................................................................................. | 24s Top 10 freqent lexemes (computed on otype=word) | 25s W 50272x | 25s H 30384x | 25s L 20069x | 25s B 15542x | 25s >T 11002x | 25s MN 7562x | 25s JHWH/ 6828x | 25s <L 5766x | 25s >L 5517x | 25s >CR 5500x .............................................................................................. . 25s Top 10 freqent lexemes (computed on otype=lex) . .............................................................................................. | 25s W 50272x | 25s H 30384x | 25s L 20069x | 25s B 15542x | 25s >T 11002x | 25s MN 7562x | 25s JHWH/ 6828x | 25s <L 5766x | 25s >L 5517x | 25s >CR 5500x | 25s INFO: Same lexeme frequencies computed by lex vs by word | 25s Done