This notebook exports the BHSA database to an R data frame. The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.
The BHSA features become the columns, so each row tells what values the features have for the corresponding object.
The edges corresponding to the BHSA features mother, functional_parent, distributional_parent are exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.
We also write the data that says which objects are contained in which. To each row we add the following columns:
word
there is a column with name that object type and containing
the identifier of the containing object of that type of the row object (if any).Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.
We compose the big table and save it as a tab delimited files. The result can be processed by R and Pandas, who may converted the table to internal formats for quicker loading. It turns out that for this size of the data Pandas is a bit quicker than R.
Also, because we remain in a Python environment, working with Pandas is easier when you want to use configurations ad libraries from the text-fabric sphere.
See bigTablesR and bigTablesP
import os, sys, collections
from tf.fabric import Fabric
locations = '~/github/etcbc'
coreModule = 'bhsa'
sources = [coreModule, 'phono']
version = '2017'
tempDir = os.path.expanduser('{}/{}/_temp/{}/r'.format(locations, coreModule, version))
tableFile = '{}/{}{}.txt'.format(tempDir, coreModule, version)
modules = ['{}/tf/{}'.format(s, version) for s in sources]
TF = Fabric(locations=locations, modules=modules)
This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 117 features found and 0 ignored
api = TF.load('')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())
0.00s loading features ... | 0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed 5.91s All features loaded/computed - for details use loadLog() | 0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed 0.00s loading features ... | 0.03s B code from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.19s B det from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.19s B dist from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.23s B dist_unit from /Users/dirk/github/etcbc/bhsa/tf/2017 | 1.61s B distributional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.02s B domain from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.12s B freq_lex from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.10s B freq_occ from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B function from /Users/dirk/github/etcbc/bhsa/tf/2017 | 3.08s B functional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.10s B g_nme from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.14s B g_nme_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_pfm from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.10s B g_pfm_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_prs from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_prs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.08s B g_uvf from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.07s B g_uvf_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.08s B g_vbe from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_vbe_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_vbs from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B g_vbs_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.01s B gloss from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B gn from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.03s B instruction from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.03s B is_root from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.04s B kind from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B kq_hybrid from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.09s B kq_hybrid_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.02s B label from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B language from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.24s B lex from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B lexeme_count from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.22s B ls from /Users/dirk/github/etcbc/bhsa/tf/2017 | 1.48s B mother from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.06s B mother_object_type from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.00s B nametype from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B nme from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B nu from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.26s B number from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.90s B omap@2016-2017 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.03s B pargr from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.14s B pdp from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B pfm from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.14s B prs from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.14s B prs_gn from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B prs_nu from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.16s B prs_ps from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.23s B ps from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.18s B rank_lex from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.14s B rank_occ from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.37s B rela from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.01s B root from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B sp from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B st from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B suffix_gender from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B suffix_number from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B suffix_person from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.02s B tab from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.03s B txt from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.26s B typ from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B uvf from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.13s B vbe from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B vbs from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.01s B voc_lex from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.01s B voc_lex_utf8 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.17s B vs from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.15s B vt from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed 18s All features loaded/computed - for details use loadLog()
## info("Writing R feature data")
if not os.path.exists(tempDir):
os.makedirs(tempDir)
hr = open(tableFile, 'w')
skipFeatures = '''
otype
oslots
'''.strip().split()
for f in (Fall() + Eall()):
if '@' in f: skipFeatures.append(f)
levelFeatures = '''
subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
half_verse verse chapter book
'''.strip().split()
inLevelFeatures = ['in.'+x for x in levelFeatures]
allNodeFeatures = sorted(set(Fall()) - set(skipFeatures))
allEdgeFeatures = sorted(set(Eall()) - set(skipFeatures))
hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
'n',
'otype',
'\t'.join(inLevelFeatures),
'\t'.join(allEdgeFeatures),
'\t'.join(allNodeFeatures),
))
chunkSize = 100000
i = 0
s = 0
NA = ['']
NAe = [['']]
for n in N():
levelValues = [(L.u(n, otype=level) or NA)[0] for level in levelFeatures]
edgeValues = [str((Es(f).f(n) or NA)[0]) for f in allEdgeFeatures]
nodeValues = [(str(Fs(f).v(n) or '')) for f in allNodeFeatures]
hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
n,
F.otype.v(n),
('\t'.join(str(x) for x in levelValues)),
('\t'.join(edgeValues)),
('\t'.join(nodeValues)).replace('\n',''),
))
i += 1
s += 1
if s == chunkSize:
s = 0
info('{:>7} nodes written'.format(i))
hr.close()
info('{:>7} nodes written and done'.format(i))
2h 34m 57s 100000 nodes written 2h 35m 21s 200000 nodes written 2h 35m 44s 300000 nodes written 2h 36m 10s 400000 nodes written 2h 36m 32s 500000 nodes written 2h 36m 54s 600000 nodes written 2h 37m 18s 700000 nodes written 2h 37m 41s 800000 nodes written 2h 38m 04s 900000 nodes written 2h 38m 27s 1000000 nodes written 2h 38m 49s 1100000 nodes written 2h 39m 14s 1200000 nodes written 2h 39m 36s 1300000 nodes written 2h 39m 58s 1400000 nodes written 2h 40m 09s 1446635 nodes written and done
!ls -lh {tempDir}
total 758624 -rw-r--r-- 1 dirk staff 41M Oct 13 12:20 bhsa2017.rds -rw-r--r--@ 1 dirk staff 324M Oct 13 14:11 bhsa2017.txt -rw-r--r-- 1 dirk staff 5.1M Oct 13 12:24 plainTextFromR.txt
The R export is ready now, but it is a bit large. We can get a much leaner file by using R to load this file and save it in .rds format.
We do that in a separate notebook, not running Python, but R: bigTablesR in this same directory.