Notebook

BHSA as a Big Table¶

This notebook exports the BHSA database to an R data frame. The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.

The BHSA features become the columns, so each row tells what values the features have for the corresponding object.

The edges corresponding to the BHSA features mother, functional_parent, distributional_parent are exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.

We also write the data that says which objects are contained in which. To each row we add the following columns:

for each object type, except word there is a column with name that object type and containing the identifier of the containing object of that type of the row object (if any).

Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.

We compose the big table and save it as a tab delimited files. The result can be processed by R and Pandas, who may converted the table to internal formats for quicker loading. It turns out that for this size of the data Pandas is a bit quicker than R.

Also, because we remain in a Python environment, working with Pandas is easier when you want to use configurations ad libraries from the text-fabric sphere.

See bigTablesR and bigTablesP

In [9]:

import os, sys, collections
from tf.fabric import Fabric

Data source¶

In [10]:

locations = '~/github/etcbc'
coreModule = 'bhsa'
sources = [coreModule, 'phono']
version = '2017'
tempDir = os.path.expanduser('{}/{}/_temp/{}/r'.format(locations, coreModule, version))
tableFile = '{}/{}{}.txt'.format(tempDir, coreModule, version)

In [11]:

modules = ['{}/tf/{}'.format(s, version) for s in sources]
TF = Fabric(locations=locations, modules=modules)

This is Text-Fabric 3.0.9
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

117 features found and 0 ignored

Load ALL features¶

In [12]:

api = TF.load('')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed
  5.91s All features loaded/computed - for details use loadLog()
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed
  0.00s loading features ...
   |     0.03s B code                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.19s B det                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.19s B dist                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.23s B dist_unit            from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     1.61s B distributional_parent from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.02s B domain               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.12s B freq_lex             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.10s B freq_occ             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B function             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     3.08s B functional_parent    from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.10s B g_nme                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B g_nme_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_pfm                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.10s B g_pfm_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_prs                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_prs_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.08s B g_uvf                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.07s B g_uvf_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.08s B g_vbe                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_vbe_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_vbs                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B g_vbs_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B gn                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.03s B instruction          from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.03s B is_root              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.04s B kind                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B kq_hybrid            from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.09s B kq_hybrid_utf8       from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.02s B label                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B language             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.24s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B lexeme_count         from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.22s B ls                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     1.48s B mother               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.06s B mother_object_type   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s B nametype             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B nme                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B nu                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.26s B number               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.90s B omap@2016-2017       from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.03s B pargr                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B pdp                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B pfm                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B prs                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B prs_gn               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B prs_nu               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.16s B prs_ps               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.23s B ps                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.18s B rank_lex             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B rank_occ             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.37s B rela                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.01s B root                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B sp                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B st                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B suffix_gender        from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B suffix_number        from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B suffix_person        from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.02s B tab                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.03s B txt                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.26s B typ                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B uvf                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B vbe                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B vbs                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.01s B voc_lex              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.01s B voc_lex_utf8         from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.17s B vs                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B vt                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 2 configs; 7 computed
    18s All features loaded/computed - for details use loadLog()

Writing R data¶

In [43]:

## info("Writing R feature data")

if not os.path.exists(tempDir):
    os.makedirs(tempDir)

hr = open(tableFile, 'w')

skipFeatures = '''
    otype
    oslots
'''.strip().split()
for f in (Fall() + Eall()):
    if '@' in f: skipFeatures.append(f)

levelFeatures = '''
    subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
    half_verse verse chapter book
'''.strip().split()
inLevelFeatures = ['in.'+x for x in levelFeatures]

allNodeFeatures = sorted(set(Fall()) - set(skipFeatures))
allEdgeFeatures = sorted(set(Eall()) - set(skipFeatures))

hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
    'n',
    'otype',
    '\t'.join(inLevelFeatures),
    '\t'.join(allEdgeFeatures),
    '\t'.join(allNodeFeatures),
))
chunkSize = 100000
i = 0
s = 0
NA = ['']
NAe = [['']]
for n in N():
    levelValues = [(L.u(n, otype=level) or NA)[0] for level in levelFeatures]
    edgeValues = [str((Es(f).f(n) or NA)[0]) for f in allEdgeFeatures]
    nodeValues = [(str(Fs(f).v(n) or '')) for f in allNodeFeatures]
    hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
        n,
        F.otype.v(n),
        ('\t'.join(str(x) for x in levelValues)),
        ('\t'.join(edgeValues)),
        ('\t'.join(nodeValues)).replace('\n',''),
    ))
    i += 1
    s += 1
    if s == chunkSize:
        s = 0
        info('{:>7} nodes written'.format(i))
hr.close()
info('{:>7} nodes written and done'.format(i))

 2h 34m 57s  100000 nodes written
 2h 35m 21s  200000 nodes written
 2h 35m 44s  300000 nodes written
 2h 36m 10s  400000 nodes written
 2h 36m 32s  500000 nodes written
 2h 36m 54s  600000 nodes written
 2h 37m 18s  700000 nodes written
 2h 37m 41s  800000 nodes written
 2h 38m 04s  900000 nodes written
 2h 38m 27s 1000000 nodes written
 2h 38m 49s 1100000 nodes written
 2h 39m 14s 1200000 nodes written
 2h 39m 36s 1300000 nodes written
 2h 39m 58s 1400000 nodes written
 2h 40m 09s 1446635 nodes written and done

In [44]:

!ls -lh {tempDir}

total 758624
-rw-r--r--  1 dirk  staff    41M Oct 13 12:20 bhsa2017.rds
-rw-r--r--@ 1 dirk  staff   324M Oct 13 14:11 bhsa2017.txt
-rw-r--r--  1 dirk  staff   5.1M Oct 13 12:24 plainTextFromR.txt

The R export is ready now, but it is a bit large. We can get a much leaner file by using R to load this file and save it in .rds format.

We do that in a separate notebook, not running Python, but R: bigTablesR in this same directory.

In [ ]: