You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
Text-Fabric is not a world to stay in for ever. When you go to other worlds, you can travel with the corpus data in your backpack.
Here we show two destinations (and one of them is also an origin): Pandas and Emdros.
Before we go there, we load the corpus.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
from tf.app import use
A = use("ETCBC/bhsa", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gb112c161cfd21eae403d51a2733740d8743460e7
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8.1
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
The first journey is to Pandas.
We convert the data to a data frame, via a tab-separated text file.
The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.
The BHSA features become the columns, so each row tells what values the features have for the corresponding node.
The edges corresponding to the BHSA features mother
, functional_parent
, distributional_parent
are
exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.
We also write the data that says which objects are contained in which. To each row we add the following columns:
word
there is a column with that node type as name;
the value in that column is the node of this type that contains the row node (if any).Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.
While exporting the data to Pandas format, the program composes the big table and saves it as a tab delimited file. This is stored in a temporary directory (not visible on GitHub).
This temporary file can also be read by R, but we proceed with Pandas. Pandas offers functions in the same spirit as R, but is more Pythonic and also faster.
A.exportPandas()
0.00s Create tsv file ... | 2.96s 5% 72342 nodes written | 5.89s 10% 144684 nodes written | 8.80s 15% 217026 nodes written | 12s 20% 289368 nodes written | 15s 25% 361710 nodes written | 18s 30% 434052 nodes written | 21s 35% 506394 nodes written | 24s 40% 578736 nodes written | 27s 45% 651078 nodes written | 30s 50% 723420 nodes written | 33s 55% 795762 nodes written | 36s 60% 868104 nodes written | 39s 65% 940446 nodes written | 42s 70% 1012788 nodes written | 45s 75% 1085130 nodes written | 48s 80% 1157472 nodes written | 50s 85% 1229814 nodes written | 53s 90% 1302156 nodes written | 56s 95% 1374498 nodes written | 59s 95% 1446831 nodes written and done 59s TSV file is ~/text-fabric-data/github/ETCBC/bhsa/_temp/data-2021.tsv 59s Columns 72: 59s nd 59s otype 59s g_cons 59s g_cons_utf8 59s g_lex 59s g_lex_utf8 59s g_word 59s g_word_utf8 59s lex 59s lex_utf8 59s phono 59s phono_trailer 59s qere 59s qere_trailer 59s qere_trailer_utf8 59s qere_utf8 59s trailer 59s trailer_utf8 59s voc_lex_utf8 59s in_book 59s in_chapter 59s in_verse 59s in_lex 59s in_half_verse 59s in_sentence 59s in_sentence_atom 59s in_clause 59s in_clause_atom 59s in_phrase 59s in_phrase_atom 59s in_subphrase 59s in_word 59s crossref 59s mother 59s book 59s chapter 59s code 59s det 59s domain 59s freq_lex 59s function 59s gloss 59s gn 59s label 59s language 59s ls 59s nametype 59s nme 59s nu 59s number 59s pargr 59s pdp 59s pfm 59s prs 59s prs_gn 59s prs_nu 59s prs_ps 59s ps 59s rank_lex 59s rela 59s sp 59s st 59s tab 59s txt 59s typ 59s uvf 59s vbe 59s vbs 59s verse 59s voc_lex 59s vs 59s vt 1m 00s 1446832 rows 1m 00s 273843208 characters 1m 00s Importing into Pandas ... | 0.00s Reading tsv file ... | 13s Done. Size = 104171832 | 13s Saving as Parquet file ... | 19s Saved 1m 19s PD in ~/text-fabric-data/github/ETCBC/bhsa/pandas/data-2021.pd
F.otype.s("verse")[0]
1414389
The next journey is to MQL, a text-database format not unlike SQL, supported by the Emdros software.
EMDROS, written by Ulrik Petersen, is a text database system with the powerful topographic query language MQL. The ideas are based on a model devised by Christ-Jan Doedens in Text Databases: One Database Model and Several Retrieval Languages.
Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.
SHEBANQ uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.
So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.
If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by importMQL()
,
which we will not show here.
And if you want to export a Text-Fabric data set to MQL, that is also possible.
After the Fabric(modules=...)
call, you can call exportMQL()
in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.
A.exportMQL("mybhsa", exportDir="~/Downloads/mql")
0.00s Checking features of dataset mybhsa
| 4m 45s feature "book@am" => "book_am" | 4m 45s feature "book@ar" => "book_ar" | 4m 45s feature "book@bn" => "book_bn" | 4m 45s feature "book@da" => "book_da" | 4m 45s feature "book@de" => "book_de" | 4m 45s feature "book@el" => "book_el" | 4m 45s feature "book@en" => "book_en" | 4m 45s feature "book@es" => "book_es" | 4m 45s feature "book@fa" => "book_fa" | 4m 45s feature "book@fr" => "book_fr" | 4m 45s feature "book@he" => "book_he" | 4m 45s feature "book@hi" => "book_hi" | 4m 45s feature "book@id" => "book_id" | 4m 45s feature "book@ja" => "book_ja" | 4m 45s feature "book@ko" => "book_ko" | 4m 45s feature "book@la" => "book_la" | 4m 45s feature "book@nl" => "book_nl" | 4m 45s feature "book@pa" => "book_pa" | 4m 45s feature "book@pt" => "book_pt" | 4m 45s feature "book@ru" => "book_ru" | 4m 45s feature "book@sw" => "book_sw" | 4m 45s feature "book@syc" => "book_syc" | 4m 45s feature "book@tr" => "book_tr" | 4m 45s feature "book@ur" => "book_ur" | 4m 45s feature "book@yo" => "book_yo" | 4m 45s feature "book@zh" => "book_zh" | 4m 45s feature "omap@2017-2021" => "omap_2017_2021" | 4m 45s feature "omap@c-2021" => "omap_c_2021"
0.02s 118 features to export to MQL ... 0.02s Loading 118 features | 0.07s T crossrefLCS from ~/text-fabric-data/github/ETCBC/parallels/tf/2021 | 0.04s T crossrefSET from ~/text-fabric-data/github/ETCBC/parallels/tf/2021 | 1.20s T dist_unit from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 3.18s T distributional_parent from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.77s T freq_occ from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 3.98s T functional_parent from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.77s T g_nme from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.79s T g_nme_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.71s T g_pfm from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.73s T g_pfm_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.71s T g_prs from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.71s T g_prs_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.68s T g_uvf from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.68s T g_uvf_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.72s T g_vbe from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.70s T g_vbe_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.70s T g_vbs from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.68s T g_vbs_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.18s T instruction from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.18s T is_root from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.17s T kind from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.68s T kq_hybrid from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.69s T kq_hybrid_utf8 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.84s T languageISO from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.94s T lex0 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.76s T lexeme_count from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.40s T mother_object_type from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 6.39s T omap@2017-2021 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 6.31s T omap@c-2021 from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.75s T rank_occ from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.17s T root from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.83s T suffix_gender from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.83s T suffix_number from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 | 0.82s T suffix_person from ~/text-fabric-data/github/ETCBC/bhsa/tf/2021 39s Writing enumerations book_am : 39 values, 39 not a name, e.g. «መኃልየ_መኃልይ_ዘሰሎሞን» book_ar : 39 values, 39 not a name, e.g. «1_اخبار» book_bn : 39 values, 39 not a name, e.g. «আদিপুস্তক» book_da : 39 values, 13 not a name, e.g. «1.Kongebog» book_de : 39 values, 7 not a name, e.g. «1_Chronik» book_el : 39 values, 39 not a name, e.g. «Άσμα_Ασμάτων» book_en : 39 values, 6 not a name, e.g. «1_Chronicles» book_es : 39 values, 22 not a name, e.g. «1_Crónicas» book_fa : 39 values, 39 not a name, e.g. «استر» book_fr : 39 values, 19 not a name, e.g. «1_Chroniques» book_he : 39 values, 39 not a name, e.g. «איוב» book_hi : 39 values, 39 not a name, e.g. «1_इतिहास» book_id : 39 values, 7 not a name, e.g. «1_Raja-raja» book_ja : 39 values, 39 not a name, e.g. «アモス書» book_ko : 39 values, 39 not a name, e.g. «나훔» book_nl : 39 values, 8 not a name, e.g. «1_Koningen» book_pa : 39 values, 39 not a name, e.g. «1_ਇਤਹਾਸ» book_pt : 39 values, 21 not a name, e.g. «1_Crônicas» book_ru : 39 values, 39 not a name, e.g. «1-я_Паралипоменон» book_sw : 39 values, 6 not a name, e.g. «1_Mambo_ya_Nyakati» book_syc : 39 values, 39 not a name, e.g. «ܐ_ܒܪܝܡܝܢ» book_tr : 39 values, 16 not a name, e.g. «1_Krallar» book_ur : 39 values, 39 not a name, e.g. «احبار» book_yo : 39 values, 8 not a name, e.g. «Amọsi» book_zh : 38 values, 37 not a name, e.g. «以斯帖记» domain : 4 values, 1 not a name, e.g. «?» g_nme : 108 values, 108 not a name, e.g. «» g_nme_utf8 : 106 values, 106 not a name, e.g. «» g_pfm : 87 values, 87 not a name, e.g. «» g_pfm_utf8 : 86 values, 86 not a name, e.g. «» g_prs : 127 values, 127 not a name, e.g. «» g_prs_utf8 : 126 values, 126 not a name, e.g. «» g_uvf : 19 values, 19 not a name, e.g. «» g_uvf_utf8 : 17 values, 17 not a name, e.g. «» g_vbe : 101 values, 101 not a name, e.g. «» g_vbe_utf8 : 97 values, 97 not a name, e.g. «» g_vbs : 66 values, 66 not a name, e.g. «» g_vbs_utf8 : 65 values, 65 not a name, e.g. «» instruction : 35 values, 20 not a name, e.g. «.#» nametype : 10 values, 5 not a name, e.g. «gens,topo» nme : 20 values, 7 not a name, e.g. «» pfm : 11 values, 4 not a name, e.g. «» phono_trailer : 4 values, 4 not a name, e.g. «» prs : 22 values, 4 not a name, e.g. «H=» qere_trailer : 5 values, 5 not a name, e.g. «» qere_trailer_utf8: 5 values, 5 not a name, e.g. «» root : 757 values, 212 not a name, e.g. «<Assyrian>» trailer : 13 values, 13 not a name, e.g. «» trailer_utf8 : 13 values, 13 not a name, e.g. «» txt : 136 values, 59 not a name, e.g. «?» uvf : 6 values, 1 not a name, e.g. «>» vbe : 19 values, 6 not a name, e.g. «» vbs : 11 values, 3 not a name, e.g. «>» | 0.36s Writing an all-in-one enum with 232 values 39s Mapping 118 features onto 13 object types 42s Writing 118 features as data in 13 object types | 0.00s word data ... | | 1.24s batch of size 49.9MB with 50000 of 50000 words | | 2.49s batch of size 50.0MB with 50000 of 100000 words | | 3.74s batch of size 50.2MB with 50000 of 150000 words | | 4.99s batch of size 50.2MB with 50000 of 200000 words | | 6.24s batch of size 50.4MB with 50000 of 250000 words | | 7.50s batch of size 50.4MB with 50000 of 300000 words | | 8.76s batch of size 50.5MB with 50000 of 350000 words | | 10s batch of size 50.4MB with 50000 of 400000 words | | 11s batch of size 26.8MB with 26590 of 426590 words | 11s word data: 426590 objects | 0.00s subphrase data ... | | 0.18s batch of size 8.6MB with 50000 of 50000 subphrases | | 0.35s batch of size 8.5MB with 50000 of 100000 subphrases | | 0.40s batch of size 2.4MB with 13850 of 113850 subphrases | 0.40s subphrase data: 113850 objects | 0.00s phrase_atom data ... | | 0.26s batch of size 12.0MB with 50000 of 50000 phrase_atoms | | 0.51s batch of size 12.0MB with 50000 of 100000 phrase_atoms | | 0.77s batch of size 12.2MB with 50000 of 150000 phrase_atoms | | 1.03s batch of size 12.2MB with 50000 of 200000 phrase_atoms | | 1.28s batch of size 12.1MB with 50000 of 250000 phrase_atoms | | 1.37s batch of size 4.3MB with 17532 of 267532 phrase_atoms | 1.37s phrase_atom data: 267532 objects | 0.00s phrase data ... | | 0.23s batch of size 10.9MB with 50000 of 50000 phrases | | 0.45s batch of size 11.0MB with 50000 of 100000 phrases | | 0.68s batch of size 11.0MB with 50000 of 150000 phrases | | 0.92s batch of size 11.0MB with 50000 of 200000 phrases | | 1.15s batch of size 11.0MB with 50000 of 250000 phrases | | 1.16s batch of size 724.2KB with 3203 of 253203 phrases | 1.16s phrase data: 253203 objects | 0.00s clause_atom data ... | | 0.32s batch of size 14.4MB with 50000 of 50000 clause_atoms | | 0.59s batch of size 11.7MB with 40704 of 90704 clause_atoms | 0.59s clause_atom data: 90704 objects | 0.00s clause data ... | | 0.28s batch of size 13.3MB with 50000 of 50000 clauses | | 0.49s batch of size 10.2MB with 38131 of 88131 clauses | 0.49s clause data: 88131 objects | 0.00s sentence_atom data ... | | 0.18s batch of size 7.8MB with 50000 of 50000 sentence_atoms | | 0.23s batch of size 2.3MB with 14514 of 64514 sentence_atoms | 0.23s sentence_atom data: 64514 objects | 0.00s sentence data ... | | 0.14s batch of size 6.3MB with 50000 of 50000 sentences | | 0.18s batch of size 1.7MB with 13717 of 63717 sentences | 0.18s sentence data: 63717 objects | 0.00s half_verse data ... | | 0.13s batch of size 5.5MB with 45179 of 45179 half_verses | 0.13s half_verse data: 45179 objects | 0.00s verse data ... | | 0.12s batch of size 4.8MB with 23213 of 23213 verses | 0.12s verse data: 23213 objects | 0.00s lex data ... | | 0.16s batch of size 5.5MB with 9230 of 9230 lexs | 0.16s lex data: 9230 objects | 0.00s chapter data ... | | 0.02s batch of size 131.1KB with 929 of 929 chapters | 0.02s chapter data: 929 objects | 0.00s book data ... | | 0.02s batch of size 29.2KB with 39 of 39 books | 0.02s book data: 39 objects 57s MQL in ~/Downloads/mql 57s Done
Now you have a file ~/Downloads/mql/mybhsa.mql
of 530 MB.
You can import it into an Emdros database by saying:
cd ~/Downloads/mql
rm mybhsa.mql
mql -b 3 < mybhsa.mql
The result is an SQLite3 database mybhsa
in the same directory (168 MB).
You can run a query against it by creating a text file test.mql with this contents:
select all objects where
[lex gloss ~ 'make'
[word FOCUS]
]
And then say
mql -b 3 -d mybhsa test.mql
You will see raw query results: all word occurrences that belong to lexemes with make
in their gloss.
It is not very pretty, and probably you should use a more visual Emdros tool to run those queries. You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric.
CC-BY Dirk Roorda