You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
Text-Fabric is not a world to stay in for ever. When you go to other worlds, you can travel with the corpus data in your backpack.
Here we show two destinations (and one of them is also an origin): Pandas and Emdros.
Before we go there, we load the corpus.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
from tf.app import use
A = use("ETCBC/bhsa", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gd905e3fb6e80d0fa537600337614adc2af157309
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
The first journey is to Pandas.
We convert the data to a dataframe, via a tab-separated text file.
The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.
The BHSA features become the columns, so each row tells what values the features have for the corresponding node.
The edges corresponding to the BHSA features mother, functional_parent, distributional_parent are exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.
We also write the data that says which objects are contained in which. To each row we add the following columns:
word
there is a column with that node type as name;
the value in that column is the node of this type that contains the row node (if any).Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.
While exporting the data to Pandas format, the program composes the big table and saves it as a tab delimited file. This is stored in a temporary directory (not visible on GitHub).
This temporary file can also be read by R, but we proceed with Pandas. Pandas offers functions in the same spirit as R, but is more Pythonic and also faster.
A.exportPandas()
0.00s Create tsv file ... | 2.92s 5% 72342 nodes written | 5.82s 10% 144684 nodes written | 8.75s 15% 217026 nodes written | 12s 20% 289368 nodes written | 15s 25% 361710 nodes written | 18s 30% 434052 nodes written | 21s 35% 506394 nodes written | 24s 40% 578736 nodes written | 26s 45% 651078 nodes written | 29s 50% 723420 nodes written | 32s 55% 795762 nodes written | 35s 60% 868104 nodes written | 38s 65% 940446 nodes written | 41s 70% 1012788 nodes written | 44s 75% 1085130 nodes written | 47s 80% 1157472 nodes written | 50s 85% 1229814 nodes written | 53s 90% 1302156 nodes written | 56s 95% 1374498 nodes written | 59s 95% 1446831 nodes written and done 59s TSV file is ~/text-fabric-data/github/etcbc/bhsa/_temp/data-2021.tsv 59s Columns 72: 59s nd 59s otype 59s g_cons 59s g_cons_utf8 59s g_lex 59s g_lex_utf8 59s g_word 59s g_word_utf8 59s lex 59s lex_utf8 59s phono 59s phono_trailer 59s qere 59s qere_trailer 59s qere_trailer_utf8 59s qere_utf8 59s trailer 59s trailer_utf8 59s voc_lex_utf8 59s in_book 59s in_chapter 59s in_verse 59s in_lex 59s in_half_verse 59s in_sentence 59s in_sentence_atom 59s in_clause 59s in_clause_atom 59s in_phrase 59s in_phrase_atom 59s in_subphrase 59s in_word 59s crossref 59s mother 59s book 59s chapter 59s code 59s det 59s domain 59s freq_lex 59s function 59s gloss 59s gn 59s label 59s language 59s ls 59s nametype 59s nme 59s nu 59s number 59s pargr 59s pdp 59s pfm 59s prs 59s prs_gn 59s prs_nu 59s prs_ps 59s ps 59s rank_lex 59s rela 59s sp 59s st 59s tab 59s txt 59s typ 59s uvf 59s vbe 59s vbs 59s verse 59s voc_lex 59s vs 59s vt 1m 00s 1446832 rows 1m 00s 273843208 characters 1m 00s Importing into Pandas ... | 0.00s Reading tsv file ... | 12s Done. Size = 104171832 | 12s Saving as Parquet file ... | 15s Saved 1m 15s PD in ~/text-fabric-data/github/etcbc/bhsa/pandas/data-2021.pd
F.otype.s("verse")[0]
1414389
The next journey is to MQL, a text-database format not unlike SQL, supported by the Emdros software.
EMDROS, written by Ulrik Petersen, is a text database system with the powerful topographic query language MQL. The ideas are based on a model devised by Christ-Jan Doedens in Text Databases: One Database Model and Several Retrieval Languages.
Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.
SHEBANQ uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.
So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.
If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by importMQL()
,
which we will not show here.
And if you want to export a Text-Fabric data set to MQL, that is also possible.
After the Fabric(modules=...)
call, you can call exportMQL()
in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.
A.exportMQL("mybhsa", exportDir="~/Downloads/mql")
0.00s Checking features of dataset mybhsa
| 1.75s feature "book@am" => "book_am" | 1.75s feature "book@ar" => "book_ar" | 1.75s feature "book@bn" => "book_bn" | 1.75s feature "book@da" => "book_da" | 1.75s feature "book@de" => "book_de" | 1.75s feature "book@el" => "book_el" | 1.75s feature "book@en" => "book_en" | 1.75s feature "book@es" => "book_es" | 1.75s feature "book@fa" => "book_fa" | 1.75s feature "book@fr" => "book_fr" | 1.75s feature "book@he" => "book_he" | 1.75s feature "book@hi" => "book_hi" | 1.75s feature "book@id" => "book_id" | 1.75s feature "book@ja" => "book_ja" | 1.75s feature "book@ko" => "book_ko" | 1.75s feature "book@la" => "book_la" | 1.75s feature "book@nl" => "book_nl" | 1.75s feature "book@pa" => "book_pa" | 1.75s feature "book@pt" => "book_pt" | 1.75s feature "book@ru" => "book_ru" | 1.75s feature "book@sw" => "book_sw" | 1.76s feature "book@syc" => "book_syc" | 1.76s feature "book@tr" => "book_tr" | 1.76s feature "book@ur" => "book_ur" | 1.76s feature "book@yo" => "book_yo" | 1.76s feature "book@zh" => "book_zh" | 1.76s feature "omap@2017-2021" => "omap_2017_2021" | 1.76s feature "omap@c-2021" => "omap_c_2021"
0.02s 118 features to export to MQL ... 0.02s Loading 118 features 1.88s Writing enumerations book_am : 39 values, 39 not a name, e.g. «መኃልየ_መኃልይ_ዘሰሎሞን» book_ar : 39 values, 39 not a name, e.g. «1_اخبار» book_bn : 39 values, 39 not a name, e.g. «আদিপুস্তক» book_da : 39 values, 13 not a name, e.g. «1.Kongebog» book_de : 39 values, 7 not a name, e.g. «1_Chronik» book_el : 39 values, 39 not a name, e.g. «Άσμα_Ασμάτων» book_en : 39 values, 6 not a name, e.g. «1_Chronicles» book_es : 39 values, 22 not a name, e.g. «1_Crónicas» book_fa : 39 values, 39 not a name, e.g. «استر» book_fr : 39 values, 19 not a name, e.g. «1_Chroniques» book_he : 39 values, 39 not a name, e.g. «איוב» book_hi : 39 values, 39 not a name, e.g. «1_इतिहास» book_id : 39 values, 7 not a name, e.g. «1_Raja-raja» book_ja : 39 values, 39 not a name, e.g. «アモス書» book_ko : 39 values, 39 not a name, e.g. «나훔» book_nl : 39 values, 8 not a name, e.g. «1_Koningen» book_pa : 39 values, 39 not a name, e.g. «1_ਇਤਹਾਸ» book_pt : 39 values, 21 not a name, e.g. «1_Crônicas» book_ru : 39 values, 39 not a name, e.g. «1-я_Паралипоменон» book_sw : 39 values, 6 not a name, e.g. «1_Mambo_ya_Nyakati» book_syc : 39 values, 39 not a name, e.g. «ܐ_ܒܪܝܡܝܢ» book_tr : 39 values, 16 not a name, e.g. «1_Krallar» book_ur : 39 values, 39 not a name, e.g. «احبار» book_yo : 39 values, 8 not a name, e.g. «Amọsi» book_zh : 38 values, 37 not a name, e.g. «以斯帖记» domain : 4 values, 1 not a name, e.g. «?» g_nme : 108 values, 108 not a name, e.g. «» g_nme_utf8 : 106 values, 106 not a name, e.g. «» g_pfm : 87 values, 87 not a name, e.g. «» g_pfm_utf8 : 86 values, 86 not a name, e.g. «» g_prs : 127 values, 127 not a name, e.g. «» g_prs_utf8 : 126 values, 126 not a name, e.g. «» g_uvf : 19 values, 19 not a name, e.g. «» g_uvf_utf8 : 17 values, 17 not a name, e.g. «» g_vbe : 101 values, 101 not a name, e.g. «» g_vbe_utf8 : 97 values, 97 not a name, e.g. «» g_vbs : 66 values, 66 not a name, e.g. «» g_vbs_utf8 : 65 values, 65 not a name, e.g. «» instruction : 35 values, 20 not a name, e.g. «.#» nametype : 10 values, 5 not a name, e.g. «gens,topo» nme : 20 values, 7 not a name, e.g. «» pfm : 11 values, 4 not a name, e.g. «» phono_trailer : 4 values, 4 not a name, e.g. «» prs : 22 values, 4 not a name, e.g. «H=» qere_trailer : 5 values, 5 not a name, e.g. «» qere_trailer_utf8: 5 values, 5 not a name, e.g. «» root : 757 values, 212 not a name, e.g. «<Assyrian>» trailer : 13 values, 13 not a name, e.g. «» trailer_utf8 : 13 values, 13 not a name, e.g. «» txt : 136 values, 59 not a name, e.g. «?» uvf : 6 values, 1 not a name, e.g. «>» vbe : 19 values, 6 not a name, e.g. «» vbs : 11 values, 3 not a name, e.g. «>» | 0.47s Writing an all-in-one enum with 232 values 2.35s Mapping 118 features onto 13 object types 4.92s Writing 118 features as data in 13 object types | 0.00s word data ... | | 1.59s batch of size 49.9MB with 50000 of 50000 words | | 3.16s batch of size 50.0MB with 50000 of 100000 words | | 4.74s batch of size 50.2MB with 50000 of 150000 words | | 6.32s batch of size 50.2MB with 50000 of 200000 words | | 7.90s batch of size 50.4MB with 50000 of 250000 words | | 9.47s batch of size 50.4MB with 50000 of 300000 words | | 11s batch of size 50.5MB with 50000 of 350000 words | | 13s batch of size 50.4MB with 50000 of 400000 words | | 13s batch of size 26.8MB with 26590 of 426590 words | 13s word data: 426590 objects | 0.00s subphrase data ... | | 0.23s batch of size 8.6MB with 50000 of 50000 subphrases | | 0.45s batch of size 8.5MB with 50000 of 100000 subphrases | | 0.51s batch of size 2.4MB with 13850 of 113850 subphrases | 0.51s subphrase data: 113850 objects | 0.00s phrase_atom data ... | | 0.32s batch of size 12.0MB with 50000 of 50000 phrase_atoms | | 0.64s batch of size 12.0MB with 50000 of 100000 phrase_atoms | | 0.96s batch of size 12.2MB with 50000 of 150000 phrase_atoms | | 1.28s batch of size 12.2MB with 50000 of 200000 phrase_atoms | | 1.60s batch of size 12.1MB with 50000 of 250000 phrase_atoms | | 1.71s batch of size 4.3MB with 17532 of 267532 phrase_atoms | 1.72s phrase_atom data: 267532 objects | 0.00s phrase data ... | | 0.29s batch of size 10.9MB with 50000 of 50000 phrases | | 0.58s batch of size 11.0MB with 50000 of 100000 phrases | | 0.87s batch of size 11.0MB with 50000 of 150000 phrases | | 1.17s batch of size 11.0MB with 50000 of 200000 phrases | | 1.46s batch of size 11.0MB with 50000 of 250000 phrases | | 1.47s batch of size 724.2KB with 3203 of 253203 phrases | 1.47s phrase data: 253203 objects | 0.00s clause_atom data ... | | 0.41s batch of size 14.4MB with 50000 of 50000 clause_atoms | | 0.74s batch of size 11.7MB with 40704 of 90704 clause_atoms | 0.74s clause_atom data: 90704 objects | 0.00s clause data ... | | 0.35s batch of size 13.3MB with 50000 of 50000 clauses | | 0.62s batch of size 10.2MB with 38131 of 88131 clauses | 0.62s clause data: 88131 objects | 0.00s sentence_atom data ... | | 0.21s batch of size 7.8MB with 50000 of 50000 sentence_atoms | | 0.27s batch of size 2.3MB with 14514 of 64514 sentence_atoms | 0.27s sentence_atom data: 64514 objects | 0.00s sentence data ... | | 0.17s batch of size 6.3MB with 50000 of 50000 sentences | | 0.22s batch of size 1.7MB with 13717 of 63717 sentences | 0.22s sentence data: 63717 objects | 0.00s half_verse data ... | | 0.16s batch of size 5.5MB with 45179 of 45179 half_verses | 0.16s half_verse data: 45179 objects | 0.00s verse data ... | | 0.14s batch of size 4.8MB with 23213 of 23213 verses | 0.14s verse data: 23213 objects | 0.00s lex data ... | | 0.20s batch of size 5.5MB with 9230 of 9230 lexs | 0.20s lex data: 9230 objects | 0.00s chapter data ... | | 0.02s batch of size 131.1KB with 929 of 929 chapters | 0.02s chapter data: 929 objects | 0.00s book data ... | | 0.02s batch of size 29.2KB with 39 of 39 books | 0.02s book data: 39 objects 24s MQL in ~/Downloads/mql 24s Done
Now you have a file ~/Downloads/mql/mybhsa.mql
of 530 MB.
You can import it into an Emdros database by saying:
cd ~/Downloads/mql
rm mybhsa.mql
mql -b 3 < mybhsa.mql
The result is an SQLite3 database mybhsa
in the same directory (168 MB).
You can run a query against it by creating a text file test.mql with this contents:
select all objects where
[lex gloss ~ 'make'
[word FOCUS]
]
And then say
mql -b 3 -d mybhsa test.mql
You will see raw query results: all word occurrences that belong to lexemes with make
in their gloss.
It is not very pretty, and probably you should use a more visual Emdros tool to run those queries. You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric.
CC-BY Dirk Roorda