#!/usr/bin/env python
# coding: utf-8
#
#
#
#
#
# # Tutorial
#
# This notebook gets you started with using
# [Text-Fabric](https://dans-labs.github.io/text-fabric/) for coding in the Hebrew Bible.
#
# Chances are that a bit of reading about the underlying
# [data model](https://dans-labs.github.io/text-fabric/Model/Data-Model/)
# helps you to follow the exercises below, and vice versa.
# ## Installing Text-Fabric
#
# ### Python
#
# You need to have Python on your system. Most systems have it out of the box,
# but alas, that is python2 and we need at least python **3.6**.
#
# Install it from [python.org](https://www.python.org) or from
# [Anaconda](https://www.anaconda.com/download).
#
# ### Jupyter notebook
#
# You need [Jupyter](http://jupyter.org).
#
# If it is not already installed:
#
# ```
# pip3 install jupyter
# ```
#
# ### TF itself
#
# ```
# pip3 install text-fabric
# ```
# In[1]:
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
# In[2]:
import sys, os, collections
from IPython.display import HTML
# In[3]:
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
# # Call Text-Fabric
#
# Everything starts by calling up Text-Fabric.
# It needs to know where to look for data.
#
# The Hebrew Bible is in the same repository as this tutorial.
# I assume you have cloned [bhsa](https://github.com/etcbc/bhsa)
# and [phono](https://github.com/etcbc/phono)
# in your directory `~/github/etcbc`, so that your directory structure looks like this
#
# your home direcectory\
# | - github\
# | | - etcbc\
# | | | - bhsa
# | | | - phono
#
# ## Tip
# If you start computing with this tutorial, first copy its parent directory to somewhere else,
# outside your `bhsa` directory.
# If you pull changes from the `bhsa` repository later, your work will not be overwritten.
# Where you put your tutorial directory is up till you.
# It will work from any directory.
# In[4]:
VERSION = '2017'
DATABASE = '~/github/etcbc'
BHSA = f'bhsa/tf/{VERSION}'
PHONO = f'phono/tf/{VERSION}'
TF = Fabric(locations=[DATABASE], modules=[BHSA, PHONO], silent=False )
# Note that we have added a module `phono`.
# The BHSA data has a special 1-1 transcription from Hebrew to ASCII,
# but not a *phonetic* transcription.
#
# I have made a
# [notebook](https://github.com/etcbc/phono/blob/master/programs/phono.ipynb)
# that tries hard to find phonological representations for all the words.
# The result is a module in text-fabric format.
# We'll encounter that later.
#
# **NB:** This is a real-world example of how to add data to an existing data source as a module.
# # Load Features
# The data of the BHSA is organized in features.
# They are *columns* of data.
# Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the
# first word, row 2 to the second word, and so on, for all 425,000 words.
#
# The information which part-of-speech each word is, constitutes a column in that spreadsheet.
# The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more
# textual objects.
#
# Instead of putting that information in one big table, the data is organized in separate columns.
# We call those columns **features**.
#
# We just load the features we need for this tutorial.
# Later on, where we use them, it will become clear what they mean.
# In[10]:
api = TF.load('''
sp lex voc_lex_utf8
g_word trailer
g_lex_utf8
qere qere_trailer
language freq_lex gloss
mother
''')
api.makeAvailableIn(globals())
# The result of this all is that we have a bunch of special variables at our disposal
# that give us access to the text and data of the Hebrew Bible.
#
# At this point it is helpful to throw a quick glance at the text-fabric
# [API documentation](https://dans-labs.github.io/text-fabric/Api/General/).
#
# The most essential thing for now is that we can use `F` to access the data in the features
# we've loaded.
# But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.
# ## More power
#
# There are extra functions on top of Text-Fabric that know about the Hebrew Bible.
# Lets acquire additional power.
# In[11]:
B = Bhsa(api, 'start', version=VERSION)
# A few things to note:
#
# * You supply the `api` as first argument to `Bhsa()`
# * You supply the plain *name* of the notebook that you are writing as the second argument
# * You supply the *version* of the BHSA data as the third argument
#
# The result is that you have a few handy links to
#
# * the data provenance and documentation
# * the BHSA API and the Text-Fabric API
# * the online versions of this notebook on GitHub and NBViewer.
# ## Search
# Text-Fabric contains a flexible search engine, that does not only work for the BHSA data,
# but also for data that you add to it.
#
# **Search is the quickest way to come up-to-speed with your data, without too much programming.**
#
# Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite.
# And if you already know MQL queries, you can build from that in
# [searchFromMQL](searchFromMQL.ipynb).
#
# The real power of search lies in the fact that it is integrated in a programming environment.
# You can use programming to:
#
# * compose dynamic queries
# * process query results
#
# Therefore, the rest of this tutorial is still important when you want to tap that power.
# If you continue here, you learn all the basics of data-navigation with Text-Fabric.
# # Counting
#
# In order to get acquainted with the data, we start with the simple task of counting.
#
# ## Count all nodes
# We use the
# [`N()` generator](https://dans-labs.github.io/text-fabric/Api/General/#navigating-nodes)
# to walk through the nodes.
#
# We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words.
# In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.
#
# We also mentioned that there are also 1,000,000 more textual objects.
# They are the phrases, clauses, sentences, verses, chapters and books.
# They also correspond to rows in the big spreadsheet.
#
# In Text-Fabric we call all these rows *nodes*, and the `N()` generator
# carries us through those nodes in the textual order.
#
# Just one extra thing: the `info` statements generate timed messages.
# If you use them instead of `print` you'll get a sense of the amount of time that
# the various processing steps typically need.
# In[6]:
indent(reset=True)
info('Counting nodes ...')
i = 0
for n in N(): i += 1
info('{} nodes'.format(i))
# Here you see it: 1,4 M nodes!
# ## What are those million nodes?
# Every node has a type, like word, or phrase, sentence.
# We know that we have approximately 425,000 words and a million other nodes.
# But what exactly are they?
#
# Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
# `otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.
#
# Here we go!
# In[7]:
F.otype.slotType
# In[8]:
F.otype.maxSlot
# In[9]:
F.otype.maxNode
# In[10]:
F.otype.all
# In[11]:
C.levels.data
# This is interesting: above you see all the textual objects, with the average size of their objects,
# the node where they start, and the node where they end.
# ## Count individual object types
# This is an intuitive way to count the number of nodes in each type.
# Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed
# and indented progress messages.
# In[12]:
indent(reset=True)
info('counting objects ...')
for otype in F.otype.all:
i = 0
indent(level=1, reset=True)
for n in F.otype.s(otype): i+=1
info('{:>7} {}s'.format(i, otype))
indent(level=0)
info('Done')
# # Viewing textual objects
#
# We use the BHSA API (the extra power) to peek into the corpus.
# First a word. Node 100,000 is a slot. Let's see what it is and where it is.
# In[13]:
wordShow = 100000
B.pretty(wordShow)
# Note
# * if you click on the word
# you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme;
# * if you hover on the part-of-speech (`prep` here), you see the passage,
# and if you click on it, you go to SHEBANQ, to exactly this verse.
# Let us do the same for more complex objects, such as phrases, sentences, etc.
# In[14]:
phraseShow = 700001
B.pretty(phraseShow)
# In[15]:
clauseShow = 500002
B.pretty(clauseShow)
# In[16]:
sentenceShow = 1200001
B.pretty(sentenceShow)
# In[17]:
verseShow = 1420000
B.pretty(verseShow)
# In[18]:
chapterShow = 427000
print(F.otype.v(chapterShow))
B.pretty(chapterShow)
# If you need a link to shebanq for just any node:
# In[19]:
million = 1000000
B.shbLink(million)
# # Feature statistics
#
# `F`
# gives access to all features.
# Every feature has a method
# `freqList()`
# to generate a frequency list of its values, higher frequencies first.
# Here are the parts of speech:
# In[20]:
F.sp.freqList()
# # Lexeme matters
#
# ## Top 10 frequent verbs
#
# If we count the frequency of words, we usually mean the frequency of their
# corresponding lexemes.
#
# There are several methods for working with lexemes.
#
# ### Method 1: counting words
# In[21]:
verbs = collections.Counter()
indent(reset=True)
info('Collecting data')
for w in F.otype.s('word'):
if F.sp.v(w) != 'verb': continue
verbs[F.lex.v(w)] +=1
info('Done')
print(''.join(
'{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
)
)
# ### Method 2: counting lexemes
#
# An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes.
# Now we walk the lexemes instead of the occurrences.
#
# Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`.
# Both also have the `lex` feature.
# In[22]:
verbs = collections.Counter()
indent(reset=True)
info('Collecting data')
for w in F.otype.s('lex'):
if F.sp.v(w) != 'verb': continue
verbs[F.lex.v(w)] += F.freq_lex.v(w)
info('Done')
print(''.join(
'{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
)
)
# This is an order of magnitude faster. In this case, that means the difference between a third of a second and a
# hundredth of a second, not a big gain in absolute terms.
# But suppose you need to run this a 1000 times in a loop.
# Then it is the difference between 5 minutes and 10 seconds.
# A five minute wait is not pleasant in interactive computing!
# ### A frequency mapping of lexemes
#
# We make a mapping between lexeme forms and the number of occurrences of those lexemes.
# In[17]:
lexeme_dict = {
F.g_lex_utf8.v(n): F.freq_lex.v(n)
for n in F.otype.s('word')
}
# In[18]:
list(lexeme_dict.items())[0:10]
# ### Real work
#
# As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on
# [Collocation MI Analysis of the Hebrew Bible](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/blob/master/Collocation%20MI%20Analysis%20of%20the%20Hebrew%20Bible.ipynb)
#
# It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results.
#
# In case the name has changed, the enclosing repo is
# [here](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/tree/master/).
# ## Lexeme distribution
#
# Let's do a bit more fancy lexeme stuff.
#
# ### Hapaxes
#
# A hapax can be found by inspecting lexemes and see to how many word nodes they are linked.
# If that is number is one, we have a hapax.
#
# We print 10 hapaxes with their glosses.
# In[23]:
indent(reset=True)
hapax = []
zero = set()
for l in F.otype.s('lex'):
occs = L.d(l, otype='word')
n = len(occs)
if n == 0: # that's weird: should not happen
zero.add(l)
elif n == 1: # hapax found!
hapax.append(l)
info('{} hapaxes found'.format(len(hapax)))
if zero:
error('{} zeroes found'.format(len(zero)), tm=False)
else:
info('No zeroes found', tm=False)
for h in hapax[0:10]:
print('\t{:<8} {}'.format(F.lex.v(h), F.gloss.v(h)))
# ### Small occurrence base
#
# The occurrence base of a lexeme are the verses, chapters and books in which occurs.
# Let's look for lexemes that occur in a single chapter.
#
# If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter.
# So, if you go *up* from the lexeme, you encounter the chapter.
#
# Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it,
# so if you go up from such lexemes, you don not find chapters.
#
# Let's check it out.
#
# Oh yes, we have already found the hapaxes, we will skip them here.
# In[24]:
indent(reset=True)
info('Finding single chapter lexemes')
singleCh = []
multiple = []
for l in F.otype.s('lex'):
chapters = L.u(l, 'chapter')
if len(chapters) == 1:
if l not in hapax:
singleCh.append(l)
elif len(chapters) > 0: # should not happen
multipleCh.append(l)
info('{} single chapter lexemes found'.format(len(singleCh)))
if multiple:
error('{} chapter embedders of multiple lexemes found'.format(len(multiple)), tm=False)
else:
info('No chapter embedders of multiple lexemes found', tm=False)
for s in singleCh[0:10]:
print('{:<20} {:<6}'.format(
'{} {}:{}'.format(*T.sectionFromNode(s)),
F.lex.v(s),
))
# ### Confined to books
#
# As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and
# the number of lexemes that occur exclusively in that book.
# In[25]:
indent(reset=True)
info('Making book-lexeme index')
allBook = collections.defaultdict(set)
allLex = set()
for b in F.otype.s('book'):
for w in L.d(b, 'word'):
l = L.u(w, 'lex')[0]
allBook[b].add(l)
allLex.add(l)
info('Found {} lexemes'.format(len(allLex)))
# In[26]:
indent(reset=True)
info('Finding single book lexemes')
singleBook = collections.defaultdict(lambda:0)
for l in F.otype.s('lex'):
book = L.u(l, 'book')
if len(book) == 1:
singleBook[book[0]] += 1
info('found {} single book lexemes'.format(sum(singleBook.values())))
# In[27]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
'book', '#all', '#own', '%own',
'-'*35,
))
booklist = []
for b in F.otype.s('book'):
book = T.bookName(b)
a = len(allBook[b])
o = singleBook.get(b, 0)
p = 100 * o / a
booklist.append((book, a, o, p))
for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):
print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
# The book names may sound a bit unfamiliar, they are in Latin here.
# Later we'll see that you can also get them in English, or in Swahili.
# # Locality API
# We travel upwards and downwards, forwards and backwards through the nodes.
# The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
# `n()` for going to next nodes and `p()` for going to previous nodes.
#
# These directions are indirect notions: nodes are just numbers, but by means of the
# `oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
# And one if next or previous to an other, if its slots follow or precede the slots of the other one.
#
# `L.u(node)` **Up** is going to nodes that embed `node`.
#
# `L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.
#
# `L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.
#
# `L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.
#
# All these functions yield nodes of all possible otypes.
# By passing an optional parameter, you can restrict the results to nodes of that type.
#
# The result are ordered according to the order of things in the text.
#
# The functions return always a tuple, even if there is just one node in the result.
#
# ## Going up
# We go from the first word to the book it contains.
# Note the `[0]` at the end. You expect one book, yet `L` returns a tuple.
# To get the only element of that tuple, you need to do that `[0]`.
#
# If you are like me, you keep forgetting it, and that will lead to weird error messages later on.
# In[28]:
firstBook = L.u(1, otype='book')[0]
print(firstBook)
# And let's see all the containing objects of word 3:
# In[29]:
w = 3
for otype in F.otype.all:
if otype == F.otype.slotType: continue
up = L.u(w, otype=otype)
upNode = 'x' if len(up) == 0 else up[0]
print('word {} is contained in {} {}'.format(w, otype, upNode))
# ## Going next
# Let's go to the next nodes of the first book.
# In[30]:
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
n, F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
))
secondBook = L.n(firstBook, otype='book')[0]
# ## Going previous
#
# And let's see what is right before the second book.
# In[31]:
for n in L.p(secondBook):
print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
n, F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
))
# ## Going down
# We go to the chapters of the second book, and just count them.
# In[32]:
chapters = L.d(secondBook, otype='chapter')
print(len(chapters))
# ## The first verse
# We pick the first verse and the first word, and explore what is above and below them.
# In[33]:
for n in [1, L.u(1, otype='verse')[0]]:
indent(level=0)
info('Node {}'.format(n), tm=False)
indent(level=1)
info('UP', tm=False)
indent(level=2)
info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
indent(level=1)
info('DOWN', tm=False)
indent(level=2)
info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)
# # Text API
#
# So far, we have mainly seen nodes and their numbers, and the names of node types.
# You would almost forget that we are dealing with text.
# So let's try to see some text.
#
# In the same way as `F` gives access to feature data,
# `T` gives access to the text.
# That is also feature data, but you can tell Text-Fabric which features are specifically
# carrying the text, and in return Text-Fabric offers you
# a Text API: `T`.
#
# ## Formats
# Hebrew text can be represented in a number of ways:
#
# * fully pointed (vocalized and accented), or consonantal,
# * in transliteration, phonetic transcription or in Hebrew characters,
# * showing the actual text or only the lexemes,
# * following the ketiv or the qere, at places where they deviate from each other.
#
# If you wonder where the information about text formats is stored:
# not in the program text-fabric, but in the data set.
# It has a feature `otext`, which specifies the formats and which features
# must be used to produce them. `otext` is the third special feature in a TF data set,
# next to `otype` and `oslots`.
# It is an optional feature.
# If it is absent, there will be no `T` API.
#
# Here is a list of all available formats in this data set.
# In[34]:
sorted(T.formats)
# Note the `text-phono-full` format here.
# It does not come from the main data source `bhsa`, but from the module `phono`.
# Look in your data directory, find `~/github/etcbc/phono/tf/2017/otext@phono.tf`,
# and you'll see this format defined there.
# ## Using the formats
# Now let's use those formats to print out the first verse of the Hebrew Bible.
# In[35]:
for fmt in sorted(T.formats):
print('{}:\n\t{}'.format(fmt, T.text(range(1,12), fmt=fmt)))
# If we do not specify a format, the **default** format is used (`text-orig-full`).
# In[36]:
print(T.text(range(1,12)))
# ## Whole text in all formats in just 10 seconds
# Part of the pleasure of working with computers is that they can crunch massive amounts of data.
# The text of the Hebrew Bible is a piece of cake.
#
# It takes just ten seconds to have that cake and eat it.
# In nearly a dozen formats.
# In[37]:
indent(reset=True)
info('writing plain text of whole Bible in all formats')
text = collections.defaultdict(list)
for v in F.otype.s('verse'):
words = L.d(v, 'word')
for fmt in sorted(T.formats):
text[fmt].append(T.text(words, fmt=fmt))
info('done {} formats'.format(len(text)))
for fmt in sorted(text):
print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))
# ### The full plain text
# We write a few formats to file, in your `Downloads` folder.
# In[38]:
T.formats
# In[39]:
for fmt in '''
text-orig-full
text-phono-full
'''.strip().split():
with open(os.path.expanduser(f'~/Downloads/{fmt}.txt'), 'w') as f:
f.write('\n'.join(text[fmt]))
# ## Book names
#
# For Bible book names, we can use several languages.
#
# ### Languages
# Here are the languages that we can use for book names.
# These languages come from the features `book@ll`, where `ll` is a two letter
# ISO language code. Have a look in your data directory, you can't miss them.
# In[40]:
T.languages
# ### Book names in Swahili
# Get the book names in Swahili.
# In[41]:
nodeToSwahili = ''
for b in F.otype.s('book'):
nodeToSwahili += '{} = {}\n'.format(b, T.bookName(b, lang='sw'))
print(nodeToSwahili)
# ## Book nodes from Swahili
# OK, there they are. We copy them into a string, and do the opposite: get the nodes back.
# We check whether we get exactly the same nodes as the ones we started with.
# In[42]:
swahiliNames = '''
Mwanzo
Kutoka
Mambo_ya_Walawi
Hesabu
Kumbukumbu_la_Torati
Yoshua
Waamuzi
1_Samweli
2_Samweli
1_Wafalme
2_Wafalme
Isaya
Yeremia
Ezekieli
Hosea
Yoeli
Amosi
Obadia
Yona
Mika
Nahumu
Habakuki
Sefania
Hagai
Zekaria
Malaki
Zaburi
Ayubu
Mithali
Ruthi
Wimbo_Ulio_Bora
Mhubiri
Maombolezo
Esta
Danieli
Ezra
Nehemia
1_Mambo_ya_Nyakati
2_Mambo_ya_Nyakati
'''.strip().split()
swahiliToNode = ''
for nm in swahiliNames:
swahiliToNode += '{} = {}\n'.format(T.bookNode(nm, lang='sw'), nm)
if swahiliToNode != nodeToSwahili:
print('Something is not right with the book names')
else:
print('Going from nodes to booknames and back yields the original nodes')
# ## Sections
#
# A section in the Hebrew bible is a book, a chapter or a verse.
# Knowledge of sections is not baked into Text-Fabric.
# The config feature `otext.tf` may specify three section levels, and tell
# what the corresponding node types and features are.
#
# From that knowledge it can construct mappings from nodes to sections, e.g. from verse
# nodes to tuples of the form:
#
# (bookName, chapterNumber, verseNumber)
#
# Here are examples of getting the section that corresponds to a node and vice versa.
#
# **NB:** `sectionFromNode` always delivers a verse specification, either from the
# first slot belonging to that node, or, if `lastSlot`, from the last slot
# belonging to that node.
# In[43]:
for x in (
('section of first word', T.sectionFromNode(1) ),
('node of Gen 1:1', T.nodeFromSection(('Genesis', 1, 1)) ),
('idem', T.nodeFromSection(('Mwanzo', 1, 1), lang='sw') ),
('node of book Genesis', T.nodeFromSection(('Genesis',)) ),
('node of Genesis 1', T.nodeFromSection(('Genesis', 1)) ),
('section of book node', T.sectionFromNode(1367534) ),
('idem, now last word', T.sectionFromNode(1367534, lastSlot=True) ),
('section of chapter node', T.sectionFromNode(1367573) ),
('idem, now last word', T.sectionFromNode(1367573, lastSlot=True) ),
): print('{:<30} {}'.format(*x))
# ## Sentences spanning multiple verses
# If you go up from a sentence node, you expect to find a verse node.
# But some sentences span multiple verses, and in that case, you will not find the enclosing
# verse node, because it is not there.
#
# Here is a piece of code to detect and list all cases where sentences span multiple verses.
#
# The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to
# discover the verse in which that word occurs, and if they are different: bingo!
#
# We show the first 5 of ca. 900 cases.
# By the way: doing this in the `2016` version of the data yields 915 results.
# The splitting up of the text into sentences is not carved in stone!
# In[44]:
indent(reset=True)
info('Get sentences that span multiple verses')
spanSentences = []
for s in F.otype.s('sentence'):
f = T.sectionFromNode(s, lastSlot=False)
l = T.sectionFromNode(s, lastSlot=True)
if f != l:
spanSentences.append('{} {}:{}-{}'.format(f[0], f[1], f[2], l[2]))
info('Found {} cases'.format(len(spanSentences)))
info('\n{}'.format('\n'.join(spanSentences[0:10])))
# A different way, with better display, is:
# In[45]:
indent(reset=True)
info('Get sentences that span multiple verses')
spanSentences = []
for s in F.otype.s('sentence'):
words = L.d(s, otype='word')
fw = words[0]
lw = words[-1]
fVerse = L.u(fw, otype='verse')[0]
lVerse = L.u(lw, otype='verse')[0]
if fVerse != lVerse:
spanSentences.append((s, fVerse, lVerse))
info('Found {} cases'.format(len(spanSentences)))
B.table(spanSentences, end=10, linked=2)
# We can zoom in:
# In[46]:
B.show(spanSentences, condensed=False, start=6, end=6)
# In[47]:
B.pretty(spanSentences[5][0])
# # Ketiv Qere
# Let us explore where Ketiv/Qere pairs are and how they render.
# In[46]:
qeres = [w for w in F.otype.s('word') if F.qere.v(w) != None]
print('{} qeres'.format(len(qeres)))
for w in qeres[0:10]:
print('{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format(
w, F.g_word.v(w), F.trailer.v(w), F.qere.v(w), F.qere_trailer.v(w),
))
# ## Show a ketiv-qere pair
# Let us print all text representations of the verse in which word node 4419 occurs.
# In[47]:
refWord = 4419
vn = L.u(refWord, otype='verse')[0]
ws = L.d(vn, otype='word')
print('{} {}:{}'.format(*T.sectionFromNode(refWord)))
for fmt in sorted(T.formats):
if fmt.startswith('text-'):
print('{:<25} {}'.format(fmt, T.text(ws, fmt=fmt)))
# # Edge features: mother
#
# We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet,
# the edges point from one row to another.
#
# One edge we have encountered: the special feature `oslots`.
# Each non-slot node is linked by `oslots` to all of its slot nodes.
#
# An edge is really a feature as well.
# Whereas a node feature is a column of information,
# one cell per node,
# an edge feature is also a column of information, one cell per pair of nodes.
#
# Linguists use more relationships between textual objects, for example:
# linguistic dependency.
# In the BHSA all cases of linguistic dependency are coded in the edge feature `mother`.
#
# Let us do a few basic enquiry on an edge feature:
# [mother](https://etcbc.github.io/bhsa/features/hebrew/2017/mother).
#
# We count how many mothers nodes can have (it turns to be 0 or 1).
# We walk through all nodes and per node we retrieve the mother nodes, and
# we store the lengths (if non-zero) in a dictionary (`mother_len`).
#
# We see that nodes have at most one mother.
#
# We also count the inverse relationship: daughters.
# In[48]:
info('Counting mothers')
motherLen = {}
daughterLen = {}
for c in N():
lms = E.mother.f(c) or []
lds = E.mother.t(c) or []
nms = len(lms)
nds = len(lds)
if nms: motherLen[c] = nms
if nds: daughterLen[c] = nds
info('{} nodes have mothers'.format(len(motherLen)))
info('{} nodes have daughters'.format(len(daughterLen)))
motherCount = collections.Counter()
daughterCount = collections.Counter()
for (n, lm) in motherLen.items(): motherCount[lm] += 1
for (n, ld) in daughterLen.items(): daughterCount[ld] += 1
print('mothers', motherCount)
print('daughters', daughterCount)
# # Next steps
#
# By now you have an impression how to compute around in the Hebrew Bible.
# While this is still the beginning, I hope you already sense the power of unlimited programmatic access
# to all the bits and bytes in the data set.
#
# Here are a few directions for unleashing that power.
#
# ## Explore additional data
# The ETCBC has a few other repositories with data that work in conjunction with the BHSA data.
# One of them you have already seen:
# [phono](https://github.com/ETCBC/phono),
# for phonetic transcriptions.
#
# There is also
# [parallels](https://github.com/ETCBC/parallels)
# for detecting parallel passages,
# and
# [valence](https://github.com/ETCBC/valence)
# for studying patterns around verbs that determine their meanings.
#
# ## Add your own data
# If you study the additional data, you can observe how that data is created and also
# how it is turned into a text-fabric data module.
# The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers
# and the values string or numbers as a Text-Fabric feature.
# When you are creating data, you have already constructed those dictionaries, so writing
# them out is just one method call.
# See for example how the
# [flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module)
# notebook in valence writes out verb sense data.
# ![flow](images/valence.png)
#
# You can then easily share your new features on GitHub, so that your colleagues everywhere
# can try it out for themselves.
# ## Export to Emdros MQL
#
# [EMDROS](http://emdros.org), written by Ulrik Petersen,
# is a text database system with the powerful *topographic* query language MQL.
# The ideas are based on a model devised by Christ-Jan Doedens in
# [Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).
#
# Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.
#
# [SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.
#
# So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.
#
# If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,
# which we will not show here.
#
# And if you want to export a Text-Fabric data set to MQL, that is also possible.
#
# After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the
# indicated modules into a big MQL dump, which can be imported by an EMDROS database.
# In[39]:
TF.exportMQL('mybhsa','~/Downloads')
# Now you have a file `~/Downloads/mybhsa.mql` of 530 MB.
# You can import it into an Emdros database by saying:
#
# cd ~/Downloads
# rm mybhsa.mql
# mql -b 3 < mybhsa.mql
#
# The result is an SQLite3 database `mybhsa` in the same directory (168 MB).
# You can run a query against it by creating a text file test.mql with this contents:
#
# select all objects where
# [lex gloss ~ 'make'
# [word FOCUS]
# ]
#
# And then say
#
# mql -b 3 -d mybhsa test.mql
#
# You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss.
#
# It is not very pretty, and probably you should use a more visual Emdros tool to run those queries.
# You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric.
# # Clean caches
#
# Text-Fabric pre-computes data for you, so that it can be loaded faster.
# If the original data is updated, Text-Fabric detects it, and will recompute that data.
#
# But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
# want to clear the cache of precomputed results.
#
# There are two ways to do that:
#
# * Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
# This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
# * Call `TF.clearCache()`, which does exactly the same.
#
# It is not handy to execute the following cell all the time, that's why I have commented it out.
# So if you really want to clear the cache, remove the comment sign below.
# In[39]:
# TF.clearCache()
# In[ ]: