#!/usr/bin/env python # coding: utf-8 # # # # # # # Tutorial # # This notebook gets you started with using # [Text-Fabric](https://dans-labs.github.io/text-fabric/) for coding in the Hebrew Bible. # # Chances are that a bit of reading about the underlying # [data model](https://dans-labs.github.io/text-fabric/Model/Data-Model/) # helps you to follow the exercises below, and vice versa. # ## Installing Text-Fabric # # ### Python # # You need to have Python on your system. Most systems have it out of the box, # but alas, that is python2 and we need at least python **3.6**. # # Install it from [python.org](https://www.python.org) or from # [Anaconda](https://www.anaconda.com/download). # # ### Jupyter notebook # # You need [Jupyter](http://jupyter.org). # # If it is not already installed: # # ``` # pip3 install jupyter # ``` # # ### TF itself # # ``` # pip3 install text-fabric # ``` # In[1]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[2]: import sys, os, collections from IPython.display import HTML # In[3]: from tf.fabric import Fabric from tf.extra.bhsa import Bhsa # # Call Text-Fabric # # Everything starts by calling up Text-Fabric. # It needs to know where to look for data. # # The Hebrew Bible is in the same repository as this tutorial. # I assume you have cloned [bhsa](https://github.com/etcbc/bhsa) # and [phono](https://github.com/etcbc/phono) # in your directory `~/github/etcbc`, so that your directory structure looks like this # # your home direcectory\ # | - github\ # | | - etcbc\ # | | | - bhsa # | | | - phono # # ## Tip # If you start computing with this tutorial, first copy its parent directory to somewhere else, # outside your `bhsa` directory. # If you pull changes from the `bhsa` repository later, your work will not be overwritten. # Where you put your tutorial directory is up till you. # It will work from any directory. # In[4]: VERSION = '2017' DATABASE = '~/github/etcbc' BHSA = f'bhsa/tf/{VERSION}' PHONO = f'phono/tf/{VERSION}' TF = Fabric(locations=[DATABASE], modules=[BHSA, PHONO], silent=False ) # Note that we have added a module `phono`. # The BHSA data has a special 1-1 transcription from Hebrew to ASCII, # but not a *phonetic* transcription. # # I have made a # [notebook](https://github.com/etcbc/phono/blob/master/programs/phono.ipynb) # that tries hard to find phonological representations for all the words. # The result is a module in text-fabric format. # We'll encounter that later. # # **NB:** This is a real-world example of how to add data to an existing data source as a module. # # Load Features # The data of the BHSA is organized in features. # They are *columns* of data. # Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the # first word, row 2 to the second word, and so on, for all 425,000 words. # # The information which part-of-speech each word is, constitutes a column in that spreadsheet. # The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more # textual objects. # # Instead of putting that information in one big table, the data is organized in separate columns. # We call those columns **features**. # # We just load the features we need for this tutorial. # Later on, where we use them, it will become clear what they mean. # In[10]: api = TF.load(''' sp lex voc_lex_utf8 g_word trailer g_lex_utf8 qere qere_trailer language freq_lex gloss mother ''') api.makeAvailableIn(globals()) # The result of this all is that we have a bunch of special variables at our disposal # that give us access to the text and data of the Hebrew Bible. # # At this point it is helpful to throw a quick glance at the text-fabric # [API documentation](https://dans-labs.github.io/text-fabric/Api/General/). # # The most essential thing for now is that we can use `F` to access the data in the features # we've loaded. # But there is more, such as `N`, which helps us to walk over the text, as we see in a minute. # ## More power # # There are extra functions on top of Text-Fabric that know about the Hebrew Bible. # Lets acquire additional power. # In[11]: B = Bhsa(api, 'start', version=VERSION) # A few things to note: # # * You supply the `api` as first argument to `Bhsa()` # * You supply the plain *name* of the notebook that you are writing as the second argument # * You supply the *version* of the BHSA data as the third argument # # The result is that you have a few handy links to # # * the data provenance and documentation # * the BHSA API and the Text-Fabric API # * the online versions of this notebook on GitHub and NBViewer. # ## Search # Text-Fabric contains a flexible search engine, that does not only work for the BHSA data, # but also for data that you add to it. # # **Search is the quickest way to come up-to-speed with your data, without too much programming.** # # Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite. # And if you already know MQL queries, you can build from that in # [searchFromMQL](searchFromMQL.ipynb). # # The real power of search lies in the fact that it is integrated in a programming environment. # You can use programming to: # # * compose dynamic queries # * process query results # # Therefore, the rest of this tutorial is still important when you want to tap that power. # If you continue here, you learn all the basics of data-navigation with Text-Fabric. # # Counting # # In order to get acquainted with the data, we start with the simple task of counting. # # ## Count all nodes # We use the # [`N()` generator](https://dans-labs.github.io/text-fabric/Api/General/#navigating-nodes) # to walk through the nodes. # # We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words. # In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words. # # We also mentioned that there are also 1,000,000 more textual objects. # They are the phrases, clauses, sentences, verses, chapters and books. # They also correspond to rows in the big spreadsheet. # # In Text-Fabric we call all these rows *nodes*, and the `N()` generator # carries us through those nodes in the textual order. # # Just one extra thing: the `info` statements generate timed messages. # If you use them instead of `print` you'll get a sense of the amount of time that # the various processing steps typically need. # In[6]: indent(reset=True) info('Counting nodes ...') i = 0 for n in N(): i += 1 info('{} nodes'.format(i)) # Here you see it: 1,4 M nodes! # ## What are those million nodes? # Every node has a type, like word, or phrase, sentence. # We know that we have approximately 425,000 words and a million other nodes. # But what exactly are they? # # Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set. # `otype` tells you for each node its type, and you can ask for the number of `slot`s in the text. # # Here we go! # In[7]: F.otype.slotType # In[8]: F.otype.maxSlot # In[9]: F.otype.maxNode # In[10]: F.otype.all # In[11]: C.levels.data # This is interesting: above you see all the textual objects, with the average size of their objects, # the node where they start, and the node where they end. # ## Count individual object types # This is an intuitive way to count the number of nodes in each type. # Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed # and indented progress messages. # In[12]: indent(reset=True) info('counting objects ...') for otype in F.otype.all: i = 0 indent(level=1, reset=True) for n in F.otype.s(otype): i+=1 info('{:>7} {}s'.format(i, otype)) indent(level=0) info('Done') # # Viewing textual objects # # We use the BHSA API (the extra power) to peek into the corpus. # First a word. Node 100,000 is a slot. Let's see what it is and where it is. # In[13]: wordShow = 100000 B.pretty(wordShow) # Note # * if you click on the word # you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme; # * if you hover on the part-of-speech (`prep` here), you see the passage, # and if you click on it, you go to SHEBANQ, to exactly this verse. # Let us do the same for more complex objects, such as phrases, sentences, etc. # In[14]: phraseShow = 700001 B.pretty(phraseShow) # In[15]: clauseShow = 500002 B.pretty(clauseShow) # In[16]: sentenceShow = 1200001 B.pretty(sentenceShow) # In[17]: verseShow = 1420000 B.pretty(verseShow) # In[18]: chapterShow = 427000 print(F.otype.v(chapterShow)) B.pretty(chapterShow) # If you need a link to shebanq for just any node: # In[19]: million = 1000000 B.shbLink(million) # # Feature statistics # # `F` # gives access to all features. # Every feature has a method # `freqList()` # to generate a frequency list of its values, higher frequencies first. # Here are the parts of speech: # In[20]: F.sp.freqList() # # Lexeme matters # # ## Top 10 frequent verbs # # If we count the frequency of words, we usually mean the frequency of their # corresponding lexemes. # # There are several methods for working with lexemes. # # ### Method 1: counting words # In[21]: verbs = collections.Counter() indent(reset=True) info('Collecting data') for w in F.otype.s('word'): if F.sp.v(w) != 'verb': continue verbs[F.lex.v(w)] +=1 info('Done') print(''.join( '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted( verbs.items() , key=lambda x: (-x[1], x[0]))[0:10], ) ) # ### Method 2: counting lexemes # # An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes. # Now we walk the lexemes instead of the occurrences. # # Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`. # Both also have the `lex` feature. # In[22]: verbs = collections.Counter() indent(reset=True) info('Collecting data') for w in F.otype.s('lex'): if F.sp.v(w) != 'verb': continue verbs[F.lex.v(w)] += F.freq_lex.v(w) info('Done') print(''.join( '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted( verbs.items() , key=lambda x: (-x[1], x[0]))[0:10], ) ) # This is an order of magnitude faster. In this case, that means the difference between a third of a second and a # hundredth of a second, not a big gain in absolute terms. # But suppose you need to run this a 1000 times in a loop. # Then it is the difference between 5 minutes and 10 seconds. # A five minute wait is not pleasant in interactive computing! # ### A frequency mapping of lexemes # # We make a mapping between lexeme forms and the number of occurrences of those lexemes. # In[17]: lexeme_dict = { F.g_lex_utf8.v(n): F.freq_lex.v(n) for n in F.otype.s('word') } # In[18]: list(lexeme_dict.items())[0:10] # ### Real work # # As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on # [Collocation MI Analysis of the Hebrew Bible](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/blob/master/Collocation%20MI%20Analysis%20of%20the%20Hebrew%20Bible.ipynb) # # It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results. # # In case the name has changed, the enclosing repo is # [here](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/tree/master/). # ## Lexeme distribution # # Let's do a bit more fancy lexeme stuff. # # ### Hapaxes # # A hapax can be found by inspecting lexemes and see to how many word nodes they are linked. # If that is number is one, we have a hapax. # # We print 10 hapaxes with their glosses. # In[23]: indent(reset=True) hapax = [] zero = set() for l in F.otype.s('lex'): occs = L.d(l, otype='word') n = len(occs) if n == 0: # that's weird: should not happen zero.add(l) elif n == 1: # hapax found! hapax.append(l) info('{} hapaxes found'.format(len(hapax))) if zero: error('{} zeroes found'.format(len(zero)), tm=False) else: info('No zeroes found', tm=False) for h in hapax[0:10]: print('\t{:<8} {}'.format(F.lex.v(h), F.gloss.v(h))) # ### Small occurrence base # # The occurrence base of a lexeme are the verses, chapters and books in which occurs. # Let's look for lexemes that occur in a single chapter. # # If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter. # So, if you go *up* from the lexeme, you encounter the chapter. # # Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it, # so if you go up from such lexemes, you don not find chapters. # # Let's check it out. # # Oh yes, we have already found the hapaxes, we will skip them here. # In[24]: indent(reset=True) info('Finding single chapter lexemes') singleCh = [] multiple = [] for l in F.otype.s('lex'): chapters = L.u(l, 'chapter') if len(chapters) == 1: if l not in hapax: singleCh.append(l) elif len(chapters) > 0: # should not happen multipleCh.append(l) info('{} single chapter lexemes found'.format(len(singleCh))) if multiple: error('{} chapter embedders of multiple lexemes found'.format(len(multiple)), tm=False) else: info('No chapter embedders of multiple lexemes found', tm=False) for s in singleCh[0:10]: print('{:<20} {:<6}'.format( '{} {}:{}'.format(*T.sectionFromNode(s)), F.lex.v(s), )) # ### Confined to books # # As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and # the number of lexemes that occur exclusively in that book. # In[25]: indent(reset=True) info('Making book-lexeme index') allBook = collections.defaultdict(set) allLex = set() for b in F.otype.s('book'): for w in L.d(b, 'word'): l = L.u(w, 'lex')[0] allBook[b].add(l) allLex.add(l) info('Found {} lexemes'.format(len(allLex))) # In[26]: indent(reset=True) info('Finding single book lexemes') singleBook = collections.defaultdict(lambda:0) for l in F.otype.s('lex'): book = L.u(l, 'book') if len(book) == 1: singleBook[book[0]] += 1 info('found {} single book lexemes'.format(sum(singleBook.values()))) # In[27]: print('{:<20}{:>5}{:>5}{:>5}\n{}'.format( 'book', '#all', '#own', '%own', '-'*35, )) booklist = [] for b in F.otype.s('book'): book = T.bookName(b) a = len(allBook[b]) o = singleBook.get(b, 0) p = 100 * o / a booklist.append((book, a, o, p)) for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])): print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x)) # The book names may sound a bit unfamiliar, they are in Latin here. # Later we'll see that you can also get them in English, or in Swahili. # # Locality API # We travel upwards and downwards, forwards and backwards through the nodes. # The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down, # `n()` for going to next nodes and `p()` for going to previous nodes. # # These directions are indirect notions: nodes are just numbers, but by means of the # `oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. # And one if next or previous to an other, if its slots follow or precede the slots of the other one. # # `L.u(node)` **Up** is going to nodes that embed `node`. # # `L.d(node)` **Down** is the opposite direction, to those that are contained in `node`. # # `L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`. # # `L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`. # # All these functions yield nodes of all possible otypes. # By passing an optional parameter, you can restrict the results to nodes of that type. # # The result are ordered according to the order of things in the text. # # The functions return always a tuple, even if there is just one node in the result. # # ## Going up # We go from the first word to the book it contains. # Note the `[0]` at the end. You expect one book, yet `L` returns a tuple. # To get the only element of that tuple, you need to do that `[0]`. # # If you are like me, you keep forgetting it, and that will lead to weird error messages later on. # In[28]: firstBook = L.u(1, otype='book')[0] print(firstBook) # And let's see all the containing objects of word 3: # In[29]: w = 3 for otype in F.otype.all: if otype == F.otype.slotType: continue up = L.u(w, otype=otype) upNode = 'x' if len(up) == 0 else up[0] print('word {} is contained in {} {}'.format(w, otype, upNode)) # ## Going next # Let's go to the next nodes of the first book. # In[30]: afterFirstBook = L.n(firstBook) for n in afterFirstBook: print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format( n, F.otype.v(n), E.oslots.s(n)[0], E.oslots.s(n)[-1], )) secondBook = L.n(firstBook, otype='book')[0] # ## Going previous # # And let's see what is right before the second book. # In[31]: for n in L.p(secondBook): print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format( n, F.otype.v(n), E.oslots.s(n)[0], E.oslots.s(n)[-1], )) # ## Going down # We go to the chapters of the second book, and just count them. # In[32]: chapters = L.d(secondBook, otype='chapter') print(len(chapters)) # ## The first verse # We pick the first verse and the first word, and explore what is above and below them. # In[33]: for n in [1, L.u(1, otype='verse')[0]]: indent(level=0) info('Node {}'.format(n), tm=False) indent(level=1) info('UP', tm=False) indent(level=2) info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False) indent(level=1) info('DOWN', tm=False) indent(level=2) info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False) indent(level=0) info('Done', tm=False) # # Text API # # So far, we have mainly seen nodes and their numbers, and the names of node types. # You would almost forget that we are dealing with text. # So let's try to see some text. # # In the same way as `F` gives access to feature data, # `T` gives access to the text. # That is also feature data, but you can tell Text-Fabric which features are specifically # carrying the text, and in return Text-Fabric offers you # a Text API: `T`. # # ## Formats # Hebrew text can be represented in a number of ways: # # * fully pointed (vocalized and accented), or consonantal, # * in transliteration, phonetic transcription or in Hebrew characters, # * showing the actual text or only the lexemes, # * following the ketiv or the qere, at places where they deviate from each other. # # If you wonder where the information about text formats is stored: # not in the program text-fabric, but in the data set. # It has a feature `otext`, which specifies the formats and which features # must be used to produce them. `otext` is the third special feature in a TF data set, # next to `otype` and `oslots`. # It is an optional feature. # If it is absent, there will be no `T` API. # # Here is a list of all available formats in this data set. # In[34]: sorted(T.formats) # Note the `text-phono-full` format here. # It does not come from the main data source `bhsa`, but from the module `phono`. # Look in your data directory, find `~/github/etcbc/phono/tf/2017/otext@phono.tf`, # and you'll see this format defined there. # ## Using the formats # Now let's use those formats to print out the first verse of the Hebrew Bible. # In[35]: for fmt in sorted(T.formats): print('{}:\n\t{}'.format(fmt, T.text(range(1,12), fmt=fmt))) # If we do not specify a format, the **default** format is used (`text-orig-full`). # In[36]: print(T.text(range(1,12))) # ## Whole text in all formats in just 10 seconds # Part of the pleasure of working with computers is that they can crunch massive amounts of data. # The text of the Hebrew Bible is a piece of cake. # # It takes just ten seconds to have that cake and eat it. # In nearly a dozen formats. # In[37]: indent(reset=True) info('writing plain text of whole Bible in all formats') text = collections.defaultdict(list) for v in F.otype.s('verse'): words = L.d(v, 'word') for fmt in sorted(T.formats): text[fmt].append(T.text(words, fmt=fmt)) info('done {} formats'.format(len(text))) for fmt in sorted(text): print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5]))) # ### The full plain text # We write a few formats to file, in your `Downloads` folder. # In[38]: T.formats # In[39]: for fmt in ''' text-orig-full text-phono-full '''.strip().split(): with open(os.path.expanduser(f'~/Downloads/{fmt}.txt'), 'w') as f: f.write('\n'.join(text[fmt])) # ## Book names # # For Bible book names, we can use several languages. # # ### Languages # Here are the languages that we can use for book names. # These languages come from the features `book@ll`, where `ll` is a two letter # ISO language code. Have a look in your data directory, you can't miss them. # In[40]: T.languages # ### Book names in Swahili # Get the book names in Swahili. # In[41]: nodeToSwahili = '' for b in F.otype.s('book'): nodeToSwahili += '{} = {}\n'.format(b, T.bookName(b, lang='sw')) print(nodeToSwahili) # ## Book nodes from Swahili # OK, there they are. We copy them into a string, and do the opposite: get the nodes back. # We check whether we get exactly the same nodes as the ones we started with. # In[42]: swahiliNames = ''' Mwanzo Kutoka Mambo_ya_Walawi Hesabu Kumbukumbu_la_Torati Yoshua Waamuzi 1_Samweli 2_Samweli 1_Wafalme 2_Wafalme Isaya Yeremia Ezekieli Hosea Yoeli Amosi Obadia Yona Mika Nahumu Habakuki Sefania Hagai Zekaria Malaki Zaburi Ayubu Mithali Ruthi Wimbo_Ulio_Bora Mhubiri Maombolezo Esta Danieli Ezra Nehemia 1_Mambo_ya_Nyakati 2_Mambo_ya_Nyakati '''.strip().split() swahiliToNode = '' for nm in swahiliNames: swahiliToNode += '{} = {}\n'.format(T.bookNode(nm, lang='sw'), nm) if swahiliToNode != nodeToSwahili: print('Something is not right with the book names') else: print('Going from nodes to booknames and back yields the original nodes') # ## Sections # # A section in the Hebrew bible is a book, a chapter or a verse. # Knowledge of sections is not baked into Text-Fabric. # The config feature `otext.tf` may specify three section levels, and tell # what the corresponding node types and features are. # # From that knowledge it can construct mappings from nodes to sections, e.g. from verse # nodes to tuples of the form: # # (bookName, chapterNumber, verseNumber) # # Here are examples of getting the section that corresponds to a node and vice versa. # # **NB:** `sectionFromNode` always delivers a verse specification, either from the # first slot belonging to that node, or, if `lastSlot`, from the last slot # belonging to that node. # In[43]: for x in ( ('section of first word', T.sectionFromNode(1) ), ('node of Gen 1:1', T.nodeFromSection(('Genesis', 1, 1)) ), ('idem', T.nodeFromSection(('Mwanzo', 1, 1), lang='sw') ), ('node of book Genesis', T.nodeFromSection(('Genesis',)) ), ('node of Genesis 1', T.nodeFromSection(('Genesis', 1)) ), ('section of book node', T.sectionFromNode(1367534) ), ('idem, now last word', T.sectionFromNode(1367534, lastSlot=True) ), ('section of chapter node', T.sectionFromNode(1367573) ), ('idem, now last word', T.sectionFromNode(1367573, lastSlot=True) ), ): print('{:<30} {}'.format(*x)) # ## Sentences spanning multiple verses # If you go up from a sentence node, you expect to find a verse node. # But some sentences span multiple verses, and in that case, you will not find the enclosing # verse node, because it is not there. # # Here is a piece of code to detect and list all cases where sentences span multiple verses. # # The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to # discover the verse in which that word occurs, and if they are different: bingo! # # We show the first 5 of ca. 900 cases. # By the way: doing this in the `2016` version of the data yields 915 results. # The splitting up of the text into sentences is not carved in stone! # In[44]: indent(reset=True) info('Get sentences that span multiple verses') spanSentences = [] for s in F.otype.s('sentence'): f = T.sectionFromNode(s, lastSlot=False) l = T.sectionFromNode(s, lastSlot=True) if f != l: spanSentences.append('{} {}:{}-{}'.format(f[0], f[1], f[2], l[2])) info('Found {} cases'.format(len(spanSentences))) info('\n{}'.format('\n'.join(spanSentences[0:10]))) # A different way, with better display, is: # In[45]: indent(reset=True) info('Get sentences that span multiple verses') spanSentences = [] for s in F.otype.s('sentence'): words = L.d(s, otype='word') fw = words[0] lw = words[-1] fVerse = L.u(fw, otype='verse')[0] lVerse = L.u(lw, otype='verse')[0] if fVerse != lVerse: spanSentences.append((s, fVerse, lVerse)) info('Found {} cases'.format(len(spanSentences))) B.table(spanSentences, end=10, linked=2) # We can zoom in: # In[46]: B.show(spanSentences, condensed=False, start=6, end=6) # In[47]: B.pretty(spanSentences[5][0]) # # Ketiv Qere # Let us explore where Ketiv/Qere pairs are and how they render. # In[46]: qeres = [w for w in F.otype.s('word') if F.qere.v(w) != None] print('{} qeres'.format(len(qeres))) for w in qeres[0:10]: print('{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format( w, F.g_word.v(w), F.trailer.v(w), F.qere.v(w), F.qere_trailer.v(w), )) # ## Show a ketiv-qere pair # Let us print all text representations of the verse in which word node 4419 occurs. # In[47]: refWord = 4419 vn = L.u(refWord, otype='verse')[0] ws = L.d(vn, otype='word') print('{} {}:{}'.format(*T.sectionFromNode(refWord))) for fmt in sorted(T.formats): if fmt.startswith('text-'): print('{:<25} {}'.format(fmt, T.text(ws, fmt=fmt))) # # Edge features: mother # # We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet, # the edges point from one row to another. # # One edge we have encountered: the special feature `oslots`. # Each non-slot node is linked by `oslots` to all of its slot nodes. # # An edge is really a feature as well. # Whereas a node feature is a column of information, # one cell per node, # an edge feature is also a column of information, one cell per pair of nodes. # # Linguists use more relationships between textual objects, for example: # linguistic dependency. # In the BHSA all cases of linguistic dependency are coded in the edge feature `mother`. # # Let us do a few basic enquiry on an edge feature: # [mother](https://etcbc.github.io/bhsa/features/hebrew/2017/mother). # # We count how many mothers nodes can have (it turns to be 0 or 1). # We walk through all nodes and per node we retrieve the mother nodes, and # we store the lengths (if non-zero) in a dictionary (`mother_len`). # # We see that nodes have at most one mother. # # We also count the inverse relationship: daughters. # In[48]: info('Counting mothers') motherLen = {} daughterLen = {} for c in N(): lms = E.mother.f(c) or [] lds = E.mother.t(c) or [] nms = len(lms) nds = len(lds) if nms: motherLen[c] = nms if nds: daughterLen[c] = nds info('{} nodes have mothers'.format(len(motherLen))) info('{} nodes have daughters'.format(len(daughterLen))) motherCount = collections.Counter() daughterCount = collections.Counter() for (n, lm) in motherLen.items(): motherCount[lm] += 1 for (n, ld) in daughterLen.items(): daughterCount[ld] += 1 print('mothers', motherCount) print('daughters', daughterCount) # # Next steps # # By now you have an impression how to compute around in the Hebrew Bible. # While this is still the beginning, I hope you already sense the power of unlimited programmatic access # to all the bits and bytes in the data set. # # Here are a few directions for unleashing that power. # # ## Explore additional data # The ETCBC has a few other repositories with data that work in conjunction with the BHSA data. # One of them you have already seen: # [phono](https://github.com/ETCBC/phono), # for phonetic transcriptions. # # There is also # [parallels](https://github.com/ETCBC/parallels) # for detecting parallel passages, # and # [valence](https://github.com/ETCBC/valence) # for studying patterns around verbs that determine their meanings. # # ## Add your own data # If you study the additional data, you can observe how that data is created and also # how it is turned into a text-fabric data module. # The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers # and the values string or numbers as a Text-Fabric feature. # When you are creating data, you have already constructed those dictionaries, so writing # them out is just one method call. # See for example how the # [flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module) # notebook in valence writes out verb sense data. # ![flow](images/valence.png) # # You can then easily share your new features on GitHub, so that your colleagues everywhere # can try it out for themselves. # ## Export to Emdros MQL # # [EMDROS](http://emdros.org), written by Ulrik Petersen, # is a text database system with the powerful *topographic* query language MQL. # The ideas are based on a model devised by Christ-Jan Doedens in # [Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C). # # Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen. # # [SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC. # # So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL. # # If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`, # which we will not show here. # # And if you want to export a Text-Fabric data set to MQL, that is also possible. # # After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the # indicated modules into a big MQL dump, which can be imported by an EMDROS database. # In[39]: TF.exportMQL('mybhsa','~/Downloads') # Now you have a file `~/Downloads/mybhsa.mql` of 530 MB. # You can import it into an Emdros database by saying: # # cd ~/Downloads # rm mybhsa.mql # mql -b 3 < mybhsa.mql # # The result is an SQLite3 database `mybhsa` in the same directory (168 MB). # You can run a query against it by creating a text file test.mql with this contents: # # select all objects where # [lex gloss ~ 'make' # [word FOCUS] # ] # # And then say # # mql -b 3 -d mybhsa test.mql # # You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss. # # It is not very pretty, and probably you should use a more visual Emdros tool to run those queries. # You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric. # # Clean caches # # Text-Fabric pre-computes data for you, so that it can be loaded faster. # If the original data is updated, Text-Fabric detects it, and will recompute that data. # # But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might # want to clear the cache of precomputed results. # # There are two ways to do that: # # * Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it. # This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems. # * Call `TF.clearCache()`, which does exactly the same. # # It is not handy to execute the following cell all the time, that's why I have commented it out. # So if you really want to clear the cache, remove the comment sign below. # In[39]: # TF.clearCache() # In[ ]: