Let's compare two tools that give users computational power over annotated corpora:
Text-Fabric and Alpino, especially its graph-based query language.
If you have an annotated corpus there are three ways you could consume the data:
The advantage of computing and querying over browsing and reading is that you can find needles in haystacks, and filter and process the corpus in ways that are infeasible by the human eye and hand.
Yet, when computing and querying are done, it is vitally important that users can read and browse around the results, in order to see what is happening in the corpus and get ideas for new computations and queries.
The advantage of querying is that it can be done without programming, although the query language must be mastered. But that is still much easier and less time consuming than the art of programming.
The disadvantage of querying is that sooner or later, when the research questions become increasingly complex, the query language tends to become a straightjacket. Users have to become over-ingenious in order to find the queries that work for them. They approach a point where it pays off to compute.
In an ideal world, it should be easy to bridge the gap between querying and hand-coding smoothly.
In the real world, this gap tends to be an enormous barrier.
Results from the query engine are serialized from an internal representation to an external representation. In the worst case, the user has only access to the results by means of a web interface. In better cases the user can download results as data, in JSON or TSV, or TXT.
Even then, important parts of the context tend to get lost. Where exactly are the results located in the corpus? Can I get from a result sentence to the sentence that immediately follows it?
This gap can be made surmountable if the system would have an addressing system for every possible text-fragment in the corpus, and able to deliver those addresses within the results.
There are also design characteristics of the query language that help to lessen the gap.
First of all, a query should expose the terms of the corpus clearly and unaltered, and should be minimalistic in all other respects.
Secondly, a query should mimic the pattern it is designed to retrieve, this helps when you want to search by example.
Thirdly, a query should be able to express spatial relationships between text-fragments, such as "contained-in", "overlapping", "completely before", "adjacent", etc.
Fourthly, a query should be able to combine spatial relationships with all other features that the corpus has on offer.
In Text-Fabric we aim to produce such a query language. Here are a few characteristics:
Queries are topographical
: a query is a relationship pattern, and the results are all instantiations of that
pattern found in the corpus.
The query language is data-agnostic, but it can use the names of all features defined in the corpus.
The results of queries are tuples of nodes and nodes are integers.
Suppose the corpus
sentences
, clauses
, phrases
, words
.typ
for clauses and phrasessp
(part-of-speech) and g_cons
(consonantal transcription) for words.Then we can make a query that looks for NP phrases with a verb in it, where such phrases occur
in clauses of type Ptcp
. Oh, and the verb should begin with an M
.
N.B.: This is a real-world example. You can reproduce this on your own computer.
sentence
clause typ=Ptcp
phrase typ=NP
word sp=verb g_cons~^M
And the query results form a tuple of individual results, where each individual result is a tuple
(s, c, p, w)
of a sentence node, a clause node, a phrase node and a word node.
# assumed: pip install 'text-fabric[all]'
from tf.app import use
Load the data and give a handle to the API that gives access to it:
A = use("ETCBC/bhsa") # the corpus data is retrieved from github.com/ETCBC/bhsa and then cached locally
Locating corpus resources ...
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
Write a query:
query = """
sentence
clause typ=Ptcp
phrase typ=NP
word sp=verb g_cons~^M
"""
Run the query:
results = A.search(query)
0.38s 20 results
Show the results as nodes:
results
[(1174454, 430355, 660050, 14127), (1177926, 434807, 673387, 35121), (1187343, 448669, 715553, 112534), (1187717, 449226, 717232, 115558), (1187736, 449248, 717296, 115657), (1202000, 468385, 774745, 213096), (1202370, 468879, 776134, 215412), (1210183, 479652, 805689, 262762), (1217009, 488825, 831501, 304239), (1217856, 489988, 834658, 309591), (1217882, 490016, 834736, 309691), (1217899, 490036, 834790, 309767), (1226243, 501363, 863729, 350269), (1226332, 501471, 864016, 350676), (1226471, 501648, 864488, 351373), (1226625, 501858, 865033, 352145), (1227200, 502606, 866980, 355006), (1227224, 502635, 867058, 355103), (1229566, 506177, 876769, 370968), (1231842, 509729, 887037, 390501)]
Dress-up the results and show the first two of them in a table:
A.table(results, end=2)
n | p | sentence | clause | phrase | word |
---|---|---|---|---|---|
1 | Genesis 27:29 | וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃ | וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃ | מְבָרֲכֶ֖יךָ | מְבָרֲכֶ֖יךָ |
2 | Exodus 12:19 | כִּ֣י׀ כָּל־ אֹכֵ֣ל מַחְמֶ֗צֶת וְ נִכְרְתָ֞ה הַנֶּ֤פֶשׁ הַהִוא֙ מֵעֲדַ֣ת יִשְׂרָאֵ֔ל בַּגֵּ֖ר וּבְאֶזְרַ֥ח הָאָֽרֶץ׃ | אֹכֵ֣ל מַחְמֶ֗צֶת | מַחְמֶ֗צֶת | מַחְמֶ֗צֶת |
Change to the ASCII transliteration of the consonants of each word, and show all results in a table, but hide the last three columns.
A.table(results, fmt="text-trans-plain", skipCols={2, 3, 4})
n | p | sentence | clause | phrase | word |
---|---|---|---|---|---|
1 | Genesis 27:29 | W MBRKJK BRWK00 | |||
2 | Exodus 12:19 | KJ05 KL& >KL MXMYT W NKRTH HNPC HHW> M<DT JFR>L BGR WB>ZRX H>RY00 | |||
3 | Deuteronomy 33:20 | BRWK MRXJB GD | |||
4 | Joshua 6:9 | W HM>SP HLK >XRJ H>RWN | |||
5 | Joshua 6:13 | W HM>SP HLK >XRJ >RWN JHWH | |||
6 | Isaiah 3:12 | <MJ M>CRJK MT<JM W DRK >RXTJK BL<W00_S | |||
7 | Isaiah 9:15 | W M>CRJW MBL<JM00 | |||
8 | Jeremiah 51:1 | HNNJ M<JR <L&BBL W>L&JCBJ LB QMJ RWX MCXJT00 | |||
9 | Haggai 1:6 | W HMFTKR MFTKR >L&YRWR NQWB00_P | |||
10 | Malachi 1:7 | MGJCJM <L&MZBXJ LXM MG>L | |||
11 | Malachi 1:11 | W BKL&MQWM MQVR MGC LCMJ WMNXH VHWRH | |||
12 | Malachi 1:14 | W ZBX MCXT L>DNJ | |||
13 | Proverbs 13:12 | TWXLT MMCKH MXLH& LB | |||
14 | Proverbs 14:31 | W MKBDW XNN >BJWN00 | |||
15 | Proverbs 17:4 | MR< MQCJB <L&FPT&>WN | |||
16 | Proverbs 20:2 | MT<BRW XWV> NPCW00 | |||
17 | Proverbs 29:15 | W N<R MCLX MBJC >MW00 | |||
18 | Proverbs 29:26 | RBJM MBQCJM PNJ&MWCL | |||
19 | Daniel 2:22 | HW> GL> <MJQT> WMSTRT> | |||
20 | Nehemiah 12:47 | W KL&JFR>L BJMJ ZRBBL WBJMJ NXMJH NTNJM MNJWT HMCRRJM WHC<RJM DBR&JWM BJWMW |
Make the text-trans-plain
format the default and show results by sentence until further notice.
A.displaySetup(fmt="text-trans-plain", condenseType="sentence")
Expand result 11, because something is going on there: a phrase gets interrupted!
A.show(results, start=11, end=11)
result 11
N.B.: None of the words sentence
, clause
, phrase
, word
, typ
, sp
, g_cons
Ptcp
, NP
, is built into Text-Fabric, they are taken from the corpus organisation.
So the text of the query is almost entirely made up of terms that are familiar if you know the corpus.
In order to get to know the corpus, the user needs to consult a feature documentation document. For the Hebrew Bible that looks like this.
There is much more to search in Text-Fabric.
Here are the search docs and here is a search tutorial for the Hebrew Bible.
The fact that the results are just tuples of integers makes it easy post-process results with your own code.
Suppose you want to limit the results to those sentences that do not have hapaxes, then you can write your own Python code to do that.
Suppose the corpus has a word feature freq_lex
that for each word occurrence has the number
of occurrences of the lexeme of the word in the corpus.
Then we can filter like this:
F = A.api.Feature # the API to retrieve feature values
L = A.api.Locality # the API to navigate to nodes in the neighbourhood
unwantedResults = []
wantedResults = []
for result in results:
s = result[0]
words = L.d(s, otype="word")
hasHapax = any(F.freq_lex.v(w) == 1 for w in words)
if hasHapax:
unwantedResults.append(result)
else:
wantedResults.append(result)
print(f"{len(wantedResults)=} {len(unwantedResults)=}")
len(wantedResults)=19 len(unwantedResults)=1
Let's show the unwanted result and show the freq_lex
feature for all words:
A.show(unwantedResults, extraFeatures="freq_lex")
This ends the Text-Fabric demo.
The Alpino system has a graph-based query language.
Let's make an explorative comparison between this Alpino way of searching and Text-Fabric, because there seems to be quite a bit of convergence between the two.
Alpino works with nodes
and words
.
Text-Fabric works with nodes
. Some nodes are the atomic ones, called slots
, they are the textual positions.
What you find at a slot depends on what the corpus modeller has chosen, but quite often slots correspond to words.
Alpino nodes have categories (cat
), e.g. NP
, PP
, SMAIN
.
Text-Fabric nodes have a type (otype
= object type). Above we saw node types sentence
, clause
, phrase
, word
,
but this is the choice of the corpus modeller. Text-Fabric expects in every corpus a feature file otype
that
maps all nodes to types.
For example, the BHSA above has this otype.tf
:
1-426590 word
426591-426629 book
426630-427558 chapter
427559-515689 clause
515690-606393 clause_atom
606394-651572 half_verse
651573-904775 phrase
904776-1172307 phrase_atom
1172308-1236024 sentence
1236025-1300538 sentence_atom
1300539-1414388 subphrase
1414389-1437601 verse
1437602-1446831 lex
This is shorthand for a mapping of the integers 1..1446831 to strings word
, book
, ... , lex
.
By the way, all data files of a TF corpus are in this format, and each file specifies a mapping from numbers (or pairs of numbers) to values, which can be numbers or strings.
In a graph there are also edges. How do you search for nodes that are connected by certain edges?
Both in Alpino and Text-Fabric edges may have properties that can be used in queries.
An example Alpino query:
match (n:node{cat:'pp'})-[:rel{rel:'hdf'}]->(:nw)
return n
Look for a PP node that is connected to an other node or word by means of an edge with ref
property being
the string hdf
.
In Text-Fabric we can also make queries like this.
We have edges between similar verses and these edges are labeled with the similarity of both verses in percents.
First we look for verses that are 90% similar, and then for verses that are for more than 90% similar.
query1 = """
verse
-crossref=90> verse
"""
query2 = """
verse
-crossref>90> verse
"""
results1 = A.search(query1)
results2 = A.search(query2)
0.03s 240 results 0.04s 9574 results
A.table(results1, end=2, condenseType="verse", full=True)
n | p | verse | verse |
---|---|---|---|
1 | Genesis 25:31 | W J>MR J<QB MKRH KJWM >T&BKRTK LJ00 | W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00 |
2 | Genesis 25:33 | W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00 | W J>MR J<QB MKRH KJWM >T&BKRTK LJ00 |
Alpino does a good job in understanding linguistics. It has quite a bit of meaningful linguistic relationships, and its corpora supply the data for those.
Text-Fabric is different. It is much more agnostic. It only assumes that there is an ordered set of slots plus nodes that represent certain subsets of slots; the nodes are divided into types.
Both the types and the subsets are given with the corpus as TF data. Above we saw the otype.tf
file,
but there is also an oslots.tf
file, that maps each non-slot node to the set of slots it is linked to.
What do queries return?
In Alpino they return nodes and/or values that certain features have for those nodes.
In Text-Fabric they return tuples of nodes. In a TF query, most lines specify a node with properties that the query has to instantiate in the corpus. If a query specifies 10 such nodes, then the query results are 10-tuples of nodes in the corresponding order.
The nodes returned are naked, not dressed-up with features.
In real-life, people issue TF-queries either in the Text-Fabric browser, where they can customise how query results are displayed, or they can catch the result nodes in their programs (which run typically in a Jupyter notebook, but by no means necessarily so).
In the Text-Fabric browser users can influence the features that must be displayed in various ways, one of them being: if a feature is mentioned in a query, then it is displayed. Users can also categorically request features, or inhibit certain standard features that are displayed by default. (These defaults are not Text-Fabric things, but set by the corpus modeller).
In Jupyter notebooks, users can programmatically achieve the same effects by using the functions table()
and show()
and supplying various keyword arguments.
All in all, given the difference in purpose, technology and scope between Text-Fabric and Alpino, the underlying concepts map fairly well from the one to the other.
It is not always the case that a query is a neatly nested template. In fact, such queries only use the "embedding" relationship, but there are much more relationships.
In Alpino, you can give names to the nodes in a query, and the same is true for Text-Fabric.
For example:
query = """
clause
phrase
:= w1:word sp=verb
<: phrase
=: w2:word sp=verb
w1 .lex. w2
"""
This means: find a verb in two phrases of a clause The first verb should be the last word in its phrase, the second verb should be the first word in its phrase.
The last line states a relationship between the two words: there lexeme value should be identical.
results = A.search(query)
0.66s 475 results
A.table(results, end=10, skipCols={2, 3, 4, 5})
n | p | clause | phrase | word | phrase | word |
---|---|---|---|---|---|---|
1 | Genesis 2:16 | MKL <Y&HGN >KL T>KL00 | ||||
2 | Genesis 2:17 | KJ BJWM MWT TMWT00 | ||||
3 | Genesis 3:4 | L>& MWT TMTWN00 | ||||
4 | Genesis 3:16 | HRBH >RBH <YBWNK WHRNK | ||||
5 | Genesis 8:7 | W JY> JYW> WCWB | ||||
6 | Genesis 12:3 | W >BRKH MBRKJK | ||||
7 | Genesis 15:13 | JD< TD< | ||||
8 | Genesis 16:10 | HRBH >RBH >T&ZR<K | ||||
9 | Genesis 17:13 | HMWL05 JMWL JLJD BJTK WMQNT KSPK | ||||
10 | Genesis 18:10 | CWB >CWB >LJK K<T XJH |
In Alpino you can use quantifiers. These are parts of a query where you look for the existence or non-existence of certain patterns. This seems to be a bit of a problematic device, because there are certain conditions on quantifiers.
In Text-Fabric it is not different: here there are also quantifiers, and here there are also restrictions on the quantifier expressions.
In Text-Fabric, quantified parts of the query do not contribute to the result tuple.
Here is an example.
Let's see how many VP
-phrases there are:
resultsVP = A.search("""phrase typ=VP""")
0.16s 69024 results
Sometimes VPs contain a noun:
query = """
phrase typ=VP
word sp=subs
"""
resultsWithNoun = A.search(query, shallow=True)
0.31s 234 results
Note the shallow=True
: this means that we deliver the results differently: not as a tuple of tuples,
but as a set of nodes that correspond to the first node in the query: the phrase
.
If we want the VPs without nouns:
resultsWithoutNoun = A.search("""
phrase typ=VP
/without/
word sp=subs
/-/
""")
0.35s 68790 results
A check to see if the results add up:
len(resultsVP) == len(resultsWithNoun) + len(resultsWithoutNoun)
True
OK, the numbers of results of the different queries are as expected, but we can also compare the results themselves:
set(r[0] for r in resultsVP) == resultsWithNoun | set(r[0] for r in resultsWithoutNoun)
True
If we want the VPs with only verbs:
resultsVerb1 = A.search("""
phrase typ=VP
/without/
word sp#verb
/-/
""")
0.45s 62771 results
Or, in a slightly different way, showing a different quantifier:
resultsVerb2 = A.search("""
phrase typ=VP
/where/
w:word
/have/
w sp=verb
/-/
""")
0.70s 62771 results
Unlike Alpino, in Text-Fabric there are no SQL-like constructs to count, group and aggregate results,
except the shallow=True
parameter, which reduces all results that have the same first node
to a single result consisting of that node only.
Text-Fabric relies on post-processing by the user, either in the program in which he issued the query, or in other programs in which he imports results exported from the Text-Fabric browser.
There is also an API function A.export() for use in your programs to export results as tab-separated tables of dressed-up nodes.
Concerning sorting: the search function can be passed a sort key to order results.
If sort=True
is passed, the results are ordered by the text-induced ordering of the result tuples.
Alpino provides set-theoretic operations on results, in Text-Fabric the user has to do that by means of post-processing.
Note that Text-Fabric constrains search templates: they have to be connected components, in the sense that between every pair of nodes in the query template there must be a path of relationships.
If a search template would consist of multiple connected components, the result would be the cartesian product of the results of the individual queries.
Such result sets are potentially monstrous, and it is unlikely that the user can and will deal with them, so they are prohibited.
Alpino search is based on Cypher and SQL. Limitations in Cypher queries can sometimes be compensated by excursions into SQL.
Also in Text-Fabric the expressive power of queries is limited. Moreover, there are also queries that can be expressed but require too much time to execute.
That is partly due to lack of sophistication in the Text-Fabric engine and partly due to inherent complexity of spatial relationships between nodes.
In Text-Fabric we do not have an escape to an other query language. Instead, the escape is to hand-coding.
There is a tutorial notebook in which we explore a difficult query task. Although we can solve it by a query, we also do it by hand-coding. We make sure both give the same result and then we save the result as a named set to disk.
We can then invoke Text-Fabric later on with a parameter to include this named set. At that moment the name of the set can be used in queries in place where a node type is expected.
In the above notebook the section Custom sets for (non-)gapped phrases shows how that works.
In Alpino there are pre-computed pieces of data, e.g. the feature vor_feld
.
This is advertised as a device that simplifies queries and makes them much more efficient.
In Text-Fabric pre-computation is also used, at several levels.
Some of the pre-computation belongs to the internals of Text-Fabric, such as
spatial indexes that facilitate the computation of embedding relations between nodes, among
other things.
See for example levUp
.
Such precomputed data is also made available in raw form to the end user
through the Computed API.
At the second level there are computed features that the corpus modeller has included into the corpus. For example, in the BHSA there are features for the frequency and rank of words and lexemes. See freq/rank.
There are also features that have been added by others to the corpus as a separate module. The similarity feature that we encountered before is an example of that.
The named sets before are an example where end users themselves can compute data that are helpful in subsequent queries.
This is the ethos of Text-Fabric: that end users, corpus modellers, and researchers have maximum scope and flexibility to compute with the corpus.
Alpino and Text-Fabric are very different in the size of corpora they deal with and the specific features they assume to be present in the corpora.
Alpino corpora are linguistic corpora, Text-Fabric corpora do not have to be linguistic. There are even corpora in Text-Fabric whose texts are not in a language, such as the proto-cuneiform tablets of Uruk.
Can there be synergies between the Alpino world and the Text-Fabric world?
There are several corpora in Text-Fabric that could benefit from linguistic tools, especially tokenisers, pos-taggers and morphological taggers. Whether Alpino can help depends on the languages that are supported by Alpino, because up till now Text-Fabric deals with historical corpora using mixtures of historical languages with a variety of spelling idiosyncrasies.
When end-users want to combine close reading with data-analysis, Text-Fabric is a handy tool. Although Text-Fabric cannot deal with huge corpora, it can deal with corpora the size of 5 million words with dozens of features or 0,5 million words and over hundred features.
Text-Fabric also has machinery to deal with volumes of a corpus.
So one could use Alpino to make a top-level search on a huge corpus, and then export results volume by volume, where Text-Fabric can be used to deal with individual volumes.
Regardless of whether there is real synergy between Alpino and Text-Fabric, it is encouraging to see that once text is regarded as a graph, there is a certain logic in how a query language should work. Both Alpino and Text-Fabric have hit upon the same elements, driven by how other tools have tackled these things.
Whereas Alpino rests on Cypher (at least for this part of its query capability), Text-Fabric has been inspired by Emdros by Ulrik Sandborg-Petersen.