Text-Fabric and Alpino¶

Introduction¶

Let's compare two tools that give users computational power over annotated corpora:

Text-Fabric and Alpino, especially its graph-based query language.

How to digest an annotated corpus¶

If you have an annotated corpus there are three ways you could consume the data:

browsing and reading,
computing (walk the corpus by means of a program that collects results),
querying.

The advantage of computing and querying over browsing and reading is that you can find needles in haystacks, and filter and process the corpus in ways that are infeasible by the human eye and hand.

Yet, when computing and querying are done, it is vitally important that users can read and browse around the results, in order to see what is happening in the corpus and get ideas for new computations and queries.

The advantage of querying is that it can be done without programming, although the query language must be mastered. But that is still much easier and less time consuming than the art of programming.

The disadvantage of querying is that sooner or later, when the research questions become increasingly complex, the query language tends to become a straightjacket. Users have to become over-ingenious in order to find the queries that work for them. They approach a point where it pays off to compute.

In an ideal world, it should be easy to bridge the gap between querying and hand-coding smoothly.

In the real world, this gap tends to be an enormous barrier.

Results from the query engine are serialized from an internal representation to an external representation. In the worst case, the user has only access to the results by means of a web interface. In better cases the user can download results as data, in JSON or TSV, or TXT.

Even then, important parts of the context tend to get lost. Where exactly are the results located in the corpus? Can I get from a result sentence to the sentence that immediately follows it?

This gap can be made surmountable if the system would have an addressing system for every possible text-fragment in the corpus, and able to deliver those addresses within the results.

There are also design characteristics of the query language that help to lessen the gap.

First of all, a query should expose the terms of the corpus clearly and unaltered, and should be minimalistic in all other respects.

Secondly, a query should mimic the pattern it is designed to retrieve, this helps when you want to search by example.

Thirdly, a query should be able to express spatial relationships between text-fragments, such as "contained-in", "overlapping", "completely before", "adjacent", etc.

Fourthly, a query should be able to combine spatial relationships with all other features that the corpus has on offer.

Text-Fabric¶

In Text-Fabric we aim to produce such a query language. Here are a few characteristics:

Queries are topographical: a query is a relationship pattern, and the results are all instantiations of that pattern found in the corpus.
The query language is data-agnostic, but it can use the names of all features defined in the corpus.
The results of queries are tuples of nodes and nodes are integers.

Example¶

Suppose the corpus

has nodes for sentences, clauses, phrases, words.
defines a feature typ for clauses and phrases
defines features sp (part-of-speech) and g_cons (consonantal transcription) for words.

Then we can make a query that looks for NP phrases with a verb in it, where such phrases occur in clauses of type Ptcp. Oh, and the verb should begin with an M.

N.B.: This is a real-world example. You can reproduce this on your own computer.

sentence
  clause typ=Ptcp
    phrase typ=NP
      word sp=verb g_cons~^M

And the query results form a tuple of individual results, where each individual result is a tuple

(s, c, p, w)

of a sentence node, a clause node, a phrase node and a word node.

Running a query¶

In Text-Fabric, a query can be run from within a Python program.

Load the software.

In [2]:

# assumed: pip install 'text-fabric[all]'

from tf.app import use

Load the data and give a handle to the API that gives access to it:

In [3]:

A = use("ETCBC/bhsa") # the corpus data is retrieved from github.com/ETCBC/bhsa and then cached locally

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/bhsa/app

data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021

data: ~/text-fabric-data/github/etcbc/phono/tf/2021

data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021

Text-Fabric: Text-Fabric API 11.1.3, ETCBC/bhsa/app v3, Search Reference
Data: BHSA, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

gn

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

ls

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

nu

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

ps

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

sp

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

st

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

vs

str

✅ verbal stem (qal; piel; hif; apel; pael)

vt

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

Write a query:

In [4]:

query = """
sentence
  clause typ=Ptcp
    phrase typ=NP
      word sp=verb g_cons~^M
"""

Run the query:

In [5]:

results = A.search(query)

  0.38s 20 results

Show the results as nodes:

In [6]:

results

Out[6]:

[(1174454, 430355, 660050, 14127),
 (1177926, 434807, 673387, 35121),
 (1187343, 448669, 715553, 112534),
 (1187717, 449226, 717232, 115558),
 (1187736, 449248, 717296, 115657),
 (1202000, 468385, 774745, 213096),
 (1202370, 468879, 776134, 215412),
 (1210183, 479652, 805689, 262762),
 (1217009, 488825, 831501, 304239),
 (1217856, 489988, 834658, 309591),
 (1217882, 490016, 834736, 309691),
 (1217899, 490036, 834790, 309767),
 (1226243, 501363, 863729, 350269),
 (1226332, 501471, 864016, 350676),
 (1226471, 501648, 864488, 351373),
 (1226625, 501858, 865033, 352145),
 (1227200, 502606, 866980, 355006),
 (1227224, 502635, 867058, 355103),
 (1229566, 506177, 876769, 370968),
 (1231842, 509729, 887037, 390501)]

Dress-up the results and show the first two of them in a table:

In [7]:

A.table(results, end=2)

n	p	sentence	clause	phrase	word
1	Genesis 27:29	וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃	וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃	מְבָרֲכֶ֖יךָ	מְבָרֲכֶ֖יךָ
2	Exodus 12:19	כִּ֣י׀ כָּל־ אֹכֵ֣ל מַחְמֶ֗צֶת וְ נִכְרְתָ֞ה הַנֶּ֤פֶשׁ הַהִוא֙ מֵעֲדַ֣ת יִשְׂרָאֵ֔ל בַּגֵּ֖ר וּבְאֶזְרַ֥ח הָאָֽרֶץ׃	אֹכֵ֣ל מַחְמֶ֗צֶת	מַחְמֶ֗צֶת	מַחְמֶ֗צֶת

Change to the ASCII transliteration of the consonants of each word, and show all results in a table, but hide the last three columns.

In [8]:

A.table(results, fmt="text-trans-plain", skipCols={2, 3, 4})

n	p	sentence
1	Genesis 27:29	W MBRKJK BRWK00
2	Exodus 12:19	KJ05 KL& >KL MXMYT W NKRTH HNPC HHW> M<DT JFR>L BGR WB>ZRX H>RY00
3	Deuteronomy 33:20	BRWK MRXJB GD
4	Joshua 6:9	W HM>SP HLK >XRJ H>RWN
5	Joshua 6:13	W HM>SP HLK >XRJ >RWN JHWH
6	Isaiah 3:12	<MJ M>CRJK MT<JM W DRK >RXTJK BL<W00_S
7	Isaiah 9:15	W M>CRJW MBL<JM00
8	Jeremiah 51:1	HNNJ M<JR <L&BBL W>L&JCBJ LB QMJ RWX MCXJT00
9	Haggai 1:6	W HMFTKR MFTKR >L&YRWR NQWB00_P
10	Malachi 1:7	MGJCJM <L&MZBXJ LXM MG>L
11	Malachi 1:11	W BKL&MQWM MQVR MGC LCMJ WMNXH VHWRH
12	Malachi 1:14	W ZBX MCXT L>DNJ
13	Proverbs 13:12	TWXLT MMCKH MXLH& LB
14	Proverbs 14:31	W MKBDW XNN >BJWN00
15	Proverbs 17:4	MR< MQCJB <L&FPT&>WN
16	Proverbs 20:2	MT<BRW XWV> NPCW00
17	Proverbs 29:15	W N<R MCLX MBJC >MW00
18	Proverbs 29:26	RBJM MBQCJM PNJ&MWCL
19	Daniel 2:22	HW> GL> <MJQT> WMSTRT>
20	Nehemiah 12:47	W KL&JFR>L BJMJ ZRBBL WBJMJ NXMJH NTNJM MNJWT HMCRRJM WHC<RJM DBR&JWM BJWMW

Make the text-trans-plain format the default and show results by sentence until further notice.

In [9]:

A.displaySetup(fmt="text-trans-plain", condenseType="sentence")

Expand result 11, because something is going on there: a phrase gets interrupted!

In [10]:

A.show(results, start=11, end=11)

result 11

Malachi 1:11

sentence 58

clause Ptcp NA

typ=Ptcp

phrase CP Conj

typ=CP

W

g_cons=Wsp=conj

phrase PP Loca

typ=PP

B

g_cons=Bsp=prep

KL&

g_cons=KLsp=subs

MQWM

g_cons=MQWMsp=subs

phrase NP Subj

typ=NP

MQVR

g_cons=MQVRsp=verb

phrase VP PreC

typ=VP

MGC

g_cons=MGCsp=verb

phrase PP Adju

typ=PP

L

g_cons=Lsp=prep

CMJ

g_cons=CMJsp=subs

phrase NP Subj

typ=NP

W

g_cons=Wsp=conj

MNXH

g_cons=MNXHsp=subs

VHWRH

g_cons=VHWRHsp=adjv

N.B.: None of the words sentence, clause, phrase, word, typ, sp, g_cons Ptcp, NP, is built into Text-Fabric, they are taken from the corpus organisation.

So the text of the query is almost entirely made up of terms that are familiar if you know the corpus.

In order to get to know the corpus, the user needs to consult a feature documentation document. For the Hebrew Bible that looks like this.

There is much more to search in Text-Fabric.

Here are the search docs and here is a search tutorial for the Hebrew Bible.

Computing with the results¶

The fact that the results are just tuples of integers makes it easy post-process results with your own code.

Suppose you want to limit the results to those sentences that do not have hapaxes, then you can write your own Python code to do that.

Suppose the corpus has a word feature freq_lex that for each word occurrence has the number of occurrences of the lexeme of the word in the corpus.

Then we can filter like this:

In [11]:

F = A.api.Feature # the API to retrieve feature values
L = A.api.Locality # the API to navigate to nodes in the neighbourhood

unwantedResults = []
wantedResults = []

for result in results:
    s = result[0]
    words = L.d(s, otype="word")
    hasHapax = any(F.freq_lex.v(w) == 1 for w in words)
    if hasHapax:
        unwantedResults.append(result)
    else:
        wantedResults.append(result)
        
print(f"{len(wantedResults)=} {len(unwantedResults)=}")

len(wantedResults)=19 len(unwantedResults)=1

Let's show the unwanted result and show the freq_lex feature for all words:

In [12]:

A.show(unwantedResults, extraFeatures="freq_lex")

result 1

Daniel 2:22

sentence 68

clause Ptcp NA

typ=Ptcp

phrase PPrP Subj

typ=PPrP

HW>

freq_lex=15g_cons=HW>sp=prps

phrase VP PreC

typ=VP

GL>

freq_lex=9g_cons=GL>sp=verb

phrase NP Objc

typ=NP

<MJQT>

freq_lex=1g_cons=<MJQT>sp=adjv

W

freq_lex=731g_cons=Wsp=conj

MSTRT>

freq_lex=1g_cons=MSTRT>sp=verb

This ends the Text-Fabric demo.

Comparison with Alpino¶

The Alpino system has a graph-based query language.

Let's make an explorative comparison between this Alpino way of searching and Text-Fabric, because there seems to be quite a bit of convergence between the two.

Data model¶

Alpino works with nodes and words.

Text-Fabric works with nodes. Some nodes are the atomic ones, called slots, they are the textual positions. What you find at a slot depends on what the corpus modeller has chosen, but quite often slots correspond to words.

Alpino nodes have categories (cat), e.g. NP, PP, SMAIN.

Text-Fabric nodes have a type (otype = object type). Above we saw node types sentence, clause, phrase, word, but this is the choice of the corpus modeller. Text-Fabric expects in every corpus a feature file otype that maps all nodes to types.

For example, the BHSA above has this otype.tf:

1-426590	word
426591-426629	book
426630-427558	chapter
427559-515689	clause
515690-606393	clause_atom
606394-651572	half_verse
651573-904775	phrase
904776-1172307	phrase_atom
1172308-1236024	sentence
1236025-1300538	sentence_atom
1300539-1414388	subphrase
1414389-1437601	verse
1437602-1446831	lex

This is shorthand for a mapping of the integers 1..1446831 to strings word, book, ... , lex.

By the way, all data files of a TF corpus are in this format, and each file specifies a mapping from numbers (or pairs of numbers) to values, which can be numbers or strings.

Edges¶

In a graph there are also edges. How do you search for nodes that are connected by certain edges?

Both in Alpino and Text-Fabric edges may have properties that can be used in queries.

An example Alpino query:

match (n:node{cat:'pp'})-[:rel{rel:'hdf'}]->(:nw)
return n

Look for a PP node that is connected to an other node or word by means of an edge with ref property being the string hdf.

In Text-Fabric we can also make queries like this.

We have edges between similar verses and these edges are labeled with the similarity of both verses in percents.

First we look for verses that are 90% similar, and then for verses that are for more than 90% similar.

In [13]:

query1 = """
verse
-crossref=90> verse
"""

query2 = """
verse
-crossref>90> verse
"""

In [14]:

results1 = A.search(query1)
results2 = A.search(query2)

  0.03s 240 results
  0.04s 9574 results

In [15]:

A.table(results1, end=2, condenseType="verse", full=True)

n	p	verse	verse
1	Genesis 25:31	W J>MR J<QB MKRH KJWM >T&BKRTK LJ00	W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00
2	Genesis 25:33	W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00	W J>MR J<QB MKRH KJWM >T&BKRTK LJ00

Alpino does a good job in understanding linguistics. It has quite a bit of meaningful linguistic relationships, and its corpora supply the data for those.

Text-Fabric is different. It is much more agnostic. It only assumes that there is an ordered set of slots plus nodes that represent certain subsets of slots; the nodes are divided into types.

Both the types and the subsets are given with the corpus as TF data. Above we saw the otype.tf file, but there is also an oslots.tf file, that maps each non-slot node to the set of slots it is linked to.

Query results¶

What do queries return?

In Alpino they return nodes and/or values that certain features have for those nodes.

In Text-Fabric they return tuples of nodes. In a TF query, most lines specify a node with properties that the query has to instantiate in the corpus. If a query specifies 10 such nodes, then the query results are 10-tuples of nodes in the corresponding order.

The nodes returned are naked, not dressed-up with features.

In real-life, people issue TF-queries either in the Text-Fabric browser, where they can customise how query results are displayed, or they can catch the result nodes in their programs (which run typically in a Jupyter notebook, but by no means necessarily so).

In the Text-Fabric browser users can influence the features that must be displayed in various ways, one of them being: if a feature is mentioned in a query, then it is displayed. Users can also categorically request features, or inhibit certain standard features that are displayed by default. (These defaults are not Text-Fabric things, but set by the corpus modeller).

In Jupyter notebooks, users can programmatically achieve the same effects by using the functions table() and show() and supplying various keyword arguments.

All in all, given the difference in purpose, technology and scope between Text-Fabric and Alpino, the underlying concepts map fairly well from the one to the other.

More complicated queries¶

It is not always the case that a query is a neatly nested template. In fact, such queries only use the "embedding" relationship, but there are much more relationships.

In Alpino, you can give names to the nodes in a query, and the same is true for Text-Fabric.

For example:

In [16]:

query = """

clause
  phrase
    := w1:word sp=verb
  <: phrase
    =: w2:word sp=verb

w1 .lex. w2
""" 

This means: find a verb in two phrases of a clause The first verb should be the last word in its phrase, the second verb should be the first word in its phrase.

The last line states a relationship between the two words: there lexeme value should be identical.

In [17]:

results = A.search(query)

  0.66s 475 results

In [18]:

A.table(results, end=10, skipCols={2, 3, 4, 5})

n	p	clause
1	Genesis 2:16	MKL <Y&HGN >KL T>KL00
2	Genesis 2:17	KJ BJWM MWT TMWT00
3	Genesis 3:4	L>& MWT TMTWN00
4	Genesis 3:16	HRBH >RBH <YBWNK WHRNK
5	Genesis 8:7	W JY> JYW> WCWB
6	Genesis 12:3	W >BRKH MBRKJK
7	Genesis 15:13	JD< TD<
8	Genesis 16:10	HRBH >RBH >T&ZR<K
9	Genesis 17:13	HMWL05 JMWL JLJD BJTK WMQNT KSPK
10	Genesis 18:10	CWB >CWB >LJK K<T XJH

Quantifiers¶

In Alpino you can use quantifiers. These are parts of a query where you look for the existence or non-existence of certain patterns. This seems to be a bit of a problematic device, because there are certain conditions on quantifiers.

In Text-Fabric it is not different: here there are also quantifiers, and here there are also restrictions on the quantifier expressions.

In Text-Fabric, quantified parts of the query do not contribute to the result tuple.

Here is an example.

Let's see how many VP-phrases there are:

In [19]:

resultsVP = A.search("""phrase typ=VP""")

  0.16s 69024 results

Sometimes VPs contain a noun:

In [20]:

query = """
phrase typ=VP
  word sp=subs
"""
resultsWithNoun = A.search(query, shallow=True)

  0.31s 234 results

Note the shallow=True: this means that we deliver the results differently: not as a tuple of tuples, but as a set of nodes that correspond to the first node in the query: the phrase.

If we want the VPs without nouns:

In [21]:

resultsWithoutNoun = A.search("""
phrase typ=VP
/without/
  word sp=subs
/-/
""")

  0.35s 68790 results

A check to see if the results add up:

In [22]:

len(resultsVP) == len(resultsWithNoun) + len(resultsWithoutNoun)

Out[22]:

True

OK, the numbers of results of the different queries are as expected, but we can also compare the results themselves:

In [23]:

set(r[0] for r in resultsVP) == resultsWithNoun | set(r[0] for r in resultsWithoutNoun)

Out[23]:

True

If we want the VPs with only verbs:

In [24]:

resultsVerb1 = A.search("""
phrase typ=VP
/without/
  word sp#verb
/-/
""")

  0.45s 62771 results

Or, in a slightly different way, showing a different quantifier:

In [25]:

resultsVerb2 = A.search("""
phrase typ=VP
/where/
  w:word
/have/
  w sp=verb
/-/
""")

  0.70s 62771 results

Counting and sorting¶

Unlike Alpino, in Text-Fabric there are no SQL-like constructs to count, group and aggregate results, except the shallow=True parameter, which reduces all results that have the same first node to a single result consisting of that node only.

Text-Fabric relies on post-processing by the user, either in the program in which he issued the query, or in other programs in which he imports results exported from the Text-Fabric browser.

There is also an API function A.export() for use in your programs to export results as tab-separated tables of dressed-up nodes.

Concerning sorting: the search function can be passed a sort key to order results. If sort=True is passed, the results are ordered by the text-induced ordering of the result tuples.

Set-theoretic operations on results¶

Alpino provides set-theoretic operations on results, in Text-Fabric the user has to do that by means of post-processing.

Note that Text-Fabric constrains search templates: they have to be connected components, in the sense that between every pair of nodes in the query template there must be a path of relationships.

If a search template would consist of multiple connected components, the result would be the cartesian product of the results of the individual queries.

Such result sets are potentially monstrous, and it is unlikely that the user can and will deal with them, so they are prohibited.

Limitations¶

Alpino search is based on Cypher and SQL. Limitations in Cypher queries can sometimes be compensated by excursions into SQL.

Also in Text-Fabric the expressive power of queries is limited. Moreover, there are also queries that can be expressed but require too much time to execute.

That is partly due to lack of sophistication in the Text-Fabric engine and partly due to inherent complexity of spatial relationships between nodes.

In Text-Fabric we do not have an escape to an other query language. Instead, the escape is to hand-coding.

There is a tutorial notebook in which we explore a difficult query task. Although we can solve it by a query, we also do it by hand-coding. We make sure both give the same result and then we save the result as a named set to disk.

We can then invoke Text-Fabric later on with a parameter to include this named set. At that moment the name of the set can be used in queries in place where a node type is expected.

In the above notebook the section Custom sets for (non-)gapped phrases shows how that works.

Pre-computation¶

In Alpino there are pre-computed pieces of data, e.g. the feature vor_feld. This is advertised as a device that simplifies queries and makes them much more efficient.

In Text-Fabric pre-computation is also used, at several levels.

Internal pre-computation¶

Some of the pre-computation belongs to the internals of Text-Fabric, such as spatial indexes that facilitate the computation of embedding relations between nodes, among other things. See for example levUp. Such precomputed data is also made available in raw form to the end user through the Computed API.

Corpus-level pre-computation¶

At the second level there are computed features that the corpus modeller has included into the corpus. For example, in the BHSA there are features for the frequency and rank of words and lexemes. See freq/rank.

There are also features that have been added by others to the corpus as a separate module. The similarity feature that we encountered before is an example of that.

User-generated pre-computation¶

The named sets before are an example where end users themselves can compute data that are helpful in subsequent queries.

This is the ethos of Text-Fabric: that end users, corpus modellers, and researchers have maximum scope and flexibility to compute with the corpus.

Alpino and Text-Fabric¶

Alpino and Text-Fabric are very different in the size of corpora they deal with and the specific features they assume to be present in the corpora.

Alpino corpora are linguistic corpora, Text-Fabric corpora do not have to be linguistic. There are even corpora in Text-Fabric whose texts are not in a language, such as the proto-cuneiform tablets of Uruk.

Can there be synergies between the Alpino world and the Text-Fabric world?

Alpino helps Text-Fabric¶

There are several corpora in Text-Fabric that could benefit from linguistic tools, especially tokenisers, pos-taggers and morphological taggers. Whether Alpino can help depends on the languages that are supported by Alpino, because up till now Text-Fabric deals with historical corpora using mixtures of historical languages with a variety of spelling idiosyncrasies.

Text-Fabric helps Alpino¶

When end-users want to combine close reading with data-analysis, Text-Fabric is a handy tool. Although Text-Fabric cannot deal with huge corpora, it can deal with corpora the size of 5 million words with dozens of features or 0,5 million words and over hundred features.

Text-Fabric also has machinery to deal with volumes of a corpus.

So one could use Alpino to make a top-level search on a huge corpus, and then export results volume by volume, where Text-Fabric can be used to deal with individual volumes.

Conclusion¶

Regardless of whether there is real synergy between Alpino and Text-Fabric, it is encouraging to see that once text is regarded as a graph, there is a certain logic in how a query language should work. Both Alpino and Text-Fabric have hit upon the same elements, driven by how other tools have tackled these things.

Whereas Alpino rests on Cypher (at least for this part of its query capability), Text-Fabric has been inspired by Emdros by Ulrik Sandborg-Petersen.

Author¶

Dirk Roorda

CC-BY