You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
Search in Text-Fabric is a template based way of looking for structural patterns in your dataset.
It is inspired by the idea of topographic query, as worked out in MQL which has been implemented in Emdros. See also pitfalls of MQL
Within Text-Fabric we have the unique possibility to combine the ease of formulating search templates for complicated syntactical patterns with the power of programmatically processing the results.
This notebook will show you how to get up and running.
See the notebook searchFromMQL for examples how MQL queries can be expressed in Text-Fabric search.
Search is a powerful feature for a wide range of purposes.
Quite a bit of the implementation work has been dedicated to optimize performance. Yet I do not pretend to have found optimal strategies for all possible search templates. Some search tasks may turn out to be somewhat costly or even very costly.
That being said, I think search might turn out helpful in many cases, especially by reducing the amount of hand-coding needed to work with special subsets of your data.
Search is as simple as saying (just an example)
results = A.search(template)
A.show(results)
See all ins and outs in the search template docs.
%load_ext autoreload
%autoreload 2
The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.
from tf.app import use
A = use('ETCBC/bhsa', hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots/node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gd905e3fb6e80d0fa537600337614adc2af157309
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
We start with the most simple form of issuing a query. Let's look for the proper nouns in 1 Samuel. We also want to show the clauses in which they occur.
All work involved in searching takes place under the hood.
query = """
book book=Samuel_I
clause
word sp=nmpr
"""
results = A.search(query)
0.22s 1868 results
We have the results. We only need to display them. Here are the first few in a table:
A.table(results, end=3)
n | p | book | clause | word |
---|---|---|---|---|
1 | 1_Samuel 1:1 | 1_Samuel | וַיְהִי֩ אִ֨ישׁ אֶחָ֜ד מִן־הָרָמָתַ֛יִם צֹופִ֖ים מֵהַ֣ר אֶפְרָ֑יִם | אֶפְרָ֑יִם |
2 | 1_Samuel 1:1 | 1_Samuel | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | אֶ֠לְקָנָה |
3 | 1_Samuel 1:1 | 1_Samuel | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | יְרֹחָ֧ם |
The hyperlinks in the p
column point to SHEBANQ, to the verse most relevant to the individual results.
The columns with the book is not very informative. We can leave it out.
You can leave columns out by passing skipCols=xxx
where xxx
is a set of numbers, which may also be passed as
a space-separated string of numbers.
Note that the book column is the first column (starting after the p
column, coounting starts at 1).
A.table(results, end=10, skipCols="1")
n | p | book | clause | word |
---|---|---|---|---|
1 | 1_Samuel 1:1 | וַיְהִי֩ אִ֨ישׁ אֶחָ֜ד מִן־הָרָמָתַ֛יִם צֹופִ֖ים מֵהַ֣ר אֶפְרָ֑יִם | אֶפְרָ֑יִם | |
2 | 1_Samuel 1:1 | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | אֶ֠לְקָנָה | |
3 | 1_Samuel 1:1 | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | יְרֹחָ֧ם | |
4 | 1_Samuel 1:1 | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | אֱלִיה֛וּא | |
5 | 1_Samuel 1:1 | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | תֹּ֥חוּ | |
6 | 1_Samuel 1:1 | וּשְׁמֹ֡ו אֶ֠לְקָנָה בֶּן־יְרֹחָ֧ם בֶּן־אֱלִיה֛וּא בֶּן־תֹּ֥חוּ בֶן־צ֖וּף אֶפְרָתִֽי׃ | צ֖וּף | |
7 | 1_Samuel 1:2 | שֵׁ֤ם אַחַת֙ חַנָּ֔ה | חַנָּ֔ה | |
8 | 1_Samuel 1:2 | וְשֵׁ֥ם הַשֵּׁנִ֖ית פְּנִנָּ֑ה | פְּנִנָּ֑ה | |
9 | 1_Samuel 1:2 | לִפְנִנָּה֙ יְלָדִ֔ים | פְנִנָּה֙ | |
10 | 1_Samuel 1:2 | וּלְחַנָּ֖ה אֵ֥ין יְלָדִֽים׃ | חַנָּ֖ה |
Here is the first one in a pretty display:
A.show(results, end=1)
result 1
We are going to do some more work where we want to skip column 1, so we make that the temporary default:
A.displaySetup(skipCols="1")
Now we show a few results without the book column:
A.show(results, end=2)
result 1
result 2
or, stopping at the clause level:
A.show(results, end=2, baseTypes={"clause"})
result 1
result 2
We can view result in phonetic representation as well:
A.table(results, end=3, fmt="text-phono-full")
n | p | book | clause | word |
---|---|---|---|---|
1 | 1_Samuel 1:1 | wa yᵊhˌî ʔˌîš ʔeḥˈāḏ min-hārāmāṯˈayim ṣôfˌîm mēhˈar ʔefrˈāyim | ʔefrˈāyim | |
2 | 1_Samuel 1:1 | û šᵊmˈô ʔelqānˌā ben-yᵊrōḥˈām ben-ʔᵉlîhˈû ben-tˌōḥû ven-ṣˌûf ʔefrāṯˈî . | ʔelqānˌā | |
3 | 1_Samuel 1:1 | û šᵊmˈô ʔelqānˌā ben-yᵊrōḥˈām ben-ʔᵉlîhˈû ben-tˌōḥû ven-ṣˌûf ʔefrāṯˈî . | yᵊrōḥˈām |
A.show(results, end=1, fmt="text-phono-full")
result 1
We are done with this query and its results. We reset the skipCols
parameter.
A.displayReset("skipCols")
There are two fundamentally different ways of presenting the results: condensed and uncondensed.
In uncondensed view, all results are listed individually. You can keep track of which parts belong to which results. The display can become unwieldy.
This is the default view, because it is the straightest, most logical, answer to your query.
In condensed view all nodes of all results are grouped in containers first (e.g. verses), and then presented container by container. You loose the information of what parts belong to what result.
Here is an example of the difference.
query = """
book book=Genesis
chapter chapter=1
verse verse=1
sentence
% order is not important!
word nu=sg
word nu=pl
"""
Note that you can have comments in a search template. Comment lines start with a %
.
results = A.search(query)
0.48s 6 results
The book, chapter, verse columns are completely uninformative, so:
A.displaySetup(skipCols="1 2 3")
A.table(results, withPassage=True)
n | p | book | chapter | verse | sentence | word | word |
---|---|---|---|---|---|---|---|
1 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | רֵאשִׁ֖ית | אֱלֹהִ֑ים | |||
2 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | רֵאשִׁ֖ית | שָּׁמַ֖יִם | |||
3 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | בָּרָ֣א | אֱלֹהִ֑ים | |||
4 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | בָּרָ֣א | שָּׁמַ֖יִם | |||
5 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | אָֽרֶץ׃ | אֱלֹהִ֑ים | |||
6 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | אָֽרֶץ׃ | שָּׁמַ֖יִם |
There are two plural and three singular words in Genesis 1:1. Search templates do not specify order, so all six combinations qualify as results.
Let's expand the results display:
A.show(results)
result 1
result 2
result 3
result 4
result 5
result 6
As you see, the results are listed per result tuple, even if they occur all in the same verse. This way you can keep track of what exactly belongs to each result.
Now in condensed mode:
A.show(results, condensed=True)
Here we have all words of all results in one display. But we cannot see that each results has two words, let alone which ones.
Note that we can apply different highlight colors to different parts of the result. The words in the pair are member 5 and 6 of the result tuples. The members that we do not map, will not be highlighted. The members that we map to the empty string will be highlighted with the default color.
NB: Choose your colors from the CSS specification.
A.show(results, condensed=False, colorMap={4: "", 5: "cyan", 6: "magenta"})
result 1
result 2
result 3
result 4
result 5
result 6
Color mapping works best for uncondensed results. If you condense results, some nodes may occupy different positions in different results. It is unpredictable which color will be used for such nodes:
A.show(results, condensed=True, colorMap={4: "", 5: "cyan", 6: "magenta"})
You can specify to what container you want to condense. By default, everything is condensed to verses.
Let's change that to phrases:
A.show(
results,
condensed=False,
condenseType="phrase",
colorMap={4: "", 5: "cyan", 6: "magenta"},
)
result 1
result 2
result 3
result 4
result 5
result 6
You can stipulate an order on the words in your template. You only have to put a relational operator between them. Say we want only results where the plural follows the singular.
query = """
book book=Genesis
chapter chapter=1
verse verse=1
sentence
word nu=sg
< word nu=pl
"""
Note that we keep the skipCols="1 2 3"
display setting in force, since it is relevant for this query as well.
results = A.search(query)
A.table(results)
0.49s 4 results
n | p | book | chapter | verse | sentence | word | word |
---|---|---|---|---|---|---|---|
1 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | רֵאשִׁ֖ית | אֱלֹהִ֑ים | |||
2 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | רֵאשִׁ֖ית | שָּׁמַ֖יִם | |||
3 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | בָּרָ֣א | אֱלֹהִ֑ים | |||
4 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | בָּרָ֣א | שָּׁמַ֖יִם |
We can also require the words to be adjacent.
query = """
book book=Genesis
chapter chapter=1
verse verse=1
sentence
word nu=sg
<: word nu=pl
"""
results = A.search(query)
colorMap = {5: "lightsalmon", 6: "mediumaquamarine"}
A.table(results, colorMap=colorMap)
A.show(results, condensed=False, colorMap=colorMap)
0.45s 1 result
n | p | book | chapter | verse | sentence | word | word |
---|---|---|---|---|---|---|---|
1 | Genesis 1:1 | בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ | בָּרָ֣א | אֱלֹהִ֑ים |
result 1
We would like to see the gender, number and person for words.
The way to do that, is to perform a A.prettySetup(features)
first.
A.displaySetup(
extraFeatures="ps gn nu", colorMap={2: "lightsalmon", 3: "mediumaquamarine"}
)
A.show(results, condensed=False)
result 1
The features without meaningful values have been left out. We can also change that by passing a set of values we think are not meaningful. The default set is
{None, 'NA', 'none', 'unknown'}
A.displaySetup(noneValues=set())
A.show(results, condensed=False)
result 1
This makes clear that it is convenient to keep None
in the noneValues
:
A.displaySetup(noneValues={None})
A.show(results, condensed=False)
result 1
We can even choose to suppress other values, e.g. the male gender values and the singular number values.
A.displaySetup(noneValues={None, "NA", "unknown", "none", "m", "sg"})
A.show(results, condensed=False)
In the rest of the notebook we stick to our normal setup, so we reset the extra features.
A.displayReset("extraFeatures")
A.show(results, condensed=False)
Now we completely reset the display customization.
A.displayReset()
So far we have show()
n the results of searches.
But you can also construct your own tuples and show them.
Whereas you can use search to get a pretty good approximation of what you want, most of the times you do not arrive precisely at your destination.
Here is an example where we use search to come close, and then work our way to produce the end result.
We look for clauses with a one-word subject that does not agree in number with its predicate.
In our search templates we cannot formulate that a feature has different values on two nodes in the template. We could spell out all possible combinations of values and make a search template for each of them, but that is needlessly complex.
Let's first use search to find all clauses containing a one word subject and a predicate. And, to narrow down it further, we require that the word in the subject and the verb in the predicate are marked for number.
(You may want to consult the feature docs, see the link at the start of the notebook, where Bhsa()
is called).
Note that the order of the phrases does not matter.
query = """
clause
phrase function=Subj
=: word nu=sg|pl
:=
phrase function=Pred|PreO
word sp=verb
nu=sg|pl
"""
results = A.search(query)
1.02s 10638 results
Now the hand coding begins. We are going to extract the tuples we want.
wantedResults = tuple(
(subj, pred)
for (clause, phraseS, subj, phraseV, pred) in results
if F.nu.v(subj) != F.nu.v(pred)
)
print(f"{len(wantedResults)} filtered results")
467 filtered results
And now we can show them:
wantedResults[0:4]
((4, 3), (34, 33), (42, 41), (50, 49))
A.table(
wantedResults, start=1, end=4, colorMap={1: "lightsalmon", 2: "mediumaquamarine"}
)
n | p | word | word |
---|---|---|---|
1 | Genesis 1:1 | אֱלֹהִ֑ים | בָּרָ֣א |
2 | Genesis 1:3 | אֱלֹהִ֖ים | יֹּ֥אמֶר |
3 | Genesis 1:4 | אֱלֹהִ֛ים | יַּ֧רְא |
4 | Genesis 1:4 | אֱלֹהִ֔ים | יַּבְדֵּ֣ל |
A.show(
wantedResults, start=1, end=4, colorMap={1: "lightsalmon", 2: "mediumaquamarine"}
)
result 1
result 2
result 3
result 4
Now suppose that we want to highlight the non-qal verb forms with a different color.
We have to assing colors to the members of our tuples:
highlights = {}
for (subj, pred) in wantedResults:
highlights[subj] = "lightsalmon"
highlights[pred] = "mediumaquamarine" if F.vs.v(pred) == "qal" else "yellow"
Now we can call show with the highlights
parameter instead of the colorMap
parameter.
A.table(wantedResults, start=1, end=4, highlights=highlights)
n | p | word | word |
---|---|---|---|
1 | Genesis 1:1 | אֱלֹהִ֑ים | בָּרָ֣א |
2 | Genesis 1:3 | אֱלֹהִ֖ים | יֹּ֥אמֶר |
3 | Genesis 1:4 | אֱלֹהִ֛ים | יַּ֧רְא |
4 | Genesis 1:4 | אֱלֹהִ֔ים | יַּבְדֵּ֣ל |
Or in condensed pretty display:
A.show(
wantedResults,
condensed=True,
start=3,
end=3,
highlights=highlights,
extraFeatures="vs",
withNodes=True,
)
verse 3
As you see, you have total control.
You know how to run queries and show off with their results.
The next thing is to dive deeper into the power of templates:
advanced sets relations quantifiers fromMQL rough gaps
CC-BY Dirk Roorda