You might want to consider the start of this tutorial.
Short introductions to other TF datasets:
or the
%load_ext autoreload
%autoreload 2
from tf.app import use
A = use("ETCBC/bhsa", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
ETCBC/bhsa
/Users/me/text-fabric-data/github/ETCBC/bhsa/app
gb112c161cfd21eae403d51a2733740d8743460e7
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
absent
n/a
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
/Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
ner
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
ETCBC
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
ETCBC
/tf
parallels
ETCBC
/tf
bhsa
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8.1
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
Searches often do not deliver the results you expect. Besides typos, lack of familiarity with the template formalism and bugs in the system, there is another cause: difficult semantics of the data.
Most users reason about phrases, clauses and sentences as if they are consecutive blocks of words. But in the BHSA this is not the case: each of these objects may have gaps.
Most of the time, verse boundaries coincide with the boundaries of sentences, clauses, and phrases. But not always, there are verse spanning sentences.
Note
These phenomena may wreak havoc with your intuitive reasoning about what search templates should deliver. Query templates do not require the objects to be consecutive and still they make sense. But that might not be your sense, unless you Mind the gap!
We are going to show these issues in depth.
TF-search has no primitives to deal with gaps directly. Nodes correspond to textual objects such as words, phrases, clauses, verses, books. Usually these are consecutive sequences of one or more words, but in theory they can be arbitrary sets of slots.
And, as far as the BHSA corpus is concerned, in practice too. If we look at phrases, then the overwhelming majority is consecutive, without gaps, But there is also a substantial amount of phrases with gaps.
People that are familiar with MQL (see from MQL) may remember that in MQL you can search for a gap. The MQL query
SELECT ALL OBJECTS WHERE
[phrase FOCUS
[word lex='L']
[gap]
]
looks for a phrase with a gap in it (i.e. one or more consecutive words between the start and the end of the phrase that do not belong to the phrase). The query then asks additionally for those gap-containing phrases that have a certain word in front of the gap.
We want this too!
We start with a query that aims to get the same results as the MQL query above.
In our template, we require that there is a word wPreGap
in the phrase that is just before the gap,
a word wGap
that comes right after, so it is in the gap, and hence does not belong to the phrase.
But this all must happen before the last word wLast
of the phrase.
query = """
verse
p:phrase
wPreGap:word lex=L
wLast:word
:=
wGap:word
wPreGap <: wGap
wGap < wLast
p || wGap
"""
results = A.search(query)
0.46s 12 results
Nice and quick. Let's see the results.
A.table(results, skipCols="1")
n | p | verse | phrase | word | word | word |
---|---|---|---|---|---|---|
1 | Genesis 17:7 | לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ | לְךָ֙ | אַחֲרֶֽיךָ׃ | לֵֽ | |
2 | Genesis 28:4 | לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ | לְךָ֙ | אִתָּ֑ךְ | אֶת־ | |
3 | Exodus 30:21 | לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו | לָהֶ֧ם | זַרְעֹ֖ו | חָק־ | |
4 | Leviticus 25:6 | לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔ | לָכֶם֙ | תֹושָׁ֣בְךָ֔ | לְ | |
5 | Numbers 20:15 | לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃ | לָ֛נוּ | אֲבֹתֵֽינוּ׃ | מִצְרַ֖יִם | |
6 | Numbers 32:33 | לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף | לָהֶ֣ם׀ | יֹוסֵ֗ף | מֹשֶׁ֡ה | |
7 | Deuteronomy 1:36 | לֹֽו־וּלְבָנָ֑יו | לֹֽו־ | בָנָ֑יו | אֶתֵּ֧ן | |
8 | Deuteronomy 26:11 | לְךָ֛ וּלְבֵיתֶ֑ךָ | לְךָ֛ | בֵיתֶ֑ךָ | יְהוָ֥ה | |
9 | 1_Samuel 25:31 | לְךָ֡ לַאדֹנִ֗י | לְךָ֡ | אדֹנִ֗י | לְ | |
10 | 2_Kings 25:24 | לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם | לָהֶ֤ם | אַנְשֵׁיהֶ֔ם | גְּדַלְיָ֨הוּ֙ | |
11 | Jeremiah 40:9 | לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם | לָהֶ֜ם | אַנְשֵׁיהֶ֣ם | גְּדַלְיָ֨הוּ | |
12 | Daniel 9:8 | לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ | לָ֚נוּ | אֲבֹתֵ֑ינוּ | בֹּ֣שֶׁת |
Let's color the word in the gap differently.
A.displaySetup(
colorMap={2: "aqua", 3: "yellow", 4: "magenta"}, condenseType="clause",
skipCols="1",
)
A.table(results, condensed=False)
n | p | verse | phrase | word | word | word |
---|---|---|---|---|---|---|
1 | Genesis 17:7 | לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃ | לְךָ֙ | אַחֲרֶֽיךָ׃ | לֵֽ | |
2 | Genesis 28:4 | לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ | לְךָ֙ | אִתָּ֑ךְ | אֶת־ | |
3 | Exodus 30:21 | לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו | לָהֶ֧ם | זַרְעֹ֖ו | חָק־ | |
4 | Leviticus 25:6 | לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔ | לָכֶם֙ | תֹושָׁ֣בְךָ֔ | לְ | |
5 | Numbers 20:15 | לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃ | לָ֛נוּ | אֲבֹתֵֽינוּ׃ | מִצְרַ֖יִם | |
6 | Numbers 32:33 | לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף | לָהֶ֣ם׀ | יֹוסֵ֗ף | מֹשֶׁ֡ה | |
7 | Deuteronomy 1:36 | לֹֽו־וּלְבָנָ֑יו | לֹֽו־ | בָנָ֑יו | אֶתֵּ֧ן | |
8 | Deuteronomy 26:11 | לְךָ֛ וּלְבֵיתֶ֑ךָ | לְךָ֛ | בֵיתֶ֑ךָ | יְהוָ֥ה | |
9 | 1_Samuel 25:31 | לְךָ֡ לַאדֹנִ֗י | לְךָ֡ | אדֹנִ֗י | לְ | |
10 | 2_Kings 25:24 | לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם | לָהֶ֤ם | אַנְשֵׁיהֶ֔ם | גְּדַלְיָ֨הוּ֙ | |
11 | Jeremiah 40:9 | לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם | לָהֶ֜ם | אַנְשֵׁיהֶ֣ם | גְּדַלְיָ֨הוּ | |
12 | Daniel 9:8 | לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ | לָ֚נוּ | אֲבֹתֵ֑ינוּ | בֹּ֣שֶׁת |
A.show(results, end=3, condensed=False)
result 1
result 2
result 3
A.displayReset()
These were particular gaps. Now we want to get all gapped phrases.
We can just lift the special requirement that
the preGapWord
has to satisfy a special lexical condition.
query = """
p:phrase
wPreGap:word
wLast:word
:=
wGap:word
wPreGap <: wGap
wGap < wLast
p || wGap
"""
results = A.search(query)
0.91s 716 results
Not too bad! We could wait for it. Here are some results.
A.table(results, start=5, end=10)
n | p | phrase | word | word | word |
---|---|---|---|---|---|
5 | Genesis 2:25 | שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו | שְׁנֵיהֶם֙ | אִשְׁתֹּ֑ו | עֲרוּמִּ֔ים |
6 | Genesis 4:4 | הֶ֨בֶל גַם־ה֛וּא | הֶ֨בֶל | ה֛וּא | הֵבִ֥יא |
7 | Genesis 7:8 | מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל | בְּהֵמָ֔ה | כֹ֥ל | אֲשֶׁ֥ר |
8 | Genesis 7:14 | הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ | רֶ֛מֶשׂ | כָּנָֽף׃ | הָ |
9 | Genesis 7:21 | כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃ | בָּשָׂ֣ר׀ | אָדָֽם׃ | הָ |
10 | Genesis 7:21 | כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃ | שֶּׁ֖רֶץ | אָדָֽם׃ | הַ |
If a phrase has multiple gaps, we encounter it multiple times in our results.
We show the two results in Genesis 7:21.
A.show(
results,
condensed=False,
condenseType="clause",
start=9,
end=10,
colorMap={1: "lightgreen", 2: "orange", 4: "magenta"},
)
result 9
result 10
If we want just the phrases, and only once, we can run the query in shallow mode, see advanced:
gapQueryResults = A.search(query, shallow=True)
1.08s 672 results
Sometimes there are two subphrases with exactly the same words in it. They only differ in their
values for the feature rela
. Let's find such a case.
We do have to show the subphrases, though.
Some types are hidden, let's find out which ones:
A.displayShow()
word
}verse
0
set()
()
{}
text-orig-full
0
0
clause_atom
sentence_atom
phrase_atom
subphrase
half_verse
True
{}
0
absent
n/a
none
unknown
NA
True
True
True
0
set()
0
set()
0
()
1
()
2
()
3
()
True
0
True
0
dResults = A.search("""
s1:subphrase
== s2:subphrase
s1 < s2
""")
0.20s 2135 results
Let's pass a different set of hidden types when showing the results:
A.show(
dResults,
end=3,
withNodes=True,
condensed=True,
hiddenTypes="clause_atom half_verse phrase_atom",
condenseType="phrase",
)
phrase 1
phrase 2
phrase 3
More research has shown that these subphrases differ in the presence of the rela
feature.
A.show(
dResults,
end=3,
withNodes=True,
colorMap = {1: "lightsalmon", 2: "lightblue"},
extraFeatures="rela",
highlights=highlights,
hiddenTypes="clause_atom half_verse phrase_atom",
condenseType="phrase",
)
result 1
result 2
result 3
The red one has a feature rela='par'
, the blue one not.
Two nodes with the same node type and the same slots. Yet: different nodes, different feature annotations.
At the moment I do not know why the encoders of the BHSA have chosen to do this.
We can make an equivalent query to get the gaps.
query = """
p:phrase
=: wFirst:word
wLast:word
:=
wGap:word
wFirst < wGap
wLast > wGap
p || wGap
"""
Experience has shown that this is a slow query, so we handle it with care.
S.study(query)
S.showPlan(details=True)
0.00s Checking search template ... 0.00s Setting up search space for 4 objects ... 0.20s Constraining search space with 7 relations ... 0.50s 2 edges thinned 0.50s Setting up retrieval plan with strategy small_choice_multi ... 0.53s Ready to deliver results from 1186199 nodes Iterate over S.fetch() to get the results See S.showPlan() to interpret the results Search with 4 objects and 6 relations Results are instantiations of the following objects: node 0-phrase 253203 choices node 1-word 253203 choices node 2-word 253203 choices node 3-word 426590 choices Performance parameters: yarnRatio = 1.25 tryLimitFrom = 40 tryLimitTo = 40 Instantiations are computed along the following relations: node 0-phrase 253203 choices edge 0-phrase [[ 2-word 1.0 choices edge 0-phrase := 2-word 0 choices edge 0-phrase =: 1-word 1.0 choices (thinned) edge 1-word ]] 0-phrase 0 choices edge 1,2-word <,> 3-word 21329.5 choices edge 3-word || 0-phrase 0 choices 0.58s The results are connected to the original search template as follows: 0 1 R0 p:phrase 2 R1 =: wFirst:word 3 R2 wLast:word 4 := 5 6 R3 wGap:word 7 wFirst < wGap 8 wLast > wGap 9 10 p || wGap 11
S.count(progress=1, limit=4)
0.00s Counting results per 1 up to 4 ... | 4.29s 1 | 4.29s 2 | 4.29s 3 | 4.29s 4 4.29s Done: 5 results
This is a good example of a query that is slow to deliver even its first result. And that is bad, because it is such a straightforward query.
Why is this one so slow, while the previous one went so smoothly?
The crucial thing is the wGap
word. In the latter template, wGap
is not embedded in anything.
It is constrained by wFirst < wGap
and wGap < wLast
.
However, the way the search strategy works is by examining all possibilities for wFirst < wGap
and only then checking whether wGap < wLast
.
The algorithm cannot check both conditions at the same time.
With embedding relations, things are better. Text-Fabric is heavily optimized to deal with embedding relationships.
In the former template, we see that the wGap
is required to be adjacent
to wPreGap
, and this one
is embedded in the phrase. Hence there are few cases to consider for wPreGap
, and per instance
there is only one wGap
.
Lesson
Try to prevent the use of free floating nodes in your template that become constrained by other spatial relationships than embedding.
The former template had it right. Can we rescue the latter template?
We can assume that the phrase and the gap each contain a word in one and the same verse. Note that phrase and gap may belong to different clauses and sentences. We assume that a phrase cannot belong to more than two verses, so either the first or the last word of the phrase is in the same verse as a word in the gap.
query = """
p:phrase
=: wFirst:word
wLast:word
:=
wGap:word
wFirst < wGap
wLast > wGap
p || wGap
v:verse
v [[ wFirst
v [[ wGap
"""
S.study(query)
S.showPlan(details=True)
S.count(progress=100, limit=3000)
0.00s Checking search template ... 0.00s Setting up search space for 5 objects ... 0.21s Constraining search space with 9 relations ... 0.53s 2 edges thinned 0.53s Setting up retrieval plan with strategy small_choice_multi ... 0.57s Ready to deliver results from 1209412 nodes Iterate over S.fetch() to get the results See S.showPlan() to interpret the results Search with 5 objects and 8 relations Results are instantiations of the following objects: node 0-phrase 253203 choices node 1-word 253203 choices node 2-word 253203 choices node 3-word 426590 choices node 4-verse 23213 choices Performance parameters: yarnRatio = 1.25 tryLimitFrom = 40 tryLimitTo = 40 Instantiations are computed along the following relations: node 4-verse 23213 choices edge 4-verse [[ 1-word 12.1 choices edge 1-word ]] 0-phrase 1.0 choices edge 0-phrase =: 1-word 0 choices edge 0-phrase := 2-word 1.0 choices (thinned) edge 0-phrase [[ 2-word 0 choices edge 4-verse [[ 3-word 18.9 choices edge 1,2-word <,> 3-word 0 choices edge 3-word || 0-phrase 0 choices 0.58s The results are connected to the original search template as follows: 0 1 R0 p:phrase 2 R1 =: wFirst:word 3 R2 wLast:word 4 := 5 6 R3 wGap:word 7 wFirst < wGap 8 wLast > wGap 9 10 p || wGap 11 12 R4 v:verse 13 14 v [[ wFirst 15 v [[ wGap 16 0.00s Counting results per 100 up to 3000 ... | 0.08s 100 | 0.17s 200 | 0.23s 300 | 0.26s 400 | 0.29s 500 | 0.35s 600 | 0.37s 700 | 0.44s 800 | 0.46s 900 | 0.50s 1000 | 0.54s 1100 | 0.57s 1200 | 0.61s 1300 | 0.70s 1400 | 0.90s 1500 | 0.95s 1600 | 1.06s 1700 | 1.11s 1800 | 1.17s 1900 | 1.26s 2000 | 1.32s 2100 | 1.37s 2200 | 1.47s 2300 | 1.80s 2400 | 1.89s 2500 | 1.98s 2600 | 2.09s 2700 2.15s Done: 2739 results
# ignore this
# S.tweakPerformance(yarnRatio=1)
We are going to run this query in shallow
mode.
results = A.search(query, shallow=True)
4.37s 672 results
Shallow mode tends to be quicker, but that does not always materialize. The number of results agrees with the first query. Yet we have been lucky, because we required the word in the gap to be in the same verse as the first word in the phrase. What if we require if it is the last word in the phrase?
query = """
p:phrase
=: wFirst:word
wLast:word
:=
wGap:word
wFirst < wGap
wLast > wGap
p || wGap
v:verse
v [[ wLast
v [[ wGap
"""
results = A.search(query, shallow=True)
4.39s 661 results
Then we would not have found all results.
So, this road, although doable, is much less comfortable, performance-wise and logic-wise.
In this misty landscape of gaps we need some corroboration that we found the right results.
gapQueryResults
a phrase?gapQueryResults
have a gap?gapQueryResults
?We check all this by hand coding.
Here is a function that checks whether a phrase has a gap. If the distance between its end points is greater than the number of words it contains, it must have a gap.
def hasGap(p):
words = L.d(p, otype="word")
return words[-1] - words[0] + 1 > len(words)
Now we can perform the checks.
otypesGood = True
haveGaps = True
for p in gapQueryResults:
otype = F.otype.v(p)
if otype != "phrase":
print(f"Non phrase detected: {p}) is a {otype}")
otypesGood = False
break
if not hasGap(p):
print(f"Phrase without a gap: {p}")
A.pretty(p)
haveGaps = False
break
print(f"{len(gapQueryResults)} nodes in query result")
if otypesGood:
print("1. all nodes are phrases")
if haveGaps:
print("2. all nodes have gaps")
inResults = True
for p in F.otype.s("phrase"):
if hasGap(p):
if p not in gapQueryResults:
print(f"Gapped phrase outside query results: {p}")
A.pretty(p)
inResults = False
break
if inResults:
print("3. all gapped phrases are contained in the results")
672 nodes in query result 1. all nodes are phrases 2. all nodes have gaps 3. all gapped phrases are contained in the results
Note that by hand coding we can get the gapped phrases much more quickly and securely!
We have obtained a set with all gapped phrases, and we have paid a price:
It would be nice if we could kick-start our queries using this set as a given. And that is exactly what we are going to do now.
We make two custom sets and give them a name, gapphrase
for gapped phrases and conphrase
for non-gapped phrases (consecutive phrases).
customSets = dict(
gapphrase=gapQueryResults,
conphrase=set(F.otype.s("phrase")) - gapQueryResults,
)
Suppose we want all verbs that occur in a gapped phrase.
query = """
gapphrase
word sp=verb
"""
Note that we have used the foreign name gapphrase
in our search template, instead of phrase
.
But we can still run search()
, provided we tell it what we mean by gapphrase
.
We do that by passing the sets
parameter to search()
, which should be a dictionary of sets.
Search will look up gapphrase
in this dictionary, and will use its value, which should be a node set.
That way, it understands that the expression gapphrase
stands for the nodes in the given node set.
Here we go:
results = A.search(query, sets=customSets)
0.20s 94 results
A.show(results, start=1, end=3, condenseType="clause")
result 1
result 2
result 3
That looks good.
We can also apply feature conditions to gapphrase
:
query = """
gapphrase function=Subj
"""
results = A.search(query, sets=customSets)
A.table(results, start=1, end=3)
0.00s 177 results
n | p | phrase |
---|---|---|
1 | Genesis 2:25 | שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו |
2 | Genesis 4:4 | הֶ֨בֶל גַם־ה֛וּא |
3 | Genesis 7:14 | הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃ |
A.show(results, start=1, end=3, condenseType="clause")
result 1
result 2
result 3
We reduce the details by setting the baseType
to phrase
.
The highlighted phrases will now get a yellow background.
A.show(results, start=3, end=3, baseTypes="phrase")
result 3
We reduce the details by setting the baseType
to phrase_atom
.
The highlighted phrases will not get a yellow background now.
A.show(results, start=3, end=3, baseTypes={"phrase_atom"})
result 3
We can find the gaps, but do our minds always reckon with gaps? Gaps cause unexpected semantics. Here is a little puzzle.
Suppose we want to count the clauses consisting of exactly two phrases.
Here follows a little journey. We use a query to find the clauses, check the result with hand-coding, scratch our heads, refine the query, the hand-coding and our question until we are satisfied.
The following template should do it: a clause, starting with a phrase, followed by an adjacent phrase, which terminates the clause.
query = """
clause
=: phrase
<: phrase
:=
"""
# ignore this
# S.tweakPerformance(yarnRatio=1.2)
S.study(query)
0.00s Checking search template ... 0.00s Setting up search space for 3 objects ... 0.08s Constraining search space with 5 relations ... 0.27s 2 edges thinned 0.27s Setting up retrieval plan with strategy small_choice_multi ... 0.29s Ready to deliver results from 264393 nodes Iterate over S.fetch() to get the results See S.showPlan() to interpret the results
S.showPlan(details=True)
Search with 3 objects and 5 relations Results are instantiations of the following objects: node 0-clause 88131 choices node 1-phrase 88131 choices node 2-phrase 88131 choices Performance parameters: yarnRatio = 1.25 tryLimitFrom = 40 tryLimitTo = 40 Instantiations are computed along the following relations: node 0-clause 88131 choices edge 0-clause [[ 2-phrase 1.0 choices edge 2-phrase := 0-clause 0 choices edge 2-phrase :> 1-phrase 0.3 choices edge 1-phrase ]] 0-clause 0 choices edge 0-clause =: 1-phrase 0 choices 1.53s The results are connected to the original search template as follows: 0 1 R0 clause 2 R1 =: phrase 3 R2 <: phrase 4 := 5
results = A.search(query)
A.table(results, end=7)
0.53s 23486 results
n | p | clause | phrase | phrase |
---|---|---|---|---|
1 | Genesis 1:3 | יְהִ֣י אֹ֑ור | יְהִ֣י | אֹ֑ור |
2 | Genesis 1:4 | כִּי־טֹ֑וב | כִּי־ | טֹ֑וב |
3 | Genesis 1:7 | אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ | אֲשֶׁר֙ | מִתַּ֣חַת לָרָקִ֔יעַ |
4 | Genesis 1:7 | אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ | אֲשֶׁ֖ר | מֵעַ֣ל לָרָקִ֑יעַ |
5 | Genesis 1:10 | כִּי־טֹֽוב׃ | כִּי־ | טֹֽוב׃ |
6 | Genesis 1:11 | מַזְרִ֣יעַ זֶ֔רַע | מַזְרִ֣יעַ | זֶ֔רַע |
7 | Genesis 1:12 | כִּי־טֹֽוב׃ | כִּי־ | טֹֽוב׃ |
If we want to have the clauses only, we run it in shallow mode:
clausesByQuery = sorted(A.search(query, shallow=True))
0.43s 23486 results
Note result 3 above: it seems we have 3 phrases. Yet there are only 2. We take a closer look:
focus = results[2][0]
A.pretty(focus)
One phrase is chunked into two phrase atoms, which are hidden by default. Let's make that more clear:
A.pretty(focus, hideTypes=False)
Let us check this with a piece of hand-written code. We want clauses that consist of exactly two phrases.
A.indent(reset=True)
A.info("counting ...")
clausesByHand = []
for clause in F.otype.s("clause"):
phrases = L.d(clause, otype="phrase")
if len(phrases) == 2:
clausesByHand.append(clause)
clausesByHand = sorted(clausesByHand)
A.info(f"Done: found {len(clausesByHand)}")
0.00s counting ... 0.21s Done: found 23864
Strange, we end up with more cases. What is happening? Let us compare the results. We look at the first result where both methods diverge.
We put the difference finding in a little function.
def showDiff(queryResults, handResults):
diff = [x for x in zip(queryResults, handResults) if x[0] != x[1]]
if not diff:
print(
f"""
{len(queryResults):>6} queryResults
are identical with
{len(handResults):>6} handResults
"""
)
return
(rQuery, rHand) = diff[0]
if rQuery < rHand:
print(f"clause {rQuery} is a query result but not found by hand")
toShow = rQuery
else:
print(f"clause {rHand} is not a query result but has been found by hand")
toShow = rHand
colors = ["aqua", "aquamarine", "khaki", "lavender", "yellow"]
highlights = {}
for (i, phrase) in enumerate(L.d(toShow, otype="phrase")):
highlights[phrase] = colors[i % len(colors)]
for atom in L.d(phrase, otype="phrase_atom"):
highlights[atom] = colors[i % len(colors)]
A.pretty(
toShow,
hideTypes=False,
withNodes=True,
suppress={"lex", "sp", "vt", "vs"},
highlights=highlights,
baseTypes="phrase_atom",
)
showDiff(clausesByQuery, clausesByHand)
clause 427937 is not a query result but has been found by hand
Lo and behold:
A.indent(reset=True)
A.info("counting ...")
clausesByHand2 = []
for clause in F.otype.s("clause"):
phrases = L.d(clause, otype="phrase")
if len(phrases) == 2:
if L.d(phrases[0], otype="word")[-1] + 1 == L.d(phrases[1], otype="word")[0]:
clausesByHand2.append(clause)
clausesByHand2 = sorted(clausesByHand2)
A.info(f"Done: found {len(clausesByHand2)}")
0.00s counting ... 0.24s Done: found 23403
Now we have less cases. What is going on?
showDiff(clausesByQuery, clausesByHand2)
clause 428698 is a query result but not found by hand
Observe:
This clause has three phrases, but the third one lies inside the second one.
Can we adjust the pattern to exclude cases like this? Yes, with custom sets, see advanced.
Instead of looking through all phrases, we can just consider non gapped phrases only.
Earlier in this notebook we have constructed the set of non-gapped phrases
and put it under the name conphrase
in the custom sets.
query = """
clause
=: conphrase
<: conphrase
:=
"""
clausesByQuery2 = sorted(A.search(query, sets=customSets, shallow=True))
0.43s 23330 results
There is still a difference.
showDiff(clausesByQuery2, clausesByHand2)
clause 428380 is not a query result but has been found by hand
Observe:
This clause has two phrases, the second one has a gap, which coincides with a gap in the clause.
A.indent(reset=True)
A.info("counting ...")
clausesByHand3 = []
for clause in F.otype.s("clause"):
if hasGap(clause):
continue
phrases = L.d(clause, otype="phrase")
if len(phrases) == 2:
if L.d(phrases[0], otype="word")[-1] + 1 == L.d(phrases[1], otype="word")[0]:
clausesByHand3.append(clause)
clausesByHand3 = sorted(clausesByHand3)
A.info(f"Done: found {len(clausesByHand3)}")
0.00s counting ... 0.33s Done: found 23330
Now the number of results agree. But are they really the same?
showDiff(clausesByQuery2, clausesByHand3)
23330 queryResults are identical with 23330 handResults
It took four attempts to arrive at the final concept of things that we were looking for.
Sometimes the search template had to be modified, sometimes the hand-written code.
The interplay and systematic comparison between the attempts helped to spot all relevant configurations of phrases within clauses.
Here is another cause of wrong query results: there are sentences that span multiple verses. Such sentences are not contained in any verse. That makes that they are easily missed out in queries.
We describe a scenario where that happens.
A clause and its mother do not have to be in the same verse. We are going to fetch are the cases where they are in different verses.
But first we fetch all pairs of clauses connected by a mother edge.
query = """
clause
-mother> clause
"""
allMotherPairs = A.search(query)
A.table(results, end=7)
0.08s 13917 results
n | p | clause | phrase | phrase |
---|---|---|---|---|
1 | Genesis 1:3 | יְהִ֣י אֹ֑ור | יְהִ֣י | אֹ֑ור |
2 | Genesis 1:4 | כִּי־טֹ֑וב | כִּי־ | טֹ֑וב |
3 | Genesis 1:7 | אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ | אֲשֶׁר֙ | מִתַּ֣חַת לָרָקִ֔יעַ |
4 | Genesis 1:7 | אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ | אֲשֶׁ֖ר | מֵעַ֣ל לָרָקִ֑יעַ |
5 | Genesis 1:10 | כִּי־טֹֽוב׃ | כִּי־ | טֹֽוב׃ |
6 | Genesis 1:11 | מַזְרִ֣יעַ זֶ֔רַע | מַזְרִ֣יעַ | זֶ֔רַע |
7 | Genesis 1:12 | כִּי־טֹֽוב׃ | כִּי־ | טֹֽוב׃ |
Now we modify the query to the effect that mother and daughter must sit in distinct verses.
query = """
cm:clause
-mother> cd:clause
v1:verse
v2:verse
v1 # v2
cm ]] v1
cd ]] v2
"""
diffMotherPairs = A.search(query)
A.table(diffMotherPairs, end=7, skipCols="3 4", withPassage="1 2")
0.12s 710 results
n | clause | clause | verse | verse |
---|---|---|---|---|
1 | Genesis 1:18 וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה | Genesis 1:17 לְהָאִ֖יר עַל־הָאָֽרֶץ׃ | ||
2 | Genesis 2:7 וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה | Genesis 2:4 בְּיֹ֗ום | ||
3 | Genesis 7:3 לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃ | Genesis 7:2 מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו | ||
4 | Genesis 22:17 כִּֽי־בָרֵ֣ךְ אֲבָרֶכְךָ֗ | Genesis 22:16 כִּ֗י | ||
5 | Genesis 24:44 הִ֣וא הָֽאִשָּׁ֔ה | Genesis 24:43 הָֽעַלְמָה֙ | ||
6 | Genesis 27:45 עַד־שׁ֨וּב אַף־אָחִ֜יךָ מִמְּךָ֗ | Genesis 27:44 עַ֥ד אֲשֶׁר־תָּשׁ֖וּב חֲמַ֥ת אָחִֽיךָ׃ | ||
7 | Genesis 36:16 אַלּֽוּף־קֹ֛רַח אַלּ֥וּף גַּעְתָּ֖ם אַלּ֣וּף עֲמָלֵ֑ק | Genesis 36:15 בְּנֵ֤י אֱלִיפַז֙ בְּכֹ֣ור עֵשָׂ֔ו אַלּ֤וּף תֵּימָן֙ אַלּ֣וּף אֹומָ֔ר אַלּ֥וּף צְפֹ֖ו אַלּ֥וּף קְנַֽז׃ |
As a check,
we modify the latter query and require v1
and v2
to be the same verse, to get the
mother pairs of which both members are in the same verse.
query = """
cm:clause
-mother> cd:clause
v1:verse
v2:verse
v1 = v2
cm ]] v1
cd ]] v2
"""
sameMotherPairs = A.search(query)
A.table(sameMotherPairs, end=7, skipCols="3 4", withPassage="1 2")
0.14s 13181 results
n | clause | clause | verse | verse |
---|---|---|---|---|
1 | Genesis 1:4 כִּי־טֹ֑וב | Genesis 1:4 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור | ||
2 | Genesis 1:10 כִּי־טֹֽוב׃ | Genesis 1:10 וַיַּ֥רְא אֱלֹהִ֖ים | ||
3 | Genesis 1:12 כִּי־טֹֽוב׃ | Genesis 1:12 וַיַּ֥רְא אֱלֹהִ֖ים | ||
4 | Genesis 1:14 לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה | Genesis 1:14 יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם | ||
5 | Genesis 1:15 לְהָאִ֖יר עַל־הָאָ֑רֶץ | Genesis 1:15 וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם | ||
6 | Genesis 1:17 לְהָאִ֖יר עַל־הָאָֽרֶץ׃ | Genesis 1:17 וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם | ||
7 | Genesis 1:18 וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ | Genesis 1:18 וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה |
Let's check if the numbers add up:
Then the results of the second and third query combined should equal the results of the first query.
That makes sense.
Still, let's check:
discrepancy = len(allMotherPairs) - len(diffMotherPairs) - len(sameMotherPairs)
print(discrepancy)
26
The numbers do not add up. We are missing cases. Why?
Clauses may cross verse boundaries. In that case they are not part of a verse, and hence our latter two queries do not detect them. Let's count how many verse boundary crossing clauses there are.
query = """
clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 < v2
/-/
"""
results = A.search(query)
0.59s 50 results
You might think we can speed up the query by requiring v1 <: v2
(both verses are adjacent).
There are less possibilities to consider, to maybe we gain something.
query = """
clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 <: v2
/-/
"""
results = A.search(query)
0.56s 49 results
Indeed, slightly faster, but one result less! How can that be?
There must be a clause that spans at least two verses and in doing so, skips at least one verse.
Let's find that one:
query = """
clause
/with/
v1:verse
&& ..
v2:verse
|| ..
v3:verse
&& ..
v1 < v2
v2 < v3
v1 < v3
/-/
"""
resultsX = A.search(query)
0.91s 1 result
A.table(resultsX)
A.show(resultsX, baseTypes="clause_atom")
n | p | clause |
---|---|---|
1 | 1_Kings 8:41 | וְגַם֙ אֶל־הַנָּכְרִ֔י אַתָּ֞ה תִּשְׁמַ֤ע הַשָּׁמַ֨יִם֙ מְכֹ֣ון שִׁבְתֶּ֔ךָ |
result 1
A more roundabout way to find the same clauses:
query = """
clause
=: first:word
last:word
:=
v1:verse
w1:word
v2:verse
w2:word
first = w1
last = w2
v1 # v2
"""
results = A.search(query)
1.04s 50 results
Some of these verse spanning clauses do not have mothers or are not mothers. Let's count the cases where two clauses are in a mother relation and at least one of them spans a verse.
We need two queries for that. These queries are almost similar. One retrieves the clause pairs where the mother crosses verse boundaries, and the other where the daughter does so.
But we are programmers. We do not have to repeat ourselves:
queryCommon = """
c1:clause
-mother> c2:clause
c3:clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 < v2
/-/
"""
query1 = f"""
{queryCommon}
c1 = c3
"""
query2 = f"""
{queryCommon}
c2 = c3
"""
results1 = A.search(query1, silent=True)
results2 = A.search(query2, silent=True)
spannersByQuery = {(r[0], r[1]) for r in results1 + results2}
print(f"{len(spannersByQuery):>3} spanners are missing")
print(f"{discrepancy:>3} missing cases were detected before")
print(f"{discrepancy - len(spannersByQuery):>3} is the resulting disagreement")
26 spanners are missing 26 missing cases were detected before 0 is the resulting disagreement
We may find the mother clause pairs in which it least one member is verse spanning by hand-coding in an easier way:
Starting with the set of all mother pairs, we filter out any pair that has a verse spanner.
spannersByHand = set()
for (c1, c2) in allMotherPairs:
if not (L.u(c1, otype="verse") and L.u(c2, otype="verse")):
spannersByHand.add((c1, c2))
len(spannersByHand)
26
And, to be completely sure:
spannersByHand == spannersByQuery
True
If we are content with the clauses that do not span verses,
we can put them in a set, and modify the queries by replacing clause
by conclause
and bind the right set to it.
Here we go. In one cell we run the queries to get all pairs, the mother-daughter-in-separate-verses pairs, and the mother-daughter-in-same-verses pair and we do the math of checking.
conClauses = {c for c in F.otype.s("clause") if L.u(c, otype="verse")}
customSets = dict(conclause=conClauses)
print("All pairs")
allPairs = A.search(
"""
conclause
-mother> conclause
""",
sets=customSets,
)
print("Different verse pairs")
diffPairs = A.search(
"""
cm:conclause
-mother> cd:conclause
v1:verse
v2:verse
v1 # v2
cm ]] v1
cd ]] v2
""",
sets=customSets,
)
print("Same verse pairs")
samePairs = A.search(
"""
cm:conclause
-mother> cd:conclause
v1:verse
v2:verse
v1 = v2
cm ]] v1
cd ]] v2
""",
sets=customSets,
)
allPairSet = set(allPairs)
diffPairSet = {(r[0], r[1]) for r in diffPairs}
samePairSet = {(r[0], r[1]) for r in samePairs}
print(f"Intersection same-verse/different-verse pairs: {samePairSet & diffPairSet}")
print(
f"All pairs is union of same-verse/different-verse pairs: {allPairSet == (samePairSet | diffPairSet)}"
)
All pairs 0.07s 13891 results Different verse pairs 0.09s 710 results Same verse pairs 0.11s 13181 results Intersection same-verse/different-verse pairs: set() All pairs is union of same-verse/different-verse pairs: True
advanced sets relations quantifiers from MQL rough gaps
You have now finished the search tutorial.
Share the work!
CC-BY Dirk Roorda