Notebook

You might want to consider the start of this tutorial.

Short introductions to other TF datasets:

or the

Quran

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

from tf.app import use

In [3]:

A = use("ETCBC/bhsa", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/bhsa/app

data: ~/text-fabric-data/github/ETCBC/bhsa/tf/2021

data: ~/text-fabric-data/github/ETCBC/phono/tf/2021

data: ~/text-fabric-data/github/ETCBC/parallels/tf/2021

TF: TF API 12.1.2, ETCBC/bhsa/app v3, Search Reference
Data: ETCBC - bhsa 2021, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	39	10938.21	100
chapter	929	459.19	100
lex	9230	46.22	100
verse	23213	18.38	100
half_verse	45179	9.44	100
sentence	63717	6.70	100
sentence_atom	64514	6.61	100
clause	88131	4.84	100
clause_atom	90704	4.70	100
phrase	253203	1.68	100
phrase_atom	267532	1.59	100
subphrase	113850	1.42	38
word	426590	1.00	100

Sets: no custom sets
Features:

Parallel Passages

crossref

int

🆗 links between similar passages

BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis

book

str

✅ book name in Latin (Genesis; Numeri; Reges1; ...)

book@ll

str

✅ book name in amharic (ኣማርኛ)

chapter

int

✅ chapter number (1; 2; 3; ...)

code

int

✅ identifier of a clause atom relationship (0; 74; 367; ...)

det

str

✅ determinedness of phrase(atom) (det; und; NA.)

domain

str

✅ text type of clause (? (Unknown); N (narrative); D (discursive); Q (Quotation).)

freq_lex

int

✅ frequency of lexemes

function

str

✅ syntactic function of phrase (Cmpl; Objc; Pred; ...)

g_cons

str

✅ word consonantal-transliterated (B R>CJT BR> >LHJM ...)

g_cons_utf8

str

✅ word consonantal-Hebrew (ב ראשׁית ברא אלהים)

g_lex

str

✅ lexeme pointed-transliterated (B.:- R;>CIJT B.@R@> >:ELOH ...)

g_lex_utf8

str

✅ lexeme pointed-Hebrew (בְּ רֵאשִׁית בָּרָא אֱלֹה)

g_word

str

✅ word pointed-transliterated (B.:- R;>CI73JT B.@R@74> >:ELOHI92JM)

g_word_utf8

str

✅ word pointed-Hebrew (בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים)

gloss

str

🆗 english translation of lexeme (beginning create god(s))

str

✅ grammatical gender (m; f; NA; unknown.)

label

str

✅ (half-)verse label (half verses: A; B; C; verses: GEN 01,02)

language

str

✅ of word or lexeme (Hebrew; Aramaic.)

lex

str

✅ lexeme consonantal-transliterated (B R>CJT/ BR>[ >LHJM/)

lex_utf8

str

✅ lexeme consonantal-Hebrew (ב ראשׁית֜ ברא אלהים֜)

str

✅ lexical set, subclassification of part-of-speech (card; ques; mult)

nametype

str

⚠️ named entity type (pers; mens; gens; topo; ppde.)

nme

str

✅ nominal ending consonantal-transliterated (absent; n/a; JM, ...)

str

✅ grammatical number (sg; du; pl; NA; unknown.)

number

int

✅ sequence number of an object within its context

otype

str

pargr

str

🆗 hierarchical paragraph number (1; 1.2; 1.2.3.4; ...)

pdp

str

✅ phrase dependent part-of-speech (art; verb; subs; nmpr, ...)

pfm

str

✅ preformative consonantal-transliterated (absent; n/a; J, ...)

prs

str

✅ pronominal suffix consonantal-transliterated (absent; n/a; W; ...)

prs_gn

str

✅ pronominal suffix gender (m; f; NA; unknown.)

prs_nu

str

✅ pronominal suffix number (sg; du; pl; NA; unknown.)

prs_ps

str

✅ pronominal suffix person (p1; p2; p3; NA; unknown.)

str

✅ grammatical person (p1; p2; p3; NA; unknown.)

qere

str

✅ word pointed-transliterated masoretic reading correction

qere_trailer

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_trailer_utf8

str

✅ interword material -pointed-transliterated (Masoretic correction)

qere_utf8

str

✅ word pointed-Hebrew masoretic reading correction

rank_lex

int

✅ ranking of lexemes based on freqnuecy

rela

str

✅ linguistic relation between clause/(sub)phrase(atom) (ADJ; MOD; ATR; ...)

str

✅ part-of-speech (art; verb; subs; nmpr, ...)

str

✅ state of a noun (a (absolute); c (construct); e (emphatic).)

tab

int

✅ clause atom: its level in the linguistic embedding

trailer

str

✅ interword material pointed-transliterated (& 00 05 00_P ...)

trailer_utf8

str

✅ interword material pointed-Hebrew (־ ׃)

txt

str

✅ text type of clause and surrounding (repetion of ? N D Q as in feature domain)

typ

str

✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)

uvf

str

✅ univalent final consonant consonantal-transliterated (absent; N; J; ...)

vbe

str

✅ verbal ending consonantal-transliterated (n/a; W; ...)

vbs

str

✅ root formation consonantal-transliterated (absent; n/a; H; ...)

verse

int

✅ verse number

voc_lex

str

✅ vocalized lexeme pointed-transliterated (B.: R;>CIJT BR> >:ELOHIJM)

voc_lex_utf8

str

✅ vocalized lexeme pointed-Hebrew (בְּ רֵאשִׁית ברא אֱלֹהִים)

str

✅ verbal stem (qal; piel; hif; apel; pael)

str

✅ verbal tense (perf; impv; wayq; infc)

mother

none

✅ linguistic dependency between textual objects

oslots

none

Phonetic Transcriptions

phono

str

🆗 phonological transcription (bᵊ rēšˌîṯ bārˈā ʔᵉlōhˈîm)

phono_trailer

str

🆗 interword material in phonological transcription

Settings:

specified

apiVersion: 3
appName: ETCBC/bhsa
appPath: /Users/me/text-fabric-data/github/ETCBC/bhsa/app
commit: gb112c161cfd21eae403d51a2733740d8743460e7
css: ''
dataDisplay:
- exampleSectionHtml:
  <code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
- excludedFeatures:
  - g_uvf_utf8
  - g_vbs
  - kq_hybrid
  - languageISO
  - g_nme
  - lex0
  - is_root
  - g_vbs_utf8
  - g_uvf
  - dist
  - root
  - suffix_person
  - g_vbe
  - dist_unit
  - suffix_number
  - distributional_parent
  - kq_hybrid_utf8
  - crossrefSET
  - instruction
  - g_prs
  - lexeme_count
  - rank_occ
  - g_pfm_utf8
  - freq_occ
  - crossrefLCS
  - functional_parent
  - g_pfm
  - g_nme_utf8
  - g_vbe_utf8
  - kind
  - g_prs_utf8
  - suffix_gender
  - mother_object_type
- noneValues:
  - absent
  - n/a
  - none
  - unknown
  - no value
  - NA
docs:
- docBase: {docRoot}/{repo}
- docExt: ''
- docPage: ''
- docRoot: https://{org}.github.io
- featurePage: 0_home
interfaceDefaults: {}
isCompatible: True
local: local
localDir: /Users/me/text-fabric-data/github/ETCBC/bhsa/_temp
provenanceSpec:
- corpus: BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
- doi: 10.5281/zenodo.1007624
- extraData: ner
- moduleSpecs:
  - :
    backend: no value
    corpus: Phonetic Transcriptions
    docUrl:
    https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
    doi: 10.5281/zenodo.1007636
    org: ETCBC
    relative: /tf
    repo: phono
  - :
    backend: no value
    corpus: Parallel Passages
    docUrl:
    https://nbviewer.jupyter.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb
    doi: 10.5281/zenodo.1007642
    org: ETCBC
    relative: /tf
    repo: parallels
- org: ETCBC
- relative: /tf
- repo: bhsa
- version: 2021
- webBase: https://shebanq.ancient-data.org/hebrew
- webHint: Show this on SHEBANQ
- webLang: la
- webLexId: True
- webUrl:
  {webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: v1.8.1
typeDisplay:
- clause:
  - label: {typ} {rela}
  - style: ''
- clause_atom:
  - hidden: True
  - label: {code}
  - level: 1
  - style: ''
- half_verse:
  - hidden: True
  - label: {label}
  - style: ''
  - verselike: True
- lex:
  - featuresBare: gloss
  - label: {voc_lex_utf8}
  - lexOcc: word
  - style: orig
  - template: {voc_lex_utf8}
- phrase:
  - label: {typ} {function}
  - style: ''
- phrase_atom:
  - hidden: True
  - label: {typ} {rela}
  - level: 1
  - style: ''
- sentence:
  - label: {number}
  - style: ''
- sentence_atom:
  - hidden: True
  - label: {number}
  - level: 1
  - style: ''
- subphrase:
  - hidden: True
  - label: {number}
  - style: ''
- word:
  - features: pdp vs vt
  - featuresBare: lex:gloss
writing: hbo

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Gaps and spans¶

Searches often do not deliver the results you expect. Besides typos, lack of familiarity with the template formalism and bugs in the system, there is another cause: difficult semantics of the data.

Most users reason about phrases, clauses and sentences as if they are consecutive blocks of words. But in the BHSA this is not the case: each of these objects may have gaps.

Most of the time, verse boundaries coincide with the boundaries of sentences, clauses, and phrases. But not always, there are verse spanning sentences.

Note

These phenomena may wreak havoc with your intuitive reasoning about what search templates should deliver. Query templates do not require the objects to be consecutive and still they make sense. But that might not be your sense, unless you Mind the gap!

We are going to show these issues in depth.

Gaps¶

TF-search has no primitives to deal with gaps directly. Nodes correspond to textual objects such as words, phrases, clauses, verses, books. Usually these are consecutive sequences of one or more words, but in theory they can be arbitrary sets of slots.

And, as far as the BHSA corpus is concerned, in practice too. If we look at phrases, then the overwhelming majority is consecutive, without gaps, But there is also a substantial amount of phrases with gaps.

People that are familiar with MQL (see from MQL) may remember that in MQL you can search for a gap. The MQL query

SELECT ALL OBJECTS WHERE

[phrase FOCUS
    [word lex='L']
    [gap]
]

looks for a phrase with a gap in it (i.e. one or more consecutive words between the start and the end of the phrase that do not belong to the phrase). The query then asks additionally for those gap-containing phrases that have a certain word in front of the gap.

We want this too!

Find the gap¶

We start with a query that aims to get the same results as the MQL query above.

In our template, we require that there is a word wPreGap in the phrase that is just before the gap, a word wGap that comes right after, so it is in the gap, and hence does not belong to the phrase. But this all must happen before the last word wLast of the phrase.

In [4]:

query = """
verse
    p:phrase
      wPreGap:word lex=L
      wLast:word
      :=

wGap:word
wPreGap <: wGap
wGap < wLast
p || wGap
"""

In [5]:

results = A.search(query)

  0.46s 12 results

Nice and quick. Let's see the results.

In [6]:

A.table(results, skipCols="1")

n	p	phrase	word	word	word
1	Genesis 17:7	לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃	לְךָ֙	אַחֲרֶֽיךָ׃	לֵֽ
2	Genesis 28:4	לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ	לְךָ֙	אִתָּ֑ךְ	אֶת־
3	Exodus 30:21	לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו	לָהֶ֧ם	זַרְעֹ֖ו	חָק־
4	Leviticus 25:6	לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔	לָכֶם֙	תֹושָׁ֣בְךָ֔	לְ
5	Numbers 20:15	לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃	לָ֛נוּ	אֲבֹתֵֽינוּ׃	מִצְרַ֖יִם
6	Numbers 32:33	לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף	לָהֶ֣ם׀	יֹוסֵ֗ף	מֹשֶׁ֡ה
7	Deuteronomy 1:36	לֹֽו־וּלְבָנָ֑יו	לֹֽו־	בָנָ֑יו	אֶתֵּ֧ן
8	Deuteronomy 26:11	לְךָ֛ וּלְבֵיתֶ֑ךָ	לְךָ֛	בֵיתֶ֑ךָ	יְהוָ֥ה
9	1_Samuel 25:31	לְךָ֡ לַאדֹנִ֗י	לְךָ֡	אדֹנִ֗י	לְ
10	2_Kings 25:24	לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם	לָהֶ֤ם	אַנְשֵׁיהֶ֔ם	גְּדַלְיָ֨הוּ֙
11	Jeremiah 40:9	לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם	לָהֶ֜ם	אַנְשֵׁיהֶ֣ם	גְּדַלְיָ֨הוּ
12	Daniel 9:8	לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ	לָ֚נוּ	אֲבֹתֵ֑ינוּ	בֹּ֣שֶׁת

Let's color the word in the gap differently.

In [18]:

A.displaySetup(
    colorMap={2: "aqua", 3: "yellow", 4: "magenta"}, condenseType="clause",
    skipCols="1",
)

In [20]:

A.table(results, condensed=False)

n	p	phrase	word	word	word
1	Genesis 17:7	לְךָ֙ וּֽלְזַרְעֲךָ֖ אַחֲרֶֽיךָ׃	לְךָ֙	אַחֲרֶֽיךָ׃	לֵֽ
2	Genesis 28:4	לְךָ֙ לְךָ֖ וּלְזַרְעֲךָ֣ אִתָּ֑ךְ	לְךָ֙	אִתָּ֑ךְ	אֶת־
3	Exodus 30:21	לָהֶ֧ם לֹ֥ו וּלְזַרְעֹ֖ו	לָהֶ֧ם	זַרְעֹ֖ו	חָק־
4	Leviticus 25:6	לָכֶם֙ לְךָ֖ וּלְעַבְדְּךָ֣ וְלַאֲמָתֶ֑ךָ וְלִשְׂכִֽירְךָ֙ וּלְתֹושָׁ֣בְךָ֔	לָכֶם֙	תֹושָׁ֣בְךָ֔	לְ
5	Numbers 20:15	לָ֛נוּ וְלַאֲבֹתֵֽינוּ׃	לָ֛נוּ	אֲבֹתֵֽינוּ׃	מִצְרַ֖יִם
6	Numbers 32:33	לָהֶ֣ם׀ לִבְנֵי־גָד֩ וְלִבְנֵ֨י רְאוּבֵ֜ן וְלַחֲצִ֣י׀ שֵׁ֣בֶט׀ מְנַשֶּׁ֣ה בֶן־יֹוסֵ֗ף	לָהֶ֣ם׀	יֹוסֵ֗ף	מֹשֶׁ֡ה
7	Deuteronomy 1:36	לֹֽו־וּלְבָנָ֑יו	לֹֽו־	בָנָ֑יו	אֶתֵּ֧ן
8	Deuteronomy 26:11	לְךָ֛ וּלְבֵיתֶ֑ךָ	לְךָ֛	בֵיתֶ֑ךָ	יְהוָ֥ה
9	1_Samuel 25:31	לְךָ֡ לַאדֹנִ֗י	לְךָ֡	אדֹנִ֗י	לְ
10	2_Kings 25:24	לָהֶ֤ם וּלְאַנְשֵׁיהֶ֔ם	לָהֶ֤ם	אַנְשֵׁיהֶ֔ם	גְּדַלְיָ֨הוּ֙
11	Jeremiah 40:9	לָהֶ֜ם וּלְאַנְשֵׁיהֶ֣ם	לָהֶ֜ם	אַנְשֵׁיהֶ֣ם	גְּדַלְיָ֨הוּ
12	Daniel 9:8	לָ֚נוּ לִמְלָכֵ֥ינוּ לְשָׂרֵ֖ינוּ וְלַאֲבֹתֵ֑ינוּ	לָ֚נוּ	אֲבֹתֵ֑ינוּ	בֹּ֣שֶׁת

In [21]:

A.show(results, end=3, condensed=False)

result 1

Genesis 17:7

verse

Genesis 17:7

clause InfC Adju

phrase VP Pred

לִ

lex=L

הְיֹ֤ות

lex=HJH[

phrase PP Cmpl

לְךָ֙

lex=L

phrase PP PreC

לֵֽ

lex=L

אלֹהִ֔ים

lex=>LHJM/

phrase PP Cmpl

lex=W

lex=L

lex=ZR</

lex=>XR/

result 2

Genesis 28:4

verse

Genesis 28:4

clause WYq0 NA

phrase CP Conj

וְ

lex=W

phrase VP Pred

יִֽתֶּן־

lex=NTN[

phrase PP Cmpl

לְךָ֙

lex=L

phrase PP Objc

אֶת־

lex=>T

בִּרְכַּ֣ת

lex=BRKH/

אַבְרָהָ֔ם

lex=>BRHM/

phrase PP Cmpl

lex=L

lex=W

lex=L

lex=ZR</

lex=>T==

result 3

Exodus 30:21

verse

Exodus 30:21

clause WQtX NA

phrase CP Conj

וְ

lex=W

phrase VP Pred

הָיְתָ֨ה

lex=HJH[

phrase PP Cmpl

לָהֶ֧ם

lex=L

phrase NP Subj

חָק־

lex=XQ/

עֹולָ֛ם

lex=<WLM/

phrase PP Cmpl

lex=L

lex=W

lex=L

lex=ZR</

phrase PP Adju

לְ

lex=L

דֹרֹתָֽם׃ פ

lex=DWR/

In [22]:

A.displayReset()

All gapped phrases¶

These were particular gaps. Now we want to get all gapped phrases.

We can just lift the special requirement that the preGapWord has to satisfy a special lexical condition.

In [23]:

query = """
p:phrase
  wPreGap:word
  wLast:word
  :=

wGap:word
wPreGap <: wGap
wGap < wLast

p || wGap
"""

In [24]:

results = A.search(query)

  0.91s 716 results

Not too bad! We could wait for it. Here are some results.

In [25]:

A.table(results, start=5, end=10)

n	p	phrase	word	word	word
5	Genesis 2:25	שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו	שְׁנֵיהֶם֙	אִשְׁתֹּ֑ו	עֲרוּמִּ֔ים
6	Genesis 4:4	הֶ֨בֶל גַם־ה֛וּא	הֶ֨בֶל	ה֛וּא	הֵבִ֥יא
7	Genesis 7:8	מִן־הַבְּהֵמָה֙ הַטְּהֹורָ֔ה וּמִן־הַ֨בְּהֵמָ֔ה וּמִ֨ן־הָעֹ֔וף וְכֹ֥ל	בְּהֵמָ֔ה	כֹ֥ל	אֲשֶׁ֥ר
8	Genesis 7:14	הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃	רֶ֛מֶשׂ	כָּנָֽף׃	הָ
9	Genesis 7:21	כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃	בָּשָׂ֣ר׀	אָדָֽם׃	הָ
10	Genesis 7:21	כָּל־בָּשָׂ֣ר׀ בָּעֹ֤וף וּבַבְּהֵמָה֙ וּבַ֣חַיָּ֔ה וּבְכָל־הַשֶּׁ֖רֶץ וְכֹ֖ל הָאָדָֽם׃	שֶּׁ֖רֶץ	אָדָֽם׃	הַ

If a phrase has multiple gaps, we encounter it multiple times in our results.

We show the two results in Genesis 7:21.

In [49]:

A.show(
    results,
    condensed=False,
    condenseType="clause",
    start=9,
    end=10,
    colorMap={1: "lightgreen", 2: "orange", 4: "magenta"},
)

result 9

Genesis 7:21

clause:428188 WayX NA

phrase:653516 CP Conj

3448 וַ

phrase:653517 VP Pred

3449 יִּגְוַ֞ע

phrase:653518 NP Subj

3450 כָּל־

3451 בָּשָׂ֣ר׀

clause:428188 WayX NA

phrase:653518 NP Subj

3457 בָּ

3458

3459 עֹ֤וף

3460 וּ

3461 בַ

3462

3463 בְּהֵמָה֙

3464 וּ

3465 בַ֣

3466

3467 חַיָּ֔ה

3468 וּ

3469 בְ

3470 כָל־

3471 הַ

3472 שֶּׁ֖רֶץ

clause:428188 WayX NA

phrase:653518 NP Subj

3478 וְ

3479 כֹ֖ל

3480 הָ

3481 אָדָֽם׃

Genesis 7:21

clause:428189 Ptcp Attr

phrase:653519 CP Rela

3452 הָ

phrase:653520 VP PreC

3453 רֹמֵ֣שׂ

phrase:653521 PP Cmpl

3454 עַל־

3455 הָ

3456 אָ֗רֶץ

result 10

Genesis 7:21

clause:428188 WayX NA

phrase:653516 CP Conj

3448 וַ

phrase:653517 VP Pred

3449 יִּגְוַ֞ע

phrase:653518 NP Subj

3450 כָּל־

3451 בָּשָׂ֣ר׀

clause:428188 WayX NA

phrase:653518 NP Subj

3457 בָּ

3458

3459 עֹ֤וף

3460 וּ

3461 בַ

3462

3463 בְּהֵמָה֙

3464 וּ

3465 בַ֣

3466

3467 חַיָּ֔ה

3468 וּ

3469 בְ

3470 כָל־

3471 הַ

3472 שֶּׁ֖רֶץ

clause:428188 WayX NA

phrase:653518 NP Subj

3478 וְ

3479 כֹ֖ל

3480 הָ

3481 אָדָֽם׃

Genesis 7:21

clause:428190 Ptcp Attr

phrase:653522 CP Rela

3473 הַ

phrase:653523 VP PreC

3474 שֹּׁרֵ֣ץ

phrase:653524 PP Cmpl

3475 עַל־

3476 הָ

3477 אָ֑רֶץ

If we want just the phrases, and only once, we can run the query in shallow mode, see advanced:

In [72]:

gapQueryResults = A.search(query, shallow=True)

  1.08s 672 results

Excursion¶

Sometimes there are two subphrases with exactly the same words in it. They only differ in their values for the feature rela. Let's find such a case.

We do have to show the subphrases, though.

Some types are hidden, let's find out which ones:

In [53]:

A.displayShow()

current display options

baseTypes: {word}
colorMap: no value
condenseType: verse
condensed: 0
edgeFeatures: set()
edgeHighlights: no value
end: no value
extraFeatures:
- ()
- {}
fmt: text-orig-full
forceEdges: 0
full: 0
hiddenTypes:
- clause_atom
- sentence_atom
- phrase_atom
- subphrase
- half_verse
hideTypes: True
highlights: {}
lineNumbers: no value
multiFeatures: 0
noneValues:
- absent
- n/a
- none
- unknown
- no value
- NA
plainGaps: True
prettyTypes: True
queryFeatures: True
showGraphics: no value
showMath: 0
skipCols: set()
standardFeatures: 0
start: no value
suppress: set()
tupleFeatures:
- :
  - 0
  - ()
- :
  - 1
  - ()
- :
  - 2
  - ()
- :
  - 3
  - ()
withLabels: True
withNodes: 0
withPassage: True
withTypes: 0

In [61]:

dResults = A.search("""
s1:subphrase
== s2:subphrase

s1 < s2
""")

  0.20s 2135 results

Let's pass a different set of hidden types when showing the results:

In [68]:

A.show(
    dResults,
    end=3,
    withNodes=True,
    condensed=True,
    hiddenTypes="clause_atom half_verse phrase_atom",
    condenseType="phrase",
)

phrase 1

Genesis 5:32

phrase:653102 PP Objc

subphrase:1301305

2594 אֶת־

2595 שֵׁ֖ם

subphrase:1301306

subphrase:1301307

2596 אֶת־

2597 חָ֥ם

2598 וְ

subphrase:1301308

2599 אֶת־

2600 יָֽפֶת׃

phrase 2

Genesis 6:7

phrase:653192 PP Adju

2746 מֵֽ

2747 אָדָם֙

subphrase:1301341

2748 עַד־

2749 בְּהֵמָ֔ה

subphrase:1301342

subphrase:1301343

2750 עַד־

2751 רֶ֖מֶשׂ

2752 וְ

subphrase:1301346

2753 עַד־

subphrase:1301344

2754 עֹ֣וף

subphrase:1301345

2755 הַ

2756 שָּׁמָ֑יִם

phrase 3

Genesis 6:10

phrase:653214 NP Objc

subphrase:1301355

2786 שְׁלֹשָׁ֣ה

subphrase:1301356

2787 בָנִ֑ים

subphrase:1301357

2788 אֶת־

2789 שֵׁ֖ם

subphrase:1301358

subphrase:1301359

2790 אֶת־

2791 חָ֥ם

2792 וְ

subphrase:1301360

2793 אֶת־

2794 יָֽפֶת׃

More research has shown that these subphrases differ in the presence of the rela feature.

In [71]:

A.show(
    dResults,
    end=3,
    withNodes=True,
    colorMap = {1: "lightsalmon", 2: "lightblue"},
    extraFeatures="rela",
    highlights=highlights,
    hiddenTypes="clause_atom half_verse phrase_atom",
    condenseType="phrase",
)

result 1

Genesis 5:32

phrase:653102 PP Objc

subphrase:1301305

2594 אֶת־

2595 שֵׁ֖ם

subphrase:1301306

rela=par

subphrase:1301307

2596 אֶת־

2597 חָ֥ם

2598 וְ

subphrase:1301308

rela=par

2599 אֶת־

2600 יָֽפֶת׃

result 2

Genesis 6:7

phrase:653192 PP Adju

2746 מֵֽ

2747 אָדָם֙

subphrase:1301341

2748 עַד־

2749 בְּהֵמָ֔ה

subphrase:1301342

rela=par

subphrase:1301343

2750 עַד־

2751 רֶ֖מֶשׂ

2752 וְ

subphrase:1301346

rela=par

2753 עַד־

subphrase:1301344

2754 עֹ֣וף

subphrase:1301345

rela=rec

2755 הַ

2756 שָּׁמָ֑יִם

result 3

Genesis 6:10

phrase:653214 NP Objc

subphrase:1301355

2786 שְׁלֹשָׁ֣ה

subphrase:1301356

rela=adj

2787 בָנִ֑ים

subphrase:1301357

2788 אֶת־

2789 שֵׁ֖ם

subphrase:1301358

rela=par

subphrase:1301359

2790 אֶת־

2791 חָ֥ם

2792 וְ

subphrase:1301360

rela=par

2793 אֶת־

2794 יָֽפֶת׃

The red one has a feature rela='par', the blue one not.

Two nodes with the same node type and the same slots. Yet: different nodes, different feature annotations.

At the moment I do not know why the encoders of the BHSA have chosen to do this.

A different query¶

We can make an equivalent query to get the gaps.

In [73]:

query = """
p:phrase
    =: wFirst:word
    wLast:word
    :=

wGap:word
wFirst < wGap
wLast > wGap

p || wGap
"""

Experience has shown that this is a slow query, so we handle it with care.

In [74]:

S.study(query)
S.showPlan(details=True)

  0.00s Checking search template ...
  0.00s Setting up search space for 4 objects ...
  0.20s Constraining search space with 7 relations ...
  0.50s 	2 edges thinned
  0.50s Setting up retrieval plan with strategy small_choice_multi ...
  0.53s Ready to deliver results from 1186199 nodes
Iterate over S.fetch() to get the results
See S.showPlan() to interpret the results
Search with 4 objects and 6 relations
Results are instantiations of the following objects:
node  0-phrase                                        253203   choices
node  1-word                                          253203   choices
node  2-word                                          253203   choices
node  3-word                                          426590   choices
Performance parameters:
	yarnRatio            =    1.25
	tryLimitFrom         =      40
	tryLimitTo           =      40
Instantiations are computed along the following relations:
node                                  0-phrase        253203   choices
edge        0-phrase           [[     2-word               1.0 choices
edge        0-phrase           :=     2-word               0   choices
edge        0-phrase           =:     1-word               1.0 choices (thinned)
edge        1-word             ]]     0-phrase             0   choices
edge      1,2-word            <,>     3-word           21329.5 choices
edge        3-word             ||     0-phrase             0   choices
  0.58s The results are connected to the original search template as follows:
 0     
 1 R0  p:phrase
 2 R1      =: wFirst:word
 3 R2      wLast:word
 4         :=
 5     
 6 R3  wGap:word
 7     wFirst < wGap
 8     wLast > wGap
 9     
10     p || wGap
11

In [75]:

S.count(progress=1, limit=4)

  0.00s Counting results per 1 up to 4 ...
   |     4.29s 1
   |     4.29s 2
   |     4.29s 3
   |     4.29s 4
  4.29s Done: 5 results

This is a good example of a query that is slow to deliver even its first result. And that is bad, because it is such a straightforward query.

Why is this one so slow, while the previous one went so smoothly?

The crucial thing is the wGap word. In the latter template, wGap is not embedded in anything. It is constrained by wFirst < wGap and wGap < wLast. However, the way the search strategy works is by examining all possibilities for wFirst < wGap and only then checking whether wGap < wLast. The algorithm cannot check both conditions at the same time.

With embedding relations, things are better. Text-Fabric is heavily optimized to deal with embedding relationships.

In the former template, we see that the wGap is required to be adjacent to wPreGap, and this one is embedded in the phrase. Hence there are few cases to consider for wPreGap, and per instance there is only one wGap.

Lesson

Try to prevent the use of free floating nodes in your template that become constrained by other spatial relationships than embedding.

To the rescue¶

The former template had it right. Can we rescue the latter template?

We can assume that the phrase and the gap each contain a word in one and the same verse. Note that phrase and gap may belong to different clauses and sentences. We assume that a phrase cannot belong to more than two verses, so either the first or the last word of the phrase is in the same verse as a word in the gap.

In [76]:

query = """
p:phrase
    =: wFirst:word
    wLast:word
    :=

wGap:word
wFirst < wGap
wLast > wGap

p || wGap

v:verse

v [[ wFirst
v [[ wGap
"""

In [77]:

S.study(query)
S.showPlan(details=True)
S.count(progress=100, limit=3000)

  0.00s Checking search template ...
  0.00s Setting up search space for 5 objects ...
  0.21s Constraining search space with 9 relations ...
  0.53s 	2 edges thinned
  0.53s Setting up retrieval plan with strategy small_choice_multi ...
  0.57s Ready to deliver results from 1209412 nodes
Iterate over S.fetch() to get the results
See S.showPlan() to interpret the results
Search with 5 objects and 8 relations
Results are instantiations of the following objects:
node  0-phrase                                        253203   choices
node  1-word                                          253203   choices
node  2-word                                          253203   choices
node  3-word                                          426590   choices
node  4-verse                                          23213   choices
Performance parameters:
	yarnRatio            =    1.25
	tryLimitFrom         =      40
	tryLimitTo           =      40
Instantiations are computed along the following relations:
node                                  4-verse          23213   choices
edge        4-verse            [[     1-word              12.1 choices
edge        1-word             ]]     0-phrase             1.0 choices
edge        0-phrase           =:     1-word               0   choices
edge        0-phrase           :=     2-word               1.0 choices (thinned)
edge        0-phrase           [[     2-word               0   choices
edge        4-verse            [[     3-word              18.9 choices
edge      1,2-word            <,>     3-word               0   choices
edge        3-word             ||     0-phrase             0   choices
  0.58s The results are connected to the original search template as follows:
 0     
 1 R0  p:phrase
 2 R1      =: wFirst:word
 3 R2      wLast:word
 4         :=
 5     
 6 R3  wGap:word
 7     wFirst < wGap
 8     wLast > wGap
 9     
10     p || wGap
11     
12 R4  v:verse
13     
14     v [[ wFirst
15     v [[ wGap
16     
  0.00s Counting results per 100 up to 3000 ...
   |     0.08s 100
   |     0.17s 200
   |     0.23s 300
   |     0.26s 400
   |     0.29s 500
   |     0.35s 600
   |     0.37s 700
   |     0.44s 800
   |     0.46s 900
   |     0.50s 1000
   |     0.54s 1100
   |     0.57s 1200
   |     0.61s 1300
   |     0.70s 1400
   |     0.90s 1500
   |     0.95s 1600
   |     1.06s 1700
   |     1.11s 1800
   |     1.17s 1900
   |     1.26s 2000
   |     1.32s 2100
   |     1.37s 2200
   |     1.47s 2300
   |     1.80s 2400
   |     1.89s 2500
   |     1.98s 2600
   |     2.09s 2700
  2.15s Done: 2739 results

In [23]:

# ignore this
# S.tweakPerformance(yarnRatio=1)

We are going to run this query in shallow mode.

In [78]:

results = A.search(query, shallow=True)

  4.37s 672 results

Shallow mode tends to be quicker, but that does not always materialize. The number of results agrees with the first query. Yet we have been lucky, because we required the word in the gap to be in the same verse as the first word in the phrase. What if we require if it is the last word in the phrase?

In [79]:

query = """
p:phrase
    =: wFirst:word
    wLast:word
    :=

wGap:word
wFirst < wGap
wLast > wGap

p || wGap

v:verse

v [[ wLast
v [[ wGap
"""

In [80]:

results = A.search(query, shallow=True)

  4.39s 661 results

Then we would not have found all results.

So, this road, although doable, is much less comfortable, performance-wise and logic-wise.

Check the gaps¶

In this misty landscape of gaps we need some corroboration that we found the right results.

is every node in gapQueryResults a phrase?
does every phrase in the gapQueryResults have a gap?
is every gapped phrase contained in gapQueryResults?

We check all this by hand coding.

Here is a function that checks whether a phrase has a gap. If the distance between its end points is greater than the number of words it contains, it must have a gap.

In [81]:

def hasGap(p):
    words = L.d(p, otype="word")
    return words[-1] - words[0] + 1 > len(words)

Now we can perform the checks.

In [82]:

otypesGood = True
haveGaps = True

for p in gapQueryResults:
    otype = F.otype.v(p)
    if otype != "phrase":
        print(f"Non phrase detected: {p}) is a {otype}")
        otypesGood = False
        break

    if not hasGap(p):
        print(f"Phrase without a gap: {p}")
        A.pretty(p)
        haveGaps = False
        break

print(f"{len(gapQueryResults)} nodes in query result")
if otypesGood:
    print("1. all nodes are phrases")
if haveGaps:
    print("2. all nodes have gaps")

inResults = True
for p in F.otype.s("phrase"):
    if hasGap(p):
        if p not in gapQueryResults:
            print(f"Gapped phrase outside query results: {p}")
            A.pretty(p)
            inResults = False
            break

if inResults:
    print("3. all gapped phrases are contained in the results")

672 nodes in query result
1. all nodes are phrases
2. all nodes have gaps
3. all gapped phrases are contained in the results

Note that by hand coding we can get the gapped phrases much more quickly and securely!

Custom sets for (non-)gapped phrases¶

We have obtained a set with all gapped phrases, and we have paid a price:

either an expensive query,
or an inconvenient bit of hand coding.

It would be nice if we could kick-start our queries using this set as a given. And that is exactly what we are going to do now.

We make two custom sets and give them a name, gapphrase for gapped phrases and conphrase for non-gapped phrases (consecutive phrases).

In [83]:

customSets = dict(
    gapphrase=gapQueryResults,
    conphrase=set(F.otype.s("phrase")) - gapQueryResults,
)

Suppose we want all verbs that occur in a gapped phrase.

In [84]:

query = """
gapphrase
  word sp=verb
"""

Note that we have used the foreign name gapphrase in our search template, instead of phrase.

But we can still run search(), provided we tell it what we mean by gapphrase. We do that by passing the sets parameter to search(), which should be a dictionary of sets. Search will look up gapphrase in this dictionary, and will use its value, which should be a node set. That way, it understands that the expression gapphrase stands for the nodes in the given node set.

Here we go:

In [85]:

results = A.search(query, sets=customSets)

  0.20s 94 results

In [86]:

A.show(results, start=1, end=3, condenseType="clause")

result 1

Genesis 30:20

clause ZQtX NA

phrase VP PreO

זְבָדַ֨נִי

sp=verb

phrase NP Subj

אֱלֹהִ֥ים׀

sp=subs

phrase VP PreO

אֹתִי֮

sp=prep

phrase NP Objc

זֵ֣בֶד

sp=subs

טֹוב֒

sp=adjv

result 2

Genesis 30:35

clause Way0 NA

phrase CP Conj

וַ

sp=conj

phrase VP Pred

יָּ֣סַר

sp=verb

phrase PP Time

בַּ

sp=prep

sp=art

יֹּום֩

sp=subs

הַ

sp=art

ה֨וּא

sp=prps

phrase PP Objc

sp=prep

sp=art

sp=subs

sp=art

sp=adjv

sp=conj

sp=art

sp=verb

sp=conj

sp=prep

sp=subs

sp=art

sp=subs

sp=art

sp=adjv

sp=conj

sp=art

sp=verb

sp=subs

clause Way0 NA

phrase PP Objc

sp=conj

sp=subs

sp=adjv

sp=prep

sp=art

כְּשָׂבִ֑ים

sp=subs

result 3

Genesis 30:35

clause Way0 NA

phrase CP Conj

וַ

sp=conj

phrase VP Pred

יָּ֣סַר

sp=verb

phrase PP Time

בַּ

sp=prep

sp=art

יֹּום֩

sp=subs

הַ

sp=art

ה֨וּא

sp=prps

phrase PP Objc

sp=prep

sp=art

sp=subs

sp=art

sp=adjv

sp=conj

sp=art

sp=verb

sp=conj

sp=prep

sp=subs

sp=art

sp=subs

sp=art

sp=adjv

sp=conj

sp=art

sp=verb

sp=subs

clause Way0 NA

phrase PP Objc

sp=conj

sp=subs

sp=adjv

sp=prep

sp=art

כְּשָׂבִ֑ים

sp=subs

That looks good.

We can also apply feature conditions to gapphrase:

In [87]:

query = """
gapphrase function=Subj
"""
results = A.search(query, sets=customSets)
A.table(results, start=1, end=3)

  0.00s 177 results

n	p	phrase
1	Genesis 2:25	שְׁנֵיהֶם֙ הָֽאָדָ֖ם וְאִשְׁתֹּ֑ו
2	Genesis 4:4	הֶ֨בֶל גַם־ה֛וּא
3	Genesis 7:14	הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃

In [88]:

A.show(results, start=1, end=3, condenseType="clause")

result 1

Genesis 2:25

clause WayX NA

phrase CP Conj

function=Conj

וַ

phrase VP Pred

function=Pred

יִּֽהְי֤וּ

phrase NP Subj

function=Subj

שְׁנֵיהֶם֙

phrase AdjP PreC

function=PreC

עֲרוּמִּ֔ים

phrase NP Subj

function=Subj

result 2

clause WXQt NA

phrase CP Conj

function=Conj

וְ

phrase PrNP Subj

function=Subj

הֶ֨בֶל

phrase VP Pred

function=Pred

הֵבִ֥יא

phrase PrNP Subj

function=Subj

גַם־

ה֛וּא

phrase PP Cmpl

function=Cmpl

result 3

clause Ellp NA

phrase PPrP Subj

function=Subj

clause Ellp NA

phrase PPrP Subj

function=Subj

We reduce the details by setting the baseType to phrase. The highlighted phrases will now get a yellow background.

In [89]:

A.show(results, start=3, end=3, baseTypes="phrase")

result 3

Genesis 7:14

verse

sentence 17

clause Ellp NA

phrase הֵ֜מָּה וְכָל־הַֽחַיָּ֣ה לְמִינָ֗הּ וְכָל־הַבְּהֵמָה֙ לְמִינָ֔הּ וְכָל־הָרֶ֛מֶשׂ

function=Subj

clause Ptcp Attr

phrase הָ

function=Rela

phrase רֹמֵ֥שׂ

function=PreC

phrase עַל־הָאָ֖רֶץ

function=Cmpl

clause Ellp NA

phrase לְמִינֵ֑הוּ וְכָל־הָעֹ֣וף לְמִינֵ֔הוּ כֹּ֖ל צִפֹּ֥ור כָּל־כָּנָֽף׃

function=Subj

We reduce the details by setting the baseType to phrase_atom. The highlighted phrases will not get a yellow background now.

In [90]:

A.show(results, start=3, end=3, baseTypes={"phrase_atom"})

result 3

Genesis 7:14

verse

sentence 17

clause Ellp NA

phrase PPrP Subj

function=Subj

clause Ptcp Attr

phrase CP Rela

function=Rela

הָ

phrase VP PreC

function=PreC

רֹמֵ֥שׂ

phrase PP Cmpl

function=Cmpl

עַל־

הָ

אָ֖רֶץ

clause Ellp NA

phrase PPrP Subj

function=Subj

Two-phrase clauses¶

We can find the gaps, but do our minds always reckon with gaps? Gaps cause unexpected semantics. Here is a little puzzle.

Suppose we want to count the clauses consisting of exactly two phrases.

Here follows a little journey. We use a query to find the clauses, check the result with hand-coding, scratch our heads, refine the query, the hand-coding and our question until we are satisfied.

Attempt 1¶

By query¶

The following template should do it: a clause, starting with a phrase, followed by an adjacent phrase, which terminates the clause.

In [91]:

query = """
clause
    =: phrase
    <: phrase
    :=
"""

In [92]:

# ignore this
# S.tweakPerformance(yarnRatio=1.2)

In [93]:

S.study(query)

  0.00s Checking search template ...
  0.00s Setting up search space for 3 objects ...
  0.08s Constraining search space with 5 relations ...
  0.27s 	2 edges thinned
  0.27s Setting up retrieval plan with strategy small_choice_multi ...
  0.29s Ready to deliver results from 264393 nodes
Iterate over S.fetch() to get the results
See S.showPlan() to interpret the results

In [94]:

S.showPlan(details=True)

Search with 3 objects and 5 relations
Results are instantiations of the following objects:
node  0-clause                                         88131   choices
node  1-phrase                                         88131   choices
node  2-phrase                                         88131   choices
Performance parameters:
	yarnRatio            =    1.25
	tryLimitFrom         =      40
	tryLimitTo           =      40
Instantiations are computed along the following relations:
node                                  0-clause         88131   choices
edge        0-clause           [[     2-phrase             1.0 choices
edge        2-phrase           :=     0-clause             0   choices
edge        2-phrase           :>     1-phrase             0.3 choices
edge        1-phrase           ]]     0-clause             0   choices
edge        0-clause           =:     1-phrase             0   choices
  1.53s The results are connected to the original search template as follows:
 0     
 1 R0  clause
 2 R1      =: phrase
 3 R2      <: phrase
 4         :=
 5

In [95]:

results = A.search(query)
A.table(results, end=7)

  0.53s 23486 results

n	p	clause	phrase	phrase
1	Genesis 1:3	יְהִ֣י אֹ֑ור	יְהִ֣י	אֹ֑ור
2	Genesis 1:4	כִּי־טֹ֑וב	כִּי־	טֹ֑וב
3	Genesis 1:7	אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ	אֲשֶׁר֙	מִתַּ֣חַת לָרָקִ֔יעַ
4	Genesis 1:7	אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ	אֲשֶׁ֖ר	מֵעַ֣ל לָרָקִ֑יעַ
5	Genesis 1:10	כִּי־טֹֽוב׃	כִּי־	טֹֽוב׃
6	Genesis 1:11	מַזְרִ֣יעַ זֶ֔רַע	מַזְרִ֣יעַ	זֶ֔רַע
7	Genesis 1:12	כִּי־טֹֽוב׃	כִּי־	טֹֽוב׃

If we want to have the clauses only, we run it in shallow mode:

In [96]:

clausesByQuery = sorted(A.search(query, shallow=True))

  0.43s 23486 results

Note result 3 above: it seems we have 3 phrases. Yet there are only 2. We take a closer look:

In [97]:

focus = results[2][0]
A.pretty(focus)

Genesis 1:7

clause NmCl Attr

phrase CP Rela

phrase PP PreC

One phrase is chunked into two phrase atoms, which are hidden by default. Let's make that more clear:

In [98]:

A.pretty(focus, hideTypes=False)

Genesis 1:7

clause NmCl Attr

clause_atom 10

phrase CP Rela

phrase_atom CP NA

אֲשֶׁר֙

phrase PP PreC

phrase_atom PP NA

phrase_atom PP Spec

By hand¶

Let us check this with a piece of hand-written code. We want clauses that consist of exactly two phrases.

In [99]:

A.indent(reset=True)
A.info("counting ...")

clausesByHand = []
for clause in F.otype.s("clause"):
    phrases = L.d(clause, otype="phrase")
    if len(phrases) == 2:
        clausesByHand.append(clause)
clausesByHand = sorted(clausesByHand)
A.info(f"Done: found {len(clausesByHand)}")

  0.00s counting ...
  0.21s Done: found 23864

The difference¶

Strange, we end up with more cases. What is happening? Let us compare the results. We look at the first result where both methods diverge.

We put the difference finding in a little function.

In [100]:

def showDiff(queryResults, handResults):
    diff = [x for x in zip(queryResults, handResults) if x[0] != x[1]]
    if not diff:
        print(
            f"""
{len(queryResults):>6} queryResults
         are identical with
{len(handResults):>6} handResults
"""
        )
        return
    (rQuery, rHand) = diff[0]
    if rQuery < rHand:
        print(f"clause {rQuery} is a query result but not found by hand")
        toShow = rQuery
    else:
        print(f"clause {rHand} is not a query result but has been found by hand")
        toShow = rHand
    colors = ["aqua", "aquamarine", "khaki", "lavender", "yellow"]
    highlights = {}
    for (i, phrase) in enumerate(L.d(toShow, otype="phrase")):
        highlights[phrase] = colors[i % len(colors)]
        for atom in L.d(phrase, otype="phrase_atom"):
            highlights[atom] = colors[i % len(colors)]
    A.pretty(
        toShow,
        hideTypes=False,
        withNodes=True,
        suppress={"lex", "sp", "vt", "vs"},
        highlights=highlights,
        baseTypes="phrase_atom",
    )

In [101]:

showDiff(clausesByQuery, clausesByHand)

clause 427937 is not a query result but has been found by hand

Genesis 4:14

clause:427937 XYqt NA

clause_atom:516080 112

phrase:652701 NP Subj

phrase_atom:905945 כָל־

clause:427937 XYqt NA

clause_atom:516082 222

phrase:652703 VP PreO

phrase_atom:905947 יַֽהַרְגֵֽנִי׃

Lo and behold:

the hand-written code is right in a sense: this is a clause that consists exactly of two phrases.
the query is also right in a sense: the two phrases are not adjacent: there is a gap in the clause between them!

Attempt 2¶

By hand¶

We modify the hand-written code such that only clauses qualify if the two phrases are adjacent.

In [102]:

A.indent(reset=True)
A.info("counting ...")

clausesByHand2 = []
for clause in F.otype.s("clause"):
    phrases = L.d(clause, otype="phrase")
    if len(phrases) == 2:
        if L.d(phrases[0], otype="word")[-1] + 1 == L.d(phrases[1], otype="word")[0]:
            clausesByHand2.append(clause)
clausesByHand2 = sorted(clausesByHand2)
A.info(f"Done: found {len(clausesByHand2)}")

  0.00s counting ...
  0.24s Done: found 23403

The difference¶

Now we have less cases. What is going on?

In [103]:

showDiff(clausesByQuery, clausesByHand2)

clause 428698 is a query result but not found by hand

Genesis 14:16

clause:428698 WxQ0 NA

clause_atom:516896 427

phrase:655130  CP Conj
phrase_atom:908523  וְ
phrase:655131  PP Objc
phrase_atom:908524  גַם֩ אֶת־לֹ֨וט 
phrase_atom:908525  אָחִ֤יו 
phrase_atom:908526  וּ
phrase_atom:908527  רְכֻשֹׁו֙ 
phrase:655132  VP Pred
phrase_atom:908528  הֵשִׁ֔יב 
phrase:655131  PP Objc
phrase_atom:908529  וְ
phrase_atom:908530  גַ֥ם אֶת־הַנָּשִׁ֖ים וְאֶת־הָעָֽם׃ 

Observe:

This clause has three phrases, but the third one lies inside the second one.

the hand-written code is right in a sense: this clause has three phrases.
the query is right in a sense: it contains two adjacent phrases that together span the whole clause.

Attempt 3¶

By query¶

Can we adjust the pattern to exclude cases like this? Yes, with custom sets, see advanced.

Instead of looking through all phrases, we can just consider non gapped phrases only.

Earlier in this notebook we have constructed the set of non-gapped phrases and put it under the name conphrase in the custom sets.

In [104]:

query = """
clause
    =: conphrase
    <: conphrase
    :=
"""

clausesByQuery2 = sorted(A.search(query, sets=customSets, shallow=True))

  0.43s 23330 results

The difference¶

There is still a difference.

In [105]:

showDiff(clausesByQuery2, clausesByHand2)

clause 428380 is not a query result but has been found by hand

Genesis 10:14

clause:428380 Ellp NA

clause_atom:516560 402

phrase:654133 CP Conj

phrase_atom:907448 וְֽ

phrase:654134 PP Objc

phrase_atom:907449 אֶת־פַּתְרֻסִ֞ים וְאֶת־כַּסְלֻחִ֗ים

clause:428380 Ellp NA

clause_atom:516562 220

phrase:654134 PP Objc

phrase_atom:907454 וְ

phrase_atom:907455 אֶת־כַּפְתֹּרִֽים׃ ס

Observe:

This clause has two phrases, the second one has a gap, which coincides with a gap in the clause.

the hand-written code is right in a sense: this clause has two phrases, adjacent, and they span the whole clause, nothing left out.
the query is right in a sense: the second phrase is not consecutive.

Attempt 4¶

By hand¶

We modify the hand-written code, so that only consecutive clauses qualify.

In [106]:

A.indent(reset=True)
A.info("counting ...")

clausesByHand3 = []
for clause in F.otype.s("clause"):
    if hasGap(clause):
        continue
    phrases = L.d(clause, otype="phrase")
    if len(phrases) == 2:
        if L.d(phrases[0], otype="word")[-1] + 1 == L.d(phrases[1], otype="word")[0]:
            clausesByHand3.append(clause)
clausesByHand3 = sorted(clausesByHand3)
A.info(f"Done: found {len(clausesByHand3)}")

  0.00s counting ...
  0.33s Done: found 23330

The difference¶

Now the number of results agree. But are they really the same?

In [107]:

showDiff(clausesByQuery2, clausesByHand3)

 23330 queryResults
         are identical with
 23330 handResults

Conclusion¶

It took four attempts to arrive at the final concept of things that we were looking for.

Sometimes the search template had to be modified, sometimes the hand-written code.

The interplay and systematic comparison between the attempts helped to spot all relevant configurations of phrases within clauses.

Spans¶

Here is another cause of wrong query results: there are sentences that span multiple verses. Such sentences are not contained in any verse. That makes that they are easily missed out in queries.

We describe a scenario where that happens.

Mother clauses¶

A clause and its mother do not have to be in the same verse. We are going to fetch are the cases where they are in different verses.

All mother clauses¶

But first we fetch all pairs of clauses connected by a mother edge.

In [108]:

query = """
clause
-mother> clause
"""
allMotherPairs = A.search(query)
A.table(results, end=7)

  0.08s 13917 results

n	p	clause	phrase	phrase
1	Genesis 1:3	יְהִ֣י אֹ֑ור	יְהִ֣י	אֹ֑ור
2	Genesis 1:4	כִּי־טֹ֑וב	כִּי־	טֹ֑וב
3	Genesis 1:7	אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ	אֲשֶׁר֙	מִתַּ֣חַת לָרָקִ֔יעַ
4	Genesis 1:7	אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ	אֲשֶׁ֖ר	מֵעַ֣ל לָרָקִ֑יעַ
5	Genesis 1:10	כִּי־טֹֽוב׃	כִּי־	טֹֽוב׃
6	Genesis 1:11	מַזְרִ֣יעַ זֶ֔רַע	מַזְרִ֣יעַ	זֶ֔רַע
7	Genesis 1:12	כִּי־טֹֽוב׃	כִּי־	טֹֽוב׃

Mother in another verse¶

Now we modify the query to the effect that mother and daughter must sit in distinct verses.

In [109]:

query = """
cm:clause
-mother> cd:clause

v1:verse
v2:verse
v1 # v2

cm ]] v1
cd ]] v2
"""
diffMotherPairs = A.search(query)
A.table(diffMotherPairs, end=7, skipCols="3 4", withPassage="1 2")

  0.12s 710 results

n	clause	clause
1	Genesis 1:18 וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה	Genesis 1:17 לְהָאִ֖יר עַל־הָאָֽרֶץ׃
2	Genesis 2:7 וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה	Genesis 2:4 בְּיֹ֗ום
3	Genesis 7:3 לְחַיֹּ֥ות זֶ֖רַע עַל־פְּנֵ֥י כָל־הָאָֽרֶץ׃	Genesis 7:2 מִכֹּ֣ל׀ הַבְּהֵמָ֣ה הַטְּהֹורָ֗ה תִּֽקַּח־לְךָ֛ שִׁבְעָ֥ה שִׁבְעָ֖ה אִ֣ישׁ וְאִשְׁתֹּ֑ו
4	Genesis 22:17 כִּֽי־בָרֵ֣ךְ אֲבָרֶכְךָ֗	Genesis 22:16 כִּ֗י
5	Genesis 24:44 הִ֣וא הָֽאִשָּׁ֔ה	Genesis 24:43 הָֽעַלְמָה֙
6	Genesis 27:45 עַד־שׁ֨וּב אַף־אָחִ֜יךָ מִמְּךָ֗	Genesis 27:44 עַ֥ד אֲשֶׁר־תָּשׁ֖וּב חֲמַ֥ת אָחִֽיךָ׃
7	Genesis 36:16 אַלּֽוּף־קֹ֛רַח אַלּ֥וּף גַּעְתָּ֖ם אַלּ֣וּף עֲמָלֵ֑ק	Genesis 36:15 בְּנֵ֤י אֱלִיפַז֙ בְּכֹ֣ור עֵשָׂ֔ו אַלּ֤וּף תֵּימָן֙ אַלּ֣וּף אֹומָ֔ר אַלּ֥וּף צְפֹ֖ו אַלּ֥וּף קְנַֽז׃

Mother in same verse¶

As a check, we modify the latter query and require v1 and v2 to be the same verse, to get the mother pairs of which both members are in the same verse.

In [110]:

query = """
cm:clause
-mother> cd:clause

v1:verse
v2:verse
v1 = v2

cm ]] v1
cd ]] v2
"""
sameMotherPairs = A.search(query)
A.table(sameMotherPairs, end=7, skipCols="3 4", withPassage="1 2")

  0.14s 13181 results

n	clause	clause
1	Genesis 1:4 כִּי־טֹ֑וב	Genesis 1:4 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור
2	Genesis 1:10 כִּי־טֹֽוב׃	Genesis 1:10 וַיַּ֥רְא אֱלֹהִ֖ים
3	Genesis 1:12 כִּי־טֹֽוב׃	Genesis 1:12 וַיַּ֥רְא אֱלֹהִ֖ים
4	Genesis 1:14 לְהַבְדִּ֕יל בֵּ֥ין הַיֹּ֖ום וּבֵ֣ין הַלָּ֑יְלָה	Genesis 1:14 יְהִ֤י מְאֹרֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם
5	Genesis 1:15 לְהָאִ֖יר עַל־הָאָ֑רֶץ	Genesis 1:15 וְהָי֤וּ לִמְאֹורֹת֙ בִּרְקִ֣יעַ הַשָּׁמַ֔יִם
6	Genesis 1:17 לְהָאִ֖יר עַל־הָאָֽרֶץ׃	Genesis 1:17 וַיִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּרְקִ֣יעַ הַשָּׁמָ֑יִם
7	Genesis 1:18 וּֽלֲהַבְדִּ֔יל בֵּ֥ין הָאֹ֖ור וּבֵ֣ין הַחֹ֑שֶׁךְ	Genesis 1:18 וְלִמְשֹׁל֙ בַּיֹּ֣ום וּבַלַּ֔יְלָה

The difference¶

Let's check if the numbers add up:

the first query asked for all pairs
the second query asked for pairs with members in different verses
the third query asked for pairs with members in the same verse

Then the results of the second and third query combined should equal the results of the first query.

That makes sense.

Still, let's check:

In [111]:

discrepancy = len(allMotherPairs) - len(diffMotherPairs) - len(sameMotherPairs)
print(discrepancy)

The numbers do not add up. We are missing cases. Why?

Clauses may cross verse boundaries. In that case they are not part of a verse, and hence our latter two queries do not detect them. Let's count how many verse boundary crossing clauses there are.

In [112]:

query = """
clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 < v2
/-/
"""
results = A.search(query)

  0.59s 50 results

You might think we can speed up the query by requiring v1 <: v2 (both verses are adjacent). There are less possibilities to consider, to maybe we gain something.

In [113]:

query = """
clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 <: v2
/-/
"""
results = A.search(query)

  0.56s 49 results

Indeed, slightly faster, but one result less! How can that be?

There must be a clause that spans at least two verses and in doing so, skips at least one verse.

Let's find that one:

In [114]:

query = """
clause
/with/
v1:verse
&& ..
v2:verse
|| ..
v3:verse
&& ..
v1 < v2
v2 < v3
v1 < v3
/-/
"""
resultsX = A.search(query)

  0.91s 1 result

In [115]:

A.table(resultsX)
A.show(resultsX, baseTypes="clause_atom")

n	p	clause
1	1_Kings 8:41	וְגַם֙ אֶל־הַנָּכְרִ֔י אַתָּ֞ה תִּשְׁמַ֤ע הַשָּׁמַ֨יִם֙ מְכֹ֣ון שִׁבְתֶּ֔ךָ

result 1

1_Kings 8:41

verse

sentence 92

clause WXYq NA

phrase וְ

phrase גַם֙ אֶל־הַנָּכְרִ֔י

clause NmCl Attr

phrase אֲשֶׁ֛ר

phrase לֹא־

phrase מֵעַמְּךָ֥ יִשְׂרָאֵ֖ל

phrase ה֑וּא

clause WQt0 Coor

phrase וּ

phrase בָ֛א

phrase מֵאֶ֥רֶץ רְחֹוקָ֖ה

phrase לְמַ֥עַן שְׁמֶֽךָ׃

1_Kings 8:43

verse

sentence 92

clause WXYq NA

phrase אַתָּ֞ה

phrase תִּשְׁמַ֤ע

phrase הַשָּׁמַ֨יִם֙ מְכֹ֣ון שִׁבְתֶּ֔ךָ

sentence 94

clause WQt0 NA

phrase וְ

phrase עָשִׂ֕יתָ

phrase כְּכֹ֛ל

clause xYqX RgRc

phrase אֲשֶׁר־

phrase יִקְרָ֥א

phrase אֵלֶ֖יךָ

phrase הַנָּכְרִ֑י

sentence 95

clause xYqX NA

phrase לְמַ֣עַן

phrase יֵדְעוּן֩

phrase כָּל־עַמֵּ֨י הָאָ֜רֶץ

phrase אֶת־שְׁמֶ֗ךָ

clause InfC Adju

phrase לְיִרְאָ֤ה

phrase אֹֽתְךָ֙

phrase כְּעַמְּךָ֣ יִשְׂרָאֵ֔ל

clause InfC Coor

phrase וְ

phrase לָדַ֕עַת

clause XQtl Objc

phrase כִּי־

phrase שִׁמְךָ֣

phrase נִקְרָ֔א

phrase עַל־הַבַּ֥יִת הַזֶּ֖ה

clause xQt0 Attr

phrase אֲשֶׁ֥ר

phrase בָּנִֽיתִי׃

A more roundabout way to find the same clauses:

In [116]:

query = """
clause
    =: first:word
    last:word
    :=
v1:verse
    w1:word
v2:verse
    w2:word

first = w1
last = w2
v1 # v2
"""
results = A.search(query)

  1.04s 50 results

Some of these verse spanning clauses do not have mothers or are not mothers. Let's count the cases where two clauses are in a mother relation and at least one of them spans a verse.

We need two queries for that. These queries are almost similar. One retrieves the clause pairs where the mother crosses verse boundaries, and the other where the daughter does so.

But we are programmers. We do not have to repeat ourselves:

In [117]:

queryCommon = """
c1:clause
-mother> c2:clause

c3:clause
/with/
v1:verse
&& ..
v2:verse
&& ..
v1 < v2
/-/
"""

query1 = f"""
{queryCommon}
c1 = c3
"""
query2 = f"""
{queryCommon}
c2 = c3
"""

results1 = A.search(query1, silent=True)
results2 = A.search(query2, silent=True)
spannersByQuery = {(r[0], r[1]) for r in results1 + results2}
print(f"{len(spannersByQuery):>3} spanners are missing")
print(f"{discrepancy:>3} missing cases were detected before")
print(f"{discrepancy - len(spannersByQuery):>3} is the resulting disagreement")

 26 spanners are missing
 26 missing cases were detected before
  0 is the resulting disagreement

We may find the mother clause pairs in which it least one member is verse spanning by hand-coding in an easier way:

Starting with the set of all mother pairs, we filter out any pair that has a verse spanner.

In [118]:

spannersByHand = set()

for (c1, c2) in allMotherPairs:
    if not (L.u(c1, otype="verse") and L.u(c2, otype="verse")):
        spannersByHand.add((c1, c2))

len(spannersByHand)

Out[118]:

And, to be completely sure:

In [119]:

spannersByHand == spannersByQuery

Out[119]:

True

By custom sets¶

If we are content with the clauses that do not span verses, we can put them in a set, and modify the queries by replacing clause by conclause and bind the right set to it.

Here we go. In one cell we run the queries to get all pairs, the mother-daughter-in-separate-verses pairs, and the mother-daughter-in-same-verses pair and we do the math of checking.

In [120]:

conClauses = {c for c in F.otype.s("clause") if L.u(c, otype="verse")}
customSets = dict(conclause=conClauses)

print("All pairs")
allPairs = A.search(
    """
conclause
-mother> conclause
""",
    sets=customSets,
)

print("Different verse pairs")
diffPairs = A.search(
    """
cm:conclause
-mother> cd:conclause

v1:verse
v2:verse
v1 # v2

cm ]] v1
cd ]] v2
""",
    sets=customSets,
)

print("Same verse pairs")
samePairs = A.search(
    """
cm:conclause
-mother> cd:conclause

v1:verse
v2:verse
v1 = v2

cm ]] v1
cd ]] v2
""",
    sets=customSets,
)

allPairSet = set(allPairs)
diffPairSet = {(r[0], r[1]) for r in diffPairs}
samePairSet = {(r[0], r[1]) for r in samePairs}

print(f"Intersection same-verse/different-verse pairs: {samePairSet & diffPairSet}")
print(
    f"All pairs is union of same-verse/different-verse pairs: {allPairSet == (samePairSet | diffPairSet)}"
)

All pairs
  0.07s 13891 results
Different verse pairs
  0.09s 710 results
Same verse pairs
  0.11s 13181 results
Intersection same-verse/different-verse pairs: set()
All pairs is union of same-verse/different-verse pairs: True

Lessons¶

mix programming with composing queries;
a good way to do so is custom sets;
use programming for processing results;
find the balance between queries and hand-coding.

All steps¶

start your first step in mastering the bible computationally
display become an expert in creating pretty displays of your text structures
search turbo charge your hand-coding with search templates

advanced sets relations quantifiers from MQL rough gaps

You have now finished the search tutorial.

Share the work!

export Excel make tailor-made spreadsheets out of your results
share draw in other people's data and let them use yours
export export your dataset as an Emdros database
annotate annotate plain text by means of other tools and import the annotations as TF features
map map somebody else's annotations to a new version of the corpus
volumes work with selected books only
trees work with the BHSA data as syntax trees

CC-BY Dirk Roorda