To get started: consult start
Search in Text-Fabric is a template based way of looking for structural patterns in your dataset.
It is inspired by the idea of topographic query.
Within Text-Fabric we have the unique possibility to combine the ease of formulating search templates for complicated patterns with the power of programmatically processing the results.
This notebook will show you how to get up and running.
Search is a powerful feature for a wide range of purposes.
Quite a bit of the implementation work has been dedicated to optimize performance. Yet I do not pretend to have found optimal strategies for all possible search templates. Some search tasks may turn out to be somewhat costly or even very costly.
That being said, I think search might turn out helpful in many cases, especially by reducing the amount of hand-coding needed to work with special subsets of your data.
Search is as simple as saying (just an example)
results = A.search(template)
A.show(results)
See all ins and outs in the search template docs.
%load_ext autoreload
%autoreload 2
from tf.app import use
A = use("Nino-cunei/oldassyrian", hoist=globals())
This is Text-Fabric 9.2.0 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 67 features found and 0 ignored
We start with the most simple form of issuing a query. Let's look for the numerals with a repeat greater than 3. We also want to show the words in which they occur.
All work involved in searching takes place under the hood.
query = """
word
sign type=numeral repeat>3
"""
results = A.search(query)
0.73s 8136 results
A.table(results, end=10)
n | p | word | sign |
---|---|---|---|
1 | P390626 obverse:9 | 7(u) | 7(u) |
2 | P390626 obverse:19 | 6(disz) | 6(disz) |
3 | P390626 obverse:21 | 7(disz) | 7(disz) |
4 | P390626 obverse:30 | 7(disz) | 7(disz) |
5 | P390626 reverse:11 | 7(disz)-sze3 | 7(disz)- |
6 | P390626 left:3 | 4(disz) | 4(disz) |
7 | P361245 obverse:1 | 7(disz) | 7(disz) |
8 | P361245 obverse:1 | 7(disz) | 7(disz) |
9 | P361245 obverse:1 | 5(disz) | 5(disz) |
10 | P361245 obverse:1 | 7(disz) | 7(disz) |
We can show them in unicode representation as well:
A.table(results, end=10, fmt="text-orig-unicode")
n | p | word | sign |
---|---|---|---|
1 | P390626 obverse:9 | 𒐒 | 𒐒 |
2 | P390626 obverse:19 | 𒐋 | 𒐋 |
3 | P390626 obverse:21 | 𒐌 | 𒐌 |
4 | P390626 obverse:30 | 𒐌 | 𒐌 |
5 | P390626 reverse:11 | 𒐌𒂠 | 𒐌 |
6 | P390626 left:3 | 𒐉 | 𒐉 |
7 | P361245 obverse:1 | 𒐌 | 𒐌 |
8 | P361245 obverse:1 | 𒐌 | 𒐌 |
9 | P361245 obverse:1 | 𒐊 | 𒐊 |
10 | P361245 obverse:1 | 𒐌 | 𒐌 |
The hyperlinks take us all to the CDLI archival page of the document (tablet) in question.
Note that we can choose start and/or end points in the results list.
A.table(results, start=500, end=503, fmt="text-orig-rich")
n | p | word | sign |
---|---|---|---|
500 | P360684 obverse:4 | 4(diš) | 4(diš) |
501 | P360684 obverse:5 | 4(u) | 4(u) |
502 | P360684 obverse:5 | 9(diš) | 9(diš) |
503 | P360684 obverse:11 | 4(diš) | 4(diš) |
We can show the results more fully with show()
.
That gives us pretty displays of tablet lines with the results highlighted.
A.show(results, end=3)
result 1
result 2
result 3
There are two fundamentally different ways of presenting the results: condensed and uncondensed.
In uncondensed view, all results are listed individually. You can keep track of which parts belong to which results. The display can become unwieldy.
This is the default view, because it is the straightest, most logical, answer to your query.
In condensed view all nodes of all results are grouped in containers first (e.g. verses), and then presented container by container. You loose the information of what parts belong to what result.
As an example of is the difference, we look for all numerals.
query = """
% we choose a tablet with several numerals
document pnumber=P357880
sign type=numeral
"""
Note that you can have comments in a search template. Comment lines start with a %
.
results = A.search(query)
A.table(results, end=10)
0.80s 10 results
n | p | document | sign |
---|---|---|---|
1 | P357880 obverse:1 | P357880 | 1(u) |
2 | P357880 obverse:1 | P357880 | 4(disz) |
3 | P357880 obverse:3 | P357880 | 1/3(disz) |
4 | P357880 obverse:6 | P357880 | 1(u) |
5 | P357880 obverse:6 | P357880 | 5(disz) |
6 | P357880 obverse:9 | P357880 | 1(u) |
7 | P357880 obverse:9 | P357880 | 3(disz) |
8 | P357880 obverse:12 | P357880 | 1/3(disz) |
9 | P357880 obverse:12 | P357880 | 2(disz) |
10 | P357880 obverse:12 | P357880 | 1/2(disz) |
Let's expand the results display:
A.show(results, end=2, skipCols="1")
result 1
result 2
As you see, the results are listed per result tuple, even if they occur all in the same verse. This way you can keep track of what exactly belongs to each result.
Now in condensed mode:
A.show(results, condensed=True, withNodes=True)
line 1
line 2
line 3
line 4
line 5
The last line has two results, and both results are highlighted in the same line display.
We can modify the container in which we see our results.
By default, it is line
, but we can make it face
as well:
A.show(results, end=2, condensed=True, condenseType="face")
face 1
We now see the the displays of two faces, one with two numerals in it and one with three.
Let us make a new search where we look for two different things in the same line.
We can apply different highlight colours to different parts of the result. The signs in the pair are member 0 and 1 of the result tuples. The members that we do not map, will not be highlighted. The members that we map to the empty string will be highlighted with the default color.
NB: Choose your colours from the CSS specification.
query = """
line
sign missing=1
sign question=1
sign damage=1
"""
results = A.search(query)
A.table(results, end=10, baseTypes="sign")
2.24s 2370 results
n | p | line | sign | sign | sign |
---|---|---|---|---|---|
1 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | szu?- | _ma#- |
2 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | bu?- | _ma#- |
3 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | szu?- | _ma#- |
4 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | bu?- | _ma#- |
5 | P361434 obverse:9 | [...] zi#?-tum? | [...] | zi#?- | zi#?- |
6 | P361434 obverse:9 | [...] zi#?-tum? | [...] | tum? | zi#?- |
7 | P360598 obverse:4 | a-ti2-ma a-na 6(disz) _gin2 ku3-babbar#_ [szi2-im?] | [szi2- | im?] | babbar#_ |
8 | P360598 obverse:4 | a-ti2-ma a-na 6(disz) _gin2 ku3-babbar#_ [szi2-im?] | im?] | im?] | babbar#_ |
9 | P360598 edge:2 | szi2-im#? [...] | [...] | im#? | im#? |
10 | P360678 reverse:19 | pu-da-dim isz!?-qu2#-[lu te2-er-ta-ku-nu] | [lu | isz!?- | qu2#- |
A.table(
results,
end=10,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
n | p | line | sign | sign | sign |
---|---|---|---|---|---|
1 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | szu?- | _ma#- |
2 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | bu?- | _ma#- |
3 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | szu?- | _ma#- |
4 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | bu?- | _ma#- |
5 | P361434 obverse:9 | [...] zi#?-tum? | [...] | zi#?- | zi#?- |
6 | P361434 obverse:9 | [...] zi#?-tum? | [...] | tum? | zi#?- |
7 | P360598 obverse:4 | a-ti2-ma a-na 6(disz) _gin2 ku3-babbar#_ [szi2-im?] | [szi2- | im?] | babbar#_ |
8 | P360598 obverse:4 | a-ti2-ma a-na 6(disz) _gin2 ku3-babbar#_ [szi2-im?] | im?] | im?] | babbar#_ |
9 | P360598 edge:2 | szi2-im#? [...] | [...] | im#? | im#? |
10 | P360678 reverse:19 | pu-da-dim isz!?-qu2#-[lu te2-er-ta-ku-nu] | [lu | isz!?- | qu2#- |
A.show(results, end=5, colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"})
result 1
result 2
result 3
result 4
result 5
Color mapping works best for uncondensed results. If you condense results, some nodes may occupy different positions in different results. It is unpredictable which color will be used for such nodes:
A.show(
results,
condensed=True,
end=5,
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
line 1
line 2
line 3
line 4
line 5
You can specify to what container you want to condense. By default, everything is condensed to lines.
Let's change that to faces.
Note that the end
parameter counts the number of faces now.
A.show(
results,
end=2,
condensed=True,
condenseType="face",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
face 1
face 2
You can stipulate an order on the things in your template. You only have to put a relational operator between them. Say we want only results where the damage follows the missing.
query = """
line
sign question=1
sign missing=1
< sign damage=1
"""
results = A.search(query)
A.table(results, end=10, baseTypes="sign")
2.22s 1137 results
n | p | line | sign | sign | sign |
---|---|---|---|---|---|
1 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | szu?- | [...] | _ma#- |
2 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | bu?- | [...] | _ma#- |
3 | P361434 obverse:9 | [...] zi#?-tum? | zi#?- | [...] | zi#?- |
4 | P361434 obverse:9 | [...] zi#?-tum? | tum? | [...] | zi#?- |
5 | P360705 reverse:18 | en-um-a-szur u3 a-szur#-[ma]-lik#? | lik#? | [ma]- | lik#? |
6 | P361421 :11' | [...] be? a x ni x sza#-da a-bi4-sa2-mu-tim | be? | [...] | sza#- |
7 | P361244 left:1 | [...] zi#? zi _la2?_ ku-ni | zi#? | [...] | zi#? |
8 | P361244 left:1 | [...] zi#? zi _la2?_ ku-ni | _la2?_ | [...] | zi#? |
9 | P361260 reverse:14 | u2? x a x [x] _ku3#-babbar_ 1(disz) _gin2#_ [...] | u2? | [x] | _ku3#- |
10 | P361260 reverse:14 | u2? x a x [x] _ku3#-babbar_ 1(disz) _gin2#_ [...] | u2? | [x] | _gin2#_ |
We can also require the things to be adjacent.
query = """
line
sign question=1
sign missing=1
<: sign damage=1
"""
results = A.search(query)
A.table(results, end=10, baseTypes="sign")
A.show(
results,
end=10,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
2.18s 425 results
n | p | line | sign | sign | sign |
---|---|---|---|---|---|
1 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | szu?- | [...] | _ma#- |
2 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | bu?- | [...] | _ma#- |
3 | P361434 obverse:9 | [...] zi#?-tum? | zi#?- | [...] | zi#?- |
4 | P361434 obverse:9 | [...] zi#?-tum? | tum? | [...] | zi#?- |
5 | P360705 reverse:18 | en-um-a-szur u3 a-szur#-[ma]-lik#? | lik#? | [ma]- | lik#? |
6 | P361244 left:1 | [...] zi#? zi _la2?_ ku-ni | zi#? | [...] | zi#? |
7 | P361244 left:1 | [...] zi#? zi _la2?_ ku-ni | _la2?_ | [...] | zi#? |
8 | P361260 reverse:14 | u2? x a x [x] _ku3#-babbar_ 1(disz) _gin2#_ [...] | u2? | [x] | _ku3#- |
9 | P361262 obverse:24 | _szunigin_ 5(disz)? [ma]-na# 1(u) 2(disz)# [5/6(disz) _gin2_] | 5(disz)? | [ma]- | na# |
10 | P361033 obverse:10' | [u2]-la2#? i-ta-mu-u2 [...] | la2#? | [u2]- | la2#? |
result 1
result 2
result 3
result 4
result 5
result 6
result 7
result 8
result 9
result 10
Finally, we make the three things fully adjacent in fixed order:
query = """
line
sign question=1
<: sign missing=1
<: sign damage=1
"""
results = A.search(query)
A.table(results, end=10, baseTypes="sign")
A.show(results, end=5, colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"})
2.16s 29 results
n | p | line | sign | sign | sign |
---|---|---|---|---|---|
1 | P361262 obverse:24 | _szunigin_ 5(disz)? [ma]-na# 1(u) 2(disz)# [5/6(disz) _gin2_] | 5(disz)? | [ma]- | na# |
2 | P290310 edge:1 | ni-il5-qe2!?-[ma] szi2#-ti2 _uruda_ | qe2!?- | [ma] | szi2#- |
3 | P297363 reverse:9 | la2 tu3#?-[sza]-ar# / a-hi a-ta | tu3#?- | [sza]- | ar# |
4 | P358697 reverse:10 | u2-sze2-ba#?-[la2]-ni# ki a-bu-um#-[x] | ba#?- | [la2]- | ni# |
5 | P358999 obverse:5 | szu?-ma? [...] nim# | ma? | [...] | nim# |
6 | P273584 reverse:7 | lu-be-er-szu / u2? [hu]-la#-li | u2? | [hu]- | la#- |
7 | P359580 reverse:7 | a-la2-qe2-[am? szu]-ma# [_la2?_] ki-am | [am? | szu]- | ma# |
8 | P359585 reverse:7 | la#? [i]-qi2#-pu a-di2 _ku3-babbar_ [e]-ru-bu | la#? | [i]- | qi2#- |
9 | P359667 obverse:8 | e?-[ru]-ba#-am | e?- | [ru]- | ba#- |
10 | P359713 obverse:32 | i3-li2-[dan? x] _ma#-na ku3-babbar_ isz-ti2 | [dan? | x] | _ma#- |
result 1
result 2
result 3
result 4
result 5
We would like to see the original ATF and the flags for signs.
The way to do that, is to perform a A.prettySetup(features)
first.
We concentrate on one specific result.
A.displaySetup(extraFeatures="atf flags")
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
The features without meaningful values have been left out. We can also change that by passing a set of values we think are not meaningful. The default set is
{None, 'NA', 'none', 'unknown'}
A.displaySetup(noneValues=set())
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
This makes clear that it is convenient to keep None
in the noneValues
:
A.displaySetup(noneValues={None})
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
We can even choose to suppress other values, e.g. the value 1.
That will remove all the features such as question
, missing
.
A.displaySetup(noneValues={None, "NA", "unknown", 1})
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
In the rest of the notebook we stick to our normal setup, so we reset the extra features.
A.displayReset()
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
In earlier displays we saw the types of signs, because the query mentioned it.
Suppose we want to display the type also here, then we can modify the query by mentioning the feature type
.
But we do not want to impose extra limitations, so we say type*
, meaning: no conditions on type whatsoever.
query = """
line
sign question=1 type*
<: sign missing=1
<: sign damage=1
"""
results = A.search(query)
A.show(
results, start=4, end=4, colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"}
)
2.19s 29 results
result 4
We do not see the features, because they are sign
features, and our display stops at the word
level.
But we can improve on that:
A.show(
results,
start=4,
end=4,
baseTypes="sign",
colorMap={0: "", 2: "cyan", 3: "magenta", 4: "lightsalmon"},
)
result 4
So far we have show()
n the results of searches.
But you can also construct your own tuples and show them.
Whereas you can use search to get a pretty good approximation of what you want, most of the times you do not arrive precisely at your destination.
Here is an example where we use search to come close, and then work our way to produce the end result.
We look for lines that have more missing signs than damaged signs.
In our search templates we cannot formulate that a feature has different values on two nodes in the template. We could spell out all possible combinations of values and make a search template for each of them, but that is needlessly complex.
Let's first use search to find all clauses containing missing and damaged signs.
query = """
line
sign missing
sign damage
"""
results = A.search(query)
1.13s 17685 results
Now the hand coding begins. We are going to extract the tuples we want.
lines = {}
for (l, m, d) in results:
lines.setdefault(l, (set(), set()))
lines[l][0].add(m)
lines[l][1].add(d)
print(f"{len(lines)} lines")
7297 lines
Now we have all lines with both missing and damaged signs, without duplicates.
For each line we have a set with its missing signs and one with its damaged signs.
We filter in order to retain the lines with more missing than damaged signs. We put all missing signs in one big set and all damaged signs in one big set.
answer = []
missing = set()
damage = set()
for (l, (m, d)) in lines.items():
if len(m) > len(d):
answer.append((l, *m, *d))
missing |= m
damage |= d
len(answer)
3547
answer[0]
(865517, 1875, 1871, 1873)
We are going to make a dictionary of highlights: one color for the missing signs and one for the damaged.
highlights = {}
colorM = "lightsalmon"
colorD = "mediumaquamarine"
for s in missing:
highlights[s] = colorM
for s in damage:
highlights[s] = colorD
And now we can show them:
A.table(answer, start=1, end=10, highlights=highlights)
n | p | line | sign | sign | sign | ||||
---|---|---|---|---|---|---|---|---|---|
1 | P360977 reverse:12 | x-[x]-szi2 i#-x-[...] | [...] | [x]- | i#- | ||||
2 | P360980 obverse:3 | [qi2-bi-ma um-ma ma]-num#-ki-{d}iszkur-ma | um- | ma | ma]- | [qi2- | bi- | ma | num#- |
3 | P360980 obverse:10 | [um-ma] a#-na-ku-ma t,up-pa2-am | [um- | ma] | a#- | ||||
4 | P360982 obverse:3 | me-er-u2-tim il5#-[qe2-u2] | [qe2- | u2] | il5#- | ||||
5 | P360986 reverse:3' | [...]-ba-ni 2/3(disz) _ma-na_ 5(disz) _gin2#_ [...] | [...] | [...]- | _gin2#_ | ||||
6 | P360986 reverse:6' | [...] _ma#-na_ 5(disz) _gin2 ku3-babbar_ szu?-ku-bu?-um [...] | [...] | [...] | _ma#- | ||||
7 | P360989 obverse:2' | [...] szi2#-ip-ru-um szu-[...] | [...] | [...] | szi2#- | ||||
8 | P361431 envelope - obverse:2 | [wa-bar]-tum# sza sza-la2-ti2-wa-ar | [wa- | bar]- | tum# | ||||
9 | P361431 envelope - obverse:3 | [di2-nam i]-di2#-ma i-na | [di2- | nam | i]- | di2#- | |||
10 | P361431 envelope - obverse:5 | [sza ba]-al#-t,u3-a u2 {d}utu-e-nam a-na | [sza | ba]- | al#- |
As you see, you have total control.
All chapters:
See the cookbook for recipes for small, concrete tasks.
CC-BY Dirk Roorda