To get started: consult start
We spot the many similarities between lines in the corpus.
There are ca 25000 lines in the corpus. To compare them all requires 300 million comparisons. That is a costly operation. On this laptop it took 6 whole minutes.
The good news it that we have stored the outcome in an extra feature.
This feature is packaged in a TF data module, that we will load below, by using the parameter mod
in the use()
statement.
%load_ext autoreload
%autoreload 2
import collections
from tf.app import use
A = use(
"Nino-cunei/oldbabylonian",
mod="Nino-cunei/oldbabylonian/parallels/tf:clone",
hoist=globals(),
)
This is Text-Fabric 9.2.2 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 68 features found and 0 ignored
The new feature is sim and it it an edge feature. It annotates pairs of lines $(l, m)$ where $l$ and $m$ have similar content. The degree of similarity is a percentage (between 90 and 100), and this value is annotated onto the edges.
Here is an example:
exampleLine = F.otype.s("line")[0]
sisters = E.sim.b(exampleLine)
print(f"{len(sisters)} similar lines")
print("\n".join(f"{s[0]} with similarity {s[1]}" for s in sisters[0:10]))
A.table(tuple((s[0],) for s in sisters), end=10)
75 similar lines 235394 with similarity 100 235421 with similarity 100 235434 with similarity 100 235464 with similarity 100 235478 with similarity 100 235503 with similarity 100 235529 with similarity 100 235585 with similarity 100 235615 with similarity 100 235629 with similarity 100
n | p | line |
---|---|---|
1 | P510729 obverse:1 | a-na {d}suen-i-din-[nam] |
2 | P510730 obverse:1 | a-na {d}suen-i-din-nam |
3 | P510731 obverse:1 | a-na {d}suen-i-din-nam |
4 | P510732 obverse:1 | a-na {d}suen#-i-din-nam# |
5 | P497779 obverse:1 | a-na {d}suen#-[i]-din-nam |
6 | P510733 obverse:1 | [a-na] {d}[suen-i-din-nam] |
7 | P510734 obverse:1 | [a-na {d}suen-i-din-nam] |
8 | P510736 obverse:1 | a-na {d}suen-i-din-nam |
9 | P510737 obverse:1 | a-na {d}suen-i-din-nam# |
10 | P370926 obverse:1 | a-na {d}suen-i-din-nam |
Let's first find out the range of similarities:
minSim = None
maxSim = None
for ln in F.otype.s("line"):
sisters = E.sim.f(ln)
if not sisters:
continue
thisMin = min(s[1] for s in sisters)
thisMax = max(s[1] for s in sisters)
if minSim is None or thisMin < minSim:
minSim = thisMin
if maxSim is None or thisMax > maxSim:
maxSim = thisMax
print(f"minimum similarity is {minSim:>3}")
print(f"maximum similarity is {maxSim:>3}")
minimum similarity is 90 maximum similarity is 100
We give a few examples of the least similar lines.
N.B. When lines are less than 90% similar, they have not made it into the sim
feature!
We can use a search template to get the 90% lines.
query = """
line
-sim=90> line
"""
In words: find a line connected via a sim-edge with value 90 to an other line.
results = A.search(query)
0.19s 722 results
Not very much indeed. It seems that lines are either very similar, or not so similar at all.
A.table(results, start=1, end=10)
n | p | line | line |
---|---|---|---|
1 | P509373 obverse:10 | _a-sza3 a-gar3_ na-ag-[ma-lum] _uru_ x x x{ki} | _a-[sza3 a-gar3_ na-ag]-ma-lum _uru gan2_ x x{ki} |
2 | P509374 obverse:4 | _{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x] | {d}utu# u3 {d}marduk# [da-ri-isz _u4_-mi-im] |
3 | P509374 obverse:4 | _{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x] | _{d}utu_ u3 _{d}marduk_ da-ri-isz u4-mi-im |
4 | P509374 obverse:4 | _{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x] | {d}utu u3 {d}[marduk da-ri-isz _u4_]-mi#-im |
5 | P509376 obverse:11 | it-ti-szu a-na _a-sza3_ ri-id-ma | [it-ti]-szu#-nu a-na _a-sza3_ ri-id-ma |
6 | P510527 obverse:4 | {d}utu u3 {d}marduk li-ba-al-li-t,u2-ka | {d}utu u3 {d}marduk li-ba-al-li-t,u2-ka!(KI) |
7 | P510527 obverse:4 | {d}utu u3 {d}marduk li-ba-al-li-t,u2-ka | {d}utu u3 {d}marduk tu-ba-al-li-t,u2-ka |
8 | P510529 obverse:4 | {d}utu u3 {d}marduk da-ri-isz _u4_-mi | {d}utu# u3 {d}marduk# [da-ri-isz _u4_-mi-im] |
9 | P510529 obverse:4 | {d}utu u3 {d}marduk da-ri-isz _u4_-mi | _{d}utu_ u3 _{d}marduk_ da-ri-isz u4-mi-im |
10 | P510529 obverse:4 | {d}utu u3 {d}marduk da-ri-isz _u4_-mi | {d}utu u3 {d}[marduk da-ri-isz _u4_]-mi#-im |
In case the ATF flags and clusters are a bit heavy on the eye, you can switch to a more pleasing rich text layout:
A.table(results, start=1, end=10, fmt="layout-orig-rich")
n | p | line | line |
---|---|---|---|
1 | P509373 obverse:10 | a-ša₃ a-gar₃ na-ag-ma-lum uru x x xki | a-ša₃ a-gar₃ na-ag-ma-lum uru gan₂ x xki |
2 | P509374 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi x | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
3 | P509374 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi x | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
4 | P509374 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi x | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
5 | P509376 obverse:11 | it-ti-šu a-na a-ša₃ ri-id-ma | it-ti-šu-nu a-na a-ša₃ ri-id-ma |
6 | P510527 obverse:4 | dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka | dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka=⌈KI⌉ |
7 | P510527 obverse:4 | dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka | dutu u₃ dmarduk tu-ba-al-li-ṭu₂-ka |
8 | P510529 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
9 | P510529 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
10 | P510529 obverse:4 | dutu u₃ dmarduk da-ri-iš u₄-mi | dutu u₃ dmarduk da-ri-iš u₄-mi-im |
Or even in cuneiform unicode:
A.table(results, start=1, end=10, fmt="layout-orig-unicode")
n | p | line | line |
---|---|---|---|
1 | P509373 obverse:10 | 𒀀𒊮 𒀀𒃼 𒈾𒀝𒈠𒈝 𒌷 x x x𒆠 | 𒀀𒊮 𒀀𒃼 𒈾𒀝𒈠𒈝 𒌷 𒃷 x x𒆠 |
2 | P509374 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
3 | P509374 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
4 | P509374 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
5 | P509376 obverse:11 | 𒀉𒋾𒋗 𒀀𒈾 𒀀𒊮 𒊑𒀉𒈠 | 𒀉𒋾𒋗𒉡 𒀀𒈾 𒀀𒊮 𒊑𒀉𒈠 |
6 | P510527 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗=⌈𒆠⌉ |
7 | P510527 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒌅𒁀𒀠𒇷𒌅𒅗 |
8 | P510529 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
9 | P510529 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
10 | P510529 obverse:4 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 | 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎 |
From now on we forget about the level of similarity, and focus on whether two lines are just "similar", meaning that they have a high degree of similarity.
Before we try to find them, let's see if we can cluster the lines in similar clusters.
CLUSTER_THRESHOLD = 0.5
def makeClusters():
A.indent(reset=True)
chunkSize = 1000
b = 0
j = 0
clusters = []
for ln in F.otype.s("line"):
j += 1
b += 1
if b == chunkSize:
b = 0
A.info(f"{j:>5} lines and {len(clusters):>5} clusters")
lSisters = {x[0] for x in E.sim.b(ln)}
lAdded = False
for cl in clusters:
if len(cl & lSisters) > CLUSTER_THRESHOLD * len(cl):
cl.add(ln)
lAdded = True
break
if not lAdded:
clusters.append({ln})
A.info(f"{j} lines and {len(clusters)} clusters")
return clusters
clusters = makeClusters()
0.10s 1000 lines and 858 clusters 0.31s 2000 lines and 1691 clusters 0.65s 3000 lines and 2509 clusters 1.12s 4000 lines and 3338 clusters 1.72s 5000 lines and 4135 clusters 2.42s 6000 lines and 4885 clusters 3.21s 7000 lines and 5659 clusters 4.05s 8000 lines and 6358 clusters 5.07s 9000 lines and 7125 clusters 6.23s 10000 lines and 7894 clusters 7.49s 11000 lines and 8715 clusters 8.82s 12000 lines and 9450 clusters 10s 13000 lines and 10166 clusters 12s 14000 lines and 11011 clusters 14s 15000 lines and 11774 clusters 15s 16000 lines and 12592 clusters 17s 17000 lines and 13219 clusters 19s 18000 lines and 13893 clusters 21s 19000 lines and 14637 clusters 24s 20000 lines and 15380 clusters 26s 21000 lines and 16095 clusters 28s 22000 lines and 16799 clusters 31s 23000 lines and 17505 clusters 33s 24000 lines and 18235 clusters 36s 25000 lines and 19005 clusters 39s 26000 lines and 19722 clusters 41s 27000 lines and 20446 clusters 43s 27375 lines and 20735 clusters
What is the distribution of the clusters, in terms of how many similar lines they contain? We count them.
clusterSizes = collections.Counter()
for cl in clusters:
clusterSizes[len(cl)] += 1
for (size, amount) in sorted(
clusterSizes.items(),
key=lambda x: (-x[0], x[1]),
):
print(f"clusters of size {size:>4}: {amount:>5}")
clusters of size 1006: 1 clusters of size 129: 1 clusters of size 126: 1 clusters of size 125: 1 clusters of size 84: 1 clusters of size 78: 1 clusters of size 76: 1 clusters of size 74: 1 clusters of size 69: 1 clusters of size 64: 1 clusters of size 56: 1 clusters of size 52: 1 clusters of size 51: 1 clusters of size 49: 1 clusters of size 48: 1 clusters of size 45: 1 clusters of size 44: 1 clusters of size 43: 1 clusters of size 39: 1 clusters of size 35: 1 clusters of size 34: 1 clusters of size 32: 1 clusters of size 30: 3 clusters of size 29: 1 clusters of size 28: 4 clusters of size 27: 2 clusters of size 26: 2 clusters of size 25: 3 clusters of size 24: 1 clusters of size 23: 3 clusters of size 22: 3 clusters of size 20: 4 clusters of size 19: 2 clusters of size 18: 3 clusters of size 17: 3 clusters of size 16: 2 clusters of size 15: 4 clusters of size 14: 9 clusters of size 13: 7 clusters of size 12: 9 clusters of size 11: 12 clusters of size 10: 17 clusters of size 9: 14 clusters of size 8: 28 clusters of size 7: 30 clusters of size 6: 49 clusters of size 5: 58 clusters of size 4: 123 clusters of size 3: 276 clusters of size 2: 998 clusters of size 1: 19043
Let's investigate some interesting groups, that lie in some sweet spots.
All chapters:
See the cookbook for recipes for small, concrete tasks.
CC-BY Dirk Roorda