To get started: consult start
We spot the many similarities between lines in the corpus.
There are ca 100000 lines in the corpus. To compare them all requires 5 billion comparisons. That is a costly operation. On this laptop it took 96 whole minutes.
The good news it that we have stored the outcome in an extra feature.
This feature is packaged in a TF data module, that we will load below, by using the parameter mod
in the use()
statement.
%load_ext autoreload
%autoreload 2
import collections
from tf.app import use
A = use(
"Nino-cunei/oldassyrian",
mod="Nino-cunei/oldassyrian/parallels/tf:clone",
hoist=globals(),
)
This is Text-Fabric 9.2.2 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 68 features found and 0 ignored
The new feature is sim and it it an edge feature. It annotates pairs of lines $(l, m)$ where $l$ and $m$ have similar content. The degree of similarity is a percentage (between 90 and 100), and this value is annotated onto the edges.
Here is an example:
exampleLine = F.otype.s("line")[1000]
sisters = E.sim.b(exampleLine)
print(f"{len(sisters)} similar lines")
print("\n".join(f"{s[0]} with similarity {s[1]}" for s in sisters[0:10]))
A.table(tuple((s[0],) for s in sisters), end=10)
36 similar lines 866817 with similarity 100 874947 with similarity 100 879396 with similarity 100 894252 with similarity 100 904016 with similarity 100 904905 with similarity 100 907962 with similarity 100 910310 with similarity 100 914251 with similarity 100 914820 with similarity 100
n | p | line |
---|---|---|
1 | P360493 reverse:3 | ma-al-u2-tim |
2 | P390646 envelope - reverse:6 | [ma]-al-u2-tim |
3 | P358334 obverse:12 | ma-al-u2-tim |
4 | P358498 reverse:2 | ma-al-u2-tim |
5 | P358744 obverse:13 | ma-al-u2-tim |
6 | P358783 reverse:4 | ma-al-u2-tim |
7 | P358910 obverse:11 | ma-al-u2-tim |
8 | P273595 reverse:8 | ma-al-u2-tim |
9 | P359405 reverse:2 | ma-al-u2-tim |
10 | P359420 envelope - reverse:10 | ma-al-u2-tim |
Let's first find out the range of similarities:
minSim = None
maxSim = None
for ln in F.otype.s("line"):
sisters = E.sim.f(ln)
if not sisters:
continue
thisMin = min(s[1] for s in sisters)
thisMax = max(s[1] for s in sisters)
if minSim is None or thisMin < minSim:
minSim = thisMin
if maxSim is None or thisMax > maxSim:
maxSim = thisMax
print(f"minimum similarity is {minSim:>3}")
print(f"maximum similarity is {maxSim:>3}")
minimum similarity is 90 maximum similarity is 100
We give a few examples of the least similar lines.
N.B. When lines are less than 90% similar, they have not made it into the sim
feature!
We can use a search template to get the 90% lines.
query = """
line
-sim=90> line
"""
In words: find a line connected via a sim-edge with value 90 to an other line.
results = A.search(query)
0.35s 3784 results
Not very much indeed. It seems that lines are either very similar, or not so similar at all.
A.table(results, start=1, end=10)
n | p | line | line |
---|---|---|---|
1 | P361250 obverse:3 | i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma | a-na a-szur-i-[mi3-ti2 qi2-bi-ma] |
2 | P361250 obverse:3 | i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma | qi2-bi-ma um-ma a-szur-i-mi3-ti2 |
3 | P361250 obverse:3 | i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma | qi2-bi-ma um-ma a-szur-i-mi3-ti2-ma |
4 | P361250 obverse:3 | i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma | qi2-bi-ma um-ma a-szur-i-mi3-ti2-ma# |
5 | P360467 envelope - obverse:4 | [_kiszib3_ i]-ri#-szi2-im _dumu_ a-mur-{d}utu | i-ri-szi2-im _dumu_ a-mur-{d}utu |
6 | P360467 envelope - obverse:5 | [8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am | 1/3(disz) _ma-na_ 3(disz) _gin2 ku3-babbar_ s,a-ru-pa2-am |
7 | P360467 envelope - obverse:5 | [8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am | 4(u) 4(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am |
8 | P360467 envelope - obverse:5 | [8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am | 1(u) 5(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am |
9 | P360467 envelope - obverse:5 | [8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am | 2/3(disz) _ma-na_ 5(disz) _gin2 ku3-babbar_ s,a-ru-pa2-am |
10 | P360467 envelope - obverse:5 | [8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am | 4(u) 4(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am |
In case the ATF flags and clusters are a bit heavy on the eye, you can switch to a more pleasing rich text layout:
A.table(results, start=1, end=10, fmt="layout-orig-rich")
n | p | line | line |
---|---|---|---|
1 | P361250 obverse:3 | i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma | a-na a-šur-i-mi₃-ti₂ qi₂-bi-ma |
2 | P361250 obverse:3 | i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma | qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂ |
3 | P361250 obverse:3 | i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma | qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂-ma |
4 | P361250 obverse:3 | i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma | qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂-ma |
5 | P360467 envelope - obverse:4 | kišib₃ i-ri-ši₂-im dumu a-mur-dutu | i-ri-ši₂-im dumu a-mur-dutu |
6 | P360467 envelope - obverse:5 | 8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am | 1/3⌈diš⌉ ma-na 3⌈diš⌉ gin₂ ku₃-babbar ṣa-ru-pa₂-am |
7 | P360467 envelope - obverse:5 | 8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am | 4⌈u⌉ 4⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am |
8 | P360467 envelope - obverse:5 | 8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am | 1⌈u⌉ 5⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am |
9 | P360467 envelope - obverse:5 | 8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am | 2/3⌈diš⌉ ma-na 5⌈diš⌉ gin₂ ku₃-babbar ṣa-ru-pa₂-am |
10 | P360467 envelope - obverse:5 | 8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am | 4⌈u⌉ 4⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am |
Or even in cuneiform unicode:
A.table(results, start=1, end=10, fmt="layout-orig-unicode")
n | p | line | line |
---|---|---|---|
1 | P361250 obverse:3 | 𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠 | 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒆠𒁉𒈠 |
2 | P361250 obverse:3 | 𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠 | 𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹 |
3 | P361250 obverse:3 | 𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠 | 𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹𒈠 |
4 | P361250 obverse:3 | 𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠 | 𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹𒈠 |
5 | P360467 envelope - obverse:4 | 𒁾 𒄿𒊑𒋛𒅎 𒌉 𒀀𒄯𒀭𒌓 | 𒄿𒊑𒋛𒅎 𒌉 𒀀𒄯𒀭𒌓 |
6 | P360467 envelope - obverse:5 | 𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 | 𒑚 𒈠𒈾 𒐈 𒂅 𒆬𒌓 𒍝𒊒𒁀𒄠 |
7 | P360467 envelope - obverse:5 | 𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 | 𒐏 𒐉 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 |
8 | P360467 envelope - obverse:5 | 𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 | 𒌋 𒐊 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 |
9 | P360467 envelope - obverse:5 | 𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 | 𒑛 𒈠𒈾 𒐊 𒂅 𒆬𒌓 𒍝𒊒𒁀𒄠 |
10 | P360467 envelope - obverse:5 | 𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 | 𒐏 𒐉 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠 |
From now on we forget about the level of similarity, and focus on whether two lines are just "similar", meaning that they have a high degree of similarity.
Before we try to find them, let's see if we can cluster the lines in similar clusters.
CLUSTER_THRESHOLD = 0.5
def makeClusters():
A.indent(reset=True)
chunkSize = 1000
b = 0
j = 0
clusters = []
for ln in F.otype.s("line"):
j += 1
b += 1
if b == chunkSize:
b = 0
A.info(f"{j:>5} lines and {len(clusters):>5} clusters")
lSisters = {x[0] for x in E.sim.b(ln)}
lAdded = False
for cl in clusters:
if len(cl & lSisters) > CLUSTER_THRESHOLD * len(cl):
cl.add(ln)
lAdded = True
break
if not lAdded:
clusters.append({ln})
A.info(f"{j} lines and {len(clusters)} clusters")
return clusters
clusters = makeClusters()
0.11s 1000 lines and 957 clusters 0.40s 2000 lines and 1808 clusters 0.82s 3000 lines and 2570 clusters 1.43s 4000 lines and 3489 clusters 2.14s 5000 lines and 4276 clusters 3.05s 6000 lines and 5137 clusters 4.17s 7000 lines and 6065 clusters 5.46s 8000 lines and 6968 clusters 6.88s 9000 lines and 7830 clusters 8.43s 10000 lines and 8666 clusters 10s 11000 lines and 9459 clusters 12s 12000 lines and 10314 clusters 14s 13000 lines and 11162 clusters 16s 14000 lines and 12040 clusters 19s 15000 lines and 12832 clusters 21s 16000 lines and 13731 clusters 24s 17000 lines and 14554 clusters 27s 18000 lines and 15392 clusters 30s 19000 lines and 16281 clusters 34s 20000 lines and 17143 clusters 37s 21000 lines and 17849 clusters 40s 22000 lines and 18589 clusters 44s 23000 lines and 19322 clusters 48s 24000 lines and 20176 clusters 52s 25000 lines and 21040 clusters 56s 26000 lines and 21795 clusters 1m 00s 27000 lines and 22556 clusters 1m 05s 28000 lines and 23344 clusters 1m 09s 29000 lines and 24011 clusters 1m 13s 30000 lines and 24693 clusters 1m 18s 31000 lines and 25420 clusters 1m 23s 32000 lines and 26261 clusters 1m 29s 33000 lines and 27075 clusters 1m 34s 34000 lines and 27900 clusters 1m 40s 35000 lines and 28703 clusters 1m 46s 36000 lines and 29507 clusters 1m 53s 37000 lines and 30330 clusters 1m 59s 38000 lines and 31112 clusters 2m 05s 39000 lines and 31916 clusters 2m 11s 40000 lines and 32567 clusters 2m 17s 41000 lines and 33303 clusters 2m 23s 42000 lines and 33988 clusters 2m 30s 43000 lines and 34756 clusters 2m 36s 44000 lines and 35407 clusters 2m 43s 45000 lines and 36205 clusters 2m 50s 46000 lines and 36883 clusters 2m 57s 47000 lines and 37593 clusters 3m 05s 48000 lines and 38415 clusters 3m 12s 49000 lines and 39116 clusters 3m 20s 50000 lines and 39816 clusters 3m 27s 51000 lines and 40473 clusters 3m 34s 52000 lines and 41061 clusters 3m 41s 53000 lines and 41691 clusters 3m 49s 54000 lines and 42339 clusters 3m 56s 55000 lines and 42946 clusters 4m 05s 56000 lines and 43680 clusters 4m 13s 57000 lines and 44372 clusters 4m 22s 58000 lines and 45140 clusters 4m 31s 59000 lines and 45791 clusters 4m 38s 60000 lines and 46406 clusters 4m 48s 61000 lines and 47183 clusters 4m 58s 62000 lines and 47920 clusters 5m 05s 63000 lines and 48503 clusters 5m 15s 64000 lines and 49178 clusters 5m 25s 65000 lines and 49942 clusters 5m 34s 66000 lines and 50646 clusters 5m 43s 67000 lines and 51240 clusters 5m 52s 68000 lines and 51942 clusters 6m 03s 69000 lines and 52709 clusters 6m 13s 70000 lines and 53372 clusters 6m 24s 71000 lines and 54111 clusters 6m 34s 72000 lines and 54782 clusters 6m 45s 73000 lines and 55556 clusters 6m 57s 74000 lines and 56332 clusters 7m 07s 75000 lines and 57016 clusters 7m 20s 76000 lines and 57840 clusters 7m 30s 77000 lines and 58527 clusters 7m 41s 78000 lines and 59184 clusters 7m 50s 79000 lines and 59742 clusters 8m 02s 80000 lines and 60439 clusters 8m 14s 81000 lines and 61150 clusters 8m 25s 82000 lines and 61763 clusters 8m 35s 83000 lines and 62383 clusters 8m 47s 84000 lines and 63059 clusters 8m 59s 85000 lines and 63707 clusters 9m 11s 86000 lines and 64310 clusters 9m 24s 87000 lines and 65057 clusters 9m 37s 88000 lines and 65802 clusters 9m 49s 89000 lines and 66478 clusters 10m 02s 90000 lines and 67194 clusters 10m 16s 91000 lines and 67978 clusters 10m 29s 92000 lines and 68671 clusters 10m 43s 93000 lines and 69400 clusters 10m 58s 94000 lines and 70186 clusters 11m 13s 95000 lines and 70980 clusters 11m 27s 96000 lines and 71690 clusters 11m 40s 97000 lines and 72340 clusters 11m 51s 98000 lines and 72891 clusters 12m 04s 99000 lines and 73513 clusters 12m 14s 100000 lines and 73986 clusters 12m 29s 101000 lines and 74708 clusters 12m 44s 102000 lines and 75414 clusters 12m 56s 103000 lines and 75974 clusters 13m 09s 104000 lines and 76618 clusters 13m 23s 105000 lines and 77253 clusters 13m 38s 106000 lines and 77929 clusters 13m 54s 107000 lines and 78646 clusters 14m 10s 108000 lines and 79387 clusters 14m 25s 109000 lines and 80022 clusters 14m 36s 109860 lines and 80540 clusters
What is the distribution of the clusters, in terms of how many similar lines they contain? We count them.
clusterSizes = collections.Counter()
for cl in clusters:
clusterSizes[len(cl)] += 1
for (size, amount) in sorted(
clusterSizes.items(),
key=lambda x: (-x[0], x[1]),
):
print(f"clusters of size {size:>4}: {amount:>5}")
clusters of size 455: 1 clusters of size 352: 1 clusters of size 267: 1 clusters of size 199: 1 clusters of size 146: 1 clusters of size 137: 1 clusters of size 131: 1 clusters of size 129: 1 clusters of size 123: 1 clusters of size 115: 1 clusters of size 112: 1 clusters of size 109: 1 clusters of size 106: 1 clusters of size 100: 1 clusters of size 99: 2 clusters of size 97: 1 clusters of size 94: 2 clusters of size 91: 1 clusters of size 89: 1 clusters of size 86: 1 clusters of size 85: 1 clusters of size 75: 1 clusters of size 73: 1 clusters of size 72: 1 clusters of size 71: 2 clusters of size 70: 1 clusters of size 69: 1 clusters of size 68: 2 clusters of size 65: 1 clusters of size 63: 1 clusters of size 62: 1 clusters of size 61: 2 clusters of size 60: 2 clusters of size 59: 1 clusters of size 58: 2 clusters of size 57: 4 clusters of size 56: 2 clusters of size 55: 1 clusters of size 54: 2 clusters of size 53: 2 clusters of size 52: 2 clusters of size 51: 2 clusters of size 50: 1 clusters of size 48: 4 clusters of size 47: 3 clusters of size 46: 2 clusters of size 45: 1 clusters of size 44: 2 clusters of size 42: 3 clusters of size 41: 2 clusters of size 40: 1 clusters of size 39: 3 clusters of size 38: 3 clusters of size 37: 4 clusters of size 36: 1 clusters of size 35: 3 clusters of size 34: 2 clusters of size 33: 3 clusters of size 32: 2 clusters of size 31: 1 clusters of size 30: 8 clusters of size 29: 9 clusters of size 28: 6 clusters of size 27: 6 clusters of size 26: 15 clusters of size 25: 8 clusters of size 24: 9 clusters of size 23: 7 clusters of size 22: 12 clusters of size 21: 10 clusters of size 20: 18 clusters of size 19: 11 clusters of size 18: 19 clusters of size 17: 20 clusters of size 16: 31 clusters of size 15: 28 clusters of size 14: 50 clusters of size 13: 38 clusters of size 12: 61 clusters of size 11: 54 clusters of size 10: 92 clusters of size 9: 95 clusters of size 8: 126 clusters of size 7: 167 clusters of size 6: 259 clusters of size 5: 372 clusters of size 4: 676 clusters of size 3: 1435 clusters of size 2: 4714 clusters of size 1: 72086
Let's investigate some interesting groups, that lie in some sweet spots.
All chapters:
See the cookbook for recipes for small, concrete tasks.
CC-BY Dirk Roorda