This notebook gets you started with using Text-Fabric for coding in the Old-Babylonian Letter corpus (cuneiform).
Familiarity with the underlying data model is recommended.
If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your copy of the repository If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.
Text-Fabric will fetch the data set for you from the newest GitHub release binaries.
The data will be stored in the text-fabric-data
in your home directory.
The data of the corpus is organized in features. They are columns of data. Think of the corpus as a gigantic spreadsheet, where row 1 corresponds to the first sign, row 2 to the second sign, and so on, for all 200,000 signs.
The information which reading each sign has, constitutes a column in that spreadsheet. The Old Babylonian corpus contains nearly 60 columns, not only for the signs, but also for thousands of other textual objects, such as clusters, lines, columns, faces, documents.
Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.
%load_ext autoreload
%autoreload 2
import os
import collections
The simplest way to get going is by this incantation:
from tf.app import use
For the very last version, use hot
.
For the latest release, use latest
.
If you have cloned the repos (TF app and data), use clone
.
If you do not want/need to upgrade, leave out the checkout specifiers.
A = use("Nino-cunei/oldbabylonian", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
document | 1285 | 158.15 | 100 |
face | 2834 | 71.71 | 100 |
line | 27375 | 7.42 | 100 |
word | 76505 | 2.64 | 100 |
cluster | 23449 | 1.78 | 21 |
sign | 203219 | 1.00 | 100 |
3
Nino-cunei/oldbabylonian
/Users/me/text-fabric-data/github/Nino-cunei/oldbabylonian/app
02ef17b79197de5431aaec71a8244b7551d77cbc
.pnum {
font-family: sans-serif;
font-size: small;
font-weight: bold;
color: #444444;
}
.op {
padding: 0.5em 0.1em 0.1em 0.1em;
margin: 0.8em 0.1em 0.1em 0.1em;
font-family: monospace;
font-size: x-large;
font-weight: bold;
}
.period {
font-family: monospace;
font-size: medium;
font-weight: bold;
color: #0000bb;
}
.comment {
color: #7777dd;
font-family: monospace;
font-size: small;
}
.operator {
color: #ff77ff;
font-size: large;
}
/* LANGUAGE: superscript and subscript */
/* cluster */
.det {
vertical-align: super;
}
/* cluster */
.langalt {
vertical-align: sub;
}
/* REDACTIONAL: line over or under */
/* flag */
.collated {
font-weight: bold;
text-decoration: underline;
}
/* cluster */
.excised {
color: #dd0000;
text-decoration: line-through;
}
/* cluster */
.supplied {
color: #0000ff;
text-decoration: overline;
}
/* flag */
.remarkable {
font-weight: bold;
text-decoration: overline;
}
/* UNSURE: italic*/
/* cluster */
.uncertain {
font-style: italic
}
/* flag */
.question {
font-weight: bold;
font-style: italic
}
/* BROKEN: text-shadow */
/* cluster */
.missing {
color: #999999;
text-shadow: #bbbbbb 1px 1px;
}
/* flag */
.damage {
font-weight: bold;
color: #999999;
text-shadow: #bbbbbb 1px 1px;
}
.empty {
color: #ff0000;
}
True
layoutRich
trans
layoutUnicode
orig
source
}trans
}trans
}orig
}mapping from readings to UNICODE
https://nbviewer.jupyter.org/github/Nino-cunei/tfFromAtf/blob/master/programs/mapReadings.ipynb
about
https://github.com/Nino-cunei/tfFromAtf/blob/master/docs/transcription{docExt}
''
0
}True
local
/Users/me/text-fabric-data/github/Nino-cunei/oldbabylonian/_temp
Old Babylonian Letters 1900-1600: Cuneiform tablets
10.5281/zenodo.2579207
Nino-cunei
/tf
oldbabylonian
1.0.6
https://cdli.ucla.edu
Show this document on CDLI
{webBase}/search/search_results.php?SearchMode=Text&ObjectID=<1>
{type}
0
collection volume docnumber docnote
srcLnNum
object
srcLnNum
remarks translation@en
srcLnNum
collated remarkable question damage det uncertain missing excised supplied langalt comment remarks repeat fraction operator grapheme
True
True
0
akk
You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.
The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the text and data of the corpus.
At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).
The most essential thing for now is that we can use F
to access the data in the features
we've loaded.
But there is more, such as N
, which helps us to walk over the text, as we see in a minute.
The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.
Text-Fabric contains a flexible search engine, that does not only work for the data, of this corpus, but also for other corpora and data that you add to corpora.
Search is the quickest way to come up-to-speed with your data, without too much programming.
Jump to the dedicated search search tutorial first, to whet your appetite.
The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:
Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.
In order to get acquainted with the data, we start with the simple task of counting.
We use the
N.walk()
generator
to walk through the nodes.
We compared the TF data to a gigantic spreadsheet, where the rows correspond to the signs.
In Text-Fabric, we call the rows slots
, because they are the textual positions that can be filled with signs.
We also mentioned that there are also other textual objects. They are the clusters, lines, faces and documents. They also correspond to rows in the big spreadsheet.
In Text-Fabric we call all these rows nodes, and the N()
generator
carries us through those nodes in the textual order.
Just one extra thing: the info
statements generate timed messages.
If you use them instead of print
you'll get a sense of the amount of time that
the various processing steps typically need.
A.indent(reset=True)
A.info("Counting nodes ...")
i = 0
for n in N.walk():
i += 1
A.info("{} nodes".format(i))
0.00s Counting nodes ... 0.03s 334667 nodes
Here you see it: over 300,000 nodes.
Every node has a type, like sign, or line, face. But what exactly are they?
Text-Fabric has two special features, otype
and oslots
, that must occur in every Text-Fabric data set.
otype
tells you for each node its type, and you can ask for the number of slot
s in the text.
Here we go!
F.otype.slotType
'sign'
F.otype.maxSlot
203219
F.otype.maxNode
334667
F.otype.all
('document', 'face', 'line', 'word', 'cluster', 'sign')
C.levels.data
(('document', 158.14708171206226, 226669, 227953), ('face', 71.70748059280169, 227954, 230787), ('line', 7.423525114155251, 230788, 258162), ('word', 2.6436180641788116, 258163, 334667), ('cluster', 1.782122905027933, 203220, 226668), ('sign', 1, 1, 203219))
This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the indent
in conjunction with info
to produce neat timed
and indented progress messages.
A.indent(reset=True)
A.info("counting objects ...")
for otype in F.otype.all:
i = 0
A.indent(level=1, reset=True)
for n in F.otype.s(otype):
i += 1
A.info("{:>7} {}s".format(i, otype))
A.indent(level=0)
A.info("Done")
0.00s counting objects ... | 0.00s 1285 documents | 0.00s 2834 faces | 0.00s 27375 lines | 0.01s 76505 words | 0.00s 23449 clusters | 0.02s 203219 signs 0.04s Done
F
gives access to all features.
Every feature has a method
freqList()
to generate a frequency list of its values, higher frequencies first.
Here are the repeats of numerals (the -1
comes from a n(rrr)
:
F.repeat.freqList()
((1, 877), (2, 398), (5, 246), (3, 239), (4, 152), (6, 67), (8, 40), (7, 26), (9, 15), (-1, 3))
Signs have types and clusters have types. We can count them separately:
F.type.freqList("cluster")
(('langalt', 7600), ('missing', 7572), ('det', 6794), ('uncertain', 1183), ('supplied', 231), ('excised', 69))
F.type.freqList("sign")
(('reading', 188292), ('unknown', 8761), ('numeral', 2184), ('ellipsis', 1617), ('grapheme', 1272), ('commentline', 969), ('complex', 122), ('comment', 2))
Finally, the flags:
F.flags.freqList()
(('#', 9830), ('?', 421), ('#?', 131), ('!', 91), ('*', 9), ('?#', 7), ('#!', 5), ('!*', 2), ('#*', 1), ('?!*', 1))
for (w, amount) in F.sym.freqList("word")[0:20]:
print(f"{amount:>5} {w}")
7089 x 4071 a-na 2517 u3 2361 sza 1645 um-ma 1440 i-na 1365 ... 1140 qi2-bi2-ma 1075 la 796 u2-ul 776 d⁼utu 626 d⁼marduk 585 ku3-babbar 551 ki-ma 540 asz-szum 407 hi-a 388 lu 381 1(disz) 363 ki-a-am 341 szum-ma
for w in [w for (w, amount) in F.sym.freqList("word") if amount == 1][0:20]:
print(f'"{w}"')
"...-BU-szu" "...-BU-um" "...-DI" "...-IG-ti-sza" "...-SZI" "...-ZU" "...-a-tim" "...-ab-ba-lam" "...-am-ma" "...-an" "...-ar" "...-ar-ra" "...-ba-lam" "...-da-an-ni" "...-d⁼en-lil2" "...-d⁼la-ga-ma-al" "...-ha-ar" "...-hu" "...-im" "...-ir"
The occurrence base of a word are the documents in which occurs.
We compute the occurrence base of each word.
occurrenceBase = collections.defaultdict(set)
for w in F.otype.s("word"):
pNum = T.sectionFromNode(w)[0]
occurrenceBase[F.sym.v(w)].add(pNum)
An overview of how many words have how big occurrence bases:
occurrenceSize = collections.Counter()
for (w, pNums) in occurrenceBase.items():
occurrenceSize[len(pNums)] += 1
occurrenceSize = sorted(
occurrenceSize.items(),
key=lambda x: (-x[1], x[0]),
)
for (size, amount) in occurrenceSize[0:10]:
print(f"base size {size:>4} : {amount:>5} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
print(f"base size {size:>4} : {amount:>5} words")
base size 1 : 11957 words base size 2 : 1817 words base size 3 : 745 words base size 4 : 367 words base size 5 : 229 words base size 6 : 150 words base size 7 : 128 words base size 9 : 75 words base size 8 : 74 words base size 10 : 64 words ... base size 459 : 1 words base size 624 : 1 words base size 626 : 1 words base size 649 : 1 words base size 736 : 1 words base size 967 : 1 words base size 1031 : 1 words base size 1119 : 1 words base size 1169 : 1 words base size 1255 : 1 words
Let's give the predicate private to those words whose occurrence base is a single document.
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)
11957
As a final exercise with words, lets make a list of all documents, and show their
docList = []
empty = set()
ordinary = set()
for d in F.otype.s("document"):
pNum = T.documentName(d)
words = {F.sym.v(w) for w in L.d(d, otype="word")}
a = len(words)
if not a:
empty.add(pNum)
continue
o = len({w for w in words if w in privates})
if not o:
ordinary.add(pNum)
continue
p = 100 * o / a
docList.append((pNum, a, o, p))
docList = sorted(docList, key=lambda e: (-e[3], -e[1], e[0]))
print(f"Found {len(empty):>4} empty documents")
print(f"Found {len(ordinary):>4} ordinary documents (i.e. without private words)")
Found 7 empty documents Found 30 ordinary documents (i.e. without private words)
print(
"{:<20}{:>5}{:>5}{:>5}\n{}".format(
"document",
"#all",
"#own",
"%own",
"-" * 35,
)
)
for x in docList[0:20]:
print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in docList[-20:]:
print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
document #all #own %own ----------------------------------- P292935 20 11 55.0% P510702 39 21 53.8% P386445 40 20 50.0% P510808 28 14 50.0% P313381 24 12 50.0% P510849 16 8 50.0% P510657 4 2 50.0% P292931 48 23 47.9% P305788 17 8 47.1% P313326 45 21 46.7% P292992 28 13 46.4% P292810 26 12 46.2% P305774 13 6 46.2% P510641 13 6 46.2% P292984 24 11 45.8% P481778 24 11 45.8% P373043 104 47 45.2% P491917 29 13 44.8% P355749 226 100 44.2% P365938 25 11 44.0% ... P510540 21 1 4.8% P372962 22 1 4.5% P275094 24 1 4.2% P372895 24 1 4.2% P385995 24 1 4.2% P510732 24 1 4.2% P386005 27 1 3.7% P510730 27 1 3.7% P372914 28 1 3.6% P372900 30 1 3.3% P372929 30 1 3.3% P510528 30 1 3.3% P413618 31 1 3.2% P510661 64 2 3.1% P510535 32 1 3.1% P365118 34 1 2.9% P510663 34 1 2.9% P510817 36 1 2.8% P275114 40 1 2.5% P372420 61 1 1.6%
We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (L
) provides functions: u()
for going up, and d()
for going down,
n()
for going to next nodes and p()
for going to previous nodes.
These directions are indirect notions: nodes are just numbers, but by means of the
oslots
feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow or precede the slots of the other one.
L.u(node)
Up is going to nodes that embed node
.
L.d(node)
Down is the opposite direction, to those that are contained in node
.
L.n(node)
Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node
.
L.p(node)
Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node
.
All these functions yield nodes of all possible node types. By passing an optional parameter, you can restrict the results to nodes of that type.
The result are ordered according to the order of things in the text.
The functions return always a tuple, even if there is just one node in the result.
We go from the first word to the document it contains.
Note the [0]
at the end. You expect one document, yet L
returns a tuple.
To get the only element of that tuple, you need to do that [0]
.
If you are like me, you keep forgetting it, and that will lead to weird error messages later on.
firstDoc = L.u(1, otype="document")[0]
print(firstDoc)
226669
And let's see all the containing objects of sign 3:
s = 3
for otype in F.otype.all:
if otype == F.otype.slotType:
continue
up = L.u(s, otype=otype)
upNode = "x" if len(up) == 0 else up[0]
print("sign {} is contained in {} {}".format(s, otype, upNode))
sign 3 is contained in document 226669 sign 3 is contained in face 227954 sign 3 is contained in line 230788 sign 3 is contained in word 258164 sign 3 is contained in cluster 203222
Let's go to the next nodes of the first document.
afterFirstDoc = L.n(firstDoc)
for n in afterFirstDoc:
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
secondDoc = L.n(firstDoc, otype="document")[0]
348: sign first slot=348 , last slot=348 258314: word first slot=348 , last slot=349 230824: line first slot=348 , last slot=355 227956: face first slot=348 , last slot=482 226670: document first slot=348 , last slot=500
And let's see what is right before the second document.
for n in L.p(secondDoc):
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
226669: document first slot=1 , last slot=347 227955: face first slot=164 , last slot=347 230823: line first slot=330 , last slot=347 203293: cluster first slot=345 , last slot=347 258313: word first slot=347 , last slot=347 347: sign first slot=347 , last slot=347
We go to the faces of the first document, and just count them.
faces = L.d(firstDoc, otype="face")
print(len(faces))
2
We pick two nodes and explore what is above and below them: the first line and the first word.
for n in [
F.otype.s("word")[0],
F.otype.s("line")[0],
]:
A.indent(level=0)
A.info("Node {}".format(n), tm=False)
A.indent(level=1)
A.info("UP", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
A.indent(level=1)
A.info("DOWN", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)
Node 258163 | UP | | 203220 cluster | | 230788 line | | 227954 face | | 226669 document | DOWN | | 203220 cluster | | 1 sign | | 2 sign Node 230788 | UP | | 227954 face | | 226669 document | DOWN | | 258163 word | | 203220 cluster | | 1 sign | | 2 sign | | 258164 word | | 203221 cluster | | 203222 cluster | | 3 sign | | 4 sign | | 5 sign | | 203223 cluster | | 6 sign | | 7 sign Done
So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.
In the same way as F
gives access to feature data,
T
gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: T
.
Cuneiform text can be represented in a number of ways:
If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature otext
, which specifies the formats and which features
must be used to produce them. otext
is the third special feature in a TF data set,
next to otype
and oslots
.
It is an optional feature.
If it is absent, there will be no T
API.
Here is a list of all available formats in this data set.
sorted(T.formats)
['layout-orig-rich', 'layout-orig-unicode', 'text-orig-full', 'text-orig-plain', 'text-orig-rich', 'text-orig-unicode']
The T.text()
function is central to get text representations of nodes. Its most basic usage is
T.text(nodes, fmt=fmt)
where nodes
is a list or iterable of nodes, usually word nodes, and fmt
is the name of a format.
If you leave out fmt
, the default text-orig-full
is chosen.
The result is the text in that format for all nodes specified:
T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-plain")
'a-na d⁼suen-i-din-namqi2-bi2-maum-'
There is also another usage of this function:
T.text(node, fmt=fmt)
where node
is a single node.
In this case, the default format is ntype-orig-full
where ntype
is the type of node
.
If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in node
will be looked up
and represented with the default format text-orig-full
.
In this way we can sensibly represent a lot of different nodes, such as documents, faces, lines, clusters, words and signs.
We compose a set of example nodes and run T.text
on them:
exampleNodes = [
F.otype.s("sign")[0],
F.otype.s("word")[0],
F.otype.s("cluster")[0],
F.otype.s("line")[0],
F.otype.s("face")[0],
F.otype.s("document")[0],
]
exampleNodes
[1, 258163, 203220, 230788, 227954, 226669]
for n in exampleNodes:
print(f"This is {F.otype.v(n)} {n}:")
print(T.text(n))
print("")
This is sign 1: [a- This is word 258163: [a-na] This is cluster 203220: [a-na] This is line 230788: [a-na] _{d}suen_-i-[din-nam] This is face 227954: [a-na] _{d}suen_-i-[din-nam]qi2-bi2-[ma]um-ma _{d}en-lil2_-sza-du-u2-ni-ma_{d}utu_ u3 _{d}[marduk]_ a-na da-ri-a-[tim]li-ba-al-li-t,u2-u2-ka{disz}sze-ep-_{d}suen a2-gal2 [dumu] um-mi-a-mesz_ki-a-am u2-lam-mi-da-an-ni um-[ma] szu-u2-[ma]{disz}sa-am-su-ba-ah-li sza-pi2-ir ma-[tim]2(esze3) _a-sza3_ s,i-[bi]-it {disz}[ku]-un-zu-lum _sza3-gud__a-sza3 a-gar3_ na-ag-[ma-lum] _uru_ x x x{ki}sza _{d}utu_-ha-zi-[ir] isz-tu _mu 7(disz) kam_ id-di-nu-szumu3 i-na _uru_ x-szum{ki} sza-ak-nu id-di-a-am-ma2(esze3) _a-sza3 szuku_ i-li-ib-bu s,i-bi-it _nagar-mesz__a-sza3 a-gar3 uru_ ra-bu-um x [...]x x x x x x [...]$ rest broken This is document 226669: [a-na] _{d}suen_-i-[din-nam]qi2-bi2-[ma]um-ma _{d}en-lil2_-sza-du-u2-ni-ma_{d}utu_ u3 _{d}[marduk]_ a-na da-ri-a-[tim]li-ba-al-li-t,u2-u2-ka{disz}sze-ep-_{d}suen a2-gal2 [dumu] um-mi-a-mesz_ki-a-am u2-lam-mi-da-an-ni um-[ma] szu-u2-[ma]{disz}sa-am-su-ba-ah-li sza-pi2-ir ma-[tim]2(esze3) _a-sza3_ s,i-[bi]-it {disz}[ku]-un-zu-lum _sza3-gud__a-sza3 a-gar3_ na-ag-[ma-lum] _uru_ x x x{ki}sza _{d}utu_-ha-zi-[ir] isz-tu _mu 7(disz) kam_ id-di-nu-szumu3 i-na _uru_ x-szum{ki} sza-ak-nu id-di-a-am-ma2(esze3) _a-sza3 szuku_ i-li-ib-bu s,i-bi-it _nagar-mesz__a-sza3 a-gar3 uru_ ra-bu-um x [...]x x x x x x [...]$ rest broken$ beginning broken[x x] x x [...][x x] x [...][x x] x s,i-bi-it _gir3-se3#-ga#_[x x] x x x-ir ub-lamin-na-me-er-max _[a-sza3_ s,i]-bi-it ku-un-zu-lum_a-[sza3 a-gar3_ na-ag]-ma-lum _uru gan2_ x x{ki}a-[na] sa-[am-su]-ba-ah-[la] x x li ig bumi-im-ma _a-sza3_ s,i-bi-[it] _nagar-mesz_u2-ul na-di-isz-szuma-na ki-ma i-[na] _dub e2-gal_-limsza _{d}utu_-ha-zi-ir ub-lam in-na-am-ruasz-t,u3-ra-am _dub_ [usz]-ta-bi-la-kuma-wa-tim sza ma-ah-hi-rum u2-lam-ma-du-u2-maasz-szum _e2-gal_-lim la lum-mu-[di ...]_dub_-pi2 u2-sza-ab-ba-x [...]x ti u2-ul a-ga-am-ma-[x ...]u2-ul a-sza-ap-pa-[x ...][at-ta] ki-ma ti-du-u2 wa-ar-ka-az-zu pu-ru-us [x x ...]
for fmt in sorted(T.formats):
if fmt.startswith("text-"):
print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt)))
text-orig-full: [a-na] _{d}suen_-i-[din-nam]qi2-bi2-[ma]um- text-orig-plain: a-na d⁼suen-i-din-namqi2-bi2-maum- text-orig-rich: a-na d⁼suen-i-din-namqi₂-bi₂-maum- text-orig-unicode: 𒀀𒈾 𒀭𒂗𒍪𒄿𒁷𒉆𒆠𒉈𒈠𒌝
If we do not specify a format, the default format is used (text-orig-full
).
T.text(range(1, 12))
'[a-na] _{d}suen_-i-[din-nam]qi2-bi2-[ma]um-'
firstLine = F.otype.s("line")[0]
T.text(firstLine)
'[a-na] _{d}suen_-i-[din-nam]'
T.text(firstLine, fmt="text-orig-unicode")
'𒀀𒈾 𒀭𒂗𒍪𒄿𒁷𒉆'
The important things to remember are:
n
in default format by T.text(n)
n
in other formats by T.text(n, fmt=fmt, descend=True)
Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Old Babylonian Letters is a piece of cake.
It takes just ten seconds to have that cake and eat it. In nearly a dozen formats.
A.indent(reset=True)
A.info("writing plain text of all letters in all text formats")
text = collections.defaultdict(list)
for ln in F.otype.s("line"):
for fmt in sorted(T.formats):
if fmt.startswith("text-"):
text[fmt].append(T.text(ln, fmt=fmt, descend=True))
A.info("done {} formats".format(len(text)))
for fmt in sorted(text):
print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))
0.00s writing plain text of all letters in all text formats 1.70s done 4 formats text-orig-full [a-na] _{d}suen_-i-[din-nam] qi2-bi2-[ma] um-ma _{d}en-lil2_-sza-du-u2-ni-ma _{d}utu_ u3 _{d}[marduk]_ a-na da-ri-a-[tim] li-ba-al-li-t,u2-u2-ka text-orig-plain a-na d⁼suen-i-din-nam qi2-bi2-ma um-ma d⁼en-lil2-sza-du-u2-ni-ma d⁼utu u3 d⁼marduk a-na da-ri-a-tim li-ba-al-li-t,u2-u2-ka text-orig-rich a-na d⁼suen-i-din-nam qi₂-bi₂-ma um-ma d⁼en-lil₂-ša-du-u₂-ni-ma d⁼utu u₃ d⁼marduk a-na da-ri-a-tim li-ba-al-li-ṭu₂-u₂-ka text-orig-unicode 𒀀𒈾 𒀭𒂗𒍪𒄿𒁷𒉆 𒆠𒉈𒈠 𒌝𒈠 𒀭𒂗𒆤𒊭𒁺𒌑𒉌𒈠 𒀭𒌓 𒅇 𒀭𒀫𒌓 𒀀𒈾 𒁕𒊑𒀀𒁴 𒇷𒁀𒀠𒇷𒌅𒌑𒅗
We write all formats to file, in your Downloads
folder.
for fmt in T.formats:
if fmt.startswith("text-"):
with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f:
f.write("\n".join(text[fmt]))
A section in the letter corpus is a document, a face or a line.
Knowledge of sections is not baked into Text-Fabric.
The config feature otext.tf
may specify three section levels, and tell
what the corresponding node types and features are.
From that knowledge it can construct mappings from nodes to sections, e.g. from line nodes to tuples of the form:
(p-number, face specifier, line number)
You can get the section of a node as a tuple of relevant document, face, and line nodes. Or you can get it as a passage label, a string.
You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.
If you are dealing with document and face nodes, you can ask to fill out the line and face parts as well.
Here are examples of getting the section that corresponds to a node and vice versa.
NB: sectionFromNode
always delivers a verse specification, either from the
first slot belonging to that node, or, if lastSlot
, from the last slot
belonging to that node.
someNodes = (
F.otype.s("sign")[100000],
F.otype.s("word")[10000],
F.otype.s("cluster")[5000],
F.otype.s("line")[15000],
F.otype.s("face")[1000],
F.otype.s("document")[500],
)
for n in someNodes:
nType = F.otype.v(n)
d = f"{n:>7} {nType}"
first = A.sectionStrFromNode(n)
last = A.sectionStrFromNode(n, lastSlot=True, fillup=True)
tup = (
T.sectionTuple(n),
T.sectionTuple(n, lastSlot=True, fillup=True),
)
print(f"{d:<16} - {first:<18} {last:<18} {tup}")
100001 sign - P313335 obverse:8 P313335 obverse:8 ((227310, 229370, 244327), (227310, 229370, 244327)) 268163 word - P510665 obverse:9 P510665 obverse:9 ((226821, 228295, 234114), (226821, 228295, 234114)) 208220 cluster - P510766 obverse:9 P510766 obverse:9 ((226925, 228516, 236231), (226925, 228516, 236231)) 245788 line - P313410 obverse:12' P313410 obverse:12' ((227376, 229516, 245788), (227376, 229516, 245788)) 228954 face - P292765 reverse P292765 reverse:12 ((227126, 228954), (227126, 228954, 240157)) 227169 document - P382526 P382526 left:2 ((227169,), (227169, 229057, 241107))
Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.
But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.
There are two ways to do that:
.tf
directory of your dataset, and remove all .tfx
files in it.
This might be a bit awkward to do, because the .tf
directory is hidden on Unix-like systems.TF.clearCache()
, which does exactly the same.It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.
# TF.clearCache()
By now you have an impression how to compute around in the corpus. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.
Here are a few directions for unleashing that power.
See the cookbook for recipes for small, concrete tasks.
CC-BY Dirk Roorda