To get started: consult start
We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.
Familiarity with the underlying data model is recommended.
%load_ext autoreload
%autoreload 2
import collections
from tf.app import use
A = use("clariah/wp6-missieven", hoist=globals())
The data of the corpus is organized in features. They are columns of data. Think of the text as a gigantic spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.
Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.
Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.
You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.
Edge features are marked by *bold italic* formatting.
A.indent(reset=True)
A.info("Counting nodes ...")
i = 0
for n in N.walk():
i += 1
A.info("{} nodes".format(i))
0.00s Counting nodes ... 0.46s 6638983 nodes
F.otype.slotType
'word'
F.otype.all
('volume', 'letter', 'page', 'table', 'para', 'remark', 'head', 'note', 'line', 'row', 'folio', 'cell', 'subhead', 'word')
C.levels.data
(('volume', 426954.28571428574, 6638970, 6638983), ('letter', 9847.380560131796, 6018166, 6018772), ('page', 532.979045920642, 6558167, 6569381), ('table', 137.91038696537677, 6638479, 6638969), ('para', 100.7875075489604, 6569382, 6604154), ('remark', 97.49029448361675, 6604155, 6628264), ('head', 31.115321252059307, 6017559, 6018165), ('note', 16.88329592818211, 6545691, 6558166), ('line', 11.344004190405338, 6018773, 6545690), ('row', 8.099520958083833, 6628265, 6636614), ('folio', 2.6304595518420055, 6009660, 6017558), ('cell', 2.0938419146103593, 5977361, 6009659), ('subhead', 1.4248927038626609, 6636615, 6638478), ('word', 1, 1, 5977360))
The second column is the average size (in words) of the node type mentioned in the first column.
The third and fourth column are the node numbers of the first and the last node of that kind.
for (typ, av, start, end) in C.levels.data:
print(
f"{end - start + 1:>7} x {typ:<7}"
f" having an average size of {int(round(av)):>6} words"
f" and a total size of {int(round(av * (end - start + 1))):>7} words"
)
14 x volume having an average size of 426954 words and a total size of 5977360 words 607 x letter having an average size of 9847 words and a total size of 5977360 words 11215 x page having an average size of 533 words and a total size of 5977360 words 491 x table having an average size of 138 words and a total size of 67714 words 34773 x para having an average size of 101 words and a total size of 3504684 words 24110 x remark having an average size of 97 words and a total size of 2350491 words 607 x head having an average size of 31 words and a total size of 18887 words 12476 x note having an average size of 17 words and a total size of 210636 words 526918 x line having an average size of 11 words and a total size of 5977360 words 8350 x row having an average size of 8 words and a total size of 67631 words 7899 x folio having an average size of 3 words and a total size of 20778 words 32299 x cell having an average size of 2 words and a total size of 67629 words 1864 x subhead having an average size of 1 words and a total size of 2656 words 5977360 x word having an average size of 1 words and a total size of 5977360 words
The node type note
corresponds to footnotes. Here we see that there are over 12,000 footnotes
in this corpus, with on average 17 words in a footnote.
Note that the node type folio
corresponds to a reference to a folio, not to the contents of a folio.
That explains its short average length in words.
By inspecting the total size (in words) of a node type, we quickly see which node types cover the corpus and which node types are rare:
volume
, letter
, page
, line
, word
partition the corpus exactlypage
nearly partitioned the corpus, but there were some words outside pages.
Not anymore. See below.para
s (e.g. folios, headings, subheadings, tables)Let's collect a the words outside any page, if any:
outsiders = []
for w in F.otype.s("word"):
if not L.u(w, otype="page"):
outsiders.append((w,))
if len(outsiders) > 10:
break
print(f"{len(outsiders)} outsiders")
A.table(outsiders, withNodes=True)
0 outsiders
for (w, amount) in F.trans.freqList("word")[0:20]:
print(f"{amount:>6} {w}")
267097 de 215289 van 145076 en 123549 te 92507 in 84398 het 59475 den 58915 dat 58055 een 56649 is 56626 op 54628 met 43217 die 43071 42416 voor 38526 niet 36956 aan 34957 tot 33338 zijn 31128 door
We look for words that occur only once.
We are only interested in words that are completely alphabetic, i.e. words that do not have numbers or other non-letters in them.
hapaxes1 = sorted(
w for (w, amount) in F.trans.freqList("word") if amount == 1 and w.isalpha()
)
len(hapaxes1)
85759
for lx in hapaxes1[0:20]:
print(lx)
AA AC ADRIAEN AF AFRIKA AGRA AJcbar AND ANDREASVAN ANTHONTO ANTONIOCAENENJOAN ANTONY ARDECRÖON ARE ARNOUD ASTELIJN AUahabad AUen AUorkulan AVR
The occurrence base of a word are the missives (letters) in which the word occurs.
N.B. (terminology) Here letter means a document that has been sent to a recipient. This corpus consists of missives which are letters.
We look only in the content of the original missives.
occurrenceBase = collections.defaultdict(set)
A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
title = F.title.v(s)
for w in L.d(s, otype="word"):
trans = F.transo.v(w)
if not trans or not trans.isalpha():
continue
occurrenceBase[trans].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")
0.00s compiling occurrence base ... 1.87s done 1.87s 127166 entries
An overview of how many words have how big occurrence bases:
occurrenceSize = collections.Counter()
for (w, letters) in occurrenceBase.items():
occurrenceSize[len(letters)] += 1
occurrenceSize = sorted(
occurrenceSize.items(),
key=lambda x: (-x[1], x[0]),
)
for (size, amount) in occurrenceSize[0:10]:
print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
print(f"letters {size:>4} : {amount:>6} words")
letters 1 : 69897 words letters 2 : 16495 words letters 3 : 8224 words letters 4 : 5165 words letters 5 : 3636 words letters 6 : 2612 words letters 7 : 2131 words letters 8 : 1708 words letters 9 : 1323 words letters 10 : 1201 words ... letters 479 : 1 words letters 481 : 1 words letters 485 : 1 words letters 489 : 1 words letters 493 : 1 words letters 494 : 1 words letters 498 : 1 words letters 501 : 1 words letters 508 : 1 words letters 511 : 1 words
Let's give the predicate private to those words whose occurrence base is a single missive.
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)
69897
As a final exercise with missives, lets make a list of all them, and show their
letterList = []
empty = set()
ordinary = set()
for d in F.otype.s("letter"):
letter = F.title.v(d)
if len(letter) > 50:
letter = f"{letter[0:22]} .. {letter[-22:]}"
words = {
trans
for w in L.d(d, otype="word")
if (trans := F.transo.v(w)) and trans.isalpha()
}
a = len(words)
if not a:
empty.add(letter)
continue
o = len({w for w in words if w in privates})
if not o:
ordinary.add(letter)
continue
p = 100 * o / a
letterList.append((letter, a, o, p))
letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))
print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")
Found 0 empty letters Found 59 ordinary letters (i.e. without private words)
print(
"{:<50}{:>5}{:>5}{:>5}\n{}".format(
"missive",
"#all",
"#own",
"%own",
"-" * 35,
)
)
for x in letterList[0:20]:
print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
missive #all #own %own ----------------------------------- Both; zonder plaats, zonder datum 7 3 42.9% Both; zonder plaats, zonder datum 7 3 42.9% Van Diemen; in het Sch .. an Afrika, 5 juni 1631 17 4 23.5% Maetsuycker, Verburch, .. via, 25 september 1675 20 4 20.0% Durven, Hasselaar, Blo .. tavia, 17 oktober 1730 120 22 18.3% Maetsuycker, Verburch, .. avia, 20 februari 1672 17 3 17.6% Reynst; Bantam, 26 oktober 1615 748 130 17.4% Both; Fort Mauritius n .. d Makéan, 26 juli 1612 12 2 16.7% Both; Fort Mauritius n .. d Makéan, 26 juli 1612 12 2 16.7% Both; Fort Mauritius n .. d Makéan, 26 juli 1612 12 2 16.7% Both; Fort Mauritius n .. d Makéan, 26 juli 1612 12 2 16.7% Reael; Kasteel Mauriti .. kéan, 20 augustus 1618 1175 181 15.4% Reniers, Maetsuycker, .. avia, 24 december 1652 5032 720 14.3% Brouwer, Van Diemen, L .. atavia, 4 januari 1636 4084 584 14.3% Coen, Jansz, Lefebvre, .. tavia, 3 november 1628 21 3 14.3% Coen, Sonck; Schip Nie .. nda-Neira , 6 mei 1621 21 3 14.3% Both; aan boord van he .. Mayo, 25 februari 1610 14 2 14.3% Both; aan boord van he .. lbaai, 6 augustus 1610 14 2 14.3% Coen, De Carpentier, L .. avia, 16 november 1621 15 2 13.3% Maetsuycker, Hulft, Ha .. tavia, 4 februari 1655 15 2 13.3% ... Maetsuycker, Hartsinck .. Batavia, 31 juli 1656 310 5 1.6% Van Goens, Speelman, B .. avia, 30 augustus 1681 63 1 1.6% Van Hoorn, Van Riebeec .. tavia, 15 januari 1707 256 4 1.6% Zwaardecroon, De Haan, .. Batavia, 2 april 1724 68 1 1.5% Maetsuycker, Hartsinck .. avia, 26 februari 1665 72 1 1.4% De Haan, Huysman, Hass .. tavia, 27 oktober 1727 74 1 1.4% Mossel, Cluysenaar, Va .. tavia, 10 oktober 1752 297 4 1.3% Maetsuycker, Verburch, .. avia, 13 december 1672 75 1 1.3% Mossel, Van der Waeyen .. avia, 31 december 1753 306 4 1.3% Zwaardecroon, De Haan, .. avia, 14 december 1724 77 1 1.3% Van Outhoorn, Van Hoor .. Batavia, 20 april 1699 81 1 1.2% Mossel, Van der Waeyen .. avia, 31 december 1754 250 3 1.2% Van Imhoff, Pasques de .. tavia, 12 oktober 1746 358 4 1.1% Van Hoorn, Van Riebeec .. Batavia, 6 maart 1705 99 1 1.0% Van Imhoff, Cluysenaar .. ; Batavia, 25 mei 1748 200 2 1.0% Van Cloon, Blom, Van d .. Batavia, 31 maart 1733 218 2 0.9% Maetsuycker, Hartsinck .. tavia, 29 januari 1663 126 1 0.8% De Carpentier, Specx, .. avia, 15 december 1626 144 1 0.7% Durven, Hasselaar, Blo .. avia, 27 februari 1730 146 1 0.7% Van Imhoff, Pasques de .. tavia, 20 oktober 1745 307 2 0.7%
By now you have an impression how to compute around in the Missieven. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.
Here are a few directions for unleashing that power.
CC-BY Dirk Roorda