To get started: consult start

Computing "by hand"¶

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying data model is recommended.

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

import collections

In [3]:

from tf.app import use

In [4]:

A = use("clariah/wp6-missieven", hoist=globals())

TF-app: ~/text-fabric-data/github/clariah/wp6-missieven/app

data: ~/text-fabric-data/github/clariah/wp6-missieven/tf/1.0

Text-Fabric: Text-Fabric API 10.2.6, clariah/wp6-missieven/app v3, Search Reference
Data: WP6-MISSIEVEN, Character table, Feature docs
Features:

General Missives Dutch East India Company 1600-1800

author

str

authors of the letter, surnames only

authorFull

str

authors of the letter, full names

col

int

column number of a column in a row in a table

day

int

day part of the date of the letter

isden

int

whether a word is the denominator in fraction, e.g. 4 in 1/4

isemph

str

whether a word is emphasized by typography

isfolio

int

a folio reference

isnote

int

whether a word belongs to footnote text

isnum

int

whether a word is the numerator in fraction, e.g. 1 in 1/4

isorig

int

whether a word belongs to original text

isq

int

whether a word is a numerical fraction, e.g. 1/4

isref

int

whether a word belongs to the text of reference

isremark

int

whether a word belongs to the text of editorial remarks

isspecial

int

whether a word has special typography possibly with OCR mistakes as well

issub

int

whether a word has subscript typography possibly indicating the denominator of a fraction

issuper

int

whether a word has superscript typography possibly indicating the numerator of a fraction

isund

str

whether a word is underlined by typography

mark

int

footnote mark (not necessarily the same as shown on the printed page

month

int

month part of the date of the letter

n

int

number of a volume, letter, page, para, line, table

otype

str

page

str

number of the first page of this letter in this volume

place

str

place from where the letter was sent

punc

str

punctuation and/or whitespace following a wordup to the next word

puncn

str

punctuation and/or whitespace following a word,up to the next word, footnote text only

punco

str

punctuation and/or whitespace following a word,up to the next word, original text only

puncr

str

punctuation and/or whitespace following a word,up to the next word, remark text only

rawdate

str

the date the letter was sent

row

int

row number of a row of column in a table

seq

str

('sequence number of this letter among the letters of the same author in this volume',)

status

str

status of the letter, e.g. secret, copy

title

str

title of the letter

trans

str

transcription of a word

transn

str

transcription of a word, only for footnote text

transo

str

transcription of a word, only for original text

transr

str

transcription of a word, only for remark text

vol

int

volume number

weblink

str

the page-specific part of web links for page nodes

x

int

column offset of a column in a row in a table

year

int

year part of the date of the letter

note

none

edge between a word and the footnotes associated with it

oslots

none

Text-Fabric API: names N F E L T S C TF directly usable

Features¶

The data of the corpus is organized in features. They are columns of data. Think of the text as a gigantic spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.

You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.

Edge features are marked by *bold italic* formatting.

Counting¶

In [5]:

A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.46s 6638983 nodes

Node types¶

In [6]:

F.otype.slotType

Out[6]:

'word'

In [7]:

F.otype.all

Out[7]:

('volume',
 'letter',
 'page',
 'table',
 'para',
 'remark',
 'head',
 'note',
 'line',
 'row',
 'folio',
 'cell',
 'subhead',
 'word')

In [8]:

C.levels.data

Out[8]:

(('volume', 426954.28571428574, 6638970, 6638983),
 ('letter', 9847.380560131796, 6018166, 6018772),
 ('page', 532.979045920642, 6558167, 6569381),
 ('table', 137.91038696537677, 6638479, 6638969),
 ('para', 100.7875075489604, 6569382, 6604154),
 ('remark', 97.49029448361675, 6604155, 6628264),
 ('head', 31.115321252059307, 6017559, 6018165),
 ('note', 16.88329592818211, 6545691, 6558166),
 ('line', 11.344004190405338, 6018773, 6545690),
 ('row', 8.099520958083833, 6628265, 6636614),
 ('folio', 2.6304595518420055, 6009660, 6017558),
 ('cell', 2.0938419146103593, 5977361, 6009659),
 ('subhead', 1.4248927038626609, 6636615, 6638478),
 ('word', 1, 1, 5977360))

The second column is the average size (in words) of the node type mentioned in the first column.

The third and fourth column are the node numbers of the first and the last node of that kind.

In [9]:

for (typ, av, start, end) in C.levels.data:
    print(
        f"{end - start + 1:>7} x {typ:<7}"
        f" having an average size of {int(round(av)):>6} words"
        f" and a total size of {int(round(av * (end - start + 1))):>7} words"
    )

     14 x volume  having an average size of 426954 words and a total size of 5977360 words
    607 x letter  having an average size of   9847 words and a total size of 5977360 words
  11215 x page    having an average size of    533 words and a total size of 5977360 words
    491 x table   having an average size of    138 words and a total size of   67714 words
  34773 x para    having an average size of    101 words and a total size of 3504684 words
  24110 x remark  having an average size of     97 words and a total size of 2350491 words
    607 x head    having an average size of     31 words and a total size of   18887 words
  12476 x note    having an average size of     17 words and a total size of  210636 words
 526918 x line    having an average size of     11 words and a total size of 5977360 words
   8350 x row     having an average size of      8 words and a total size of   67631 words
   7899 x folio   having an average size of      3 words and a total size of   20778 words
  32299 x cell    having an average size of      2 words and a total size of   67629 words
   1864 x subhead having an average size of      1 words and a total size of    2656 words
5977360 x word    having an average size of      1 words and a total size of 5977360 words

The node type note corresponds to footnotes. Here we see that there are over 12,000 footnotes in this corpus, with on average 17 words in a footnote.

Note that the node type folio corresponds to a reference to a folio, not to the contents of a folio. That explains its short average length in words.

By inspecting the total size (in words) of a node type, we quickly see which node types cover the corpus and which node types are rare:

the types volume, letter, page, line, word partition the corpus exactly
previously, the type page nearly partitioned the corpus, but there were some words outside pages. Not anymore. See below.
not all material is divided in paras (e.g. folios, headings, subheadings, tables)

Let's collect a the words outside any page, if any:

In [10]:

outsiders = []

for w in F.otype.s("word"):
    if not L.u(w, otype="page"):
        outsiders.append((w,))
        if len(outsiders) > 10:
            break

print(f"{len(outsiders)} outsiders")
A.table(outsiders, withNodes=True)

0 outsiders

Word matters¶

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

Top 20 frequent words¶

In [11]:

for (w, amount) in F.trans.freqList("word")[0:20]:
    print(f"{amount:>6} {w}")

267097 de
215289 van
145076 en
123549 te
 92507 in
 84398 het
 59475 den
 58915 dat
 58055 een
 56649 is
 56626 op
 54628 met
 43217 die
 43071 
 42416 voor
 38526 niet
 36956 aan
 34957 tot
 33338 zijn
 31128 door

Hapaxes¶

We look for words that occur only once.

We are only interested in words that are completely alphabetic, i.e. words that do not have numbers or other non-letters in them.

In [12]:

hapaxes1 = sorted(
    w for (w, amount) in F.trans.freqList("word") if amount == 1 and w.isalpha()
)
len(hapaxes1)

Out[12]:

In [13]:

for lx in hapaxes1[0:20]:
    print(lx)

AA
AC
ADRIAEN
AF
AFRIKA
AGRA
AJcbar
AND
ANDREASVAN
ANTHONTO
ANTONIOCAENENJOAN
ANTONY
ARDECRÖON
ARE
ARNOUD
ASTELIJN
AUahabad
AUen
AUorkulan
AVR

Small occurrence base¶

The occurrence base of a word are the missives (letters) in which the word occurs.

N.B. (terminology) Here letter means a document that has been sent to a recipient. This corpus consists of missives which are letters.

We look only in the content of the original missives.

In [14]:

occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
    title = F.title.v(s)
    for w in L.d(s, otype="word"):
        trans = F.transo.v(w)
        if not trans or not trans.isalpha():
            continue
        occurrenceBase[trans].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  1.87s done
  1.87s 127166 entries

An overview of how many words have how big occurrence bases:

In [15]:

occurrenceSize = collections.Counter()

for (w, letters) in occurrenceBase.items():
    occurrenceSize[len(letters)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"letters {size:>4} : {amount:>6} words")

letters    1 :  69897 words
letters    2 :  16495 words
letters    3 :   8224 words
letters    4 :   5165 words
letters    5 :   3636 words
letters    6 :   2612 words
letters    7 :   2131 words
letters    8 :   1708 words
letters    9 :   1323 words
letters   10 :   1201 words
...
letters  479 :      1 words
letters  481 :      1 words
letters  485 :      1 words
letters  489 :      1 words
letters  493 :      1 words
letters  494 :      1 words
letters  498 :      1 words
letters  501 :      1 words
letters  508 :      1 words
letters  511 :      1 words

Let's give the predicate private to those words whose occurrence base is a single missive.

In [16]:

privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

Out[16]:

Peculiarity of missives¶

As a final exercise with missives, lets make a list of all them, and show their

total number of words
number of private words
the percentage of private words: a measure of the peculiarity of the missive

In [17]:

letterList = []

empty = set()
ordinary = set()

for d in F.otype.s("letter"):
    letter = F.title.v(d)
    if len(letter) > 50:
        letter = f"{letter[0:22]} .. {letter[-22:]}"
    words = {
        trans
        for w in L.d(d, otype="word")
        if (trans := F.transo.v(w)) and trans.isalpha()
    }
    a = len(words)
    if not a:
        empty.add(letter)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(letter)
        continue
    p = 100 * o / a
    letterList.append((letter, a, o, p))

letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")

Found    0 empty letters
Found   59 ordinary letters (i.e. without private words)

In [18]:

print(
    "{:<50}{:>5}{:>5}{:>5}\n{}".format(
        "missive",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in letterList[0:20]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))

missive                                            #all #own %own
-----------------------------------
Both; zonder plaats, zonder datum                     7    3 42.9%
Both; zonder plaats, zonder datum                     7    3 42.9%
Van Diemen; in het Sch .. an Afrika, 5 juni 1631     17    4 23.5%
Maetsuycker, Verburch, .. via, 25 september 1675     20    4 20.0%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730    120   22 18.3%
Maetsuycker, Verburch, .. avia, 20 februari 1672     17    3 17.6%
Reynst; Bantam, 26 oktober 1615                     748  130 17.4%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Reael; Kasteel Mauriti .. kéan, 20 augustus 1618   1175  181 15.4%
Reniers, Maetsuycker,  .. avia, 24 december 1652   5032  720 14.3%
Brouwer, Van Diemen, L .. atavia, 4 januari 1636   4084  584 14.3%
Coen, Jansz, Lefebvre, .. tavia, 3 november 1628     21    3 14.3%
Coen, Sonck; Schip Nie .. nda-Neira , 6 mei 1621     21    3 14.3%
Both; aan boord van he .. Mayo, 25 februari 1610     14    2 14.3%
Both; aan boord van he .. lbaai, 6 augustus 1610     14    2 14.3%
Coen, De Carpentier, L .. avia, 16 november 1621     15    2 13.3%
Maetsuycker, Hulft, Ha .. tavia, 4 februari 1655     15    2 13.3%
...
Maetsuycker, Hartsinck ..  Batavia, 31 juli 1656    310    5  1.6%
Van Goens, Speelman, B .. avia, 30 augustus 1681     63    1  1.6%
Van Hoorn, Van Riebeec .. tavia, 15 januari 1707    256    4  1.6%
Zwaardecroon, De Haan, ..  Batavia, 2 april 1724     68    1  1.5%
Maetsuycker, Hartsinck .. avia, 26 februari 1665     72    1  1.4%
De Haan, Huysman, Hass .. tavia, 27 oktober 1727     74    1  1.4%
Mossel, Cluysenaar, Va .. tavia, 10 oktober 1752    297    4  1.3%
Maetsuycker, Verburch, .. avia, 13 december 1672     75    1  1.3%
Mossel, Van der Waeyen .. avia, 31 december 1753    306    4  1.3%
Zwaardecroon, De Haan, .. avia, 14 december 1724     77    1  1.3%
Van Outhoorn, Van Hoor .. Batavia, 20 april 1699     81    1  1.2%
Mossel, Van der Waeyen .. avia, 31 december 1754    250    3  1.2%
Van Imhoff, Pasques de .. tavia, 12 oktober 1746    358    4  1.1%
Van Hoorn, Van Riebeec ..  Batavia, 6 maart 1705     99    1  1.0%
Van Imhoff, Cluysenaar .. ; Batavia, 25 mei 1748    200    2  1.0%
Van Cloon, Blom, Van d .. Batavia, 31 maart 1733    218    2  0.9%
Maetsuycker, Hartsinck .. tavia, 29 januari 1663    126    1  0.8%
De Carpentier, Specx,  .. avia, 15 december 1626    144    1  0.7%
Durven, Hasselaar, Blo .. avia, 27 februari 1730    146    1  0.7%
Van Imhoff, Pasques de .. tavia, 20 oktober 1745    307    2  0.7%

Next steps¶

By now you have an impression how to compute around in the Missieven. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

start start computing with this corpus
search turbo charge your hand-coding with search templates
compute sink down a level and compute it yourself
exportExcel make tailor-made spreadsheets out of your results
annotate export text, annotate with BRAT, import annotations
share draw in other people's data and let them use yours
entities use results of third-party NER (named entity recognition)
porting port features made against an older version to a newer version
volumes work with selected volumes only

CC-BY Dirk Roorda