Start with convert


Use the Banks example corpus

Load TF

We (down)load the corpus.

In [1]:
from tf.app import use
In [2]:
A = use("annotation/banks", hoist=globals())
TF-app: ~/text-fabric-data/annotation/banks/app
data: ~/text-fabric-data/annotation/banks/tf/0.2
This is Text-Fabric 9.2.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

10 features found and 0 ignored
Text-Fabric: Text-Fabric API 9.2.2, annotation/banks/app v3, Search Reference
Data: BANKS, Character table, Feature docs
Features:
Two quotes from Consider Phlebas by Iain M. Banks
str
the author of a book
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
gap
int
1 for words that occur between [ ], which are inserted by the editor
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
str
the letters of a word
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
int
number of chapter, or sentence in chapter, or line in sentence
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
str
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
str
the punctuation after a word
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
remark:
a bit more info is needed
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
str
the last character of a line
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
str
the title of a book
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
none
compiler:
Dirk Roorda
dateWritten:
2020-02-13T13:37:47Z
name:
Culture quotes from Iain Banks
purpose:
exposition
source:
Good Reads
status:
with for similarities in a separate module
url:
https://www.goodreads.com/work/quotes/14366-consider-phlebas
version:
0.2
writtenBy:
Text-Fabric
Text-Fabric API: names N F E L T S C TF directly usable

Exploration

Let's explore this corpus by means of Text-Fabric.

Frequency list

We can get ordered frequency lists for the values of all features.

First the words:

In [3]:
F.letters.freqList()
Out[3]:
(('the', 8),
 ('of', 5),
 ('and', 4),
 ('in', 3),
 ('we', 3),
 ('everything', 2),
 ('know', 2),
 ('most', 2),
 ('ones', 2),
 ('patterns', 2),
 ('us', 2),
 ('Besides', 1),
 ('Culture', 1),
 ('Everything', 1),
 ('So', 1),
 ('a', 1),
 ('about', 1),
 ('aid', 1),
 ('any', 1),
 ('around', 1),
 ('as', 1),
 ('barbarian', 1),
 ('bottom', 1),
 ('can', 1),
 ('care', 1),
 ('climbing', 1),
 ('composed', 1),
 ('control', 1),
 ('dead', 1),
 ('elegant', 1),
 ('enjoyable', 1),
 ('final', 1),
 ('find', 1),
 ('free', 1),
 ('games', 1),
 ('good', 1),
 ('harness', 1),
 ('have', 1),
 ('high', 1),
 ('humans', 1),
 ('impossible', 1),
 ('is', 1),
 ('it', 1),
 ('languages', 1),
 ('left', 1),
 ('life', 1),
 ('line', 1),
 ('make', 1),
 ('mattered', 1),
 ('mountains', 1),
 ('not', 1),
 ('nothing', 1),
 ('our', 1),
 ('over', 1),
 ('own', 1),
 ('problems', 1),
 ('really', 1),
 ('romance', 1),
 ('safety', 1),
 ('societies', 1),
 ('sports', 1),
 ('studying', 1),
 ('such', 1),
 ('take', 1),
 ('terms', 1),
 ('that', 1),
 ('that’s', 1),
 ('things', 1),
 ('those', 1),
 ('to', 1),
 ('truth', 1),
 ('ultimately', 1),
 ('where', 1),
 ('why', 1),
 ('without', 1))

For the node types we can get info by calling this:

In [4]:
C.levels.data
Out[4]:
(('book', 99.0, 100, 100),
 ('chapter', 49.5, 101, 102),
 ('sentence', 33.0, 115, 117),
 ('line', 7.666666666666667, 103, 114),
 ('word', 1, 1, 99))

It means that chapters are 49.5 words long on average, and that the chapter nodes are 101 and 102.

And you see that we have 99 words.

Add to the banks corpus

We are going to make a relationship between each pair of words, and we annotate each related pair with how similar they are.

We measure the similarity by looking at the distinct letters in each word (lowercase), and computing the percentage of how many letters they have in common with respect to how many letters they jointly have.

This will become a symmetric edge feature. Symmetric means, that if a and b are similar, then b and a as well, with the same similarity.

We only store one copy of each symmetric pair of edges.

We can then use E.sim.b(node) to find all nodes that are parallel to node.

If words do not have letters in common, their similarity is 0, and we do not make an edge.

Preparation

We pre-compute all letter sets for all words.

In [5]:
def makeSet(w):
    return set(F.letters.v(w).lower())
In [6]:
words = {}

for w in F.otype.s("word"):
    words[w] = makeSet(w)

nWords = len(words)
print(f"{nWords} words")
99 words
In [7]:
def sim(wSet, vSet):
    return int(round(100 * len(wSet & vSet) / len(wSet | vSet)))

Compute all similarities

We are going to perform all comparisons.

Since there are 99 words, this will amount to only 5000 comparisons.

For a big corpus, this amount will quickly grow with the number of items to be compared.

See for example the similarities in the Quran.

In [8]:
def computeSim():
    similarity = {}

    wordNodes = sorted(words.keys())
    nWords = len(wordNodes)

    nComparisons = nWords * (nWords - 1) // 2

    print(f"{nComparisons} comparisons to make")

    TF.indent(reset=True)

    co = 0
    si = 0
    stop = False
    for i in range(nWords):
        nodeI = wordNodes[i]
        wordI = words[nodeI]
        for j in range(i + 1, nWords):
            nodeJ = wordNodes[j]
            wordJ = words[nodeJ]
            s = sim(wordI, wordJ)
            co += 1
            if s:
                similarity[(nodeI, nodeJ)] = sim(wordI, wordJ)
                si += 1
        if stop:
            break

    TF.info(f"{co:>4} comparisons and {si:>4} similarities")
    return similarity
In [9]:
similarity = computeSim()
4851 comparisons to make
  0.01s 4851 comparisons and 3332 similarities
In [10]:
print(min(similarity.values()))
print(max(similarity.values()))
7
100
In [11]:
eq = [x for x in similarity.items() if x[1] >= 100]
neq = [x for x in similarity.items() if x[1] <= 50]
In [12]:
print(eq[0])
print(neq[0])
((1, 4), 100)
((1, 2), 8)
In [13]:
print(len(eq))
print(len(neq))
58
3247
In [14]:
print(eq[0][0][0], F.letters.v(eq[0][0][0]))
print(eq[0][0][1], F.letters.v(eq[0][0][1]))
1 Everything
4 everything
In [15]:
print(neq[0][0][0], F.letters.v(neq[0][0][0]))
print(neq[0][0][1], F.letters.v(neq[0][0][1]))
1 Everything
2 about

Add parallels to the TF dataset

We now add this information to the Banks dataset as an edge feature.

In [22]:
import os
In [25]:
GH_BASE = os.path.expanduser("~/github")

path = f"{A.context.org}/{A.context.repo}/sim/tf"
location = f"{GH_BASE}/{path}"
module = A.context.version
In [26]:
metaData = {
    "": {
        "name": "Banks (similar words)",
        "converters": "Dirk Roorda",
        "sourceUrl": "https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/text-fabric/use.ipynb",
        "version": "0.2",
    },
    "sim": {
        "valueType": "int",
        "edgeValues": True,
        "description": "similarity between words, as a percentage of the common material wrt the combined material",
    },
}
In [27]:
simData = {}
for ((f, t), d) in similarity.items():
    simData.setdefault(f, {})[t] = d
In [28]:
A.api.TF.save(
    edgeFeatures=dict(sim=simData), metaData=metaData, location=location, module=module
)
  0.00s Exporting 0 node and 1 edge and 0 config features to ~/github/annotation/banks/sim/tf/0.2:
   |     0.01s T sim                  to ~/github/annotation/banks/sim/tf/0.2
  0.01s Exported 0 node features and 1 edge features and 0 config features to ~/github/annotation/banks/sim/tf/0.2
Out[28]:
True

All chapters:


CC-BY Dirk Roorda