We start with a baby corpus in a markdown like format.
This is not meant as a preferred format of a corpus.
The point of this tutorial is to show what it takes to turn arbitrary data into TF.
Below you see a string which contains 1 "book", with 2 "chapters", each having one or two sentences.
Here is the complete corpus source material:
source = '''
# Consider Phlebas
$ author=Iain M. Banks
## 1
Everything about us,
everything around us,
everything we know [and can know of] is composed ultimately of patterns of nothing;
that’s the bottom line, the final truth.
So where we find we have any control over those patterns,
why not make the most elegant ones, the most enjoyable and good ones,
in our own terms?
## 2
Besides,
it left the humans in the Culture free to take care of the things that really mattered in life,
such as [sports, games, romance,] studying dead languages,
barbarian societies and impossible problems,
and climbing high mountains without the aid of a safety harness.
'''
Note a few details:
#
starts a "book" and the line contains its title: section level 1.$
starts a line with key=value metadata: the author.
This line is not part of the text.##
starts a "chapter" and the line contains its number: section level 2.gap=1
.punc
feature.Now we start the engines: Text-Fabric, and the walker conversion module.
import os
import re
from tf.fabric import Fabric
from tf.convert.walker import CV
We call up TF and let it look into the directory where the output has to land, in this case a subdirectory of the tutorials repo on annotation.
# TF_DIR = os.path.expanduser('~/Downloads/banks/tf') # if you want it in your Downloads directory instead
BASE = os.path.expanduser('~/github')
ORG = 'annotation'
REPO = 'banks'
RELATIVE = 'tf'
TF_DIR = os.path.expanduser(f'{BASE}/{ORG}/{REPO}/{RELATIVE}')
VERSION = '0.2'
TF_PATH = f'{TF_DIR}/{VERSION}'
TF = Fabric(locations=TF_PATH, silent=True)
A Text-Fabric dataset is a bunch of individual .tf
files that start with a little bit of metadata and then contain
a stream of data, typically the values of a single feature for each node or edge in the graph.
We specify the metadata bit by bit.
A crucial design aspect of each TF dataset is its granularity. What are the slots?
Words, morphemes, characters?
You decide.
slotType = 'word'
Users that encounter your TF data in the wild, will be thankful to you if you took the trouble to say in a few key-value pairs what that data is about.
The metadata you specify here will end up in all generated TF features.
generic = {
'name': 'Culture quotes from Iain Banks',
'compiler': 'Dirk Roorda',
'source': 'Good Reads',
'url': 'https://www.goodreads.com/work/quotes/14366-consider-phlebas',
'version': '0.2',
'purpose': 'exposition',
'status': 'with for similarities in a separate module'
}
A few things concerning the presentation of your text can be specified in the otext
feature.
This is a TF feature without data, it has only metadata.
It contains the specs for the section structure of your corpus and the text formats.
TF assumes that there are two or three section levels it can work with.
For each level you have to specify the corresponding node type and the feature that contains the section name or number
(sectionTypes
and sectionFeatures
).
But you can also define a more extensive and flexible section structure for your own purposes
(structureTypes
and structureFeatures
).
Both systems may use the same types and features, but they are completely independent.
When you ask TF to render slot nodes (the nodes with text), TF needs to know which features to render.
A text format is a template with placeholders for the features you want to use.
otext = {
'fmt:text-orig-full': '{letters}{punc} ',
'fmt:line-term': 'line#{terminator} ',
'fmt:line-default': '{letters:XXX}{terminator} ',
'sectionTypes': 'book,chapter,sentence',
'sectionFeatures': 'title,number,number',
'structureTypes': 'book,chapter,sentence,line',
'structureFeatures': 'title,number,number,number',
}
The values of features are usually strings. But if you know that they are always integers, you can declare a feature as an integer valued feature.
The only thing you have to do is to declare them in the following set:
intFeatures = {
'number',
'gap',
}
You can say per feature what it does. Use as many key-values as you like.
A good description is particularly helpful.
It is surprising how often you want to consult those descriptions yourself.
featureMeta = {
'number': {
'description': 'number of chapter, or sentence in chapter, or line in sentence',
},
'gap': {
'description': '1 for words that occur between [ ], which are inserted by the editor',
},
'title': {
'description': 'the title of a book',
},
'author': {
'description': 'the author of a book',
},
'terminator': {
'description': 'the last character of a line',
},
'letters': {
'description': 'the letters of a word',
},
'punc': {
'description': 'the punctuation after a word',
},
}
This is the heart of the matter.
The director is a function to be written by you.
It needs to plough through your source material, and fire actions towards the TF machinery. The TF work is done for you by these actions, you can concentrate on the aspects of your source data.
Every time the director encounters a new textual object in the source, it issues an action requesting a new node. The director gets a receipt, by which it can issue subsequent actions for that node, like adding feature values to it.
And when the object is done, the director issues a terminate
action.
During all this, the cv
machine is busy to translate these actions into
the construction of a proper TF graph representing all the
source material that you have exposed to it.
A few things to note
cv.stop(message)
;
the director is responsible for returning control after issuing a cv.stop)
;def director(cv):
counter = dict(
sentence=0,
line=0,
)
cur = dict(
book=None,
chapter=None,
sentence=None,
)
wordRe = re.compile(r'^(.*?)([^A-Za-z0-9]*)$')
metaRe = re.compile(r'^\$\s*([^= ]+)\s*=\s*(.*)')
for line in source.strip().split('\n'):
line = line.rstrip()
if not line:
cv.terminate(cur['sentence']) # action
for ntp in counter:
counter[ntp] += 1
cur['sentence'] = cv.node('sentence') # action
cv.feature(
cur['sentence'],
number=counter['sentence'],
) # action
continue
if line.startswith('# '):
for ntp in ('sentence', 'chapter', 'book'):
cv.terminate(cur[ntp]) # action
cur[ntp] = None
title = line[2:].strip()
cur['book'] = cv.node('book') # action
for ntp in counter:
counter[ntp] = 0
cv.feature(
cur['book'],
title=title,
) # action
continue
if line.startswith('## '):
for ntp in ('sentence', 'chapter'):
cv.terminate(cur[ntp]) # action
cur[ntp] = None
number = line[2:].strip()
cur['chapter'] = cv.node('chapter') # action
for ntp in counter:
counter[ntp] = 0
cv.feature(
cur['chapter'],
number=number,
) # action
continue
if line.startswith('$'):
match = metaRe.match(line)
if not match:
cv.stop(
f'Malformed metadata line: "{line}"'
) # action
return
name = match.group(1)
value = match.group(2)
cv.feature(
cur['book'],
**{name: value},
) # action
continue
if not cur['sentence']:
cur['sentence'] = cv.node('sentence') # action
counter['sentence'] += 1
cv.feature(
cur['sentence'],
number=counter['sentence'],
) # action
cur['line'] = cv.node('line') # action
counter['line'] += 1
cv.feature(
cur['line'],
terminator=line[-1],
number=counter['line'],
) # action
gap = False
for word in line.split():
if word.startswith('['):
gap = True
cv.terminate(cur['line']) # action
w = cv.slot() # action
cv.feature(w, gap=1) # action
word = word[1:]
elif word.endswith(']'):
w = cv.slot() # action
cv.resume(cur['line']) # action
cv.feature(w, gap=1) # action
gap = False
word = word[0:-1]
else:
w = cv.slot()
if gap:
cv.feature(w, gap=1) # action
(letters, punc) = wordRe.findall(word)[0]
cv.feature(w, letters=letters) # action
if punc:
cv.feature(w, punc=punc) # action
cv.terminate(cur['line']) # action
curLine = None
# just for informational purposes
print('\nINFORMATION:', cv.activeTypes(), '\n') # action
for ntp in ('sentence', 'chapter', 'book'):
cv.terminate(cur[ntp]) # action
cv.meta(
'punc', remark='a bit more info is needed',
) # action
We are going to run the conversion and check whether all is well.
Next we initialize the conversion machinery: we obtain an object with methods.
cv = CV(TF)
good = cv.walk(
director,
slotType,
otext=otext,
generic=generic,
intFeatures=intFeatures,
featureMeta=featureMeta,
)
good
0.00s Importing data from walking through the source ... | 0.00s Preparing metadata... | SECTION TYPES: book, chapter, sentence | SECTION FEATURES: title, number, number | STRUCTURE TYPES: book, chapter, sentence, line | STRUCTURE FEATURES: title, number, number, number | TEXT FEATURES: | | line-default letters, terminator | | line-term terminator | | text-orig-full letters, punc | 0.01s OK | 0.00s Following director... INFORMATION: {'sentence', 'chapter', 'book'} | 0.00s "edge" actions: 0 | 0.00s "feature" actions: 144 | 0.00s "node" actions: 20 | 0.00s "resume" actions: 2 | 0.00s "slot" actions: 99 | 0.00s "terminate" actions: 27 | 1 x "book" node | 2 x "chapter" node | 12 x "line" node | 5 x "sentence" node | 99 x "word" node = slot type | 119 nodes of all types | 0.01s OK | 0.00s Removing unlinked nodes ... | | -0.00s 2 unlinked "sentence" nodes: [1, 4] | | 0.00s 2 unlinked nodes | | 0.00s Leaving 117 nodes | 0.00s checking for nodes and edges ... | 0.00s OK | 0.00s checking features ... | 0.00s OK | 0.00s reordering nodes ... | 0.00s Sorting 1 nodes of type "book" | 0.00s Sorting 2 nodes of type "chapter" | 0.00s Sorting 12 nodes of type "line" | 0.00s Sorting 3 nodes of type "sentence" | 0.00s Max node = 117 | 0.00s OK | 0.00s reassigning feature values ... | | 0.01s node feature "author" with 1 node | | 0.01s node feature "gap" with 7 nodes | | 0.01s node feature "letters" with 99 nodes | | 0.01s node feature "number" with 17 nodes | | 0.01s node feature "punc" with 17 nodes | | 0.01s node feature "terminator" with 12 nodes | | 0.01s node feature "title" with 1 node | 0.00s OK 0.00s Exporting 8 node and 1 edge and 1 config features to /Users/dirk/github/annotation/banks/tf/0.2: 0.00s VALIDATING oslots feature 0.00s VALIDATING oslots feature 0.00s maxSlot= 99 0.00s maxNode= 117 0.00s OK: oslots is valid | 0.00s T author to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s T gap to /Users/dirk/github/annotation/banks/tf/0.2 | 0.01s T letters to /Users/dirk/github/annotation/banks/tf/0.2 | 0.01s T number to /Users/dirk/github/annotation/banks/tf/0.2 | 0.01s T otype to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s T punc to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s T terminator to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s T title to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s T oslots to /Users/dirk/github/annotation/banks/tf/0.2 | 0.00s M otext to /Users/dirk/github/annotation/banks/tf/0.2 0.05s Exported 8 node features and 1 edge features and 1 config features to /Users/dirk/github/annotation/banks/tf/0.2
True
with open(f'{TF_PATH}/otype.tf') as fh:
print(fh.read())
@node @compiler=Dirk Roorda @name=Culture quotes from Iain Banks @purpose=exposition @source=Good Reads @status=with for similarities in a separate module @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z 1-99 word 100 book 101-102 chapter 103-114 line 115-117 sentence
otext
¶with open(f'{TF_PATH}/otext.tf') as fh:
print(fh.read())
@config @compiler=Dirk Roorda @fmt:line-default={letters:XXX}{terminator} @fmt:line-term=line#{terminator} @fmt:text-orig-full={letters}{punc} @name=Culture quotes from Iain Banks @purpose=exposition @sectionFeatures=title,number,number @sectionTypes=book,chapter,sentence @source=Good Reads @status=with for similarities in a separate module @structureFeatures=title,number,number,number @structureTypes=book,chapter,sentence,line @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z
oslots
¶with open(f'{TF_PATH}/oslots.tf') as fh:
print(fh.read())
@edge @compiler=Dirk Roorda @name=Culture quotes from Iain Banks @purpose=exposition @source=Good Reads @status=with for similarities in a separate module @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z 100 1-99 1-55 56-99 1-3 4-6 7-9,14-20 21-27 28-38 39-51 52-55 56 57-75 76-77,81-83 84-88 89-99 1-27 28-55 56-99
100 1-99
tells that node 100 (the first book node, see otext
, is linked to slots 1-99, which are all slots.1-55
. These are the slots of node 101, being 1 + the previous node.And so on.
with open(f'{TF_PATH}/number.tf') as fh:
print(fh.read())
@node @compiler=Dirk Roorda @description=number of chapter, or sentence in chapter, or line in sentence @name=Culture quotes from Iain Banks @purpose=exposition @source=Good Reads @status=with for similarities in a separate module @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=int @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z 101 1 2 1 2 3 4 6 7 8 1 2 3 4 5 1 2 1
Node 101 is the first chapter node, which has chapter number 1.
The next line is about node 102, the second chapter, with number 2.
The following line refers to node 103, and a quick glance at the otype
feature shows that this is a line.
The last three lines are about the three sentences, which are numbered within their chapter:
1
then 2
and then again 1
.
with open(f'{TF_PATH}/letters.tf') as fh:
print(fh.read())
@node @compiler=Dirk Roorda @description=the letters of a word @name=Culture quotes from Iain Banks @purpose=exposition @source=Good Reads @status=with for similarities in a separate module @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z Everything about us everything around us everything we know and can know of is composed ultimately of patterns of nothing that’s the bottom line the final truth So where we find we have any control over those patterns why not make the most elegant ones the most enjoyable and good ones in our own terms Besides it left the humans in the Culture free to take care of the things that really mattered in life such as sports games romance studying dead languages barbarian societies and impossible problems and climbing high mountains without the aid of a safety harness
The plain, clean text of everything.
punc
¶with open(f'{TF_PATH}/punc.tf') as fh:
print(fh.read())
@node @compiler=Dirk Roorda @description=the punctuation after a word @name=Culture quotes from Iain Banks @purpose=exposition @remark=a bit more info is needed @source=Good Reads @status=with for similarities in a separate module @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @version=0.2 @writtenBy=Text-Fabric @dateWritten=2020-02-13T06:46:28Z 3 , 6 , 20 ; 24 , 27 . 38 , 45 , 51 , 55 ? , 75 , 78 , , , 83 , 88 , 99 .
Note the metadata field remark=a bit more info is needed
which was added "last-minute" by means of a cv.meta()
action.