To get started: consult start
Once you analyse a corpus, it is likely that you produce data that others can reuse. Maybe you have defined a set of proper name occurrences, or special numerals, or you have computed part-of-speech assignments.
It is possible to turn these insights into new features, i.e. new .tf
files with values assigned to specific nodes.
New data is a product of your own methods and computations in the first place. But how do you turn that data into new TF features? It turns out that the last step is not that difficult.
If you can shape your data as a mapping (dictionary) from node numbers (integers) to values (strings or integers), then TF can turn that data into a feature file for you with one command.
You can then easily share your new features on GitHub, so that your colleagues everywhere can try it out for themselves.
You can add such data on the fly, by passing a mod={org}/{repo}/{path}
parameter,
or a bunch of them separated by commas.
If the data is there, it will be auto-downloaded and stored on your machine.
Let's do it.
%load_ext autoreload
%autoreload 2
import re
import collections
import os
from tf.app import use
A = use("CLARIAH/wp6-missieven", hoist=globals())
VERSION = A.version
We illustrate the data creation part by creating a new feature, number
.
The idea is that we compute a number value for each word that looks like a number,
but that contains OCR errors.
We keep things simple.
We are interested in words that contain only digits and letters, and where the number of digits is greater than de number of letters. We exclude words that consist of digits only.
We only work in original letter content.
Let's find them by hand coding.
results = []
digitRe = re.compile(r"[0-9]")
for w in F.otype.s("word"):
chars = F.transo.v(w)
if not chars:
continue
(letters, nDigits) = digitRe.subn("", chars)
nLetters = len(chars) - nDigits
if nLetters and nDigits > nLetters:
results.append(w)
print(results[0:10])
len(results)
[11761, 28520, 30481, 31702, 36287, 37982, 37988, 106832, 112548, 119347]
4727
It happens quite a bit.
Let's have a quick look at the text of the results
print("\n".join(sorted(F.transo.v(w) for w in results)[0:20]))
0001b 0001b 0001b 0001b 0001b 000© 006½ 022H 024½ 03£ 042V2 051| 052J 053f 053f 062| 0753A 084| 086j 087|
We want to map characters to digits. To get a feel for that, inventorize the characters that occur in these words.
For each character, count how often it occurs and give at most 10 examples.
inventory = collections.defaultdict(list)
for w in results:
for c in (trans := F.transo.v(w)):
if not c.isdigit():
inventory[c].append(trans)
len(inventory)
61
Quite a bit of different characters.
for c in sorted(inventory):
examples = inventory[c]
n = len(examples)
showExamples = ", ".join(sorted(examples)[0:10])
print(f"{c} ({n:>4}x) {showExamples}")
? ( 15x) 12?, 144?, 1617?, 16?, 18?, 19?, 286?, 29?, 31?, 413? A ( 9x) 0753A, 13A, 273A, 343A, 3933A, 423A, 43A, 4743A, 553A C ( 1x) 540C3 D ( 1x) 1685De E ( 1x) 194845En H ( 4x) 022H, 22H, 2328H, 252H I ( 5x) 217IM, I299v, I85, I85, I85 J ( 96x) 052J, 1079J, 1092J, 10J, 10J, 110J, 115J, 1191J, 11J, 121378J M ( 3x) 217IM, 4047M, 564M O ( 4x) 1671Op, 27O4508, O86V2, ÏO011 P ( 1x) P10 S ( 1x) 16S6 U ( 1x) 1U8 V ( 76x) 042V2, 1014V2, 1019V2, 1062V4, 1062V4, 10V5, 12V2, 1364V2, 13V2, 14V2 a ( 5x) 10a, 11a, 11a, 13a, 1684dat b ( 26x) 0001b, 0001b, 0001b, 0001b, 0001b, 1156bls, 121b, 121b, 121b, 121b c ( 59x) 10c, 12c, 12c, 13c, 13c, 13c, 14c, 14c, 14c, 15c d ( 14x) 100d, 14101de, 1684dat, 29d, d08, d08, d08, d08, d08, d08 e (2952x) 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e, 10e f ( 58x) 053f, 053f, 09f, 102f, 108f, 121f, 1222f, 137f, 14f, 14f g ( 9x) 16g, 22g, 28g, 36g, 430g, 6000g, 600g, 705g, 74g h ( 2x) 42h, 605h i ( 4x) 302061in, 496159in, 7897io, 8337tis j ( 24x) 086j, 1023j, 12j, 14j, 14j, 16j, 176j, 236j, 30j, 31j l ( 1x) 1156bls m ( 1x) 366m n ( 44x) 10n, 10n, 10n, 10n, 14n, 14n, 150599en, 15n, 15n, 15n o ( 9x) 24o, 24o, 36o, 36o, 438834V36o, 48o, 5622tot, 5957V3óo, 7897io p ( 2x) 1419p, 1671Op q ( 1x) 2901§§q r ( 24x) 128r, 1300r, 1394rv, 1427r, 149r, 189r, 202r, 20r, 2182r, 256r s ( 6x) 1156bls, 167s, 336s, 4395Vs, 50s, 8337tis t ( 8x) 1684dat, 22t, 4t0, 520t, 5622tot, 5622tot, 6t0, 8337tis u ( 2x) 21u, 417u v ( 20x) 124v, 1394rv, 1426v, 148v, 15v, 15v, 16v, 19v, 212v, 286v x ( 4x) 10x, 18x, 31x, 34x | ( 232x) 051|, 062|, 084|, 087|, 1034|, 104|, 104|, 106|, 108|, 10| £ ( 51x) 03£, 10£, 10£, 10£, 11£, 12£, 14£, 14£, 14£, 16£ § ( 49x) 090§, 10§, 10§, 10§, 1216§, 1372|§, 139§, 146§, 14§, 166§ © ( 1x) 000© ® ( 25x) 1000®, 10®, 10®, 125®, 15®, 16®, 1719®, 1®11, 2000®, 20® ° ( 10x) 16°, 17°, 20°, 24°, 24°, 25°, 28°, 30°, 51°, °1677 ± ( 4x) 16±, 28±, 32±, 97± ¼ ( 1x) 254¼ ½ ( 7x) 006½, 024½, 117½, 144½, 22½, 27½, 699½ Ï ( 2x) 143Ï, ÏO011 Ö ( 1x) Ö00 Ü ( 4x) 2328Ü, 516Ü, 659Ü, 929Ü è ( 2x) è60, è70 ï ( 8x) 10ï, 166ï, 24ï, 28ï, 292ï, 29ï, 42ï, 8ï4 ó ( 3x) 169Vó, 25ó, 5957V3óo ö ( 2x) 189öf, 2ö00 ƒ ( 765x) 12ƒ, 14ƒ, 1753ƒ, 17ƒ, 19ƒ, 8ƒ294, ƒ10, ƒ10, ƒ100, ƒ1002 — ( 2x) 1—151, 568— ‘ ( 1x) 6440‘ ’ ( 68x) 36’, d’480, ’19, ’20, ’29, ’34, ’34, ’35, ’35, ’35 “ ( 1x) 29“ ” ( 1x) 1681” „ ( 36x) 12143„, 13757„, 1637„, 3096„, 3546„, 44246„, 615„, „10, „114, „116 ™ ( 1x) 30™ ⌊ ( 1x) 1706⌊
We decide to translate a few characters to numerals:
charMapping = {
"o": 0,
"ó": 0,
"ö": 0,
"Ö": 0,
"I": 1,
"J": 1,
"ï": 1,
"è": 6,
}
Now we translate all numerals with this mapping, and if the result is numeric and does not start with a 0, we save the result in a mapping from nodes to numbers.
def cmap(chars):
n = "".join(str(charMapping.get(c, c)) for c in chars)
return int(n) if not n.startswith("0") and n.isdigit() else None
number = {w: n for w in results if (n := cmap(F.transo.v(w)))}
len(number)
114
print(number)
{11761: 1151, 368089: 670, 379197: 94001, 379568: 131, 396613: 141, 396656: 20621, 407164: 121, 430354: 121, 432757: 128181, 432879: 1241, 434920: 141, 462917: 621, 464624: 1241, 465415: 631, 472907: 3191, 473135: 9581, 483858: 8191, 486913: 10791, 498619: 8541, 533953: 261, 533968: 331, 535684: 6121, 557983: 77841, 618358: 261, 618871: 4021, 618877: 501, 627195: 261, 653407: 1741, 667437: 15301, 675324: 65931, 750255: 3231, 750445: 5021, 1019955: 10921, 1047395: 1371, 1068377: 52141, 1070934: 49141, 1079667: 2000, 1080766: 72771, 1118656: 4061, 1173348: 161, 1178433: 101, 1196647: 191, 1200319: 201, 1211567: 660, 1230723: 3501, 1234154: 171, 1237203: 111, 1237391: 141, 1250144: 8421, 1253186: 32091, 1271818: 121, 1282202: 75621, 1327325: 121, 1346403: 131, 1352127: 421, 1352309: 421, 1372543: 371, 1379628: 161, 1393864: 2228491, 1443457: 161, 1443464: 361, 1443641: 361, 1443657: 361, 1443666: 101, 1451420: 2981, 1548082: 1101, 1554393: 421, 1653139: 2501, 1669175: 151, 1682688: 4041, 1682700: 1441, 1714540: 721, 1833190: 1213781, 1851679: 1441, 1877221: 98771, 1877228: 977381, 1877230: 167081, 1948091: 925981, 1957857: 15361, 1965567: 181, 2089027: 541, 2126313: 701, 2126473: 621, 2126645: 901, 2126699: 731, 2126709: 911, 2126717: 761, 2126753: 561, 2207671: 1321, 2207675: 361, 2207742: 361, 2351417: 151, 2379398: 121, 2945183: 240, 2968542: 480, 2968588: 240, 2993386: 360, 2993418: 360, 3037496: 250, 3704420: 185, 3820516: 9961, 4086262: 101, 4131362: 185, 4188174: 241, 4262757: 2991, 4277217: 281, 4355285: 291, 4355770: 2921, 4394040: 421, 4412289: 814, 4522464: 1661, 4792505: 121, 4993558: 185, 5146359: 11911}
GITHUB = os.path.expanduser("~/github")
ORG = A.context.org
REPO = A.context.repo
PATH = "exercises/numerics"
Later on, we pass this version on, so that users of our data will get the shared data in exactly the same version as their core data.
We have to specify a bit of metadata for this feature:
metaData = {
"number": dict(
valueType="int",
description="numeric value of corrected number-like strings",
creator="Dirk Roorda",
),
}
Now we can give the save command:
location = f"{GITHUB}/{ORG}/{REPO}/{PATH}/tf"
TF.save(
nodeFeatures=dict(number=number),
metaData=metaData,
location=location,
module=VERSION,
silent="auto",
)
0.00s Exporting 1 node and 0 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0: | 0.00s T number to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0 0.00s Exported 1 node features and 0 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/exercises/numerics/tf/1.0
True
Here is the data in text-fabric format: a feature file
with open(f"{location}/{VERSION}/number.tf") as fh:
print(fh.read())
@node @creator=Dirk Roorda @description=numeric value of corrected number-like strings @valueType=int @writtenBy=Text-Fabric @dateWritten=2022-10-11T14:56:42Z 11761 1151 368089 670 379197 94001 379568 131 396613 141 396656 20621 407164 121 430354 121 432757 128181 432879 1241 434920 141 462917 621 464624 1241 465415 631 472907 3191 473135 9581 483858 8191 486913 10791 498619 8541 533953 261 533968 331 535684 6121 557983 77841 618358 261 618871 4021 618877 501 627195 261 653407 1741 667437 15301 675324 65931 750255 3231 750445 5021 1019955 10921 1047395 1371 1068377 52141 1070934 49141 1079667 2000 1080766 72771 1118656 4061 1173348 161 1178433 101 1196647 191 1200319 201 1211567 660 1230723 3501 1234154 171 1237203 111 1237391 141 1250144 8421 1253186 32091 1271818 121 1282202 75621 1327325 121 1346403 131 1352127 421 1352309 421 1372543 371 1379628 161 1393864 2228491 1443457 161 1443464 361 1443641 361 1443657 361 1443666 101 1451420 2981 1548082 1101 1554393 421 1653139 2501 1669175 151 1682688 4041 1682700 1441 1714540 721 1833190 1213781 1851679 1441 1877221 98771 1877228 977381 1877230 167081 1948091 925981 1957857 15361 1965567 181 2089027 541 2126313 701 2126473 621 2126645 901 2126699 731 2126709 911 2126717 761 2126753 561 2207671 1321 2207675 361 2207742 361 2351417 151 2379398 121 2945183 240 2968542 480 2968588 240 2993386 360 2993418 360 3037496 250 3704420 185 3820516 9961 4086262 101 4131362 185 4188174 241 4262757 2991 4277217 281 4355285 291 4355770 2921 4394040 421 4412289 814 4522464 1661 4792505 121 4993558 185 5146359 11911
How to share your own data is explained in the documentation.
Here we show it step by step for the number
feature.
If you commit your changes to the exercises repo, and have done a git push origin master
,
you already have shared your data!
Keep it simple for small datasets: For small feature datasets, you are done.
If it gets serious, there is support for releases and efficient data transfer. Here is how:
Note (releases)
If you want to make a stable release, so that you can keep developing, while your users fall back on the stable data, you can make a new release.
Go to the GitHub website for that, go to your repo, and click Releases and follow the nudges.
Note (release binaries)
If you want to make it even smoother for your users, you can zip the data and attach it as a binary to the release just created.
We need to zip the data in exactly the right directory structure. Text-Fabric can do that for us.
%%sh
text-fabric-zip CLARIAH/wp6-missieven/exercises/numerics/tf
This is a TF dataset Create release data for CLARIAH/wp6-missieven/exercises/numerics/tf Found 2 versions zip files end up in ~/Downloads/None/CLARIAH-release/wp6-missieven zipping CLARIAH/wp6-missieven 0.9.1 with 1 features ==> exercises-numerics-tf-0.9.1.zip zipping CLARIAH/wp6-missieven 1.0 with 1 features ==> exercises-numerics-tf-1.0.zip
All versions have been zipped, but it works OK if you only attach the newest version to the newest release.
If a user asks for an older version in this release, the system can still find it.
We can use the data by calling it up when we say use('CLARIAH/wp6-missieven', ...)
where we put in a data module argument on the dots.
We will also call up the entity data we created in the annotate chapter.
Note that for each module we can specify flags like :latest
, :hot
, clone
.
If you are the author of the data, and want to test it, use :clone
: it takes the data from where you saved it.
If you are a new user of the data, use :hot
(get latest commit) or :latest
(get latest release)
to download the data.
If you have downloaded the data before, leave out the flag.
A = use(
f"CLARIAH/wp6-missieven",
hoist=globals(),
mod=(
f"CLARIAH/wp6-missieven/exercises/entities/tf",
f"CLARIAH/wp6-missieven/exercises/numerics/tf",
),
version=VERSION,
silent=False,
)
This is Text-Fabric 10.2.6 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 49 features found and 0 ignored 4.57s All features loaded/computed - for details use TF.isLoaded() 0.42s All additional features loaded - for details use TF.isLoaded()
Above you see a new sections in the feature list that you can expand to see which features that module contributed.
Now, suppose did not know much about these feature, then we would like to do a few basic checks.
A good start it to do inspect a frequency list of the values of the new features, and then to perform a query looking for the nodes that have these features.
We do that for the entity features and for the number feature.
F.entityId.freqList()
(('T11', 6), ('T2', 5), ('T13', 3), ('T16', 3), ('T8', 3), ('T9', 3), ('T10', 2), ('T15', 2), ('T17', 2), ('T3', 2), ('T5', 2), ('T1', 1), ('T12', 1), ('T4', 1), ('T6', 1), ('T7', 1))
F.entityKind.freqList()
(('Person', 18), ('GPE', 15), ('Organization', 5))
F.entityComment.freqList()
(('Ternate', 5), ('Amboina', 2))
Let's query all words that have an entity notation:
query = """
word entityId entityKind* entityComment*
"""
results = A.search(query)
4.40s 23 results
Here we query all word where the entityId
is present.
We also mention the entityKind
and entityComment
features, but with a *
behind them.
That is a criterion that is always True, so these mentions do not alter the result list.
But now these features do occur in the query, and when we show results, these features will be shown.
A.show(results, condensed=True)
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
line 11
line 12
Observation
It's not only words that have entity features, also the lines themselves have gotten such annotations.
It turns out that it is not very useful to annotate lines with entities this way. It would be better to annotate them with the number of entities they contain. That is our feedback to the creator of these annotations, and because we know the GitHub repo that they are from, we can file an issue!
F.number.freqList()
((121, 6), (361, 5), (421, 4), (101, 3), (141, 3), (161, 3), (185, 3), (261, 3), (131, 2), (151, 2), (240, 2), (360, 2), (621, 2), (1241, 2), (1441, 2), (111, 1), (171, 1), (181, 1), (191, 1), (201, 1), (241, 1), (250, 1), (281, 1), (291, 1), (331, 1), (371, 1), (480, 1), (501, 1), (541, 1), (561, 1), (631, 1), (660, 1), (670, 1), (701, 1), (721, 1), (731, 1), (761, 1), (814, 1), (901, 1), (911, 1), (1101, 1), (1151, 1), (1321, 1), (1371, 1), (1661, 1), (1741, 1), (2000, 1), (2501, 1), (2921, 1), (2981, 1), (2991, 1), (3191, 1), (3231, 1), (3501, 1), (4021, 1), (4041, 1), (4061, 1), (5021, 1), (6121, 1), (8191, 1), (8421, 1), (8541, 1), (9581, 1), (9961, 1), (10791, 1), (10921, 1), (11911, 1), (15301, 1), (15361, 1), (20621, 1), (32091, 1), (49141, 1), (52141, 1), (65931, 1), (72771, 1), (75621, 1), (77841, 1), (94001, 1), (98771, 1), (128181, 1), (167081, 1), (925981, 1), (977381, 1), (1213781, 1), (2228491, 1))
We see that the values that we have generated before.
Let's show the original and the number side by side.
results = A.search(
"""
word number transo*
"""
)
1.87s 114 results
A.show(results, start=1, end=10)
result 1
result 2
result 3
result 4
result 5
result 6
result 7
result 8
result 9
result 10
If more researchers have shared data modules, you can draw them all in.
Then you can design queries that use features from all these different sources.
In that way, you build your own research on top of the work of others.
Hover over the features to see where they come from, and you'll see they come from your local GitHub repo.
See the next tutorial in this series how you can draw in and make use additional features produced by a serious algorithm to detect named entities.
CC-BY Dirk Roorda