To get started: consult start
Once you analyse a corpus, it is likely that you produce data that others can reuse. Maybe you have defined a set of proper name occurrences, or you have computed sentiments.
It is possible to turn these insights into new features, i.e. new .tf
files with values assigned to specific nodes.
New data is a product of your own methods and computations in the first place. But how do you turn that data into new TF features? It turns out that the last step is not that difficult.
If you can shape your data as a mapping (dictionary) from node numbers (integers) to values (strings or integers), then TF can turn that data into a feature file for you with one command.
You can then easily share your new features on GitHub, so that your colleagues everywhere can try it out for themselves.
You can add such data on the fly, by passing a mod={org}/{repo}/{path}
parameter,
or a bunch of them separated by commas.
If the data is there, it will be auto-downloaded and stored on your machine.
Let's do it.
%load_ext autoreload
%autoreload 2
import collections
import os
from tf.app import use
A = use("q-ran/quran", checkout="clone", hoist=globals())
This is Text-Fabric 9.2.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 40 features found and 0 ignored
We illustrate the data creation part by creating a new feature, sentiment
.
It is not very sensical, but it serves to illustrate the workflow.
We consider ayas that start with a vocative particle as a positive context, and ayas that start with a resumptive particle as a negative context.
For each lemma of a noun, verb, or adjective in the corpus, we count how often it occurs in a positive context, and subtract how many times it occurs in a negative context.
The resulting number is the sentiment.
We use a query to fetch the positive contexts and the negative contexts.
contentTypes = set("verb noun adjective".split())
contentTypeCrit = "|".join(contentTypes)
queryP = f"""
aya
=: word posx=vocative
word pos={contentTypeCrit}
"""
queryN = f"""
aya
=: word posx=resumption
word pos={contentTypeCrit}
"""
resultsP = A.search(queryP)
resultsN = A.search(queryN)
0.24s 2513 results 0.23s 7351 results
Here are the first few results of both:
A.displaySetup(extraFeatures="translation@en")
A.show(resultsP, end=2, condensed=True)
aya 1
aya 2
A.show(resultsN, end=2, condensed=True)
aya 1
aya 2
Observe how the positive results indeed have a positive sentiment, and the negative ones are indeed negative.
However, we do not attempt at all to weed out the positive words under negation from the negative contexts.
So our sentiments have to work against a massive "pollution", and are probably not useful.
sentiment = collections.Counter()
for (results, kind) in ((resultsP, 1), (resultsN, -1)):
for (aya, particle, word) in results:
sentiment[F.lemma.v(word)] += kind
Let's check what we found: how many lemma's per sentiment.
sentimentDist = collections.Counter()
for (lemma, sent) in sentiment.items():
sentimentDist[sent] += 1
for (sent, amount) in sorted(
sentimentDist.items(),
key=lambda x: (-x[1], x[0]),
):
print(f"sentiment {sent:>3} is assigned to {amount:>4} lemmas")
sentiment -1 is assigned to 870 lemmas sentiment 1 is assigned to 273 lemmas sentiment -2 is assigned to 248 lemmas sentiment 0 is assigned to 118 lemmas sentiment -3 is assigned to 107 lemmas sentiment -4 is assigned to 87 lemmas sentiment 2 is assigned to 56 lemmas sentiment -5 is assigned to 53 lemmas sentiment -6 is assigned to 38 lemmas sentiment -7 is assigned to 30 lemmas sentiment -8 is assigned to 20 lemmas sentiment 3 is assigned to 17 lemmas sentiment -10 is assigned to 16 lemmas sentiment -9 is assigned to 10 lemmas sentiment -13 is assigned to 7 lemmas sentiment -11 is assigned to 7 lemmas sentiment -14 is assigned to 6 lemmas sentiment -12 is assigned to 6 lemmas sentiment -17 is assigned to 5 lemmas sentiment -32 is assigned to 4 lemmas sentiment -21 is assigned to 4 lemmas sentiment -49 is assigned to 3 lemmas sentiment -22 is assigned to 3 lemmas sentiment -18 is assigned to 3 lemmas sentiment -16 is assigned to 3 lemmas sentiment 4 is assigned to 3 lemmas sentiment 5 is assigned to 3 lemmas sentiment -36 is assigned to 2 lemmas sentiment -30 is assigned to 2 lemmas sentiment -25 is assigned to 2 lemmas sentiment -24 is assigned to 2 lemmas sentiment -23 is assigned to 2 lemmas sentiment -233 is assigned to 1 lemmas sentiment -195 is assigned to 1 lemmas sentiment -138 is assigned to 1 lemmas sentiment -130 is assigned to 1 lemmas sentiment -73 is assigned to 1 lemmas sentiment -54 is assigned to 1 lemmas sentiment -45 is assigned to 1 lemmas sentiment -39 is assigned to 1 lemmas sentiment -37 is assigned to 1 lemmas sentiment -34 is assigned to 1 lemmas sentiment -33 is assigned to 1 lemmas sentiment -31 is assigned to 1 lemmas sentiment -26 is assigned to 1 lemmas sentiment -20 is assigned to 1 lemmas sentiment -15 is assigned to 1 lemmas sentiment 11 is assigned to 1 lemmas sentiment 29 is assigned to 1 lemmas sentiment 122 is assigned to 1 lemmas
We show the most negative and most positive sentiments in context.
negaThreshold = -100
posiThreshold = 4
xPlemmas = {lemma for lemma in sentiment if sentiment[lemma] >= posiThreshold}
xNlemmas = {lemma for lemma in sentiment if sentiment[lemma] <= negaThreshold}
xPwords = [
w
for w in F.otype.s("word")
if F.lemma.v(w) in xPlemmas and F.pos.v(w) in contentTypes
]
xNwords = [
w
for w in F.otype.s("word")
if F.lemma.v(w) in xNlemmas and F.pos.v(w) in contentTypes
]
print(f"{len(xPwords)} extremely positive word occurrences")
print(f"{len(xNwords)} extremely negative word occurrences")
929 extremely positive word occurrences 6650 extremely negative word occurrences
We put the words in their ayas, and show a few.
xPayas = collections.defaultdict(list)
xNayas = collections.defaultdict(list)
for w in xPwords:
a = L.u(w, otype="aya")[0]
xPayas[a].append(w)
for w in xNwords:
a = L.u(w, otype="aya")[0]
xNayas[a].append(w)
print(f"{len(xPayas)} ayas with extremely positive word occurrences")
print(f"{len(xNayas)} ayas with extremely negative word occurrences")
xPtuples = [(a, *words) for (a, words) in sorted(xPayas.items())]
xNtuples = [(a, *words) for (a, words) in sorted(xNayas.items())]
692 ayas with extremely positive word occurrences 3558 ayas with extremely negative word occurrences
We show three ayas of each category
A.show(xPtuples, end=3)
result 1
result 2
result 3
A.show(xNtuples, end=3)
result 1
result 2
result 3
Probably Allah has a negative sentiment because He occurs in many negative contexts as a punisher.
Anyway, we do not try to be sophisticated here.
We move on to export this sentiment feature.
The documentation explains how to save this data into a text-fabric data file.
We choose a location where to save it, the exercises
repository in the q-ran
organization, in the folder mining
.
In order to do this, we restart the TF API, but now with the desired output location in the locations
parameter.
GITHUB = os.path.expanduser("~/github")
ORG = "q-ran"
REPO = "exercises"
PATH = "mining"
VERSION = A.version
Note the version: we have built the version against a specific version of the data:
A.version
'0.4'
Later on, we pass this version on, so that users of our data will get the shared data in exactly the same version as their core data.
We have to specify a bit of metadata for this feature:
metaData = {
"sentiment": dict(
valueType="int",
description="crude sentiments in the Quran",
creator="Dirk Roorda",
),
}
sentimentData = {
w: sentiment[F.lemma.v(w)]
for w in F.otype.s("word")
if F.lemma.v(w) in sentiment and F.pos.v(w) in contentTypes
}
Now we can give the save command:
TF.save(
nodeFeatures=dict(sentiment=sentimentData),
metaData=metaData,
location=f"{GITHUB}/{ORG}/{REPO}/{PATH}/tf",
module=VERSION,
)
0.00s Exporting 1 node and 0 edge and 0 config features to ~/github/q-ran/exercises/mining/tf/0.4: | 0.05s T sentiment to ~/github/q-ran/exercises/mining/tf/0.4 0.05s Exported 1 node features and 0 edge features and 0 config features to ~/github/q-ran/exercises/mining/tf/0.4
True
How to share your own data is explained in the documentation.
Here we show it step by step for the sentiment
feature.
We need to zip the data in exactly the right directory structure. Text-Fabric can do that for us:
%%sh
text-fabric-zip q-ran/exercises/mining/tf
This is a TF dataset Create release data for q-ran/exercises/mining/tf Found 2 versions zip files end up in ~/Downloads/q-ran-release/exercises zipping q-ran/exercises 0.3 with 1 features ==> mining-tf-0.3.zip zipping q-ran/exercises 0.4 with 1 features ==> mining-tf-0.4.zip
Now you have the file in the desired structure in your Downloads folder.
The next thing is: make a new release in your github
directory, in this case Nino-cunei/exercises, and attach
the zip file as a binary.
You have to do this in your web browser, on the GitHub website.
Here is the result for our case:
A = use(
"q-ran/quran",
hoist=globals(),
mod="q-ran/exercises/mining/tf:clone",
)
This is Text-Fabric 9.2.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 41 features found and 0 ignored | 0.22s T sentiment from ~/github/q-ran/exercises/mining/tf/0.4
Above you see a new section in the feature list: q-ran/exercises/mining/tf with our foreign feature in it: sentiment
.
Now, suppose did not know much about this feature, then we would like to do a few basic checks:
F.sentiment.freqList()
((-1, 4509), (-2, 3044), (-233, 2699), (-4, 2368), (-3, 2217), (1, 2040), (-5, 1964), (-7, 1693), (-195, 1618), (-8, 1541), (0, 1517), (-6, 1416), (-10, 1406), (-138, 1358), (-49, 1147), (-130, 975), (-32, 865), (-21, 841), (2, 833), (-13, 696), (-12, 690), (-14, 669), (-36, 662), (-25, 589), (-17, 589), (-9, 577), (-30, 571), (29, 537), (-22, 529), (-11, 431), (-23, 410), (-18, 360), (-45, 358), (-33, 289), (-54, 278), (-37, 271), (-31, 271), (-24, 261), (-16, 245), (-26, 236), (3, 205), (-73, 176), (122, 153), (-34, 136), (-39, 127), (5, 115), (-20, 75), (11, 75), (-15, 72), (4, 49))
Which nodes have a sentiment feature?
{F.otype.v(n) for n in N.walk() if F.sentiment.v(n)}
{'word'}
Only words have the feature.
Which part of speech do these words have?
{F.pos.v(n) for n in F.otype.s("word") if F.sentiment.v(n)}
{'adjective', 'noun', 'verb'}
Let's have a look at a table of some words with positive sentiments.
results = A.search(
"""
word sentiment>0
"""
)
0.09s 4007 results
A.table(results, start=1, end=5)
results = A.search(
"""
word sentiment<0
"""
)
0.12s 39229 results
A.table(results, start=1, end=5)
Let's get lines with both positive and negative signs:
results = A.search(
"""
aya
word sentiment>0
word sentiment<0
"""
)
0.29s 40238 results
A.table(results, start=1, end=2, condensed=True)
With highlights:
highlights = {}
for w in F.otype.s("word"):
sent = F.sentiment.v(w)
if sent:
color = "lightsalmon" if sent < 0 else "mediumaquamarine"
highlights[w] = color
A.table(results, start=1, end=10, condensed=True, highlights=highlights)
n | p | aya | word | word | word | word | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1:1 | بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ | سْمِ | ٱللَّهِ | رَّحْمَٰنِ | رَّحِيمِ | ||||||
2 | 1:3 | ٱلرَّحْمَٰنِ ٱلرَّحِيمِ | رَّحْمَٰنِ | رَّحِيمِ | ||||||||
3 | 1:5 | إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ | نَعْبُدُ | نَسْتَعِينُ | ||||||||
4 | 2:3 | ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ | غَيْبِ | يُقِيمُ | صَّلَوٰةَ | رَزَقْ | يُنفِقُ | يُؤْمِنُ | ||||
5 | 2:4 | وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ | ءَاخِرَةِ | يُوقِنُ | يُؤْمِنُ | أُنزِلَ | أُنزِلَ | قَبْلِ | ||||
6 | 2:6 | إِنَّ ٱلَّذِينَ كَفَرُوا۟ سَوَآءٌ عَلَيْهِمْ ءَأَنذَرْتَهُمْ أَمْ لَمْ تُنذِرْهُمْ لَا يُؤْمِنُونَ | يُؤْمِنُ | كَفَرُ | سَوَآءٌ | أَنذَرْ | تُنذِرْ | |||||
7 | 2:8 | وَمِنَ ٱلنَّاسِ مَن يَقُولُ ءَامَنَّا بِٱللَّهِ وَبِٱلْيَوْمِ ٱلْءَاخِرِ وَمَا هُم بِمُؤْمِنِينَ | يَوْمِ | ءَاخِرِ | مُؤْمِنِينَ | نَّاسِ | يَقُولُ | ءَامَ | ٱللَّهِ | |||
8 | 2:9 | يُخَٰدِعُونَ ٱللَّهَ وَٱلَّذِينَ ءَامَنُوا۟ وَمَا يَخْدَعُونَ إِلَّآ أَنفُسَهُمْ وَمَا يَشْعُرُونَ | ءَامَنُ | يَشْعُرُ | ٱللَّهَ | أَنفُسَ | ||||||
9 | 2:10 | فِى قُلُوبِهِم مَّرَضٌ فَزَادَهُمُ ٱللَّهُ مَرَضًا وَلَهُمْ عَذَابٌ أَلِيمٌۢ بِمَا كَانُوا۟ يَكْذِبُونَ | مَّرَضٌ | زَادَ | ٱللَّهُ | مَرَضًا | عَذَابٌ | أَلِيمٌۢ | كَانُ | يَكْذِبُ | قُلُوبِ | |
10 | 2:13 | وَإِذَا قِيلَ لَهُمْ ءَامِنُوا۟ كَمَآ ءَامَنَ ٱلنَّاسُ قَالُوٓا۟ أَنُؤْمِنُ كَمَآ ءَامَنَ ٱلسُّفَهَآءُ أَلَآ إِنَّهُمْ هُمُ ٱلسُّفَهَآءُ وَلَٰكِن لَّا يَعْلَمُونَ | سُّفَهَآءُ | سُّفَهَآءُ | يَعْلَمُ | قِيلَ | ءَامِنُ | ءَامَنَ | نَّاسُ | قَالُ | نُؤْمِنُ | ءَامَنَ |
If we do a pretty display, the sentiment
feature shows up.
A.show(results, start=1, end=3, condensed=True, withNodes=True, highlights=highlights)
aya 1
aya 2
aya 3
If more researchers have shared data modules, you can draw them all in.
Then you can design queries that use features from all these different sources.
In that way, you build your own research on top of the work of others.
All chapters:
CC-BY Dirk Roorda