import string
import gzip
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import watermark
%matplotlib inline
%load_ext watermark
List out the versions of all loaded libraries
%watermark -n -v -m -g -iv
Python implementation: CPython Python version : 3.11.7 IPython version : 8.20.0 Compiler : Clang 14.0.6 OS : Darwin Release : 23.4.0 Machine : arm64 Processor : arm CPU cores : 16 Architecture: 64bit Git hash: ed7010f27131287f28d990d9846e4ce6cd87e34d watermark : 2.4.3 numpy : 1.26.4 pandas : 2.1.4 matplotlib: 3.8.0
Set the default style
plt.style.use('d4sci.mplstyle')
We start by taking the simplest approach and simply counting positive and negative words. We'll use Hu and Liu's Lexicon from their 2004 KDD paper: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
pos = np.loadtxt('data/positive-words.txt', dtype='str', comments=';')
neg = np.loadtxt('data/negative-words.txt', dtype='str', comments=';')
pos
array(['a+', 'abound', 'abounds', ..., 'zenith', 'zest', 'zippy'], dtype='<U20')
neg
array(['2-faced', '2-faces', 'abnormal', ..., 'zealous', 'zealously', 'zombie'], dtype='<U24')
Create a dictionary and assign the valence to each positive and negative word
valence = {}
for word in pos:
valence[word.lower()] = +1
for word in neg:
valence[word.lower()] = -1
Here's the simple word extraction function we defined in Lesson I
def extract_words(text):
temp = text.split() # Split the text on whitespace
text_words = []
for word in temp:
# Remove any punctuation characters present in the beginning of the word
while word[0] in string.punctuation:
word = word[1:]
# Remove any punctuation characters present in the end of the word
while word[-1] in string.punctuation:
word = word[:-1]
# Append this word into our list of words.
text_words.append(word.lower())
return text_words
That now we can use to define a sentiment measuring function that returns the valence of a sentence or piece of text. Notice that we use the valence directly from the dictionary instead of treating positive and negative words separatly. This will prove useful later on ;)
$$ sentiment=\frac{P-N}{P+N}$$def sentiment(text, valence):
words = extract_words(text.lower())
word_count = 0
score = 0
for word in words:
if word in valence:
score += valence[word]
word_count += 1
return score/word_count
Now let's test our simple code with some simple examples
texts = ["I'm very happy",
"The product is pretty annoying, and I hate it",
"I'm sad",
]
for text in texts:
print(text, ':', sentiment(text, valence))
I'm very happy : 1.0 The product is pretty annoying, and I hate it : -0.3333333333333333 I'm sad : -1.0
This is a bit surprising. One might expect the second sentence to be negative, after all "pretty annoying" and "hate" sound pretty negative. However, since each word in taken by itself, regardless of context we end up with:
words = extract_words(texts[1].lower())
for word in words:
if word in valence:
print(word, valence[word])
pretty 1 annoying -1 hate -1
We'll see in a bit how to handle cases like this, but the solution requires two important changes to our current approach: modifier words and real valued weights
The first step is to define a dictionary of modifiers
modifiers = {
"very": 1.5,
"much": 1.3,
"not": -1,
"pretty": 1.5,
"somewhat": 1.2
}
And to change our sentiment measuring function to take the modifiers into account.
def sentiment_modified(text, valence, modifiers, verbose=False):
words = extract_words(text.lower())
ngrams = [[]]
# generate ngrams
for i in range(len(words)):
word = words[i]
if word in modifiers:
ngrams[-1].append(word)
continue
if word in valence:
ngrams[-1].append(word)
else:
if len(ngrams[-1]) > 0:
ngrams.append([])
score = 0
# Remove the trailing empty ngram if necessary
if len(ngrams[-1]) == 0:
ngrams = ngrams[:-1]
for ngram in ngrams:
value = 1
for word in ngram:
if word in modifiers:
value *= modifiers[word]
elif word in valence:
value *= valence[word]
if verbose:
print(ngram, value)
score += value
return score/len(ngrams)
This implementation is still relatively simple, but, as you can see, the results are already better.
print(texts[1])
The product is pretty annoying, and I hate it
sentiment_modified(texts[1], valence, modifiers, verbose=True)
['pretty', 'annoying'] -1.5 ['hate'] -1
-1.25
A more complete implementation would be more careful in handling the modifiers and would build larger ngrams so that cases like this one would also work:
sentiment_modified("It was not very good", valence, modifiers, True)
['not', 'very', 'good'] -1.5
-1.5
And even more complex (and unrealistic) examples work fine
sentiment_modified("It was not not very very good", valence, modifiers, True)
['not', 'not', 'very', 'very', 'good'] 2.25
2.25
VADER is a state of the art sentiment analysis tool. Here we will use their excelent and well documented lexicon to explore non binary weights. Their approach is significantly more advanced than what we present here, but some of the fundamental ideas are the same
vader = pd.read_csv("data/vader_lexicon.txt", sep='\t', header=None)
The vader lexicon includes a lot of interesting information:
vader.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | $: | -1.5 | 0.80623 | [-1, -1, -1, -1, -3, -1, -3, -1, -2, -1] |
1 | %) | -0.4 | 1.01980 | [-1, 0, -1, 0, 0, -2, -1, 2, -1, 0] |
2 | %-) | -1.5 | 1.43178 | [-2, 0, -2, -2, -1, 2, -2, -3, -2, -3] |
3 | &-: | -0.4 | 1.42829 | [-3, -1, 0, 0, -1, -1, -1, 2, -1, 2] |
4 | &: | -0.7 | 0.64031 | [0, -1, -1, -1, 1, -1, -1, -1, -1, -1] |
vader.tail()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
7512 | }: | -2.1 | 0.83066 | [-1, -1, -3, -2, -3, -2, -2, -1, -3, -3] |
7513 | }:( | -2.0 | 0.63246 | [-3, -1, -2, -1, -3, -2, -2, -2, -2, -2] |
7514 | }:) | 0.4 | 1.42829 | [1, 1, -2, 1, 2, -2, 1, -1, 2, 1] |
7515 | }:-( | -2.1 | 0.70000 | [-2, -1, -2, -2, -2, -4, -2, -2, -2, -2] |
7516 | }:-) | 0.3 | 1.61555 | [1, 1, -2, 1, -1, -3, 2, 2, 1, 1] |
Similies are also included and, in addition to the average sentiment of each word (in column 1) and it's standard deviation (in column 2) it provides the raw human generated scores in column 3. So that we may easily check (and possibly modify) their weights. To extract the raw scores for the word "love" we could simply do:
print(vader.shape)
(7517, 4)
print(vader.iloc[4446])
0 love 1 3.2 2 0.4 3 [3, 3, 3, 3, 3, 3, 3, 4, 4, 3] Name: 4446, dtype: object
scores = eval(vader.iloc[4446][3])
print(scores)
[3, 3, 3, 3, 3, 3, 3, 4, 4, 3]
scores[8]
4
And we can see that 8/10 people thought that the word love should receive a score of 3 and two others a score of 4. This gives us insight into how uniform the scores are. If for some reason, we thought that there was some problem with the 2 values of 4 or perhaps just not appropriate to our purposes we might discard them and recalculate the valence of the word.
One justification for this might be the fact that the scores for the closely related word, "loved", are significantly different with a wider range of variation in the human scores
vader.iloc[4447]
0 loved 1 2.9 2 0.7 3 [3, 3, 4, 2, 2, 4, 3, 2, 3, 3] Name: 4447, dtype: object
Now we convert this dataset into a dictionary similar to the one we used above
valence_vader = dict(vader[[0, 1]].values)
valence_vader['love']
3.2
To use this new dictionary we just have to modify the arguments to the sentiment_modified function:
sentiment_modified("It was not not very very good", valence_vader, modifiers, verbose=True)
['not', 'not', 'very', 'very', 'good'] 4.2749999999999995
4.2749999999999995
One important detail to keep in mind is that scores obtained through different methods are not comparable. In this example, the score of the sentence "It was not not very very good" went from 2.25 to 4.27 when we switched dictionaries. This is due not only to different levels of coverage in differnet dictionaries but also to differnet choices in the possible ranges of values.
texts = ["I'm very happy",
"The product is pretty annoying, and I hate it",
"I'm sad",
]
for text in texts:
print(text, ':', sentiment_modified(text, valence_vader, modifiers))
I'm very happy : 4.050000000000001 The product is pretty annoying, and I hate it : -2.625 I'm sad : -2.1