Import the libraries we will be using
import random
import nltk
import gensim
import numpy as np
import pandas as pd
import scipy
from nltk.stem.porter import *
First we download the necessary resources for NLTK
nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
Tokenization involves segmenting text into tokens. It is a common preprocessing step in many NLP applications
text = "Are you crazy? I don't know."
A simple method is just to split based on white space. Note that this doesn't work for many other languages (like Chinese)!
text.split()
We will now explore tokenization as provided by two NLP tools. First, look at how NLTK tokenizes words:
nltk.word_tokenize(text)
Now, let's look at how Gensim is handling tokenization
list(gensim.utils.tokenize(text))
It often makes sense to lowercase the text, as follows:
"HELLO world".lower()
def tokenize(text):
return #TODO: fill in your preferred method to tokenize the text. Perhaps add some more preprocessing, like lowercasing etc.
Make sure the tokenize method works as expected.
tokenize("Hello word!")
We will now start looking at Reddit data. We will focus on the politics subreddit. First, let's load in the data from October 2016.
file_name = 'reddit_discussion_network_2016_10.csv';
reddit_df = pd.read_csv('../../../data/reddit/' + file_name);
Which columns does this dataset have?
reddit_df.columns
The first post:
reddit_df.head(1)
This method returns the tokens of the Reddit dataset based on your tokenization method.
def iter_reddit():
for index, row in reddit_df.iterrows():
yield tokenize(str(row["comment"])) # Convert to string, there are some weird entries (NaN)
Count how often each word occurs. Applying to the whole dataset might take some time, so we will only process 10,000 documents.
import itertools
num_documents = 10000
counts = {}
for tokens in itertools.islice(iter_reddit(), num_documents):
for token in tokens:
if token not in counts:
counts[token] = 0
counts[token] = counts[token] + 1
Print out the top 25 most frequent words:
for w in sorted(counts, key=counts.get, reverse=True)[:25]:
print "%s\t%s" % (w, counts[w])
For some applications, stemming the words can be helpful
stemmer = PorterStemmer()
tokens = ['politics', 'agreed', 'trump', 'clinton', 'replied', 'meeting']
print [stemmer.stem(token) for token in tokens]
We now look at an example of part of speech tagging using NLTK
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
print(pos_sentence)
NLTK provides a method to retrieve more information about a tag. For example:
nltk.help.upenn_tagset('NNP')
We will do sentiment analysis using Empath (empath.stanford.edu), which is a dictionary tool that counts words in various categories (e.g., positive sentiment). The dictionary is created by first expanding manually provided seed words automatically, and then having crowdworkers filter out incorrect words. First, import the library and create a lexicon.
from empath import Empath
lexicon = Empath()
Let's start analyzing a sentence. With setting normalize to True, the counts are normalized according to sentence length.
lexicon.analyze(tokenize("Bullshit, you can't even post FACTS on this sub- like Clinton lying about sniper fire."), normalize=True)
Another sentence
lexicon.analyze(tokenize("Totally agree. Planning to beat your opponent is not a sign of corruption. That's politics. "), normalize=True)
We will now look at topic modeling. The code below is a utility class to help process the reddit data.
class RedditCorpus(object):
def __init__(self, dictionary):
"""
Parse the data.
Yield each document in turn, as a list of tokens.
"""
self.dictionary = dictionary
def __iter__(self):
for tokens in iter_reddit():
yield self.dictionary.doc2bow(tokens)
We will be using the gensim library for topic modeling. The first step involves constructing a dictionary, which is a mapping from identifiers to words.
id2word_reddit = gensim.corpora.Dictionary(iter_reddit())
Save the full dictionary to a file
id2word_reddit.save("full_reddit.dict")
Load a dictionary from file (continue here if you skipped constructing the dictionary)
id2word_reddit = gensim.corpora.dictionary.Dictionary.load("../data/full_reddit.dict")
How big is the dictionary?
len(id2word_reddit)
The first word in the dictionary with identifier 0
id2word_reddit[0]
It often helps to remove very infrequent and very frequent words. It will also help speed up the process (which we need - otherwise it will take a long time to train a topic model).
id2word_reddit.filter_extremes(no_below=50, no_above=0.05)
How big is the dictionary now?
len(id2word_reddit)
Save the pruned dictionary
id2word_reddit.save("pruned_reddit.dict")
Load the pruned dictionary here in case something went wrong with the previous steps
id2word_reddit = gensim.corpora.dictionary.Dictionary.load("../data/pruned_reddit.dict")
We will now start building a topic model. The pretrained model uses 15 topics, but feel free to explore other settings when training your own model.
NUM_TOPICS = 15
reddit_corpus = RedditCorpus(id2word_reddit)
lda_model_reddit = gensim.models.LdaModel(reddit_corpus, num_topics=NUM_TOPICS, id2word=id2word_reddit, passes=2, update_every=1)
lda_model_reddit.save('reddit_lda.lda')
Load in a trained model if you didn't train a model yourself
lda_model_reddit = gensim.models.ldamodel.LdaModel.load('../data/reddit_lda.lda')
Print out the topics. For each topic, the top words and their probability are shown.
lda_model_reddit.print_topics(-1)
Get the topics for a particular text. If minimum_probability is not specified, only topics with a high probability are shown.
lda_model_reddit.get_document_topics(id2word_reddit.doc2bow(tokenize("Just because you are selfish and don't want to pay taxes for services that you may/may not use, does not mean you don't have to pay them. ")), minimum_probability=0.0)