Import the libraries we will be using

In [ ]:

import random
import nltk
import gensim
import numpy as np
import pandas as pd
import scipy

from nltk.stem.porter import *

First we download the necessary resources for NLTK

In [ ]:

nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')

Part 1: Tokenization¶

Tokenization involves segmenting text into tokens. It is a common preprocessing step in many NLP applications

In [ ]:

text = "Are you crazy? I don't know."

A simple method is just to split based on white space. Note that this doesn't work for many other languages (like Chinese)!

In [ ]:

text.split()

We will now explore tokenization as provided by two NLP tools. First, look at how NLTK tokenizes words:

In [ ]:

nltk.word_tokenize(text)

Now, let's look at how Gensim is handling tokenization

In [ ]:

list(gensim.utils.tokenize(text))

It often makes sense to lowercase the text, as follows:

In [ ]:

"HELLO world".lower()

**Question:** Try tokenizing some other texts as well. Which of these methods do you prefer, and why?

** Todo:** We will soon start counting words. First, decide which tokenization method you would like to use

In [ ]:

def tokenize(text):
    return #TODO: fill in your preferred method to tokenize the text. Perhaps add some more preprocessing, like lowercasing etc.

Make sure the tokenize method works as expected.

In [ ]:

tokenize("Hello word!")

**Optional:** Explore how the tokenization methods are dealing with

hyphenation (e.g., co-operative, thirty-three)
non-standard language (e.g., tweets. Take a look at TweetTokenizer from NLTK)
languages other than English

Part 2: Preprocessing¶

We will now start looking at Reddit data. We will focus on the politics subreddit. First, let's load in the data from October 2016.

In [ ]:

file_name = 'reddit_discussion_network_2016_10.csv';
reddit_df = pd.read_csv('../../../data/reddit/' + file_name);

Which columns does this dataset have?

In [ ]:

reddit_df.columns

The first post:

In [ ]:

reddit_df.head(1)

This method returns the tokens of the Reddit dataset based on your tokenization method.

In [ ]:

def iter_reddit():
    for index, row in reddit_df.iterrows():
        yield tokenize(str(row["comment"])) # Convert to string, there are some weird entries (NaN)

Count how often each word occurs. Applying to the whole dataset might take some time, so we will only process 10,000 documents.

In [ ]:

import itertools

num_documents = 10000
counts = {}
for tokens in itertools.islice(iter_reddit(), num_documents):
    for token in tokens:
        if token not in counts:
            counts[token] = 0
        counts[token] = counts[token] + 1

Print out the top 25 most frequent words:

In [ ]:

for w in sorted(counts, key=counts.get, reverse=True)[:25]:
  print "%s\t%s" % (w, counts[w]) 

For some applications, stemming the words can be helpful

In [ ]:

stemmer = PorterStemmer()
tokens = ['politics', 'agreed', 'trump', 'clinton', 'replied', 'meeting']
print [stemmer.stem(token) for token in tokens]

**Optional:** Try out different preprocessing options. Perhaps modify your tokenization function. How does this influence the statistics? How would filtering infrequent words influence the vocabulary size?

Part 3: Part of speech tagging¶

We now look at an example of part of speech tagging using NLTK

In [ ]:

sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
print(pos_sentence)

NLTK provides a method to retrieve more information about a tag. For example:

In [ ]:

nltk.help.upenn_tagset('NNP')

**Optional:** Try out POS tagging on texts from different sources (like Facebook, Twitter, etc. etc.). What goes well? What goes wrong?

Part 4: Sentiment analysis¶

We will do sentiment analysis using Empath (empath.stanford.edu), which is a dictionary tool that counts words in various categories (e.g., positive sentiment). The dictionary is created by first expanding manually provided seed words automatically, and then having crowdworkers filter out incorrect words. First, import the library and create a lexicon.

In [ ]:

from empath import Empath
lexicon = Empath()

Let's start analyzing a sentence. With setting normalize to True, the counts are normalized according to sentence length.

In [ ]:

lexicon.analyze(tokenize("Bullshit, you can't even post FACTS on this sub- like Clinton lying about sniper fire."), normalize=True)

Another sentence

In [ ]:

lexicon.analyze(tokenize("Totally agree. Planning to beat your opponent is not a sign of corruption. That's politics. "), normalize=True)

**Optional** Explore the tool with some more examples. What happens in cases of sarcasm, negation, or very informal text?

Part 5: Topic modeling¶

We will now look at topic modeling. The code below is a utility class to help process the reddit data.

In [ ]:

class RedditCorpus(object):
    def __init__(self, dictionary):
        """
        Parse the data. 
        Yield each document in turn, as a list of tokens.
        
        """
        self.dictionary = dictionary
    
    def __iter__(self):
        for tokens in iter_reddit():
            yield self.dictionary.doc2bow(tokens)

We will be using the gensim library for topic modeling. The first step involves constructing a dictionary, which is a mapping from identifiers to words.

Note the following may take some time, so skip the next two steps if it takes too long and load the dictionary directly from the provided file.

In [ ]:

id2word_reddit = gensim.corpora.Dictionary(iter_reddit())

Save the full dictionary to a file

In [ ]:

id2word_reddit.save("full_reddit.dict")

Load a dictionary from file (continue here if you skipped constructing the dictionary)

In [ ]:

id2word_reddit = gensim.corpora.dictionary.Dictionary.load("../data/full_reddit.dict")

How big is the dictionary?

In [ ]:

len(id2word_reddit)

The first word in the dictionary with identifier 0

In [ ]:

id2word_reddit[0]

It often helps to remove very infrequent and very frequent words. It will also help speed up the process (which we need - otherwise it will take a long time to train a topic model).

In [ ]:

id2word_reddit.filter_extremes(no_below=50, no_above=0.05)

How big is the dictionary now?

In [ ]:

len(id2word_reddit)

Save the pruned dictionary

In [ ]:

id2word_reddit.save("pruned_reddit.dict")

Load the pruned dictionary here in case something went wrong with the previous steps

In [ ]:

id2word_reddit = gensim.corpora.dictionary.Dictionary.load("../data/pruned_reddit.dict")

We will now start building a topic model. The pretrained model uses 15 topics, but feel free to explore other settings when training your own model.

In [ ]:

NUM_TOPICS = 15

**Optional** Now, let's train a topic model Now, this takes a lot of time, so consider training the model during one of the breaks. Skips the next two steps to continue with exploring an (already) trained topic model

In [ ]:

reddit_corpus = RedditCorpus(id2word_reddit)
lda_model_reddit = gensim.models.LdaModel(reddit_corpus, num_topics=NUM_TOPICS, id2word=id2word_reddit, passes=2, update_every=1)

In [ ]:

lda_model_reddit.save('reddit_lda.lda')

Load in a trained model if you didn't train a model yourself

In [ ]:

lda_model_reddit = gensim.models.ldamodel.LdaModel.load('../data/reddit_lda.lda')

Print out the topics. For each topic, the top words and their probability are shown.

In [ ]:

lda_model_reddit.print_topics(-1)

Get the topics for a particular text. If minimum_probability is not specified, only topics with a high probability are shown.

In [ ]:

lda_model_reddit.get_document_topics(id2word_reddit.doc2bow(tokenize("Just because you are selfish and don't want to pay taxes for services that you may/may not use, does not mean you don't have to pay them. ")), minimum_probability=0.0)