Latent Dirichlet Allocation (LDA) Topic Modeling¶

Description of methods in this notebook: This notebook demonstrates how to do topic modeling on a JSTOR and/or Portico dataset using Python. The following processes are described:

Importing your dataset
Importing libraries including gensim, nltk, and pyLDAvis
Writing a helper function to help clean up a single token
Building a gensim dictionary and training the model
Computing a topic list
Visualizing the topic list

Difficulty: Intermediate

Purpose: Learning (Optimized for explanation over code)

Knowledge Required:

Knowledge Recommended:

Exploring Metadata
A familiarity with gensim is helpful but not required.

Completion time: 90 minutes

Data Format: JSTOR/Portico JSON Lines (.jsonl)

Libraries Used:

json to convert our dataset from json lines format to a Python list
gensim to accomplish the topic modeling
NLTK to help clean up our dataset
pyldavis to visualize our topic model

What is Topic Modeling?¶

Topic modeling is a machine learning technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.

Topic modeling is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a supervised, clustering technique called Topic Classification, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.

Topic modeling is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. Topic Classification, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.

Import your dataset¶

You'll use the tdm_client library to automatically upload your dataset. We import the Dataset module from the tdm_client library. The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset. To analyze your dataset, use the dataset ID provided when you created your dataset. A copy of your dataset ID was sent to your email when you created your corpus. It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a list by discipline. Advanced users can also upload a dataset from their local machine.

In [ ]:

#Importing your dataset with a dataset ID
import tdm_client
#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.
tdm_client.get_dataset("7e41317e-740f-e86a-4729-20dab492e925", "sampleJournalAnalysis") #Insert your dataset ID on this line
# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).
#tdm_client.get_dataset("b4668c50-a970-c4d7-eb2c-bb6d04313542", "sampleJournalAnalysis")

Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

lowercases all tokens
discards all tokens less than 4 characters
discards non alphabetical tokens - e.g. --9
removes stopwords using NLTK's stopword list
Lemmatizes the token using NLTK's WordNetLemmatizer

In [ ]:

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def process_token(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return WordNetLemmatizer().lemmatize(token)

Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens.

In [ ]:

import json

documents = []
doc_count = 0
# Limit the number of documents, set to None to not limit.
limit_to = 25

with open("./datasets/sampleJournalAnalysis.jsonl") as input_file:
    for line in input_file:
        doc = json.loads(line)
        unigram_count = doc["unigramCount"]
        document_tokens = []
        for token, count in unigram_count.items():
            clean_token = process_token(token)
            if clean_token is None:
                continue
            document_tokens += [clean_token] * count
        documents.append(document_tokens)
        doc_count += 1 
        if (limit_to is not None) and (doc_count >= limit_to):
            break
                

Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the Gensim LDA Model page.

In [ ]:

import gensim

num_topics = 7 # Change the number of topics

dictionary = gensim.corpora.Dictionary(documents)

# Remove terms that appear in less than 10% of documents and more than 75% of documents.
dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train the LDA model.
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=20 # Change the number of passes or iterations
)

Print the most significant terms, as determined by the model, for each topic.

In [ ]:

for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Visualize the model using pyLDAvis. This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the # symbol on the line below and run the cell.

In [ ]:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)

In [ ]: