Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___
Description of methods in this notebook: This notebook demonstrates how to do topic modeling on a JSTOR and/or Portico dataset using Python. The following processes are described:
gensim
, nltk
, and pyLDAvis
Difficulty: Intermediate
Purpose: Learning (Optimized for explanation over code)
Knowledge Required:
Knowledge Recommended:
Completion time: 90 minutes
Data Format: JSTOR/Portico JSON Lines (.jsonl)
Libraries Used:
Topic modeling is a machine learning technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.
Topic modeling is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a supervised, clustering technique called Topic Classification, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.
Topic modeling is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. Topic Classification, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.
You'll use the tdm_client library to automatically upload your dataset. We import the Dataset
module from the tdm_client
library. The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset. To analyze your dataset, use the dataset ID provided when you created your dataset. A copy of your dataset ID was sent to your email when you created your corpus. It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a list by discipline. Advanced users can also upload a dataset from their local machine.
#Importing your dataset with a dataset ID
import tdm_client
#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.
tdm_client.get_dataset("7e41317e-740f-e86a-4729-20dab492e925", "sampleJournalAnalysis") #Insert your dataset ID on this line
# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).
#tdm_client.get_dataset("b4668c50-a970-c4d7-eb2c-bb6d04313542", "sampleJournalAnalysis")
Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def process_token(token):
token = token.lower()
if len(token) < 4:
return
if not(token.isalpha()):
return
if token in stop_words:
return
return WordNetLemmatizer().lemmatize(token)
Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens.
import json
documents = []
doc_count = 0
# Limit the number of documents, set to None to not limit.
limit_to = 25
with open("./datasets/sampleJournalAnalysis.jsonl") as input_file:
for line in input_file:
doc = json.loads(line)
unigram_count = doc["unigramCount"]
document_tokens = []
for token, count in unigram_count.items():
clean_token = process_token(token)
if clean_token is None:
continue
document_tokens += [clean_token] * count
documents.append(document_tokens)
doc_count += 1
if (limit_to is not None) and (doc_count >= limit_to):
break
Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the Gensim LDA Model page.
import gensim
num_topics = 7 # Change the number of topics
dictionary = gensim.corpora.Dictionary(documents)
# Remove terms that appear in less than 10% of documents and more than 75% of documents.
dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
# Train the LDA model.
model = gensim.models.LdaModel(
corpus=bow_corpus,
id2word=dictionary,
num_topics=num_topics,
passes=20 # Change the number of passes or iterations
)
Print the most significant terms, as determined by the model, for each topic.
for topic_num in range(0, num_topics):
word_ids = model.get_topic_terms(topic_num)
words = []
for wid, weight in word_ids:
word = dictionary.id2token[wid]
words.append(word)
print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))
Visualize the model using pyLDAvis
. This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the #
symbol on the line below and run the cell.
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)