Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____
Description of methods in this notebook: This notebook shows how to explore the word frequencies of your JSTOR and/or Portico dataset using Python. The following processes are described:
Difficulty: Intermediate
Purpose: Learning (Optimized for explanation over code)
Knowledge Required:
Knowledge Recommended:
Completion time: 60 minutes
Data Format: JSTOR/Portico JSON Lines (.jsonl)
Libraries Used:
You'll use the tdm_client library to automatically upload your dataset. We import the Dataset
module from the tdm_client
library. The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset. To analyze your dataset, use the dataset ID provided when you created your dataset. A copy of your dataset ID was sent to your email when you created your corpus. It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a list by discipline. Advanced users can also upload a dataset from their local machine.
#Importing your dataset with a dataset ID
import tdm_client
#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.
tdm_client.get_dataset("7e41317e-740f-e86a-4729-20dab492e925", "sampleJournalAnalysis") #Insert your dataset ID on this line
Before we can begin working with our dataset, we need to convert the JSON lines file format into Python so we can work with it. Remember that each line of our JSON lines file represents a single text, whether that is a journal article, book, or something else. We will create a Python list that contains every document. Within each list item for each document, we will use a Python dictionary of key/value pairs to store information related to that document. Read more about the dataset format.
Essentially we will have a list of documents numbered, from zero to the last document. Each list item then will be composed of a dictionary of key/value pairs that allows us to retrieve information from that particular document by number. The structure will look something like this:
For each item in our list we will be able to use key/value pairs to get a value if we supply a key. We will call our Python list variable all_documents
since it will contain all of the documents in our corpus.
# The tdm_client uses a default `file_name` of `sampleJournalAnalysis.jsonl`.
# Unless you have uploaded your own dataset, this code cell requires no modification before running.
file_name = 'sampleJournalAnalysis.jsonl'
# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = []
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file:
#for each line in the dataset file
for line in dataset_file:
# Read each line into a Python dictionary.
# Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
document = json.loads(line)
# Append a new list item to `all_documents` containing the dictionary we created.
all_documents.append(document)
Now all of our documents have been converted from our original JSON lines file format (.jsonl) into a python List variable named all_documents
. Let's see what we can discover about our corpus with a few simple methods.
First, we can determine how many texts are in our dataset by using the len()
function to get the size of all_documents
.
len(all_documents)
We will create a new variable called chosen_document
and set it equal to the first list item in all_documents
. (Remember, in Python lists, 0 is the first item, 1 is the second item, 2 is the third item, etc.)
We'll also use the .get()
method to retrieve some information about the item and print it here. This will help us check to make sure this is a suitable article. If it is front matter or back matter, for example, you may want to select another article. You can achieve that by changing the index number in the first line of code. For example, you might change all_documents[0]
(the first article in the list) to all_documents[5]
(the sixth article in the list).
If you're looking at a JSTOR corpus document, you can also follow the URL to preview it. This will help you determine if it's a good example.
chosen_document = all_documents[0] # Create a dictionary variable that contains the first document from all_documents. Change 0 if you want another document.
print(chosen_document.get('title')) # Get the value for the key title for `chosen_document` and print it
print('written in ' + str(chosen_document.get('isPartOf'))) # Print 'written in' and the journal value stored in the key 'isPartOf'
#print(str(chosen_document.get('publicationYear')) + ', Volume ' + chosen_document.get('volumeNumber')) # Print the value of the key `publicationYear` and `volumeNumber`
print('URL is: ' + chosen_document.get('url')) # Print 'URL is: ' and the value for the key 'url' in `chosen_document`
Now, let's examine the word counts from the chosen_document
. First, we create a new variable word_counts
that will contain the word counts from our chosen_document
. These are stored as Python dictionary variables.
word_counts = chosen_document.get('unigramCount')
dict(list(word_counts.items())[:10]) #This code previews the first 10 items in your dictionary
# It does this by turning the `word_counts` dictionary into a list and then shows 10 items (and then turns it back into a dictionary)
# We could also use a for loop to show all the keys and values using the items() method in the word_counts dictionary
#for k,v in word_counts.items():
# print(k + ': ' + str(v))
In order to help analyze our dictionary, we are going to use a special container datatype called a Counter. A Counter is like a dictionary. In fact, it uses brackets {}
like a dictionary. Here's an example where we turn a dictionary (dictionaryDemo
) into a Counter (counterDemo
) in order to explore the difference between the two:
from collections import Counter # Import Counter datatype
dictionary_demo = {"Random": 23,
'Words': 3,
'For': 4,
'The': 4,
'Example': 553} # Create example dictionary with key/value pairs of words and numbers
counter_demo = Counter(dictionary_demo) # Turn the dictionary into a counter
print(counter_demo)
As you can see, the Counter type looks identical to a dictionary with key/value pairs within {} that is surrounded by the parentheses in Counter()
. Both dictionaries and counters can return a value from a key.
print(dictionary_demo['Random']) # Using the Python dictionary `dictionary_demo`, return the value for the key 'Random'
print(counter_demo['Random']) #Using the Python counter `counter_demo`, return the value for the key 'Random'
However, the Counter()
has some helpful differences from a dictionary. One difference is that a Counter()
returns a 0 when no such key exists.
print(counter_demo['no_such_key_exists']) # With a Counter, the value of the made-up key `no_such_key_exists` is 0.
If a key is not in a dictionary, it returns a KeyError.
#print(dictionary_demo['no_such_key_exists']) # With a dictionary, the value of the made-up key `no_such_key_exists` causes a KeyError in Python
While this is useful, we already have a similar functionality using the get()
method.
# A demonstration of returning a string when no such key exists
print(dictionary_demo.get('no_such_key_exists')) # If no key is found, `None` is returned
print(counter_demo.get('no_such_key_exists', 'No such key')) # We can also supply a second argument that defines a string to be returned
For our purposes, the most useful aspect of the Counter()
datatype is that it lets us easily return the most common items through the most_common()
method. We can specify an argument with this method to receive a certain number of results. Let's try it on our example counter_demo
.
counter_demo.most_common(3) # Print the top 3 most common items in `counter_demo`
Let's return then to the dictionary we created to hold all the words in our article. We called that variable word_counts
. We can get a preview of the first 10 words in our dictionary using the code below.
dict(list(word_counts.items())[:10]) #This code previews the first 10 items in the dictionary
# It does this by turning the `word_counts` dictionary into a list and then shows 10 items (and then turns it back into a dictionary)
# We could also use a for loop to show all the keys and values using the items() method in the word_counts dictionary
#for k,v in word_counts.items():
# print(k + ': ' + str(v))
Note, the key/value pairs may not be in order from most frequent to least frequent words. We can sort by most frequent words by turning our dictionary word_counts
into a Counter and then using the most_common()
method. Let's call our new Counter object counter_word_counts
and then print out the top 30 most common words.
counter_word_counts = Counter(word_counts) # Create `counter_word_counts` that will be Counter datatype version of our original `word_counts` dictionary
for key, value in counter_word_counts.most_common(30): # For each key/value pair in counter_word_count's top 30 most common words
print(key.ljust(15), value) #print the `key` left justified 15 characters from the `value`
We have successfully created a word frequency list. There are a couple small issues, however, that we still need to address:
To solve these issues, we need to find a way to remove common function words and combine strings that may have capital letters in them. We can solve these issues by:
We could create our own stopwords list, but luckily there are many examples out there already. We'll use NLTK's stopwords list to get started.
First, we create a new list variable stop_words
and initialize it with the common English stopwords from the Natural Language Toolkit library.
# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.
from nltk.corpus import stopwords #import stopwords from nltk.corpus
stop_words = stopwords.words('english') #create a list `stop_words` that contains the English stop words list
If you're curious what is in our stopwords list, we can print a slice of the first ten words in our list to get a preview.
stop_words[:10] #print the first 10 stop words in the list
#list(stop_words) #show the whole stopwords list
It may be that we want to add additional words to our stoplist. For example, we may want to remove character names. We can add items to the list by using the append method.
stop_words.append("octopus")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable
We can also add multiple words to our stoplist by using the extend() method. Notice that this method requires using a set of brackets []
to clarify that we are adding "gertrude" and "horatio" as list items.
stop_words.extend(["kangaroo", "lemur"])
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable
We can also remove words from our list with the remove() method.
stop_words.remove("octopus")
stop_words.remove("kangaroo")
stop_words.remove("lemur")
# Or to remove the last three words:
# del stop_words[-3:]
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable
We could also store our stop words in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas separating each entry. The file could be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets. Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.
Let's create an example CSV.
# Create a CSV file to store a set of stopwords
import csv # Import the csv module to work with csv files
outputFile = open('stop_words.csv', 'w', newline='') # Create a variable `outputFile` that will be linked to a new csv file called stop_words.csv
outputWriter = csv.writer(outputFile) # Create a writer object to add to our `outputFile`
outputWriter.writerow(stop_words) # Add our list `stop_words` to the CSV file
outputFile.close() # Close the CSV file
We have created a new file called stopWords.csv that you can open to modify. Go ahead and make a change to your stopWords.csv (either adding or subtracting words). Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select "Open With > Editor."
Now go ahead and add in a new word. Remember a few things:
Now let's read our CSV file back and overwrite our original stop_words
list variable.
# Open the CSV file and list the contents
new_stopwords_file = open('stop_words.csv') # Open `stopWords.csv` as the variable newStopwordsFile
new_stopwords_reader = csv.reader(new_stopwords_file) # Create newStopwordsReader variable to open the newStopwordsFile in Reader Mode
stop_words = list(new_stopwords_reader)[0] # Define the stop_words variable as a list to the contents of newStopwordsReader
stop_words[-10:] # Return the last ten items of the list stop_words
Refining a stopwords list for your analysis can take time. It depends on:
If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list. ___
We can standardize and clean up the tokens in our dataset by creating a function that passes each token through a series of tests. The function will:
stop_words
Of course, depending on your analysis and goals, you may want to change one or more the tests.
# A series of tests to see whether a token should be added to our final word count.
# In order for a token to be added, it must pass all these tests.
cleaned_word_counts = Counter() # define a new variable `cleaned_word_counts` that is an empty counter type. We will store our cleaned data in it.
for token, count in counter_word_counts.items(): # For each key (`token`), value (`count`) pair in our cleaned_word_counts Counter, run the following tests...
if len(token) < 4: # If the token is less than four characters, restart the loop with the next token
continue
if not token.isalpha(): # If the token contains characters that are not from the alphabet, restart the loop over with the next token
continue
t = token.lower() # Define a variable `t` that is an all-lowercase version of the token
if t in stop_words: # If the token `t` is in our stop_words list, restart the loop over with the next token
continue
cleaned_word_counts[t] += count # Add `t` and `count` to `cleaned_word_counts`
print(cleaned_word_counts)
The resulting dictionary clean_word_counts
contains only function words, lowercased, and greater than four characters. We can now print the top 25 most common words using the most_common()
method for Counters.
for key, value in cleaned_word_counts.most_common(25): # For the top 25 most common key/value pairs in `cleaned_word_counts`
print(key.ljust(15), value) # print the key (left-justified by 15 characters) followed by the value
# Remember that the key above corresponds to the token and the value corresponds to the number of times that token occurs