Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___
Creating a Stopwords List
Description: This notebook explains what a stopwords list is and how to create one. The following processes are described:
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion time: 20 minutes
Knowledge Required:
Knowledge Recommended: None
Data Format: CSV files
Libraries Used:
Research Pipeline: None ___
Many text analytics techniques are based on counting the occurrence of words in a given text or set of texts (called a corpus). The most frequent words can reveal general textual patterns, but the most frequent words for any given text in English tend to look very similar to this:
Word | Frequency |
---|---|
the | 1,160,276 |
of | 906,898 |
and | 682,419 |
in | 461,328 |
to | 418,017 |
a | 334,082 |
is | 214,663 |
that | 204,277 |
by | 181,605 |
as | 177,774 |
There are many function words, words like "the", "in", and "of" that are grammatically important but do not carry as much semantic meaning in comparison to content words, such as nouns and verbs.
For this reason, many analysts remove common function words using a stopwords list. There are many sources for stopwords lists. (We'll use the Natural Language Toolkit stopwords list in this lesson.) There is no official, standardized stopwords list for text analysis.
An effective stopwords list depends on:
Even if we remove all common function words, there are often formulaic repetitions in texts that may be counter-productive for the research goal.The researcher is responsible for making educated decisions about whether or not to include any particular stopword given the research context.
Here are a few examples where additional stopwords may be necessary:
Because every research project may require unique stopwords, it is important for researchers to learn to create and modify stopwords lists.
The Natural Language Toolkit Stopwords list is well-known and a natural starting point for creating your own list. Let's take a look at what it contains before learning to make our own modifications.
We will store our stopwords in a Python list variable called stop_words
.
# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.
from nltk.corpus import stopwords # Import stopwords from nltk.corpus
stop_words = stopwords.words('english') # Create a list `stop_words` that contains the English stop words list
If you're curious what is in our stopwords list, we can use the print()
or list()
functions to find out.
list(stop_words) # Show each string in our stopwords list
Storing the stopwords list in a variable like stop_words
is useful for analysis, but we will likely want to keep the list even after the session is over for future changes and analyses. We can store our stop words list in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas separating each entry. The file can be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets.
Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.
Let's create an example CSV using the csv
module.
# Create a CSV file to store a set of stopwords
import csv # Import the csv module to work with csv files
with open('data/stop_words.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(stop_words)
We have created a new file called data/stop_words.csv that you can open and modify using a basic text editor. Go ahead and make a change to your data/stop_words.csv (either adding or subtracting words) using a text editor. Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select "Open With > Editor."
Now go ahead and add in a new word. Remember a few things:
Now let's read our CSV file back and overwrite our original stop_words
list variable.
# Open the CSV file and list the contents
with open('data/stop_words.csv', 'r') as f:
stop_words = f.read().strip().split(",")
stop_words[-10:]
Refining a stopwords list for your analysis can take time. It depends on:
If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list.