Notebook

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___ Creating a Stopwords List

Description: This notebook explains what a stopwords list is and how to create one. The following processes are described:

Loading the NLTK stopwords list
Modifying the stopwords list in Python
Saving a stopwords list to a .csv file
Loading a stopwords list from a .csv file

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 20 minutes

Knowledge Required:

Python Basics Series (Start Python Basics I)

Knowledge Recommended: None

Data Format: CSV files

Libraries Used:

nltk to create an initial stopwords list
csv to read and write the stopwords to a file

Research Pipeline: None ___

The Purpose of a Stopwords List¶

Many text analytics techniques are based on counting the occurrence of words in a given text or set of texts (called a corpus). The most frequent words can reveal general textual patterns, but the most frequent words for any given text in English tend to look very similar to this:

Word	Frequency
the	1,160,276
of	906,898
and	682,419
in	461,328
to	418,017
a	334,082
is	214,663
that	204,277
by	181,605
as	177,774

There are many function words, words like "the", "in", and "of" that are grammatically important but do not carry as much semantic meaning in comparison to content words, such as nouns and verbs.

For this reason, many analysts remove common function words using a stopwords list. There are many sources for stopwords lists. (We'll use the Natural Language Toolkit stopwords list in this lesson.) There is no official, standardized stopwords list for text analysis.

An effective stopwords list depends on:

the texts being analyzed
the purpose of the analysis

Even if we remove all common function words, there are often formulaic repetitions in texts that may be counter-productive for the research goal.The researcher is responsible for making educated decisions about whether or not to include any particular stopword given the research context.

Here are a few examples where additional stopwords may be necessary:

A corpus of law books is likely to have formulaic, archaic repetition, such as, "hereunto this law is enacted..."
A corpus of dramatic plays is likely to have speech markers for each line, leading to an over-representation of character names (Hamlet, Gertrude, Laertes, etc.)
A corpus of emails is likely to have header language (to, from, cc, bcc), technical language (attached, copied, thread, chain) and salutations (attached, best, dear, cheers, etc.)

Because every research project may require unique stopwords, it is important for researchers to learn to create and modify stopwords lists.

Examining the NLTK Stopwords List¶

The Natural Language Toolkit Stopwords list is well-known and a natural starting point for creating your own list. Let's take a look at what it contains before learning to make our own modifications.

We will store our stopwords in a Python list variable called stop_words.

In [ ]:

# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.
from nltk.corpus import stopwords # Import stopwords from nltk.corpus
stop_words = stopwords.words('english') # Create a list `stop_words` that contains the English stop words list

If you're curious what is in our stopwords list, we can use the print() or list() functions to find out.

In [ ]:

list(stop_words) # Show each string in our stopwords list

Storing Stopwords in a CSV File¶

Storing the stopwords list in a variable like stop_words is useful for analysis, but we will likely want to keep the list even after the session is over for future changes and analyses. We can store our stop words list in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas separating each entry. The file can be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets.

Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.

The csv file as an image

Let's create an example CSV using the csv module.

In [ ]:

# Create a CSV file to store a set of stopwords

import csv # Import the csv module to work with csv files
with open('data/stop_words.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(stop_words)

We have created a new file called data/stop_words.csv that you can open and modify using a basic text editor. Go ahead and make a change to your data/stop_words.csv (either adding or subtracting words) using a text editor. Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select "Open With > Editor."

Selecting "Open With > Editor" in Jupyter Lab

Now go ahead and add in a new word. Remember a few things:

Each word is separated from the next word by a comma.
There are no spaces between the words.
You must save changes to the file if you're using a text editor, Excel, or the Jupyter Lab editor.
You can reopen the file to make sure your changes were saved.

Now let's read our CSV file back and overwrite our original stop_words list variable.

Reading in a Stopwords CSV¶

In [ ]:

# Open the CSV file and list the contents

with open('data/stop_words.csv', 'r') as f:
    stop_words = f.read().strip().split(",")
stop_words[-10:]

Refining a stopwords list for your analysis can take time. It depends on:

What you are hoping to discover (for example, are function words important?)
The material you are analyzing (for example, journal articles may repeat words like "abstract")

If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list.