Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___
Finding Significant Words Using TF/IDF
Description: This notebook shows how to discover significant words. The method for finding significant terms is tf-idf. The following processes are described:
tdm_client
to retrieve a datasetUse Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion time: 60 minutes
Knowledge Required:
Knowledge Recommended:
Data Format: JSON Lines (.jsonl)
Libraries Used:
pandas
to load a preprocessing listcsv
to load a custom stopwords listResearch Pipeline:
TF-IDF is used in machine learning and natural language processing for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:
If we were to merely consider word frequency, the most frequent words would be common function words like: "the", "and", "of". We could use a stopwords list to remove the common function words, but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:
The TF-IDF calculation reveals the words that are frequent in this document yet rare in other documents. The goal is to find out what is unique or remarkable about a document given the context (and the given context can change the results of the analysis).
Here is how the calculation is mathematically written:
$$tfidf_{t,d} = tf_{t,d} \cdot idf_{t,D}$$In plain English, this means: The value of TF-IDF is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency. Let's unpack these terms one at a time.
The number of times (t) a term occurs in a given document (d)
The inverse document frequency can be expanded to the calculation on the right. In plain English, this means: The log of the total number of documents (N) divided by the number of documents that contain the term
There are variations on the TF-IDF formula, but this is the most widely-used version.
Let's take a look at an example to illustrate the fundamentals of TF-IDF. First, we need several texts to compare. Our texts will be very simple.
The first step is we need to discover how many unique words are in each text.
text1 | text2 | text3 | text4 |
---|---|---|---|
the | green | green | the |
grass | eggs | sailors | grass |
was | and | were | was |
green | ham | met | green |
and | were | like | |
spread | spread | the | |
out | out | sea | |
into | like | met | |
distance | the | troubles | |
like | book | ||
sea |
Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the gensim library later, we will call this list a gensim dictionary.)
Unique Words |
---|
and |
book |
distance |
eggs |
grass |
green |
ham |
like |
met |
out |
sailors |
sea |
spread |
the |
troubles |
was |
were |
Now let's count the occurences of each unique word in each sentence
word | text1 | text2 | text3 | text4 |
---|---|---|---|---|
and | 1 | 1 | 0 | 0 |
book | 0 | 1 | 0 | 0 |
distance | 1 | 0 | 0 | 0 |
eggs | 0 | 1 | 0 | 0 |
grass | 1 | 0 | 0 | 1 |
green | 1 | 1 | 1 | 1 |
ham | 0 | 1 | 0 | 0 |
like | 1 | 1 | 1 | 0 |
met | 0 | 0 | 2 | 0 |
out | 1 | 1 | 0 | 0 |
sailors | 0 | 0 | 1 | 0 |
sea | 1 | 0 | 1 | 0 |
spread | 1 | 1 | 0 | 0 |
the | 3 | 1 | 1 | 1 |
troubles | 0 | 0 | 1 | 0 |
was | 1 | 0 | 0 | 1 |
were | 0 | 1 | 1 | 0 |
We have enough information now to compute TF-IDF for every word in our corpus. Recall the plain English formula.
$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$We can use the formula to compute TF-IDF for the most common word in our corpus: 'the'. In total, we will compute TF-IDF four times (once for each of our texts).
word | text1 | text2 | text3 | text4 |
---|---|---|---|---|
the | 3 | 1 | 1 | 1 |
text1: $$ tf-idf = 3 \cdot \mbox{log} \frac{4}{(4)} = 3 \cdot \mbox{log} 1 = 3 \cdot 0 = 0$$ text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$ text3: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$ text4: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
The results of our analysis suggest 'the' has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.
Given that idf is
$$\mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$and
$$\mbox{log} 1 = 0$$we can see that TF-IDF will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any individual document.
Let's try a second example with the word 'out'. Recall the plain English formula.
$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$We will compute TF-IDF four times, once for each of our texts.
word | text1 | text2 | text3 | text4 |
---|---|---|---|---|
out | 1 | 1 | 0 | 0 |
text1: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$ text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$ text3: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$ text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$
The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.
Let's try one last example with the word 'met'. Here's the TF-IDF formula again:
$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$And here's how many times the word 'met' occurs in each text.
word | text1 | text2 | text3 | text4 |
---|---|---|---|---|
met | 0 | 0 | 2 | 0 |
text1: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$ text2: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$ text3: $$ tf-idf = 2 \cdot \mbox{log} \frac{4}{(1)} = 2 \cdot \mbox{log} 4 = 2 \cdot .6021 = 1.2042$$ text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$
As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text.
Here are the original sentences for each text:
And here's the corresponding TF-IDF scores for each word in each text:
word | text1 | text2 | text3 | text4 |
---|---|---|---|---|
and | .3010 | .3010 | 0 | 0 |
book | 0 | .6021 | 0 | 0 |
distance | .6021 | 0 | 0 | 0 |
eggs | 0 | .6021 | 0 | 0 |
grass | .3010 | 0 | 0 | .3010 |
green | 0 | 0 | 0 | 0 |
ham | 0 | .6021 | 0 | 0 |
like | .1249 | .1249 | .1249 | 0 |
met | 0 | 0 | 1.2042 | 0 |
out | .3010 | .3010 | 0 | 0 |
sailors | 0 | 0 | .6021 | 0 |
sea | .3010 | 0 | .3010 | 0 |
spread | .3010 | .3010 | 0 | 0 |
the | 0 | 0 | 0 | 0 |
troubles | 0 | 0 | .6021 | 0 |
was | .3010 | 0 | 0 | .3010 |
were | 0 | .3010 | .3010 | 0 |
There are a few noteworthy things in this data.
Now that you have a basic understanding of how TF-IDF is computed at a small scale, let's try computing TF-IDF on a corpus which could contain millions of words.
We'll use the tdm_client library to automatically retrieve the dataset in the JSON file format.
Enter a dataset ID in the next code cell.
If you don't have a dataset ID, you can:
dataset_id = "b4668c50-a970-c4d7-eb2c-bb6d04313542"
Next, import the tdm_client
, passing the dataset_id
as an argument using the get_dataset
method.
# Importing your dataset with a dataset ID
import tdm_client
# Pull in the dataset that matches `dataset_id`
# in the form of a gzipped JSON lines file.
dataset_file = tdm_client.get_dataset(dataset_id)
If you completed pre-processing with the "Exploring Metadata and Pre-processing" notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file must be in the root folder.
# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os
pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'
if os.path.exists(pre_processed_file_name):
df = pd.read_csv(pre_processed_file_name)
filtered_id_list = df["id"].tolist()
use_filtered_list = True
print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
else:
use_filtered_list = False
print('No pre-processed CSV file found. Full dataset will be used.')
# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English
# Create an empty Python list to hold the stopwords
stop_words = []
# The filename of the custom data/stop_words.csv file
stopwords_list_filename = 'data/stop_words.csv'
if os.path.exists(stopwords_list_filename):
import csv
with open(stopwords_list_filename, 'r') as f:
stop_words = list(csv.reader(f))[0]
print('Custom stopwords list loaded from CSV')
else:
# Load the NLTK stopwords list
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print('NLTK stopwords list loaded')
In this step, we gather the unigrams. If there is a Pre-Processing Filter, we will only analyze documents from the filtered ID list. We will also process each unigram, assessing them individually. We will complete the following tasks:
We can define this process in a function.
# Define a function that will process individual tokens
# Only a token that passes through all three `if`
# statements will be returned. A `True` result for
# any `if` statement does not return the token.
def process_token(token):
token = token.lower()
if token in stop_words: # If True, do not return token
return
if len(token) < 4: # If True, do not return token
return
if not(token.isalpha()): # If True, do not return token
return
return token # If all are False, return the lowercased token
Next, we process all the unigrams into a list called documents
. For demonstration purposes, this code runs on a limit of 500 documents, but we can change this to process all the documents.
# Collecting the unigrams and processing them into `documents`
limit = 500 # Change number of documents being analyzed. Set to `None` to do all documents.
n = 0
documents = []
document_ids = []
for document in tdm_client.dataset_reader(dataset_file):
processed_document = []
document_id = document['id']
if use_filtered_list is True:
# Skip documents not in our filtered_id_list
if document_id not in filtered_id_list:
continue
document_ids.append(document_id)
unigrams = document.get("unigramCount", [])
for gram, count in unigrams.items():
clean_gram = process_token(gram)
if clean_gram is None:
continue
processed_document.append(clean_gram)
if len(processed_document) > 0:
documents.append(processed_document)
n += 1
if (limit is not None) and (n >= limit):
break
print('Unigrams collected and processed.')
Now that we have all the cleaned unigrams in a list, we can use Gensim to compute TF/IDF.
It will be helpful to remember the basic steps we did in the explanatory TF-IDF example:
So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In gensim, this is called a "dictionary". A gensim dictionary is similar to a Python dictionary, but here it is called a gensim dictionary to show it is a specialized kind of dictionary.
Let's create our gensim dictionary. A gensim dictionary is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.
import gensim
dictionary = gensim.corpora.Dictionary(documents)
Now that we have a gensim dictionary, we can get a preview that displays the number of unique tokens across all of our texts.
print(dictionary)
The gensim dictionary stores a unique identifier (starting with 0) for every unique token in the corpus. The gensim dictionary does not contain information on word frequencies; it only catalogs all the words in the corpus. You can see the unique ID for each token in the text using the .token2id() method. Your corpus may have hundreds of thousands of unique words so here we just give a preview of the first ten.
dict(list(dictionary.token2id.items())[0:10]) # Print the first ten tokens and their associated IDs.
We can also look up the corresponding ID for a token using the .get
method.
dictionary.token2id.get('people', 0) # Get the value for the key 'people'. Return 0 if there is no token matching 'people'. The number returned is the gensim dictionary ID for the token.
The next step is to combine our word frequency data found within documents
to our gensim dictionary token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We can do a single document first to show how this works. We will create a Python list called example_bow_corpus
that will turn our word counts into a series of tuples where the first number is the gensim dictionary token ID and the second number is the word frequency.
example_bow_corpus = [dictionary.doc2bow(documents[0])] # Create an example bag of words corpus. We select a document at random to use as our sample.
list(example_bow_corpus[0][:10]) # List out the first ten tuples in ``example_bow_corpus``
Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code will replace the token IDs in the last example with the actual tokens.
word_counts = [[(dictionary[id], count) for id, count in line] for line in example_bow_corpus]
list(word_counts[0][:10])
We saw before that you could discover the gensim dictionary ID number by running:
dictionary.token2id.get('people', 0)
If you wanted to discover the token given only the ID number, the method is a little more involved. You could use list comprehension to find the key token based on the value ID. Normally, Python dictionaries only map from keys to values (not from values to keys). However, we can write a quick list comprehension to go the other direction. (It is unlikely one would ever do these methods in practice, but they are shown here to demonstrate how the gensim dictionary is connected to the list entries in the gensim bow_corpus
.
[token for dict_id, token in dictionary.items() if dict_id == 100] # Find the corresponding token in our gensim dictionary for the gensim dictionary ID
We have seen an example that demonstrates how the gensim bag of words corpus works on a single document. Let's apply it now to all of our documents.
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
#print(bow_corpus[:3]) #Show the bag of words corpus for the first 3 documents
The next step is to create the TF-IDF model which will set the parameters for our implementation of TF-IDF. In our TF-IDF example, the formula for TF-IDF was:
$$(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}$$In gensim, the default formula for measuring TF-IDF uses log base 2 instead of log base 10, as shown:
$$(Times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-the-word)}$$If you would like to use a different formula for your TF-IDF calculation, there is a description of parameters you can pass.
TfidfModel
¶model = gensim.models.TfidfModel(bow_corpus) # Create our gensim TF-IDF model
Now, we apply our model to the bow_corpus
to create our results in corpus_tfidf
. The corpus_tfidf
is a python list of each document similar to bow_document
. Instead of listing the frequency next to the gensim dictionary ID, however, it contains the TF-IDF](https://docs.tdm-pilot.org/key-terms/#tf-idf) score for the associated token. Below, we display the first document in corpus_tfidf
.
corpus_tfidf = model[bow_corpus] # Create TF-IDF scores for the ``bow_corpus`` using our model
list(corpus_tfidf[0][:10]) # List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
Let's display the tokens instead of the gensim dictionary IDs.
example_tfidf_scores = [[(dictionary[id], count) for id, count in line] for line in corpus_tfidf]
list(example_tfidf_scores[0][:10]) # List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
Finally, let's sort the terms by their TF-IDF weights to find the most significant terms in the document.
# Sort the tuples in our tf-idf scores list
def Sort(tfidf_tuples):
tfidf_tuples.sort(key = lambda x: x[1], reverse=True)
return tfidf_tuples
list(Sort(example_tfidf_scores[0])[:10]) #List the top ten tokens in our example document by their TF-IDF scores
We could also analyze across the entire corpus to find the most unique terms. These are terms that appear frequently in a single text, but rarely or never appear in other texts. (Often, these will be proper names since a particular article may mention a name often but the name may rarely appear in other articles.)
td = { # Define a dictionary ``td`` where each document gather
dictionary.get(_id): value for doc in corpus_tfidf
for _id, value in doc
}
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True) # Sort the items of ``td`` into a new variable ``sorted_td``, the ``reverse`` starts from highest to lowest
for term, weight in sorted_td[:25]: # Print the top 25 terms in the entire corpus
print(term, weight)
And, finally, we can see the most significant term in every document.
# For each document, print the ID, most common word, and TF/IDF score
for n, doc in enumerate(corpus_tfidf):
if len(doc) < 1:
continue
word_id, score = max(doc, key=lambda x: x[1])
print(document_ids[n], dictionary.get(word_id), score)
if n >= 10:
break