Notebook

Using TF-IDF to Determine What Constitutes an NSF "Woke DEI Grant"¶

In this exercise, you'll use Subset TF-IDF to discover what terms characterize grants that were labeled as "woke" in a dataset released by the U.S. Senate Commerce Committee.

U.S. Senate Commerce Committee Chairman Ted Cruz (R-Texas) released a database identifying over 3,400 grants, totaling more than $2.05 billion in federal funding awarded by the National Science Foundation (NSF) during the Biden-Harris administration. This funding was diverted toward questionable projects that promoted Diversity, Equity, and Inclusion (DEI) or advanced neo-Marxist class warfare propaganda.

You will:

Load two datasets of NSF grants
Preprocess grant descriptions
Implement Subset TF-IDF to find distinctive terms
Analyze what makes certain grants distinctive

Background: What is an NSF Grant?¶

The U.S. National Science Foundation (NSF) is an independent agency of the United States federal government that supports fundamental research and education in all the non-medical fields of science and engineering.

The NSF funds approximately 25% of all federally supported basic research conducted by the United States' colleges and universities.

In some fields, such as mathematics, computer science, economics, and the social sciences, the NSF is the major source of federal backing.

Pulling and Joining our Datasets¶

For this exercise, we will leverage two datasets which I've already pulled and placed into our shared datasets Google Drive Folder.

Cruz's Dataset of "Woke" NSF Grants¶

This database was released as part of the U.S. Senate Commerce Committee's press release and is downloadable from the following page:

https://www.commerce.senate.gov/2025/2/cruz-led-investigation-uncovers-2-billion-in-woke-dei-grants-at-nsf-releases-full-database

This dataset contains useful information such as recipients and descriptions but is missing details about the "non-woke" grants necessary for TF-IDF.

USASpending.gov's Dataset of ALL NSA Assistance Grants in 2022¶

USAspending.gov is the official open data source for federal spending information. Detailed, year-by-year records of all grants since 2008 are available for download:

https://www.usaspending.gov/download_center/award_data_archive

This dataset is much larger, more detailed, and most importantly, includes the missing information about the non-woke grants needed to build our corpus for TF-IDF.

Interesting Columns:

award_id_fain - Unique award identifier (join key)
total_obligated_amount - Total grant obligated amount
prime_award_base_transaction_description - Award description
recipient_name - Name of the recipient (often the university name)
recipient_state_name - State of the award recipient
recipient_city_name - City of the award recipient
cfda_title - Catalog of Federal Domestic Assistance title (area of research)

Note: Because these datasets are large, I focused on just the 2022 grants. If you're interested, I encourage you to extend this exercise by incorporating other years.

In this cell we will:

Pull both datasets from our shared Google Drive folder.
Use the "woke" dataset to add a new boolean column, is_woke, to our dataset. We will use this flag to partion our target subset of documents from the corpus.

I have done the work to pull these datasets and set the is_work column.

Part 1: Setup and import¶

In [ ]:

# JUST RUN THIS, no changes needed

from google.colab import drive
import pandas as pd
import math
import re
from collections import Counter

drive.mount('/content/gdrive')

# Load the "woke" grants dataset
woke_grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/woke_grants.tsv", delimiter="\t")
woke_grant_ids = woke_grants_df.dropna(subset="AWARD ID")["AWARD ID"]

# Load all NSF grants from 2022
grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/FY2022_049_Assistance_Full_20250109_1.csv",
                        on_bad_lines='skip', low_memory=False)

# Add a boolean "is_woke" column
grants_df["is_woke"] = grants_df["award_id_fain"].isin(woke_grant_ids)

# Print dataset info
print(f"Total grants: {len(grants_df)}")
print(f"Labeled 'woke': {grants_df['is_woke'].sum()}")
print(f"Percentage: {100 * grants_df['is_woke'].mean():.1f}%")

# Visualize the dataframe
grants_df.head()

Part 2: Define our preprocessor¶

In [ ]:

# JUST RUN THIS, no changes needed

STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "by", "for", "from",
    "has", "he", "in", "is", "it", "its", "of", "on", "that", "the",
    "to", "was", "were", "will", "with", "i", "you", "we", "they",
    "this", "their", "our", "or", "but", "if", "then", "so", "such"
}

def preprocess(text):
    """Convert text to a list of lowercase words, removing stop words"""
    if pd.isna(text):
        return []

    # Convert to lowercase
    text = str(text).lower()

    # Split on punctuation and whitespace
    tokens = re.split(r"[,\.\!\?\s\(\)\[\];:\"']+", text)

    # Keep only non-empty tokens that aren't stop words
    processed_tokens = []
    for token in tokens:
        # Remove any remaining punctuation from edges
        token = token.strip("-/")
        if token and token not in STOP_WORDS and len(token) > 2:
            processed_tokens.append(token)

    return processed_tokens

# Test it
test_text = "This research investigates climate change impacts!"
print(preprocess(test_text))  # Should print: ['research', 'investigates', 'climate', 'change', 'impacts']

Part 3: Get Grant Description by grant_id¶

Hint¶

get_grant_description should look a lot like get_song_lyrics:

def get_song_lyrics(lyrics_df, artist, title):
    artist_df = lyrics_df[lyrics_df["Artist"] == artist]
    title_df = artist_df[artist_df["Title"] == title]
    return title_df['Lyric'].values[0]

Except this time we only need to filer by a single column this time (grant_id).

In [ ]:

def get_grant_description(grants_df, grant_id):
    # Input: grants_df is the DataFrame, grant_id is the award ID to find
    # Output: Returns the description string (or None if not found)

    # TODO: Your code here!
    # 1. Filter to rows where award_id_fain equals grant_id
    # 2. Get the prime_award_base_transaction_description value
    # 3. Handle the case where grant isn't found
    pass

# Test your function
sample_id = grants_df['award_id_fain'].iloc[0]
description = get_grant_description(grants_df, sample_id)
if description:
    print(f"Sample description: {description[:200]}...")

# Look at some examples of "woke" grants
print("\nExamples of labeled grants:")
woke_examples = grants_df[grants_df['is_woke'] == True].head(3)
for _, row in woke_examples.iterrows():
    print(f"\n{row['recipient_name']}: {row['prime_award_base_transaction_description'][:150]}...")

Part 4: Calculate the Subset Term Frequency (TF-IDF)¶

Hint¶

Take a look at the slides at: bigd103.link/slides

In [ ]:

def calculate_subset_term_frequency(subset_descriptions):
    # Input: descriptions is a list of nsf grant description (our subset of documents)
    # Output: Returns dictionary mapping word -> total count across all descriptions

    subset_tf = {}

    # TODO: Your code here!
    #
    # 1. For each grant description:
    #      2. preprocess the description
    #      3. for each term in the preprocessed description
    #          4. add the count of that term to subset_tf

    return subset_tf

# Calculate TF for "woke" grants
woke_descriptions = grants_df[grants_df['is_woke'] == True]['prime_award_base_transaction_description'].tolist()
print(f"Calculating term frequency for {len(woke_descriptions)} 'woke' grants...")

woke_tf = calculate_subset_term_frequency(woke_descriptions)
print(f"Unique terms in subset: {len(woke_tf)}")
print("Top 10 terms by frequency:", sorted(woke_tf.items(), key=lambda x: x[1], reverse=True)[:10])

Part 5: Calculate the Document Frequency (TF-IDF)¶

In [ ]:

def calculate_document_frequency(corpus, target_terms):
    # Input: corpus is a list of all nsf grant description strings (all of our documents)
    #        target_terms is a set of terms to check
    # Output: Returns dictionary mapping term -> number of documents containing it

    doc_freq = {}

    # TODO: Your code here!
    # For each document in corpus:
    # 1. Create a new empty list, preprocessed_corpus
    # 2. For each document in corpus
    #     3. preprocess the document
    #     4. append the new preprocessed doc to the preprocessed_corpus
    # 2. For each term in target_terms
    #     3. For each preprocessed doc in preprocessed_corpus
    #         4. If the term is in the doc, increment doc_feq for that term

    return doc_freq

# Create corpus and calculate DF
all_descriptions = grants_df['prime_award_base_transaction_description'].tolist()
target_terms = set(woke_tf.keys())

print(f"Calculating document frequency for {len(target_terms)} terms...")
print(f"Corpus size: {len(all_descriptions)} grants")

df_counts = calculate_document_frequency(all_descriptions, target_terms)

Part 6: Calculate Subset TFIDF¶

In [ ]:

def calculate_subset_tfidf(subset_tf, doc_freq, total_docs):
    # Input: subset_tf is dictionary of term frequencies in the subset
    #        doc_freq is dictionary of document frequencies in full corpus
    #        total_docs is total number of documents in corpus
    # Output: Returns dictionary mapping term -> Subset TF-IDF score

    tfidf = {}

    # TODO: Your code here!
    # For each term in subset_tf:
    #   1. Calculate IDF = math.log(total_docs / doc_freq[term])
    #   2. Calculate Subset TF-IDF = subset_tf[term] * IDF
    #   3. Store in tfidf dictionary

    return tfidf

# Calculate Subset TF-IDF
subset_tfidf_scores = calculate_subset_tfidf(woke_tf, df_counts, len(all_descriptions))

# Display results
sorted_scores = sorted(subset_tfidf_scores.items(), key=lambda x: x[1], reverse=True)
print(f"\nTop 30 most terms among 'woke' grants:")
for term, score in sorted_scores[:30]:
    print(f"  {term}: {score:.3f}")

Bonus: Additional Questions Left to the Reader¶

This concludes this assignment but, in the above code, we've established a toolset to do deeper analysis.

Questions that may be worth exploring:

Do we see different terms when we isolate the dataset to only one area (cfda_title)?
Do terms change for other Biden-Harris presidency years such as 2021, 2023, and 2024?
How could we scale this methodology into a general form classifier that spans across NSF grants from other presidencies?

In [ ]: