Using TF-IDF to Determine What Constitutes an NSF "Woke DEI Grant"¶

In this exercise, you'll use Subset TF-IDF to discover what terms characterize grants that were labeled as "woke" in a dataset released by the U.S. Senate Commerce Committee.

U.S. Senate Commerce Committee Chairman Ted Cruz (R-Texas) released a database identifying over 3,400 grants, totaling more than $2.05 billion in federal funding awarded by the National Science Foundation (NSF) during the Biden-Harris administration. This funding was diverted toward questionable projects that promoted Diversity, Equity, and Inclusion (DEI) or advanced neo-Marxist class warfare propaganda.

You will:

Load two datasets of NSF grants
Preprocess grant descriptions
Implement Subset TF-IDF to find distinctive terms
Analyze what makes certain grants distinctive

Background: What is an NSF Grant?¶

The U.S. National Science Foundation (NSF) is an independent agency of the United States federal government that supports fundamental research and education in all the non-medical fields of science and engineering.

The NSF funds approximately 25% of all federally supported basic research conducted by the United States' colleges and universities.

In some fields, such as mathematics, computer science, economics, and the social sciences, the NSF is the major source of federal backing.

Pulling and Joining our Datasets¶

For this exercise, we will leverage two datasets which I've already pulled and placed into our shared datasets Google Drive Folder.

Cruz's Dataset of "Woke" NSF Grants¶

This database was released as part of the U.S. Senate Commerce Committee's press release and is downloadable from the following page:

https://www.commerce.senate.gov/2025/2/cruz-led-investigation-uncovers-2-billion-in-woke-dei-grants-at-nsf-releases-full-database

This dataset contains useful information such as recipients and descriptions but is missing details about the "non-woke" grants necessary for TF-IDF.

USASpending.gov's Dataset of ALL NSA Assistance Grants in 2022¶

USAspending.gov is the official open data source for federal spending information. Detailed, year-by-year records of all grants since 2008 are available for download:

https://www.usaspending.gov/download_center/award_data_archive

This dataset is much larger, more detailed, and most importantly, includes the missing information about the non-woke grants needed to build our corpus for TF-IDF.

Interesting Columns:

award_id_fain - Unique award identifier (join key)
total_obligated_amount - Total grant obligated amount
prime_award_base_transaction_description - Award description
recipient_name - Name of the recipient (often the university name)
recipient_state_name - State of the award recipient
recipient_city_name - City of the award recipient
cfda_title - Catalog of Federal Domestic Assistance title (area of research)

Note: Because these datasets are large, I focused on just the 2022 grants. If you're interested, I encourage you to extend this exercise by incorporating other years.

In this cell we will:

Pull both datasets from our shared Google Drive folder.
Use the "woke" dataset to add a new boolean column, is_woke, to our dataset. We will use this flag to partion our target subset of documents from the corpus.

I have done the work to pull these datasets and set the is_work column.

Part 1: Setup and import¶

In [2]:

# JUST RUN THIS, no changes needed

from google.colab import drive
import pandas as pd
import math
import re
from collections import Counter

drive.mount('/content/gdrive')

# Load the "woke" grants dataset
woke_grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/woke_grants.tsv", delimiter="\t")
woke_grant_ids = woke_grants_df.dropna(subset="AWARD ID")["AWARD ID"]

# Load all NSF grants from 2022
grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/FY2022_049_Assistance_Full_20250109_1.csv",
                        on_bad_lines='skip', low_memory=False)

# Add a boolean "is_woke" column
grants_df["is_woke"] = grants_df["award_id_fain"].isin(woke_grant_ids)

# Print dataset info
print(f"Total grants: {len(grants_df)}")
print(f"Labeled 'woke': {grants_df['is_woke'].sum()}")
print(f"Percentage: {100 * grants_df['is_woke'].mean():.1f}%")

# Uncomment to sample the dataset for development / testing purposes
#grants_df = grants_df.sample(2000)

grants_df.head(10)

Mounted at /content/gdrive
Total grants: 29425
Labeled 'woke': 1564
Percentage: 5.3%

Out[2]:

	assistance_transaction_unique_key	assistance_award_unique_key	award_id_fain	modification_number	award_id_uri	sai_number	federal_action_obligation	total_obligated_amount	total_outlayed_amount_for_overall_award	indirect_cost_federal_share_amount	...	highly_compensated_officer_3_name	highly_compensated_officer_3_amount	highly_compensated_officer_4_name	highly_compensated_officer_4_amount	highly_compensated_officer_5_name	highly_compensated_officer_5_amount	usaspending_permalink	initial_report_date	last_modified_date	is_woke
0	4900_2025735_-NONE-_47.050_002	ASST_NON_2025735_4900	2025735	002	NaN	SAI EXEMPT	112807.0	370110.0	218741.17	31386.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_202...	2022-07-06	2022-07-06	False
1	4900_2139301_-NONE-_47.074_000	ASST_NON_2139301_4900	2139301	000	NaN	SAI EXEMPT	86116.0	172232.0	112878.72	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_213...	2022-02-04	2022-02-04	False
2	4900_2139301_-NONE-_47.050_000	ASST_NON_2139301_4900	2139301	000	NaN	SAI EXEMPT	86116.0	172232.0	112878.72	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_213...	2022-02-04	2022-02-04	False
3	4900_1763524_-NONE-_47.070_005	ASST_NON_1763524_4900	1763524	005	NaN	SAI EXEMPT	0.0	1325062.0	633019.51	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_176...	2022-07-06	2022-07-06	False
4	4900_2209765_-NONE-_47.074_000	ASST_NON_2209765_4900	2209765	000	NaN	SAI EXEMPT	252488.0	379579.0	141017.33	77948.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_220...	2022-09-06	2022-09-06	False
5	4900_2224973_-NONE-_47.050_000	ASST_NON_2224973_4900	2224973	000	NaN	SAI EXEMPT	497132.0	497132.0	414390.47	140705.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_222...	2022-09-06	2022-09-06	True
6	4900_2026822_-NONE-_47.049_005	ASST_NON_2026822_4900	2026822	005	NaN	SAI EXEMPT	200000.0	5875000.0	4014242.15	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_202...	2022-09-20	2022-09-20	False
7	4900_1846076_-NONE-_47.070_P002	ASST_NON_1846076_4900	1846076	P002	NaN	SAI EXEMPT	-78415.0	144791.0	NaN	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_184...	2022-05-04	2022-05-04	False
8	4900_1642232_-NONE-_47.050_004	ASST_NON_1642232_4900	1642232	004	NaN	SAI EXEMPT	0.0	382873.0	61920.42	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_164...	2022-04-21	2022-04-21	False
9	4900_1534606_-NONE-_47.075_006	ASST_NON_1534606_4900	1534606	006	NaN	SAI EXEMPT	0.0	149341.0	7544.60	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	https://www.usaspending.gov/award/ASST_NON_153...	2022-06-06	2022-06-06	False

10 rows × 113 columns

Part 2: Define our preprocessor¶

In [3]:

# JUST RUN THIS, no changes needed

STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "by", "for", "from",
    "has", "he", "in", "is", "it", "its", "of", "on", "that", "the",
    "to", "was", "were", "will", "with", "i", "you", "we", "they",
    "this", "their", "our", "or", "but", "if", "then", "so", "such"
}

def preprocess(text):
    """Convert text to a list of lowercase words, removing stop words"""
    if pd.isna(text):
        return []

    # Convert to lowercase
    text = str(text).lower()

    # Split on punctuation and whitespace
    tokens = re.split(r"[,\.\!\?\s\(\)\[\];:\"']+", text)

    # Keep only non-empty tokens that aren't stop words
    processed_tokens = []
    for token in tokens:
        # Remove any remaining punctuation from edges
        token = token.strip("-/")
        if token and token not in STOP_WORDS and len(token) > 2:
            processed_tokens.append(token)

    return set(processed_tokens)

# Test it
test_text = "This research investigates climate change impacts!"
print(preprocess(test_text))  # Should print: ['research', 'investigates', 'climate', 'change', 'impacts']

{'investigates', 'change', 'impacts', 'climate', 'research'}

Part 3: Get Grant Description by grant_id¶

Hint¶

get_grant_description should look a lot like get_song_lyrics:

def get_song_lyrics(lyrics_df, artist, title):
    artist_df = lyrics_df[lyrics_df["Artist"] == artist]
    title_df = artist_df[artist_df["Title"] == title]
    return title_df['Lyric'].values[0]

Except this time we only need to filer by a single column this time (grant_id).

In [4]:

def get_grant_description(grants_df, grant_id):
    return grants_df[grants_df['award_id_fain'] == grant_id]['prime_award_base_transaction_description'].values[0]

# Test your function
sample_id = grants_df['award_id_fain'].iloc[0]
description = get_grant_description(grants_df, sample_id)
if description:
    print(f"Sample description: {description[:200]}...")

# Look at some examples of "woke" grants
print("\nExamples of labeled grants:")
woke_examples = grants_df[grants_df['is_woke'] == True].head(3)
for _, row in woke_examples.iterrows():
    print(f"\n{row['recipient_name']}: {row['prime_award_base_transaction_description'][:150]}...")

Sample description: REVISITING THE CAMBRIAN SERIES 1 ANIMAL ORIGINATION CHRONOLOGY...

Examples of labeled grants:

GEORGIA TECH RESEARCH CORP: TRACK I CENTER CATALYST: COLLABORATIVE CENTER FOR LANDSLIDES AND GROUND FAILURE GEOHAZARDS -LANDSLIDES ARE A COMMON HAZARD IN MOUNTAINOUS TERRAIN WORL...

ARIZONA STATE UNIVERSITY: CAREER: IDENTIFYING ASPECTS OF RESEARCH THAT EXACERBATE UNDERGRADUATE AND GRADUATE STUDENT DEPRESSION AND DEVELOPING INTERVENTIONS TO IMPROVE STUDENT ...

ILLINOIS INSTITUTE OF TECHNOLOGY: CIVIC-PG TRACK B: COMMUNITY FOOD MOBILIZATION IN CHICAGO -THE CHICAGOLAND FOOD RESEARCH-CENTERED PILOT PROJECT (CHIFOOD-RCPP) ADDRESSES A PRESSING RES...

Part 4: Calculate the Subset Term Frequency (TF-IDF)¶

In [5]:

def calculate_subset_term_frequency(descriptions):
    # Input: descriptions is a list of description strings
    # Output: Returns dictionary mapping word -> total count across all descriptions
    subset_tf = {}
    for grant in descriptions:
        for term in preprocess(grant):
            subset_tf[term] = subset_tf.get(term, 0) + 1
    return subset_tf

# Calculate TF for "woke" grants
woke_descriptions = grants_df[grants_df['is_woke'] == True]['prime_award_base_transaction_description'].tolist()
print(f"Calculating term frequency for {len(woke_descriptions)} 'woke' grants...")

woke_tf = calculate_subset_term_frequency(woke_descriptions)
print(f"Unique terms in subset: {len(woke_tf)}")
print("Top 10 terms by frequency:", sorted(woke_tf.items(), key=lambda x: x[1], reverse=True)[:10])

Calculating term frequency for 1564 'woke' grants...
Unique terms in subset: 20923
Top 10 terms by frequency: [('using', 1563), ('support', 1563), ('impacts', 1563), ('worthy', 1561), ('statutory', 1561), ('intellectual', 1561), ('mission', 1561), ('review', 1561), ('merit', 1561), ('broader', 1561)]

Part 5: Calculate the Document Frequency (TF-IDF)¶

In [6]:

def calculate_document_frequency(corpus, target_terms):
    # Input: corpus is a list of all description strings
    #        target_terms is a set of terms to check
    # Output: Returns dictionary mapping term -> number of documents containing it

    preprocessed_corpus = []
    for doc in corpus:
        preprocessed_corpus.append(preprocess(doc))

    doc_freq = {}
    for term in target_terms:
        for doc in preprocessed_corpus:
            if term in doc:
                doc_freq[term] = doc_freq.get(term, 0) + 1

    return doc_freq

# Create corpus and calculate DF
all_descriptions = grants_df['prime_award_base_transaction_description'].tolist()
target_terms = set(woke_tf.keys())

print(f"Calculating document frequency for {len(target_terms)} terms...")
print(f"Corpus size: {len(all_descriptions)} grants")

df_counts = calculate_document_frequency(all_descriptions, target_terms)

Calculating document frequency for 20923 terms...
Corpus size: 29425 grants

Part 6: Calculate Subset TFIDF¶

In [7]:

def calculate_subset_tfidf(subset_tf, doc_freq, total_docs):
    # Input: subset_tf is dictionary of term frequencies in the subset
    #        doc_freq is dictionary of document frequencies in full corpus
    #        total_docs is total number of documents in corpus
    # Output: Returns dictionary mapping term -> Subset TF-IDF score
    tfidf = {}
    for term in subset_tf:
        idf = math.log(total_docs / doc_freq[term])
        tfidf[term] = subset_tf[term] * idf
    return tfidf

# Calculate Subset TF-IDF
subset_tfidf_scores = calculate_subset_tfidf(woke_tf, df_counts, len(all_descriptions))

# Display results
sorted_scores = sorted(subset_tfidf_scores.items(), key=lambda x: x[1], reverse=True)
print(f"\nTop 30 most distinctive terms in 'woke' grants:")
for term, score in sorted_scores[:30]:
    print(f"  {term}: {score:.3f}")

Top 30 most distinctive terms in 'woke' grants:
  underrepresented: 1908.339
  participation: 1629.193
  groups: 1623.682
  stem: 1567.371
  education: 1518.674
  program: 1511.992
  students: 1511.077
  diverse: 1489.686
  knowledge: 1428.619
  community: 1383.398
  how: 1375.194
  worthy: 1368.826
  statutory: 1368.826
  reflects: 1368.826
  deemed: 1368.826
  mission: 1368.698
  criteria: 1368.698
  merit: 1368.443
  intellectual: 1368.316
  review: 1367.551
  been: 1367.551
  broader: 1366.914
  award: 1364.370
  foundation: 1364.370
  project: 1364.268
  evaluation: 1359.039
  impacts: 1355.461
  also: 1350.105
  diversity: 1345.061
  nsf: 1340.023