In this exercise, you'll use Subset TF-IDF to discover what terms characterize grants that were labeled as "woke" in a dataset released by the U.S. Senate Commerce Committee.
U.S. Senate Commerce Committee Chairman Ted Cruz (R-Texas) released a database identifying over 3,400 grants, totaling more than $2.05 billion in federal funding awarded by the National Science Foundation (NSF) during the Biden-Harris administration. This funding was diverted toward questionable projects that promoted Diversity, Equity, and Inclusion (DEI) or advanced neo-Marxist class warfare propaganda.
You will:
The U.S. National Science Foundation (NSF) is an independent agency of the United States federal government that supports fundamental research and education in all the non-medical fields of science and engineering.
The NSF funds approximately 25% of all federally supported basic research conducted by the United States' colleges and universities.
In some fields, such as mathematics, computer science, economics, and the social sciences, the NSF is the major source of federal backing.
For this exercise, we will leverage two datasets which I've already pulled and placed into our shared datasets Google Drive Folder.
This database was released as part of the U.S. Senate Commerce Committee's press release and is downloadable from the following page:
This dataset contains useful information such as recipients and descriptions but is missing details about the "non-woke" grants necessary for TF-IDF.
USAspending.gov is the official open data source for federal spending information. Detailed, year-by-year records of all grants since 2008 are available for download:
https://www.usaspending.gov/download_center/award_data_archive
This dataset is much larger, more detailed, and most importantly, includes the missing information about the non-woke grants needed to build our corpus for TF-IDF.
Interesting Columns:
award_id_fain
- Unique award identifier (join key)total_obligated_amount
- Total grant obligated amountprime_award_base_transaction_description
- Award descriptionrecipient_name
- Name of the recipient (often the university name)recipient_state_name
- State of the award recipientrecipient_city_name
- City of the award recipientcfda_title
- Catalog of Federal Domestic Assistance title (area of research)Note: Because these datasets are large, I focused on just the 2022 grants. If you're interested, I encourage you to extend this exercise by incorporating other years.
In this cell we will:
is_woke
, to our dataset. We will use this flag to partion our target subset of documents from the corpus.I have done the work to pull these datasets and set the is_work
column.
# JUST RUN THIS, no changes needed
from google.colab import drive
import pandas as pd
import math
import re
from collections import Counter
drive.mount('/content/gdrive')
# Load the "woke" grants dataset
woke_grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/woke_grants.tsv", delimiter="\t")
woke_grant_ids = woke_grants_df.dropna(subset="AWARD ID")["AWARD ID"]
# Load all NSF grants from 2022
grants_df = pd.read_csv("/content/gdrive/MyDrive/datasets/FY2022_049_Assistance_Full_20250109_1.csv",
on_bad_lines='skip', low_memory=False)
# Add a boolean "is_woke" column
grants_df["is_woke"] = grants_df["award_id_fain"].isin(woke_grant_ids)
# Print dataset info
print(f"Total grants: {len(grants_df)}")
print(f"Labeled 'woke': {grants_df['is_woke'].sum()}")
print(f"Percentage: {100 * grants_df['is_woke'].mean():.1f}%")
# Uncomment to sample the dataset for development / testing purposes
#grants_df = grants_df.sample(2000)
grants_df.head(10)
Mounted at /content/gdrive Total grants: 29425 Labeled 'woke': 1564 Percentage: 5.3%
assistance_transaction_unique_key | assistance_award_unique_key | award_id_fain | modification_number | award_id_uri | sai_number | federal_action_obligation | total_obligated_amount | total_outlayed_amount_for_overall_award | indirect_cost_federal_share_amount | ... | highly_compensated_officer_3_name | highly_compensated_officer_3_amount | highly_compensated_officer_4_name | highly_compensated_officer_4_amount | highly_compensated_officer_5_name | highly_compensated_officer_5_amount | usaspending_permalink | initial_report_date | last_modified_date | is_woke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4900_2025735_-NONE-_47.050_002 | ASST_NON_2025735_4900 | 2025735 | 002 | NaN | SAI EXEMPT | 112807.0 | 370110.0 | 218741.17 | 31386.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_202... | 2022-07-06 | 2022-07-06 | False |
1 | 4900_2139301_-NONE-_47.074_000 | ASST_NON_2139301_4900 | 2139301 | 000 | NaN | SAI EXEMPT | 86116.0 | 172232.0 | 112878.72 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_213... | 2022-02-04 | 2022-02-04 | False |
2 | 4900_2139301_-NONE-_47.050_000 | ASST_NON_2139301_4900 | 2139301 | 000 | NaN | SAI EXEMPT | 86116.0 | 172232.0 | 112878.72 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_213... | 2022-02-04 | 2022-02-04 | False |
3 | 4900_1763524_-NONE-_47.070_005 | ASST_NON_1763524_4900 | 1763524 | 005 | NaN | SAI EXEMPT | 0.0 | 1325062.0 | 633019.51 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_176... | 2022-07-06 | 2022-07-06 | False |
4 | 4900_2209765_-NONE-_47.074_000 | ASST_NON_2209765_4900 | 2209765 | 000 | NaN | SAI EXEMPT | 252488.0 | 379579.0 | 141017.33 | 77948.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_220... | 2022-09-06 | 2022-09-06 | False |
5 | 4900_2224973_-NONE-_47.050_000 | ASST_NON_2224973_4900 | 2224973 | 000 | NaN | SAI EXEMPT | 497132.0 | 497132.0 | 414390.47 | 140705.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_222... | 2022-09-06 | 2022-09-06 | True |
6 | 4900_2026822_-NONE-_47.049_005 | ASST_NON_2026822_4900 | 2026822 | 005 | NaN | SAI EXEMPT | 200000.0 | 5875000.0 | 4014242.15 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_202... | 2022-09-20 | 2022-09-20 | False |
7 | 4900_1846076_-NONE-_47.070_P002 | ASST_NON_1846076_4900 | 1846076 | P002 | NaN | SAI EXEMPT | -78415.0 | 144791.0 | NaN | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_184... | 2022-05-04 | 2022-05-04 | False |
8 | 4900_1642232_-NONE-_47.050_004 | ASST_NON_1642232_4900 | 1642232 | 004 | NaN | SAI EXEMPT | 0.0 | 382873.0 | 61920.42 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_164... | 2022-04-21 | 2022-04-21 | False |
9 | 4900_1534606_-NONE-_47.075_006 | ASST_NON_1534606_4900 | 1534606 | 006 | NaN | SAI EXEMPT | 0.0 | 149341.0 | 7544.60 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.usaspending.gov/award/ASST_NON_153... | 2022-06-06 | 2022-06-06 | False |
10 rows × 113 columns
# JUST RUN THIS, no changes needed
STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "by", "for", "from",
"has", "he", "in", "is", "it", "its", "of", "on", "that", "the",
"to", "was", "were", "will", "with", "i", "you", "we", "they",
"this", "their", "our", "or", "but", "if", "then", "so", "such"
}
def preprocess(text):
"""Convert text to a list of lowercase words, removing stop words"""
if pd.isna(text):
return []
# Convert to lowercase
text = str(text).lower()
# Split on punctuation and whitespace
tokens = re.split(r"[,\.\!\?\s\(\)\[\];:\"']+", text)
# Keep only non-empty tokens that aren't stop words
processed_tokens = []
for token in tokens:
# Remove any remaining punctuation from edges
token = token.strip("-/")
if token and token not in STOP_WORDS and len(token) > 2:
processed_tokens.append(token)
return set(processed_tokens)
# Test it
test_text = "This research investigates climate change impacts!"
print(preprocess(test_text)) # Should print: ['research', 'investigates', 'climate', 'change', 'impacts']
{'investigates', 'change', 'impacts', 'climate', 'research'}
get_grant_description
should look a lot like get_song_lyrics
:
def get_song_lyrics(lyrics_df, artist, title):
artist_df = lyrics_df[lyrics_df["Artist"] == artist]
title_df = artist_df[artist_df["Title"] == title]
return title_df['Lyric'].values[0]
Except this time we only need to filer by a single column this time (grant_id
).
def get_grant_description(grants_df, grant_id):
return grants_df[grants_df['award_id_fain'] == grant_id]['prime_award_base_transaction_description'].values[0]
# Test your function
sample_id = grants_df['award_id_fain'].iloc[0]
description = get_grant_description(grants_df, sample_id)
if description:
print(f"Sample description: {description[:200]}...")
# Look at some examples of "woke" grants
print("\nExamples of labeled grants:")
woke_examples = grants_df[grants_df['is_woke'] == True].head(3)
for _, row in woke_examples.iterrows():
print(f"\n{row['recipient_name']}: {row['prime_award_base_transaction_description'][:150]}...")
Sample description: REVISITING THE CAMBRIAN SERIES 1 ANIMAL ORIGINATION CHRONOLOGY... Examples of labeled grants: GEORGIA TECH RESEARCH CORP: TRACK I CENTER CATALYST: COLLABORATIVE CENTER FOR LANDSLIDES AND GROUND FAILURE GEOHAZARDS -LANDSLIDES ARE A COMMON HAZARD IN MOUNTAINOUS TERRAIN WORL... ARIZONA STATE UNIVERSITY: CAREER: IDENTIFYING ASPECTS OF RESEARCH THAT EXACERBATE UNDERGRADUATE AND GRADUATE STUDENT DEPRESSION AND DEVELOPING INTERVENTIONS TO IMPROVE STUDENT ... ILLINOIS INSTITUTE OF TECHNOLOGY: CIVIC-PG TRACK B: COMMUNITY FOOD MOBILIZATION IN CHICAGO -THE CHICAGOLAND FOOD RESEARCH-CENTERED PILOT PROJECT (CHIFOOD-RCPP) ADDRESSES A PRESSING RES...
def calculate_subset_term_frequency(descriptions):
# Input: descriptions is a list of description strings
# Output: Returns dictionary mapping word -> total count across all descriptions
subset_tf = {}
for grant in descriptions:
for term in preprocess(grant):
subset_tf[term] = subset_tf.get(term, 0) + 1
return subset_tf
# Calculate TF for "woke" grants
woke_descriptions = grants_df[grants_df['is_woke'] == True]['prime_award_base_transaction_description'].tolist()
print(f"Calculating term frequency for {len(woke_descriptions)} 'woke' grants...")
woke_tf = calculate_subset_term_frequency(woke_descriptions)
print(f"Unique terms in subset: {len(woke_tf)}")
print("Top 10 terms by frequency:", sorted(woke_tf.items(), key=lambda x: x[1], reverse=True)[:10])
Calculating term frequency for 1564 'woke' grants... Unique terms in subset: 20923 Top 10 terms by frequency: [('using', 1563), ('support', 1563), ('impacts', 1563), ('worthy', 1561), ('statutory', 1561), ('intellectual', 1561), ('mission', 1561), ('review', 1561), ('merit', 1561), ('broader', 1561)]
def calculate_document_frequency(corpus, target_terms):
# Input: corpus is a list of all description strings
# target_terms is a set of terms to check
# Output: Returns dictionary mapping term -> number of documents containing it
preprocessed_corpus = []
for doc in corpus:
preprocessed_corpus.append(preprocess(doc))
doc_freq = {}
for term in target_terms:
for doc in preprocessed_corpus:
if term in doc:
doc_freq[term] = doc_freq.get(term, 0) + 1
return doc_freq
# Create corpus and calculate DF
all_descriptions = grants_df['prime_award_base_transaction_description'].tolist()
target_terms = set(woke_tf.keys())
print(f"Calculating document frequency for {len(target_terms)} terms...")
print(f"Corpus size: {len(all_descriptions)} grants")
df_counts = calculate_document_frequency(all_descriptions, target_terms)
Calculating document frequency for 20923 terms... Corpus size: 29425 grants
def calculate_subset_tfidf(subset_tf, doc_freq, total_docs):
# Input: subset_tf is dictionary of term frequencies in the subset
# doc_freq is dictionary of document frequencies in full corpus
# total_docs is total number of documents in corpus
# Output: Returns dictionary mapping term -> Subset TF-IDF score
tfidf = {}
for term in subset_tf:
idf = math.log(total_docs / doc_freq[term])
tfidf[term] = subset_tf[term] * idf
return tfidf
# Calculate Subset TF-IDF
subset_tfidf_scores = calculate_subset_tfidf(woke_tf, df_counts, len(all_descriptions))
# Display results
sorted_scores = sorted(subset_tfidf_scores.items(), key=lambda x: x[1], reverse=True)
print(f"\nTop 30 most distinctive terms in 'woke' grants:")
for term, score in sorted_scores[:30]:
print(f" {term}: {score:.3f}")
Top 30 most distinctive terms in 'woke' grants: underrepresented: 1908.339 participation: 1629.193 groups: 1623.682 stem: 1567.371 education: 1518.674 program: 1511.992 students: 1511.077 diverse: 1489.686 knowledge: 1428.619 community: 1383.398 how: 1375.194 worthy: 1368.826 statutory: 1368.826 reflects: 1368.826 deemed: 1368.826 mission: 1368.698 criteria: 1368.698 merit: 1368.443 intellectual: 1368.316 review: 1367.551 been: 1367.551 broader: 1366.914 award: 1364.370 foundation: 1364.370 project: 1364.268 evaluation: 1359.039 impacts: 1355.461 also: 1350.105 diversity: 1345.061 nsf: 1340.023