This notebook is designed to look at how presidents have used first person vs. second person pronouns during their speeches.
import pandas as pd
import json
import nltk
The data used in this notebook comes from Vocativ's collection of presidential addressses, which can be found here: https://github.com/Vocativ-data/presidents_readability
objects = json.loads(open("../../vocativ_president_data/The original speeches.json").read())["objects"]
speeches_df = pd.DataFrame(objects)
speeches_df["word_count"] = speeches_df["Text"].apply(lambda x: len(x.split()))
json_data = open().read()
speeches_df["tokens"] = speeches_df["Text"].apply(lambda x: nltk.word_tokenize(x))
speeches_df["i"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "i"]), axis=1)
speeches_df["me"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "me"]), axis=1)
speeches_df["my"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "my"]), axis=1)
speeches_df["mine"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "mine"]), axis=1)
speeches_df["myself"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "myself"]), axis=1)
speeches_df["first_person_singular"] = speeches_df.apply(lambda x: x["i"] + x["me"] + x["my"] +\
x["mine"] + x["myself"], axis=1)
speeches_df["we"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "we"]), axis=1)
speeches_df["our"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "our"]), axis=1)
speeches_df["ours"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ours"]), axis=1)
speeches_df["ourselves"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ourselves"]), axis=1)
speeches_df["us"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "us"]), axis=1)
speeches_df["first_person_plural"] = speeches_df.apply(lambda x: x["we"] + x["our"] + x["ours"] + x["ourselves"] + x["us"], axis=1)
speeches_df["first_person"] = speeches_df.apply(lambda x: x["first_person_singular"] + x["first_person_singular"], axis=1)
speech_analysis = speeches_df[["word_count", "tokens", "President", "first_person",
"first_person_singular", "first_person_plural"]]
We only want modern presidents (since 1929) because that's the data that's available for our news conference analysis. This is a list of all the presidents with names matching the data found in the President column of the address dataframe.
news_conf_presidents = ["Richard Nixon", "Gerald Ford", "George H. W. Bush", "Lyndon B. Johnson", "Jimmy Carter",
"Bill Clinton", "Harry S. Truman", "Ronald Reagan", "Barack Obama", "John F. Kennedy",
"Franklin D. Roosevelt", "Dwight D. Eisenhower", "Herbert Hoover", "George W. Bush"]
modern_presidents = speech_analysis[speech_analysis["President"].isin(news_conf_presidents)]
presidents = pd.DataFrame(modern_presidents.groupby("President").sum())
presidents["pct_first"] = presidents.apply(lambda x: round(100.0 * x["first_person"] / x["word_count"], 2), axis=1)
presidents["pct_first_singular"] = presidents.apply(lambda x: round(100.0 * x["first_person_singular"] / x["word_count"], 2), axis=1)
presidents["pct_first_plural"] = presidents.apply(lambda x: round(100.0 * x["first_person_plural"] / x["word_count"], 2), axis=1)
presidents.sort("pct_first_singular", ascending=False)
word_count | first_person | first_person_singular | first_person_plural | pct_first | pct_first_singular | pct_first_plural | |
---|---|---|---|---|---|---|---|
President | |||||||
Richard Nixon | 67445 | 3368 | 1684 | 1943 | 4.99 | 2.50 | 2.88 |
Gerald Ford | 40301 | 1950 | 975 | 1323 | 4.84 | 2.42 | 3.28 |
George H. W. Bush | 89646 | 4308 | 2154 | 2878 | 4.81 | 2.40 | 3.21 |
Lyndon B. Johnson | 246786 | 10116 | 5058 | 8062 | 4.10 | 2.05 | 3.27 |
Jimmy Carter | 91936 | 3642 | 1821 | 2997 | 3.96 | 1.98 | 3.26 |
Bill Clinton | 145846 | 5234 | 2617 | 5694 | 3.59 | 1.79 | 3.90 |
Harry S. Truman | 31802 | 1132 | 566 | 852 | 3.56 | 1.78 | 2.68 |
Ronald Reagan | 206217 | 6592 | 3296 | 6679 | 3.20 | 1.60 | 3.24 |
Barack Obama | 33672 | 1046 | 523 | 1292 | 3.11 | 1.55 | 3.84 |
John F. Kennedy | 160468 | 4670 | 2335 | 4907 | 2.91 | 1.46 | 3.06 |
Franklin D. Roosevelt | 130024 | 3034 | 1517 | 3222 | 2.33 | 1.17 | 2.48 |
Dwight D. Eisenhower | 17919 | 354 | 177 | 429 | 1.98 | 0.99 | 2.39 |
George W. Bush | 45437 | 808 | 404 | 1818 | 1.78 | 0.89 | 4.00 |
Herbert Hoover | 10718 | 178 | 89 | 303 | 1.66 | 0.83 | 2.83 |
Do a quick calculation to find the overall average so that you can compare it to Obama's 1.55 in table above.
round(100.0 * presidents["first_person_singular"].sum() / presidents["word_count"].sum(), 2)
1.76