Analysis of Pronoun Usage In Presidential Addresses¶

This notebook is designed to look at how presidents have used first person vs. second person pronouns during their speeches.

In [20]:

import pandas as pd
import json
import nltk

Load in Data¶

The data used in this notebook comes from Vocativ's collection of presidential addressses, which can be found here: https://github.com/Vocativ-data/presidents_readability

In [2]:

objects = json.loads(open("../../vocativ_president_data/The original speeches.json").read())["objects"]

In [3]:

speeches_df = pd.DataFrame(objects)

In [4]:

speeches_df["word_count"] = speeches_df["Text"].apply(lambda x: len(x.split()))

In [3]:

json_data = open().read()

In [5]:

speeches_df["tokens"] = speeches_df["Text"].apply(lambda x: nltk.word_tokenize(x))

Find and Count All First-Person Singular Pronouns¶

In [6]:

speeches_df["i"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "i"]), axis=1)
speeches_df["me"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "me"]), axis=1)
speeches_df["my"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "my"]), axis=1)
speeches_df["mine"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "mine"]), axis=1)
speeches_df["myself"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "myself"]), axis=1)

In [7]:

speeches_df["first_person_singular"] = speeches_df.apply(lambda x: x["i"] + x["me"] + x["my"] +\
                                                                x["mine"] + x["myself"], axis=1)

Find And Count All First-Person Plural Pronouns¶

In [8]:

speeches_df["we"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "we"]), axis=1)
speeches_df["our"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "our"]), axis=1)
speeches_df["ours"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ours"]), axis=1)
speeches_df["ourselves"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ourselves"]), axis=1)
speeches_df["us"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "us"]), axis=1)

In [9]:

speeches_df["first_person_plural"] = speeches_df.apply(lambda x: x["we"] + x["our"] + x["ours"] + x["ourselves"] + x["us"], axis=1)

In [10]:

speeches_df["first_person"] = speeches_df.apply(lambda x: x["first_person_singular"] + x["first_person_singular"], axis=1)

Segment Off Necessary Data Points¶

In [11]:

speech_analysis = speeches_df[["word_count", "tokens", "President", "first_person", 
                               "first_person_singular", "first_person_plural"]]

We only want modern presidents (since 1929) because that's the data that's available for our news conference analysis. This is a list of all the presidents with names matching the data found in the President column of the address dataframe.

In [12]:

news_conf_presidents = ["Richard Nixon", "Gerald Ford", "George H. W. Bush", "Lyndon B. Johnson", "Jimmy Carter", 
                        "Bill Clinton", "Harry S. Truman", "Ronald Reagan", "Barack Obama", "John F. Kennedy", 
                        "Franklin D. Roosevelt", "Dwight D. Eisenhower", "Herbert Hoover", "George W. Bush"]

In [13]:

modern_presidents = speech_analysis[speech_analysis["President"].isin(news_conf_presidents)]

In [14]:

presidents = pd.DataFrame(modern_presidents.groupby("President").sum())

Analyze Each President's Total Corpus of Speeches¶

In [15]:

presidents["pct_first"] = presidents.apply(lambda x: round(100.0 * x["first_person"] / x["word_count"], 2), axis=1)

In [16]:

presidents["pct_first_singular"] = presidents.apply(lambda x: round(100.0 * x["first_person_singular"] / x["word_count"], 2), axis=1)

In [17]:

presidents["pct_first_plural"] = presidents.apply(lambda x: round(100.0 * x["first_person_plural"] / x["word_count"], 2), axis=1)

In [18]:

presidents.sort("pct_first_singular", ascending=False)

Out[18]:

	word_count	first_person	first_person_singular	first_person_plural	pct_first	pct_first_singular	pct_first_plural
President
Richard Nixon	67445	3368	1684	1943	4.99	2.50	2.88
Gerald Ford	40301	1950	975	1323	4.84	2.42	3.28
George H. W. Bush	89646	4308	2154	2878	4.81	2.40	3.21
Lyndon B. Johnson	246786	10116	5058	8062	4.10	2.05	3.27
Jimmy Carter	91936	3642	1821	2997	3.96	1.98	3.26
Bill Clinton	145846	5234	2617	5694	3.59	1.79	3.90
Harry S. Truman	31802	1132	566	852	3.56	1.78	2.68
Ronald Reagan	206217	6592	3296	6679	3.20	1.60	3.24
Barack Obama	33672	1046	523	1292	3.11	1.55	3.84
John F. Kennedy	160468	4670	2335	4907	2.91	1.46	3.06
Franklin D. Roosevelt	130024	3034	1517	3222	2.33	1.17	2.48
Dwight D. Eisenhower	17919	354	177	429	1.98	0.99	2.39
George W. Bush	45437	808	404	1818	1.78	0.89	4.00
Herbert Hoover	10718	178	89	303	1.66	0.83	2.83

Do a quick calculation to find the overall average so that you can compare it to Obama's 1.55 in table above.

In [19]:

round(100.0 * presidents["first_person_singular"].sum() / presidents["word_count"].sum(), 2)

Out[19]:

1.76