Exploratory Data Analysis

Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:

  1. Most common words - find these and create word clouds
  2. Size of vocabulary - look number of unique words and also how quickly someone speaks
  3. Amount of profanity - most common terms

Most Common Words

Analysis

In [1]:
# Read in the document-term matrix
import pandas as pd

data = pd.read_pickle('dtm.pkl')
#Transpose becuse it's harder to operate across rows. Easier across columns.
#We want to aggregate for each comedian. So comedians should be on the columns.
data = data.transpose() 
data.head()
Out[1]:
ali anthony bill bo dave hasan jim joe john louis mike ricky
aaaaah 0 0 1 0 0 0 0 0 0 0 0 0
aaaaahhhhhhh 0 0 0 1 0 0 0 0 0 0 0 0
aaaaauuugghhhhhh 0 0 0 1 0 0 0 0 0 0 0 0
aaaahhhhh 0 0 0 1 0 0 0 0 0 0 0 0
aaah 0 0 0 0 1 0 0 0 0 0 0 0
In [2]:
# Find the top 30 words said by each comedian
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict
Out[2]:
{'ali': [('like', 126),
  ('im', 74),
  ('know', 65),
  ('just', 64),
  ('dont', 61),
  ('shit', 34),
  ('thats', 34),
  ('youre', 31),
  ('gonna', 28),
  ('ok', 26),
  ('lot', 24),
  ('wanna', 21),
  ('gotta', 21),
  ('oh', 21),
  ('husband', 20),
  ('got', 19),
  ('right', 19),
  ('time', 19),
  ('cause', 18),
  ('women', 17),
  ('day', 17),
  ('people', 16),
  ('pregnant', 15),
  ('need', 14),
  ('god', 14),
  ('hes', 14),
  ('tell', 13),
  ('yeah', 13),
  ('theyre', 12),
  ('dude', 12)],
 'anthony': [('im', 60),
  ('like', 50),
  ('know', 39),
  ('dont', 38),
  ('joke', 34),
  ('got', 34),
  ('thats', 31),
  ('said', 31),
  ('anthony', 27),
  ('day', 26),
  ('say', 26),
  ('just', 26),
  ('guys', 23),
  ('people', 22),
  ('tell', 19),
  ('youre', 19),
  ('right', 18),
  ('grandma', 18),
  ('time', 17),
  ('think', 17),
  ('thing', 17),
  ('yeah', 16),
  ('jokes', 16),
  ('school', 16),
  ('good', 16),
  ('did', 16),
  ('gonna', 15),
  ('okay', 15),
  ('ive', 15),
  ('baby', 15)],
 'bill': [('like', 200),
  ('just', 149),
  ('right', 131),
  ('im', 107),
  ('know', 99),
  ('dont', 95),
  ('gonna', 77),
  ('got', 72),
  ('fucking', 70),
  ('yeah', 67),
  ('shit', 63),
  ('youre', 59),
  ('thats', 56),
  ('dude', 40),
  ('think', 36),
  ('want', 36),
  ('fuck', 36),
  ('people', 32),
  ('did', 31),
  ('hes', 31),
  ('guy', 30),
  ('didnt', 29),
  ('make', 28),
  ('come', 27),
  ('thing', 26),
  ('going', 26),
  ('theyre', 25),
  ('let', 24),
  ('theres', 24),
  ('little', 23)],
 'bo': [('know', 50),
  ('like', 44),
  ('think', 37),
  ('im', 37),
  ('love', 37),
  ('bo', 35),
  ('just', 35),
  ('stuff', 33),
  ('repeat', 31),
  ('dont', 29),
  ('yeah', 27),
  ('want', 25),
  ('right', 24),
  ('cos', 23),
  ('people', 22),
  ('said', 22),
  ('eye', 22),
  ('fucking', 22),
  ('contact', 21),
  ('um', 21),
  ('prolonged', 21),
  ('youre', 19),
  ('thats', 19),
  ('time', 18),
  ('good', 17),
  ('little', 17),
  ('sluts', 17),
  ('man', 17),
  ('oh', 15),
  ('fuck', 15)],
 'dave': [('like', 103),
  ('know', 79),
  ('said', 63),
  ('just', 61),
  ('im', 47),
  ('shit', 45),
  ('people', 43),
  ('didnt', 39),
  ('ahah', 38),
  ('dont', 38),
  ('time', 36),
  ('thats', 33),
  ('fuck', 33),
  ('fucking', 32),
  ('black', 31),
  ('man', 30),
  ('good', 27),
  ('got', 27),
  ('right', 22),
  ('gonna', 21),
  ('gay', 20),
  ('lot', 20),
  ('nigga', 20),
  ('hes', 19),
  ('did', 19),
  ('yeah', 18),
  ('oj', 18),
  ('oh', 18),
  ('come', 17),
  ('guys', 16)],
 'hasan': [('like', 220),
  ('im', 136),
  ('know', 70),
  ('dont', 64),
  ('dad', 59),
  ('youre', 51),
  ('just', 46),
  ('going', 41),
  ('thats', 39),
  ('want', 38),
  ('got', 35),
  ('love', 34),
  ('shes', 32),
  ('hasan', 31),
  ('say', 30),
  ('right', 30),
  ('time', 27),
  ('life', 25),
  ('mom', 25),
  ('people', 25),
  ('hey', 24),
  ('oh', 24),
  ('look', 22),
  ('did', 22),
  ('brown', 21),
  ('parents', 20),
  ('guys', 20),
  ('white', 20),
  ('girl', 19),
  ('whats', 19)],
 'jim': [('like', 108),
  ('im', 101),
  ('dont', 90),
  ('right', 81),
  ('fucking', 78),
  ('know', 63),
  ('just', 63),
  ('went', 63),
  ('youre', 48),
  ('people', 44),
  ('thats', 42),
  ('day', 40),
  ('oh', 40),
  ('think', 39),
  ('going', 39),
  ('fuck', 37),
  ('thing', 34),
  ('goes', 34),
  ('said', 32),
  ('guns', 30),
  ('theyre', 29),
  ('good', 28),
  ('ive', 27),
  ('theres', 26),
  ('women', 26),
  ('cause', 26),
  ('got', 26),
  ('want', 25),
  ('really', 23),
  ('hes', 23)],
 'joe': [('like', 143),
  ('people', 100),
  ('just', 87),
  ('dont', 79),
  ('fucking', 69),
  ('im', 69),
  ('fuck', 66),
  ('thats', 62),
  ('gonna', 52),
  ('theyre', 49),
  ('know', 46),
  ('youre', 42),
  ('think', 41),
  ('shit', 40),
  ('got', 36),
  ('theres', 34),
  ('right', 31),
  ('man', 30),
  ('house', 27),
  ('oh', 25),
  ('kids', 25),
  ('white', 24),
  ('cause', 24),
  ('say', 23),
  ('real', 22),
  ('life', 21),
  ('time', 20),
  ('really', 20),
  ('gotta', 20),
  ('dude', 20)],
 'john': [('like', 190),
  ('know', 66),
  ('just', 53),
  ('dont', 52),
  ('said', 39),
  ('clinton', 34),
  ('im', 33),
  ('thats', 31),
  ('right', 29),
  ('youre', 28),
  ('little', 26),
  ('hey', 25),
  ('got', 24),
  ('time', 24),
  ('people', 22),
  ('say', 22),
  ('cause', 22),
  ('mom', 22),
  ('think', 21),
  ('way', 21),
  ('day', 21),
  ('old', 21),
  ('oh', 21),
  ('gonna', 21),
  ('cow', 20),
  ('went', 18),
  ('wife', 18),
  ('really', 18),
  ('dad', 17),
  ('real', 17)],
 'louis': [('like', 110),
  ('just', 97),
  ('know', 70),
  ('dont', 53),
  ('thats', 51),
  ('im', 50),
  ('youre', 50),
  ('life', 41),
  ('people', 40),
  ('thing', 31),
  ('gonna', 29),
  ('hes', 29),
  ('cause', 28),
  ('theres', 28),
  ('shit', 25),
  ('time', 22),
  ('good', 22),
  ('tit', 22),
  ('right', 21),
  ('think', 21),
  ('theyre', 21),
  ('really', 20),
  ('course', 19),
  ('kids', 18),
  ('murder', 18),
  ('guy', 18),
  ('ok', 17),
  ('mean', 15),
  ('fuck', 15),
  ('didnt', 15)],
 'mike': [('like', 234),
  ('im', 142),
  ('know', 105),
  ('said', 88),
  ('just', 83),
  ('dont', 76),
  ('think', 51),
  ('thats', 51),
  ('says', 46),
  ('cause', 35),
  ('right', 34),
  ('jenny', 33),
  ('goes', 32),
  ('id', 30),
  ('really', 30),
  ('point', 28),
  ('youre', 28),
  ('mean', 28),
  ('gonna', 27),
  ('got', 25),
  ('yeah', 25),
  ('people', 23),
  ('kind', 23),
  ('uh', 22),
  ('say', 21),
  ('feel', 20),
  ('want', 19),
  ('didnt', 19),
  ('going', 19),
  ('time', 19)],
 'ricky': [('right', 110),
  ('like', 80),
  ('just', 66),
  ('im', 66),
  ('dont', 56),
  ('know', 55),
  ('said', 51),
  ('yeah', 49),
  ('fucking', 47),
  ('got', 44),
  ('say', 43),
  ('youre', 41),
  ('went', 40),
  ('id', 39),
  ('thats', 38),
  ('people', 34),
  ('didnt', 33),
  ('little', 32),
  ('joke', 31),
  ('theyre', 29),
  ('hes', 29),
  ('ive', 28),
  ('thing', 26),
  ('going', 26),
  ('years', 24),
  ('day', 23),
  ('saying', 22),
  ('theres', 22),
  ('ill', 21),
  ('big', 21)]}
In [3]:
# Print the top 15 words said by each comedian
for comedian, top_words in top_dict.items():
    print(comedian)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')
ali
like, im, know, just, dont, shit, thats, youre, gonna, ok, lot, wanna, gotta, oh
---
anthony
im, like, know, dont, joke, got, thats, said, anthony, day, say, just, guys, people
---
bill
like, just, right, im, know, dont, gonna, got, fucking, yeah, shit, youre, thats, dude
---
bo
know, like, think, im, love, bo, just, stuff, repeat, dont, yeah, want, right, cos
---
dave
like, know, said, just, im, shit, people, didnt, ahah, dont, time, thats, fuck, fucking
---
hasan
like, im, know, dont, dad, youre, just, going, thats, want, got, love, shes, hasan
---
jim
like, im, dont, right, fucking, know, just, went, youre, people, thats, day, oh, think
---
joe
like, people, just, dont, fucking, im, fuck, thats, gonna, theyre, know, youre, think, shit
---
john
like, know, just, dont, said, clinton, im, thats, right, youre, little, hey, got, time
---
louis
like, just, know, dont, thats, im, youre, life, people, thing, gonna, hes, cause, theres
---
mike
like, im, know, said, just, dont, think, thats, says, cause, right, jenny, goes, id
---
ricky
right, like, just, im, dont, know, said, yeah, fucking, got, say, youre, went, id
---

NOTE: At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.

In [5]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first create a list that just has each comedians top 30 words (even if repeated)
words = []
for comedian in data.columns:
    top = [word for (word, count) in top_dict[comedian]]
    for t in top:
        words.append(t)
        
words
Out[5]:
['like',
 'im',
 'know',
 'just',
 'dont',
 'shit',
 'thats',
 'youre',
 'gonna',
 'ok',
 'lot',
 'wanna',
 'gotta',
 'oh',
 'husband',
 'got',
 'right',
 'time',
 'cause',
 'women',
 'day',
 'people',
 'pregnant',
 'need',
 'god',
 'hes',
 'tell',
 'yeah',
 'theyre',
 'dude',
 'im',
 'like',
 'know',
 'dont',
 'joke',
 'got',
 'thats',
 'said',
 'anthony',
 'day',
 'say',
 'just',
 'guys',
 'people',
 'tell',
 'youre',
 'right',
 'grandma',
 'time',
 'think',
 'thing',
 'yeah',
 'jokes',
 'school',
 'good',
 'did',
 'gonna',
 'okay',
 'ive',
 'baby',
 'like',
 'just',
 'right',
 'im',
 'know',
 'dont',
 'gonna',
 'got',
 'fucking',
 'yeah',
 'shit',
 'youre',
 'thats',
 'dude',
 'think',
 'want',
 'fuck',
 'people',
 'did',
 'hes',
 'guy',
 'didnt',
 'make',
 'come',
 'thing',
 'going',
 'theyre',
 'let',
 'theres',
 'little',
 'know',
 'like',
 'think',
 'im',
 'love',
 'bo',
 'just',
 'stuff',
 'repeat',
 'dont',
 'yeah',
 'want',
 'right',
 'cos',
 'people',
 'said',
 'eye',
 'fucking',
 'contact',
 'um',
 'prolonged',
 'youre',
 'thats',
 'time',
 'good',
 'little',
 'sluts',
 'man',
 'oh',
 'fuck',
 'like',
 'know',
 'said',
 'just',
 'im',
 'shit',
 'people',
 'didnt',
 'ahah',
 'dont',
 'time',
 'thats',
 'fuck',
 'fucking',
 'black',
 'man',
 'good',
 'got',
 'right',
 'gonna',
 'gay',
 'lot',
 'nigga',
 'hes',
 'did',
 'yeah',
 'oj',
 'oh',
 'come',
 'guys',
 'like',
 'im',
 'know',
 'dont',
 'dad',
 'youre',
 'just',
 'going',
 'thats',
 'want',
 'got',
 'love',
 'shes',
 'hasan',
 'say',
 'right',
 'time',
 'life',
 'mom',
 'people',
 'hey',
 'oh',
 'look',
 'did',
 'brown',
 'parents',
 'guys',
 'white',
 'girl',
 'whats',
 'like',
 'im',
 'dont',
 'right',
 'fucking',
 'know',
 'just',
 'went',
 'youre',
 'people',
 'thats',
 'day',
 'oh',
 'think',
 'going',
 'fuck',
 'thing',
 'goes',
 'said',
 'guns',
 'theyre',
 'good',
 'ive',
 'theres',
 'women',
 'cause',
 'got',
 'want',
 'really',
 'hes',
 'like',
 'people',
 'just',
 'dont',
 'fucking',
 'im',
 'fuck',
 'thats',
 'gonna',
 'theyre',
 'know',
 'youre',
 'think',
 'shit',
 'got',
 'theres',
 'right',
 'man',
 'house',
 'oh',
 'kids',
 'white',
 'cause',
 'say',
 'real',
 'life',
 'time',
 'really',
 'gotta',
 'dude',
 'like',
 'know',
 'just',
 'dont',
 'said',
 'clinton',
 'im',
 'thats',
 'right',
 'youre',
 'little',
 'hey',
 'got',
 'time',
 'people',
 'say',
 'cause',
 'mom',
 'think',
 'way',
 'day',
 'old',
 'oh',
 'gonna',
 'cow',
 'went',
 'wife',
 'really',
 'dad',
 'real',
 'like',
 'just',
 'know',
 'dont',
 'thats',
 'im',
 'youre',
 'life',
 'people',
 'thing',
 'gonna',
 'hes',
 'cause',
 'theres',
 'shit',
 'time',
 'good',
 'tit',
 'right',
 'think',
 'theyre',
 'really',
 'course',
 'kids',
 'murder',
 'guy',
 'ok',
 'mean',
 'fuck',
 'didnt',
 'like',
 'im',
 'know',
 'said',
 'just',
 'dont',
 'think',
 'thats',
 'says',
 'cause',
 'right',
 'jenny',
 'goes',
 'id',
 'really',
 'point',
 'youre',
 'mean',
 'gonna',
 'got',
 'yeah',
 'people',
 'kind',
 'uh',
 'say',
 'feel',
 'want',
 'didnt',
 'going',
 'time',
 'right',
 'like',
 'just',
 'im',
 'dont',
 'know',
 'said',
 'yeah',
 'fucking',
 'got',
 'say',
 'youre',
 'went',
 'id',
 'thats',
 'people',
 'didnt',
 'little',
 'joke',
 'theyre',
 'hes',
 'ive',
 'thing',
 'going',
 'years',
 'day',
 'saying',
 'theres',
 'ill',
 'big']
In [6]:
# Aggregate this list and identify the most common words along with how many comedian's routines they occur in
Counter(words).most_common()
Out[6]:
[('like', 12),
 ('im', 12),
 ('know', 12),
 ('just', 12),
 ('dont', 12),
 ('thats', 12),
 ('right', 12),
 ('people', 12),
 ('youre', 11),
 ('got', 10),
 ('time', 9),
 ('gonna', 8),
 ('think', 8),
 ('oh', 7),
 ('yeah', 7),
 ('said', 7),
 ('cause', 6),
 ('hes', 6),
 ('theyre', 6),
 ('say', 6),
 ('fucking', 6),
 ('fuck', 6),
 ('shit', 5),
 ('day', 5),
 ('thing', 5),
 ('good', 5),
 ('want', 5),
 ('didnt', 5),
 ('going', 5),
 ('theres', 5),
 ('really', 5),
 ('did', 4),
 ('little', 4),
 ('dude', 3),
 ('guys', 3),
 ('ive', 3),
 ('man', 3),
 ('life', 3),
 ('went', 3),
 ('ok', 2),
 ('lot', 2),
 ('gotta', 2),
 ('women', 2),
 ('tell', 2),
 ('joke', 2),
 ('guy', 2),
 ('come', 2),
 ('love', 2),
 ('dad', 2),
 ('mom', 2),
 ('hey', 2),
 ('white', 2),
 ('goes', 2),
 ('kids', 2),
 ('real', 2),
 ('mean', 2),
 ('id', 2),
 ('wanna', 1),
 ('husband', 1),
 ('pregnant', 1),
 ('need', 1),
 ('god', 1),
 ('anthony', 1),
 ('grandma', 1),
 ('jokes', 1),
 ('school', 1),
 ('okay', 1),
 ('baby', 1),
 ('make', 1),
 ('let', 1),
 ('bo', 1),
 ('stuff', 1),
 ('repeat', 1),
 ('cos', 1),
 ('eye', 1),
 ('contact', 1),
 ('um', 1),
 ('prolonged', 1),
 ('sluts', 1),
 ('ahah', 1),
 ('black', 1),
 ('gay', 1),
 ('nigga', 1),
 ('oj', 1),
 ('shes', 1),
 ('hasan', 1),
 ('look', 1),
 ('brown', 1),
 ('parents', 1),
 ('girl', 1),
 ('whats', 1),
 ('guns', 1),
 ('house', 1),
 ('clinton', 1),
 ('way', 1),
 ('old', 1),
 ('cow', 1),
 ('wife', 1),
 ('tit', 1),
 ('course', 1),
 ('murder', 1),
 ('says', 1),
 ('jenny', 1),
 ('point', 1),
 ('kind', 1),
 ('uh', 1),
 ('feel', 1),
 ('years', 1),
 ('saying', 1),
 ('ill', 1),
 ('big', 1)]
In [7]:
# If more than half of the comedians have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words
Out[7]:
['like',
 'im',
 'know',
 'just',
 'dont',
 'thats',
 'right',
 'people',
 'youre',
 'got',
 'time',
 'gonna',
 'think',
 'oh',
 'yeah',
 'said']
In [11]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text # Contains the stop word list
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data from corpus
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix which excludes our additional stop words
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")
In [9]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)
/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/font_manager.py:232: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  'Matplotlib is building the font cache using fc-list. '
In [10]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16, 6]

full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',
              'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']

# Create subplots for each comedian
for index, comedian in enumerate(data.columns):
    wc.generate(data_clean.transcript[comedian])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(full_names[index])
    
plt.show()

Findings

  • Ali Wong says the s-word a lot and talks about her husband. I guess that's funny to me.
  • A lot of people use the F-word. Let's dig into that later.

Number of Words

Analysis

In [25]:
# Find the number of unique words that each comedian uses

# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for comedian in data.columns:
    uniques = data[comedian].nonzero()[0].size
    unique_list.append(uniques)

# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=['comedian', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort
/Users/nwams/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: FutureWarning: Series.nonzero() is deprecated and will be removed in a future version.Use Series.to_numpy().nonzero() instead
  
Out[25]:
comedian unique_words
1 Anthony Jeselnik 984
9 Louis C.K. 1098
3 Bo Burnham 1272
6 Jim Jefferies 1313
0 Ali Wong 1341
8 John Mulaney 1391
4 Dave Chappelle 1404
7 Joe Rogan 1435
10 Mike Birbiglia 1494
5 Hasan Minhaj 1559
2 Bill Burr 1633
11 Ricky Gervais 1633
In [29]:
# Calculate the words per minute of each comedian

# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
    totals = sum(data[comedian])
    total_list.append(totals)

# Comedy special run times from IMDB (in minutes)
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79]

# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words['run_times']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort
Out[29]:
comedian unique_words total_words run_times words_per_minute
1 Anthony Jeselnik 984 2905 59 49.237288
3 Bo Burnham 1272 3165 60 52.750000
0 Ali Wong 1341 3283 60 54.716667
9 Louis C.K. 1098 3332 58 57.448276
4 Dave Chappelle 1404 4094 67 61.104478
6 Jim Jefferies 1313 4764 77 61.870130
10 Mike Birbiglia 1494 4741 76 62.381579
11 Ricky Gervais 1633 4972 79 62.936709
8 John Mulaney 1391 4001 62 64.532258
5 Hasan Minhaj 1559 4777 73 65.438356
2 Bill Burr 1633 5535 80 69.187500
7 Joe Rogan 1435 4579 63 72.682540
In [34]:
# Let's plot our findings
import numpy as np

y_pos = np.arange(len(data_words)) # Return evenly spaced values within a given interval. Stop at len(data_words)
plt.subplot(1, 2, 1) # plt.subplot (nrows, ncols, index)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)

plt.tight_layout()
plt.show()

Findings

  • Vocabulary
    • Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
    • Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary
  • Talking Speed
    • Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast
    • Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow

Ali Wong is somewhere in the middle in both cases. Nothing too interesting here.

Amount of Profanity

Analysis

In [35]:
# Earlier I said we'd revisit profanity. Let's take a look at the most common words again.
Counter(words).most_common()
Out[35]:
[('like', 12),
 ('im', 12),
 ('know', 12),
 ('just', 12),
 ('dont', 12),
 ('thats', 12),
 ('right', 12),
 ('people', 12),
 ('youre', 11),
 ('got', 10),
 ('time', 9),
 ('gonna', 8),
 ('think', 8),
 ('oh', 7),
 ('yeah', 7),
 ('said', 7),
 ('cause', 6),
 ('hes', 6),
 ('theyre', 6),
 ('say', 6),
 ('fucking', 6),
 ('fuck', 6),
 ('shit', 5),
 ('day', 5),
 ('thing', 5),
 ('good', 5),
 ('want', 5),
 ('didnt', 5),
 ('going', 5),
 ('theres', 5),
 ('really', 5),
 ('did', 4),
 ('little', 4),
 ('dude', 3),
 ('guys', 3),
 ('ive', 3),
 ('man', 3),
 ('life', 3),
 ('went', 3),
 ('ok', 2),
 ('lot', 2),
 ('gotta', 2),
 ('women', 2),
 ('tell', 2),
 ('joke', 2),
 ('guy', 2),
 ('come', 2),
 ('love', 2),
 ('dad', 2),
 ('mom', 2),
 ('hey', 2),
 ('white', 2),
 ('goes', 2),
 ('kids', 2),
 ('real', 2),
 ('mean', 2),
 ('id', 2),
 ('wanna', 1),
 ('husband', 1),
 ('pregnant', 1),
 ('need', 1),
 ('god', 1),
 ('anthony', 1),
 ('grandma', 1),
 ('jokes', 1),
 ('school', 1),
 ('okay', 1),
 ('baby', 1),
 ('make', 1),
 ('let', 1),
 ('bo', 1),
 ('stuff', 1),
 ('repeat', 1),
 ('cos', 1),
 ('eye', 1),
 ('contact', 1),
 ('um', 1),
 ('prolonged', 1),
 ('sluts', 1),
 ('ahah', 1),
 ('black', 1),
 ('gay', 1),
 ('nigga', 1),
 ('oj', 1),
 ('shes', 1),
 ('hasan', 1),
 ('look', 1),
 ('brown', 1),
 ('parents', 1),
 ('girl', 1),
 ('whats', 1),
 ('guns', 1),
 ('house', 1),
 ('clinton', 1),
 ('way', 1),
 ('old', 1),
 ('cow', 1),
 ('wife', 1),
 ('tit', 1),
 ('course', 1),
 ('murder', 1),
 ('says', 1),
 ('jenny', 1),
 ('point', 1),
 ('kind', 1),
 ('uh', 1),
 ('feel', 1),
 ('years', 1),
 ('saying', 1),
 ('ill', 1),
 ('big', 1)]
In [43]:
# Let's isolate just profanity
data_bad_words = data.transpose()[['fucking', 'fuck', 'shit']]
data_profanity = pd.concat([data_bad_words.fucking + data_bad_words.fuck, data_bad_words.shit], axis=1) # Manually combine fucking and fuck as the same word
data_profanity.columns = ['f_word', 's_word']
data_profanity
Out[43]:
f_word s_word
ali 16 34
anthony 15 9
bill 106 63
bo 37 6
dave 65 45
hasan 24 15
jim 115 20
joe 135 40
john 4 6
louis 21 25
mike 0 0
ricky 60 6
In [48]:
data_profanity.index
Out[48]:
Index(['ali', 'anthony', 'bill', 'bo', 'dave', 'hasan', 'jim', 'joe', 'john',
       'louis', 'mike', 'ricky'],
      dtype='object')
In [52]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8] # Set width to 10 inches and height to 8 inches

for i, comedian in enumerate(data_profanity.index):
    x = data_profanity.f_word.loc[comedian]
    y = data_profanity.s_word.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_names[i], fontsize=10) # Offset the label to avoid overlap of names and dot
    plt.xlim(-5, 155)
    
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)

plt.show()