So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.
When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.
For more info on how TextBlob coded up its sentiment function.
Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.
# We'll start by reading in the corpus, which preserves word order
import pandas as pd
data = pd.read_pickle('corpus.pkl')
data
transcript | full_name | |
---|---|---|
ali | Ladies and gentlemen, please welcome to the st... | Ali Wong |
anthony | Thank you. Thank you. Thank you, San Francisco... | Anthony Jeselnik |
bill | [cheers and applause] All right, thank you! Th... | Bill Burr |
bo | Bo What? Old MacDonald had a farm E I E I O An... | Bo Burnham |
dave | This is Dave. He tells dirty jokes for a livin... | Dave Chappelle |
hasan | [theme music: orchestral hip-hop] [crowd roars... | Hasan Minhaj |
jim | [Car horn honks] [Audience cheering] [Announce... | Jim Jefferies |
joe | [rock music playing] [audience cheering] [anno... | Joe Rogan |
john | All right, Petunia. Wish me luck out there. Yo... | John Mulaney |
louis | Intro\nFade the music out. Let’s roll. Hold th... | Louis C.K. |
mike | Wow. Hey, thank you. Thanks. Thank you, guys. ... | Mike Birbiglia |
ricky | Hello. Hello! How you doing? Great. Thank you.... | Ricky Gervais |
# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
# Apply a function along an axis of the DataFrame.
data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data
transcript | full_name | polarity | subjectivity | |
---|---|---|---|---|
ali | Ladies and gentlemen, please welcome to the st... | Ali Wong | 0.069359 | 0.482403 |
anthony | Thank you. Thank you. Thank you, San Francisco... | Anthony Jeselnik | 0.055237 | 0.558976 |
bill | [cheers and applause] All right, thank you! Th... | Bill Burr | 0.016479 | 0.537016 |
bo | Bo What? Old MacDonald had a farm E I E I O An... | Bo Burnham | 0.074514 | 0.539368 |
dave | This is Dave. He tells dirty jokes for a livin... | Dave Chappelle | -0.002690 | 0.513958 |
hasan | [theme music: orchestral hip-hop] [crowd roars... | Hasan Minhaj | 0.086856 | 0.460619 |
jim | [Car horn honks] [Audience cheering] [Announce... | Jim Jefferies | 0.044224 | 0.523382 |
joe | [rock music playing] [audience cheering] [anno... | Joe Rogan | 0.004968 | 0.551628 |
john | All right, Petunia. Wish me luck out there. Yo... | John Mulaney | 0.082355 | 0.484137 |
louis | Intro\nFade the music out. Let’s roll. Hold th... | Louis C.K. | 0.056665 | 0.515796 |
mike | Wow. Hey, thank you. Thanks. Thank you, guys. ... | Mike Birbiglia | 0.092927 | 0.518476 |
ricky | Hello. Hello! How you doing? Great. Thank you.... | Ricky Gervais | 0.066489 | 0.497313 |
# Create a simple scatter plot of Polarity and Subjectivity
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8] # Set width to 10 inches and height to 8 inches
for index, comedian in enumerate(data.index):
x = data.polarity.loc[comedian]
y = data.subjectivity.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10) # Offset the label to avoid overlap of label & dot
plt.xlim(-.01, .12)
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)
plt.show()
Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine.
# Split each routine into 10 parts
import numpy as np
import math
def split_text(text, n=10):
'''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''
length = len(text) # Calculate length of text
size = math.floor(length / n) # Calculate size of each chunk of text
# Calculate the starting points of each chunk of text
start = np.arange(0, length, size) # numpy.arange([start, ]stop, [step]) ...Return evenly spaced values within a given interval.
# Pull out equally sized pieces of text and put it into a list. Return a list with chunks of text
split_list = []
for piece in range(n):
split_list.append(text[start[piece]:start[piece]+size])
return split_list
# Let's take a look at our data again
data
transcript | full_name | polarity | subjectivity | |
---|---|---|---|---|
ali | Ladies and gentlemen, please welcome to the st... | Ali Wong | 0.069359 | 0.482403 |
anthony | Thank you. Thank you. Thank you, San Francisco... | Anthony Jeselnik | 0.055237 | 0.558976 |
bill | [cheers and applause] All right, thank you! Th... | Bill Burr | 0.016479 | 0.537016 |
bo | Bo What? Old MacDonald had a farm E I E I O An... | Bo Burnham | 0.074514 | 0.539368 |
dave | This is Dave. He tells dirty jokes for a livin... | Dave Chappelle | -0.002690 | 0.513958 |
hasan | [theme music: orchestral hip-hop] [crowd roars... | Hasan Minhaj | 0.086856 | 0.460619 |
jim | [Car horn honks] [Audience cheering] [Announce... | Jim Jefferies | 0.044224 | 0.523382 |
joe | [rock music playing] [audience cheering] [anno... | Joe Rogan | 0.004968 | 0.551628 |
john | All right, Petunia. Wish me luck out there. Yo... | John Mulaney | 0.082355 | 0.484137 |
louis | Intro\nFade the music out. Let’s roll. Hold th... | Louis C.K. | 0.056665 | 0.515796 |
mike | Wow. Hey, thank you. Thanks. Thank you, guys. ... | Mike Birbiglia | 0.092927 | 0.518476 |
ricky | Hello. Hello! How you doing? Great. Thank you.... | Ricky Gervais | 0.066489 | 0.497313 |
# Let's create a list of lists that'll hold all of the pieces of text of all the comedians
list_pieces = []
for t in data.transcript:
split = split_text(t)
list_pieces.append(split)
list_pieces
# The list has 12 items, one for each transcript
len(list_pieces)
12
# And then each transcript has been split into 10 pieces of text
len(list_pieces[0])
10
# Calculate the polarity for each piece of text
polarity_transcript = []
for lp in list_pieces:
polarity_piece = []
for p in lp:
polarity_piece.append(TextBlob(p).sentiment.polarity)
polarity_transcript.append(polarity_piece)
polarity_transcript
[[0.11168482647296207, 0.056407029478458055, 0.09445691155249979, 0.09236886724386723, -0.014671592775041055, 0.09538361348808912, 0.06079713127248339, 0.08721655328798185, 0.030089690638160044, 0.07351994851994852], [0.13933883477633482, -0.06333451704545455, -0.056153799903799935, 0.014602659245516405, 0.16377334420812684, 0.09091338259441709, 0.09420031055900621, 0.11566683919944787, -0.04238582919138478, 0.058467487373737366], [-0.0326152022580594, 0.006825656825656827, 0.023452001215159095, 0.01934081890331888, -0.026312183887941466, 0.06207506613756614, 0.030250682288725742, -0.020351594027441484, -0.01150485008818343, 0.10757491470108295], [0.17481829573934843, -0.04116923483102918, -0.022686011904761886, 0.019912549136687042, 0.0592493946731235, 0.05700242218099361, 0.04407051282051284, 0.11019892033865757, 0.19319944575626394, 0.23029900332225917], [-0.05093449586407334, -0.05557354333778966, 0.035829891691960644, 0.08313791054959534, -0.026718682968682954, 0.09785205955660498, -0.12774488467261902, -0.0858667847304211, -0.06019759281122916, 0.15191938178780284], [0.13394056175306174, 0.02376372354497352, -0.007972552910052912, 0.04265769467158358, 0.07530352418745274, 0.15949924118027567, 0.10203196740128559, 0.14881118326118326, 0.09374107374107374, 0.08352238478820757], [0.09480343501984131, 0.10371625923096511, 0.11308390022675745, -0.013150758122349044, -0.02455058908894137, -0.007404902196568868, 0.17715569885361557, 0.056709288653733085, -0.021307012256379345, -0.03465315598380114], [-0.012990848813633627, -0.016682990620490615, -0.030701264880952372, 0.020673450261079116, 0.01936133039074215, -0.005925324675324666, -0.06152908312447788, -0.004201252366389975, 0.03232746840701385, 0.08166467395038819], [0.19791810097532989, -0.008067970389398962, 0.08149321079648951, 0.15441811660561658, 0.13565162907268175, 0.0701145477722339, -0.019158122579175218, 0.027327034827034836, 0.14517770876466524, 0.050684369412290604], [0.0611880068716006, 0.03561645331664863, 0.1377777723365958, 0.026953933747412007, 0.05593590437340433, 0.029630690660102422, 0.08344959077380953, 0.14363832165641377, 0.022845309452452306, -0.01568181818181818], [0.1329261237205162, 0.02408706538170825, 0.17029999672045126, 0.039525430617202775, 0.056102175283209765, 0.22768604411461554, 0.04553152114840428, 0.08756809163059162, 0.07564848302190076, 0.06571405688010165], [0.17428735102346216, 0.15246819899189015, 0.05809832528582529, -0.0330954733672125, 0.16723688497882047, 0.013249771062271064, 0.06146913882762937, 0.023207239808802314, 0.012928669410150898, 0.06295056216931218]]
# Show the plot for one comedian
plt.plot(polarity_transcript[0])
plt.title(data['full_name'].index[0])
plt.show()
# Show the plot for all comedians
plt.rcParams['figure.figsize'] = [16, 12]
for index, comedian in enumerate(data.index):
plt.subplot(3, 4, index+1)
plt.plot(polarity_transcript[index])
plt.plot(np.arange(0,10), np.zeros(10))
plt.title(data['full_name'][index])
plt.ylim(ymin=-.2, ymax=.3)
plt.show()
/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3604: MatplotlibDeprecationWarning: The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead. alternative='`bottom`', obj_type='argument') /Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3610: MatplotlibDeprecationWarning: The `ymax` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `top` instead. alternative='`top`', obj_type='argument')
Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.
On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine.