Sentiment Analysis¶

Introduction¶

So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.

When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.

TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
- Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
- Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

For more info on how TextBlob coded up its sentiment function.

Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.

Sentiment of Routine¶

In [1]:

# We'll start by reading in the corpus, which preserves word order
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Out[1]:

	transcript	full_name
ali	Ladies and gentlemen, please welcome to the st...	Ali Wong
anthony	Thank you. Thank you. Thank you, San Francisco...	Anthony Jeselnik
bill	[cheers and applause] All right, thank you! Th...	Bill Burr
bo	Bo What? Old MacDonald had a farm E I E I O An...	Bo Burnham
dave	This is Dave. He tells dirty jokes for a livin...	Dave Chappelle
hasan	[theme music: orchestral hip-hop] [crowd roars...	Hasan Minhaj
jim	[Car horn honks] [Audience cheering] [Announce...	Jim Jefferies
joe	[rock music playing] [audience cheering] [anno...	Joe Rogan
john	All right, Petunia. Wish me luck out there. Yo...	John Mulaney
louis	Intro\nFade the music out. Let’s roll. Hold th...	Louis C.K.
mike	Wow. Hey, thank you. Thanks. Thank you, guys. ...	Mike Birbiglia
ricky	Hello. Hello! How you doing? Great. Thank you....	Ricky Gervais

In [2]:

# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

# Apply a function along an axis of the DataFrame.
data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data

Out[2]:

	transcript	full_name	polarity	subjectivity
ali	Ladies and gentlemen, please welcome to the st...	Ali Wong	0.069359	0.482403
anthony	Thank you. Thank you. Thank you, San Francisco...	Anthony Jeselnik	0.055237	0.558976
bill	[cheers and applause] All right, thank you! Th...	Bill Burr	0.016479	0.537016
bo	Bo What? Old MacDonald had a farm E I E I O An...	Bo Burnham	0.074514	0.539368
dave	This is Dave. He tells dirty jokes for a livin...	Dave Chappelle	-0.002690	0.513958
hasan	[theme music: orchestral hip-hop] [crowd roars...	Hasan Minhaj	0.086856	0.460619
jim	[Car horn honks] [Audience cheering] [Announce...	Jim Jefferies	0.044224	0.523382
joe	[rock music playing] [audience cheering] [anno...	Joe Rogan	0.004968	0.551628
john	All right, Petunia. Wish me luck out there. Yo...	John Mulaney	0.082355	0.484137
louis	Intro\nFade the music out. Let’s roll. Hold th...	Louis C.K.	0.056665	0.515796
mike	Wow. Hey, thank you. Thanks. Thank you, guys. ...	Mike Birbiglia	0.092927	0.518476
ricky	Hello. Hello! How you doing? Great. Thank you....	Ricky Gervais	0.066489	0.497313

In [13]:

# Create a simple scatter plot of Polarity and Subjectivity
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8] # Set width to 10 inches and height to 8 inches

for index, comedian in enumerate(data.index):
    x = data.polarity.loc[comedian]
    y = data.subjectivity.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10) # Offset the label to avoid overlap of label & dot
    plt.xlim(-.01, .12) 

plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

Sentiment of Routine Over Time¶

Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine.

In [24]:

# Split each routine into 10 parts
import numpy as np
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    length = len(text) # Calculate length of text
    size = math.floor(length / n) # Calculate size of each chunk of text 
    # Calculate the starting points of each chunk of text
    start = np.arange(0, length, size) # numpy.arange([start, ]stop, [step]) ...Return evenly spaced values within a given interval.
    
    # Pull out equally sized pieces of text and put it into a list. Return a list with chunks of text
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

In [15]:

# Let's take a look at our data again
data

Out[15]:

	transcript	full_name	polarity	subjectivity
ali	Ladies and gentlemen, please welcome to the st...	Ali Wong	0.069359	0.482403
anthony	Thank you. Thank you. Thank you, San Francisco...	Anthony Jeselnik	0.055237	0.558976
bill	[cheers and applause] All right, thank you! Th...	Bill Burr	0.016479	0.537016
bo	Bo What? Old MacDonald had a farm E I E I O An...	Bo Burnham	0.074514	0.539368
dave	This is Dave. He tells dirty jokes for a livin...	Dave Chappelle	-0.002690	0.513958
hasan	[theme music: orchestral hip-hop] [crowd roars...	Hasan Minhaj	0.086856	0.460619
jim	[Car horn honks] [Audience cheering] [Announce...	Jim Jefferies	0.044224	0.523382
joe	[rock music playing] [audience cheering] [anno...	Joe Rogan	0.004968	0.551628
john	All right, Petunia. Wish me luck out there. Yo...	John Mulaney	0.082355	0.484137
louis	Intro\nFade the music out. Let’s roll. Hold th...	Louis C.K.	0.056665	0.515796
mike	Wow. Hey, thank you. Thanks. Thank you, guys. ...	Mike Birbiglia	0.092927	0.518476
ricky	Hello. Hello! How you doing? Great. Thank you....	Ricky Gervais	0.066489	0.497313

In [37]:

# Let's create a list of lists that'll hold all of the pieces of text of all the comedians
list_pieces = []
for t in data.transcript:
    split = split_text(t)
    list_pieces.append(split)

In [ ]:

list_pieces

In [32]:

# The list has 12 items, one for each transcript
len(list_pieces)

Out[32]:

In [33]:

# And then each transcript has been split into 10 pieces of text
len(list_pieces[0])

Out[33]:

In [34]:

# Calculate the polarity for each piece of text

polarity_transcript = []
for lp in list_pieces:
    polarity_piece = []
    for p in lp:
        polarity_piece.append(TextBlob(p).sentiment.polarity)
    polarity_transcript.append(polarity_piece)
    
polarity_transcript

Out[34]:

[[0.11168482647296207,
  0.056407029478458055,
  0.09445691155249979,
  0.09236886724386723,
  -0.014671592775041055,
  0.09538361348808912,
  0.06079713127248339,
  0.08721655328798185,
  0.030089690638160044,
  0.07351994851994852],
 [0.13933883477633482,
  -0.06333451704545455,
  -0.056153799903799935,
  0.014602659245516405,
  0.16377334420812684,
  0.09091338259441709,
  0.09420031055900621,
  0.11566683919944787,
  -0.04238582919138478,
  0.058467487373737366],
 [-0.0326152022580594,
  0.006825656825656827,
  0.023452001215159095,
  0.01934081890331888,
  -0.026312183887941466,
  0.06207506613756614,
  0.030250682288725742,
  -0.020351594027441484,
  -0.01150485008818343,
  0.10757491470108295],
 [0.17481829573934843,
  -0.04116923483102918,
  -0.022686011904761886,
  0.019912549136687042,
  0.0592493946731235,
  0.05700242218099361,
  0.04407051282051284,
  0.11019892033865757,
  0.19319944575626394,
  0.23029900332225917],
 [-0.05093449586407334,
  -0.05557354333778966,
  0.035829891691960644,
  0.08313791054959534,
  -0.026718682968682954,
  0.09785205955660498,
  -0.12774488467261902,
  -0.0858667847304211,
  -0.06019759281122916,
  0.15191938178780284],
 [0.13394056175306174,
  0.02376372354497352,
  -0.007972552910052912,
  0.04265769467158358,
  0.07530352418745274,
  0.15949924118027567,
  0.10203196740128559,
  0.14881118326118326,
  0.09374107374107374,
  0.08352238478820757],
 [0.09480343501984131,
  0.10371625923096511,
  0.11308390022675745,
  -0.013150758122349044,
  -0.02455058908894137,
  -0.007404902196568868,
  0.17715569885361557,
  0.056709288653733085,
  -0.021307012256379345,
  -0.03465315598380114],
 [-0.012990848813633627,
  -0.016682990620490615,
  -0.030701264880952372,
  0.020673450261079116,
  0.01936133039074215,
  -0.005925324675324666,
  -0.06152908312447788,
  -0.004201252366389975,
  0.03232746840701385,
  0.08166467395038819],
 [0.19791810097532989,
  -0.008067970389398962,
  0.08149321079648951,
  0.15441811660561658,
  0.13565162907268175,
  0.0701145477722339,
  -0.019158122579175218,
  0.027327034827034836,
  0.14517770876466524,
  0.050684369412290604],
 [0.0611880068716006,
  0.03561645331664863,
  0.1377777723365958,
  0.026953933747412007,
  0.05593590437340433,
  0.029630690660102422,
  0.08344959077380953,
  0.14363832165641377,
  0.022845309452452306,
  -0.01568181818181818],
 [0.1329261237205162,
  0.02408706538170825,
  0.17029999672045126,
  0.039525430617202775,
  0.056102175283209765,
  0.22768604411461554,
  0.04553152114840428,
  0.08756809163059162,
  0.07564848302190076,
  0.06571405688010165],
 [0.17428735102346216,
  0.15246819899189015,
  0.05809832528582529,
  -0.0330954733672125,
  0.16723688497882047,
  0.013249771062271064,
  0.06146913882762937,
  0.023207239808802314,
  0.012928669410150898,
  0.06295056216931218]]

In [35]:

# Show the plot for one comedian
plt.plot(polarity_transcript[0])
plt.title(data['full_name'].index[0])
plt.show()

In [36]:

# Show the plot for all comedians
plt.rcParams['figure.figsize'] = [16, 12]

for index, comedian in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(polarity_transcript[index])
    plt.plot(np.arange(0,10), np.zeros(10))
    plt.title(data['full_name'][index])
    plt.ylim(ymin=-.2, ymax=.3)
    
plt.show()

/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3604: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')
/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3610: MatplotlibDeprecationWarning: 
The `ymax` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `top` instead.
  alternative='`top`', obj_type='argument')

Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.

On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine.