Sentiment Analysis

Introduction

So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.

When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.

  1. TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
  2. Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
    • Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
    • Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

For more info on how TextBlob coded up its sentiment function.

Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.

Sentiment of Routine

In [1]:
# We'll start by reading in the corpus, which preserves word order
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data
Out[1]:
transcript full_name
ali Ladies and gentlemen, please welcome to the st... Ali Wong
anthony Thank you. Thank you. Thank you, San Francisco... Anthony Jeselnik
bill [cheers and applause] All right, thank you! Th... Bill Burr
bo Bo What? Old MacDonald had a farm E I E I O An... Bo Burnham
dave This is Dave. He tells dirty jokes for a livin... Dave Chappelle
hasan [theme music: orchestral hip-hop] [crowd roars... Hasan Minhaj
jim [Car horn honks] [Audience cheering] [Announce... Jim Jefferies
joe [rock music playing] [audience cheering] [anno... Joe Rogan
john All right, Petunia. Wish me luck out there. Yo... John Mulaney
louis Intro\nFade the music out. Let’s roll. Hold th... Louis C.K.
mike Wow. Hey, thank you. Thanks. Thank you, guys. ... Mike Birbiglia
ricky Hello. Hello! How you doing? Great. Thank you.... Ricky Gervais
In [2]:
# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

# Apply a function along an axis of the DataFrame.
data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data
Out[2]:
transcript full_name polarity subjectivity
ali Ladies and gentlemen, please welcome to the st... Ali Wong 0.069359 0.482403
anthony Thank you. Thank you. Thank you, San Francisco... Anthony Jeselnik 0.055237 0.558976
bill [cheers and applause] All right, thank you! Th... Bill Burr 0.016479 0.537016
bo Bo What? Old MacDonald had a farm E I E I O An... Bo Burnham 0.074514 0.539368
dave This is Dave. He tells dirty jokes for a livin... Dave Chappelle -0.002690 0.513958
hasan [theme music: orchestral hip-hop] [crowd roars... Hasan Minhaj 0.086856 0.460619
jim [Car horn honks] [Audience cheering] [Announce... Jim Jefferies 0.044224 0.523382
joe [rock music playing] [audience cheering] [anno... Joe Rogan 0.004968 0.551628
john All right, Petunia. Wish me luck out there. Yo... John Mulaney 0.082355 0.484137
louis Intro\nFade the music out. Let’s roll. Hold th... Louis C.K. 0.056665 0.515796
mike Wow. Hey, thank you. Thanks. Thank you, guys. ... Mike Birbiglia 0.092927 0.518476
ricky Hello. Hello! How you doing? Great. Thank you.... Ricky Gervais 0.066489 0.497313
In [13]:
# Create a simple scatter plot of Polarity and Subjectivity
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8] # Set width to 10 inches and height to 8 inches

for index, comedian in enumerate(data.index):
    x = data.polarity.loc[comedian]
    y = data.subjectivity.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10) # Offset the label to avoid overlap of label & dot
    plt.xlim(-.01, .12) 

plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

Sentiment of Routine Over Time

Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine.

In [24]:
# Split each routine into 10 parts
import numpy as np
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    length = len(text) # Calculate length of text
    size = math.floor(length / n) # Calculate size of each chunk of text 
    # Calculate the starting points of each chunk of text
    start = np.arange(0, length, size) # numpy.arange([start, ]stop, [step]) ...Return evenly spaced values within a given interval.
    
    # Pull out equally sized pieces of text and put it into a list. Return a list with chunks of text
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list
In [15]:
# Let's take a look at our data again
data
Out[15]:
transcript full_name polarity subjectivity
ali Ladies and gentlemen, please welcome to the st... Ali Wong 0.069359 0.482403
anthony Thank you. Thank you. Thank you, San Francisco... Anthony Jeselnik 0.055237 0.558976
bill [cheers and applause] All right, thank you! Th... Bill Burr 0.016479 0.537016
bo Bo What? Old MacDonald had a farm E I E I O An... Bo Burnham 0.074514 0.539368
dave This is Dave. He tells dirty jokes for a livin... Dave Chappelle -0.002690 0.513958
hasan [theme music: orchestral hip-hop] [crowd roars... Hasan Minhaj 0.086856 0.460619
jim [Car horn honks] [Audience cheering] [Announce... Jim Jefferies 0.044224 0.523382
joe [rock music playing] [audience cheering] [anno... Joe Rogan 0.004968 0.551628
john All right, Petunia. Wish me luck out there. Yo... John Mulaney 0.082355 0.484137
louis Intro\nFade the music out. Let’s roll. Hold th... Louis C.K. 0.056665 0.515796
mike Wow. Hey, thank you. Thanks. Thank you, guys. ... Mike Birbiglia 0.092927 0.518476
ricky Hello. Hello! How you doing? Great. Thank you.... Ricky Gervais 0.066489 0.497313
In [37]:
# Let's create a list of lists that'll hold all of the pieces of text of all the comedians
list_pieces = []
for t in data.transcript:
    split = split_text(t)
    list_pieces.append(split)
In [ ]:
list_pieces
In [32]:
# The list has 12 items, one for each transcript
len(list_pieces)
Out[32]:
12
In [33]:
# And then each transcript has been split into 10 pieces of text
len(list_pieces[0])
Out[33]:
10
In [34]:
# Calculate the polarity for each piece of text

polarity_transcript = []
for lp in list_pieces:
    polarity_piece = []
    for p in lp:
        polarity_piece.append(TextBlob(p).sentiment.polarity)
    polarity_transcript.append(polarity_piece)
    
polarity_transcript
Out[34]:
[[0.11168482647296207,
  0.056407029478458055,
  0.09445691155249979,
  0.09236886724386723,
  -0.014671592775041055,
  0.09538361348808912,
  0.06079713127248339,
  0.08721655328798185,
  0.030089690638160044,
  0.07351994851994852],
 [0.13933883477633482,
  -0.06333451704545455,
  -0.056153799903799935,
  0.014602659245516405,
  0.16377334420812684,
  0.09091338259441709,
  0.09420031055900621,
  0.11566683919944787,
  -0.04238582919138478,
  0.058467487373737366],
 [-0.0326152022580594,
  0.006825656825656827,
  0.023452001215159095,
  0.01934081890331888,
  -0.026312183887941466,
  0.06207506613756614,
  0.030250682288725742,
  -0.020351594027441484,
  -0.01150485008818343,
  0.10757491470108295],
 [0.17481829573934843,
  -0.04116923483102918,
  -0.022686011904761886,
  0.019912549136687042,
  0.0592493946731235,
  0.05700242218099361,
  0.04407051282051284,
  0.11019892033865757,
  0.19319944575626394,
  0.23029900332225917],
 [-0.05093449586407334,
  -0.05557354333778966,
  0.035829891691960644,
  0.08313791054959534,
  -0.026718682968682954,
  0.09785205955660498,
  -0.12774488467261902,
  -0.0858667847304211,
  -0.06019759281122916,
  0.15191938178780284],
 [0.13394056175306174,
  0.02376372354497352,
  -0.007972552910052912,
  0.04265769467158358,
  0.07530352418745274,
  0.15949924118027567,
  0.10203196740128559,
  0.14881118326118326,
  0.09374107374107374,
  0.08352238478820757],
 [0.09480343501984131,
  0.10371625923096511,
  0.11308390022675745,
  -0.013150758122349044,
  -0.02455058908894137,
  -0.007404902196568868,
  0.17715569885361557,
  0.056709288653733085,
  -0.021307012256379345,
  -0.03465315598380114],
 [-0.012990848813633627,
  -0.016682990620490615,
  -0.030701264880952372,
  0.020673450261079116,
  0.01936133039074215,
  -0.005925324675324666,
  -0.06152908312447788,
  -0.004201252366389975,
  0.03232746840701385,
  0.08166467395038819],
 [0.19791810097532989,
  -0.008067970389398962,
  0.08149321079648951,
  0.15441811660561658,
  0.13565162907268175,
  0.0701145477722339,
  -0.019158122579175218,
  0.027327034827034836,
  0.14517770876466524,
  0.050684369412290604],
 [0.0611880068716006,
  0.03561645331664863,
  0.1377777723365958,
  0.026953933747412007,
  0.05593590437340433,
  0.029630690660102422,
  0.08344959077380953,
  0.14363832165641377,
  0.022845309452452306,
  -0.01568181818181818],
 [0.1329261237205162,
  0.02408706538170825,
  0.17029999672045126,
  0.039525430617202775,
  0.056102175283209765,
  0.22768604411461554,
  0.04553152114840428,
  0.08756809163059162,
  0.07564848302190076,
  0.06571405688010165],
 [0.17428735102346216,
  0.15246819899189015,
  0.05809832528582529,
  -0.0330954733672125,
  0.16723688497882047,
  0.013249771062271064,
  0.06146913882762937,
  0.023207239808802314,
  0.012928669410150898,
  0.06295056216931218]]
In [35]:
# Show the plot for one comedian
plt.plot(polarity_transcript[0])
plt.title(data['full_name'].index[0])
plt.show()
In [36]:
# Show the plot for all comedians
plt.rcParams['figure.figsize'] = [16, 12]

for index, comedian in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(polarity_transcript[index])
    plt.plot(np.arange(0,10), np.zeros(10))
    plt.title(data['full_name'][index])
    plt.ylim(ymin=-.2, ymax=.3)
    
plt.show()
/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3604: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')
/Users/nwams/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py:3610: MatplotlibDeprecationWarning: 
The `ymax` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `top` instead.
  alternative='`top`', obj_type='argument')

Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.

On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine.