Introduction to Scattertext

DDSEA17: Understanding Cultures and Perspectives through Text and Emjoi Visualization

@jasonkessler

https://github.com/JasonKessler/scattertext

Cite as: Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2017.

Link to preprint: https://arxiv.org/abs/1703.00565

@article{kessler2017scattertext, author = {Kessler, Jason S.}, title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ}, booktitle = {ACL System Demonstrations}, year = {2017}, }

In [1]:
%matplotlib inline
import scattertext as st
import re, io
from pprint import pprint
import pandas as pd
import numpy as np
from scipy.stats import rankdata, hmean, norm
import spacy.en
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))
In [2]:
nlp = spacy.en.English()
# If this doesn't work, please uncomment the following line and use a regex-based parser instead
#nlp = st.whitespace_nlp_with_sentences

Grab the 2012 political convention data set and preview it

In [3]:
convention_df = st.SampleCorpora.ConventionData2012.get_data()
In [4]:
convention_df.iloc[0]
Out[4]:
party                                               democrat
speaker                                         BARACK OBAMA
text       Thank you. Thank you. Thank you. Thank you so ...
Name: 0, dtype: object
In [6]:
print("Document Count")
print(convention_df.groupby('party')['text'].count())
print("Word Count")
convention_df.groupby('party').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())
convention_df['parsed'] = convention_df.text.apply(nlp)
Document Count
party
democrat      123
republican     66
Name: text, dtype: int64
Word Count

Turn it into a Scattertext corpus, and have spaCy parse it.

In [7]:
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()

Scattertext has some functions to find how associated words are with categories

I've reworded this section since the talk

Lots of ways to do this. I'm partial to a novel technique called Scaled F-Score

Intutition:

Associatied terms have a relatively high category-specific precision and category-specific term frequency (i.e., % of terms in category are term)

Take the harmonic mean of precision and frequency (both have to be high)

Hyper-parameters are pretty much universal (beta and transformation function)

Given a word $w_i \in W$ and a category $c_j \in C$, define the precision of the word $w_i$ wrt to a category as: $$ \mbox{prec}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{c \in C} \#(w_i, c)}. $$

The function $\#(w_i, c_j)$ represents either the number of times $w_i$ occurs in a document labeled with the category $c_j$ or the number of documents labeled $c_j$ which contain $w_i$.

Similarly, define the frequency a word occurs in the category as:

$$ \mbox{freq}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{w \in W} \#(w, c_j)}. $$

The F-Score of these two values is defined as:

$$ \mathcal{F}_\beta(\mbox{prec}, \mbox{freq}) = (1 + \beta^2) \frac{\mbox{prec} \cdot \mbox{freq}}{\beta^2 \cdot \mbox{prec} + \mbox{freq}}. $$

$\beta \in \mathcal{R}^+$ is a scaling factor where frequency is favored if $\beta < 1$, precision if $\beta > 1$, and both are equally weighted if $\beta = 1$. F-Score is equivalent to the harmonic mean where $\beta = 1$.

In [8]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['dem_precision'] = term_freq_df['democrat freq'] * 1./(term_freq_df['democrat freq'] + term_freq_df['republican freq'])
term_freq_df['dem_freq_pct'] = term_freq_df['democrat freq'] * 1./term_freq_df['democrat freq'].sum()
term_freq_df['dem_hmean'] = term_freq_df.apply(lambda x: (hmean([x['dem_precision'], x['dem_freq_pct']])
                                                                   if x['dem_precision'] > 0 and x['dem_freq_pct'] > 0 
                                                                   else 0), axis=1)                                                        
term_freq_df.sort_values(by='dem_hmean', ascending=False).iloc[:10]
Out[8]:
democrat freq republican freq dem_precision dem_freq_pct dem_hmean
term
the 3402 2532 0.573306 0.022343 0.043009
and 2709 2233 0.548159 0.017791 0.034464
to 2340 1667 0.583978 0.015368 0.029948
a 1602 1345 0.543604 0.010521 0.020643
of 1569 1377 0.532587 0.010304 0.020218
that 1400 1051 0.571195 0.009195 0.018098
we 1318 1146 0.534903 0.008656 0.017036
in 1291 986 0.566974 0.008479 0.016708
i 1098 851 0.563366 0.007211 0.014240
's 1037 631 0.621703 0.006811 0.013473

Solution:

Take the normal CDF of precision and frequency percentage scores, which will fall between 0 and 1, which scales and standardizes both scores.

Define the the Normal CDF as:

$$ \Phi(z) = \int_{-\infty}^z \mathcal{N}(x; \mu, \sigma^2)\ \mathrm{d}x.$$

Where $ \mathcal{N} $ is the PDF of the Normal distribution, $\mu$ is the mean, and $\sigma^2$ is the variance.

$\Phi$ is used to scale and standardize the precisions and frequencies, and place them on the same scale $[0,1]$.

Now we can define Scaled F-Score as the harmonic mean of the Normal CDF transformed frequency and precision:

$$ \mathcal{S}_{\beta}(w_i, c_j) = \mathcal{F}_{\beta}(\Phi(\mbox{prec}(w_i, c_j)), \Phi(\mbox{freq}(w_i, c_j))).$$

$\mu$ and $\sigma^2$ are defined separately as the mean and variance of precision and frequency.

A $\beta$ of 0.5 is recommended and is the default value in Scattertext.

Note that any function with the range of $[0,1]$ (this includes the identity function) may be used in place of $\Phi$.

In [15]:
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())
term_freq_df['dem_precision_normcdf'] = normcdf(term_freq_df['dem_precision'])
term_freq_df['dem_freq_pct_normcdf'] = normcdf(term_freq_df['dem_freq_pct'])
term_freq_df['dem_scaled_f_score'] = hmean([term_freq_df['dem_precision_normcdf'], term_freq_df['dem_freq_pct_normcdf']])
term_freq_df.sort_values(by='dem_scaled_f_score', ascending=False).iloc[:10]
Out[15]:
democrat freq republican freq dem_precision dem_freq_pct dem_hmean dem_precision_normcdf dem_freq_pct_normcdf dem_scaled_f_score
term
middle class 148 18 0.891566 0.000972 0.001942 0.769762 1.000000 0.869905
auto 37 0 1.000000 0.000243 0.000486 0.836010 0.889307 0.861835
fair 45 3 0.937500 0.000296 0.000591 0.799485 0.933962 0.861507
insurance 54 6 0.900000 0.000355 0.000709 0.775397 0.965959 0.860251
forward 105 16 0.867769 0.000690 0.001378 0.753443 0.999858 0.859334
president barack 47 4 0.921569 0.000309 0.000617 0.789447 0.942572 0.859241
class 161 25 0.865591 0.001057 0.002112 0.751919 1.000000 0.858395
middle 164 27 0.858639 0.001077 0.002151 0.747021 1.000000 0.855194
the middle 98 17 0.852174 0.000644 0.001286 0.742422 0.999640 0.852041
medicare 84 15 0.848485 0.000552 0.001103 0.739778 0.998050 0.849722
In [16]:
term_freq_df['dem_corner_score'] = corpus.get_corner_scores('democrat')
term_freq_df.sort_values(by='dem_corner_score', ascending=False).iloc[:10]
Out[16]:
democrat freq republican freq dem_precision dem_freq_pct dem_hmean dem_precision_normcdf dem_freq_pct_normcdf dem_scaled_f_score dem_corner_score
term
auto 37 0 1.0 0.000243 0.000486 0.83601 0.889307 0.861835 0.919467
america forward 28 0 1.0 0.000184 0.000368 0.83601 0.817094 0.826444 0.919436
auto industry 24 0 1.0 0.000158 0.000315 0.83601 0.777205 0.805536 0.919413
insurance companies 24 0 1.0 0.000158 0.000315 0.83601 0.777205 0.805536 0.919413
pell 23 0 1.0 0.000151 0.000302 0.83601 0.766509 0.799752 0.919404
last week 22 0 1.0 0.000144 0.000289 0.83601 0.755535 0.793738 0.919393
pell grants 21 0 1.0 0.000138 0.000276 0.83601 0.744288 0.787487 0.919381
platform 20 0 1.0 0.000131 0.000263 0.83601 0.732776 0.780996 0.919369
women 's 20 0 1.0 0.000131 0.000263 0.83601 0.732776 0.780996 0.919369
millionaires 18 0 1.0 0.000118 0.000236 0.83601 0.708996 0.767282 0.919333
In [17]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')
term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
print("Top 10 Democratic terms")
pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))
print("Top 10 Republican terms")
pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))
Top 10 Democratic terms
['auto',
 'america forward',
 'fought for',
 'insurance companies',
 'auto industry',
 'fair',
 'pell',
 'last week',
 'fighting for',
 'president barack']
Top 10 Republican terms
['unemployment',
 'do better',
 'liberty',
 'olympics',
 'built it',
 'it has',
 'ann',
 'reagan',
 'big government',
 'story of']

Make and visualize chart, scale based on raw frequency.

- A word used 10 times by Republicans will be at position 10 on the on the x-axis

- This isn't very useful. Everything but the most frequent terms are squished the lower-left corner

- The corner-distance scores are largely stopwords

- By default, color words by Scaled F-Score

In [31]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.scale,
                                    metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextScale.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[31]:

Using log scales seems to help a bit, but blank space and stop words still dominate the graph

The chracteristic terms look much more informative

In [32]:
html = st.produce_scattertext_explorer(corpus,
                                       category='democrat',
                                       category_name='Democratic',
                                       not_category_name='Republican',
                                       minimum_term_frequency=5,
                                       width_in_pixels=1000,
                                       transform=st.Scalers.log_scale_standardize)
file_name = 'output/Conventions2012ScattertextLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[32]:

Rank terms by frequency percentiles instead of raw frequenies.

A term at the middle of the x-axis will be mentioned by Republicans at the median frequency.

This nicely distributes terms throughout the space

But, terms occuring with the same frequencies in both classes are stacked atop each other.

Can't mouseover points not at top of stack.

In [33]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.percentile,
                                    metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankData.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[33]:

One solution is to randomly jitter each point

Points don't leave enough space for many labels

Top terms laregely result of jitter

In [34]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    jitter=0.1,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.percentile,
                                    metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankDataJitter.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[34]:

The preferred solution is to fall back to alphabetic order among equally frequent terms

Lets you mouseover all points

Leaves a bit of room for labels

Top points may be slightly distorted

In [8]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankDefault.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[8]:
In [ ]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    scores = corpus.get_logistic_regression_coefs_l1('democrat'),
                                    grey_zero_scores = True,
                                    metadata=convention_df['speaker'])
file_name = 'output/sparseviz.html'
open(file_name, 'wb').write(html.encode('utf-8'))
In [5]:
IFrame(src='output/sparseviz.html', width = 1200, height=700)
Out[5]:

Scattertext can also be used for alternative visualizations

Visualize L2-penalized logistic regression coefficients vs. log term frequency

In [23]:
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))
In [24]:
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('democrat',
                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)

html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=scores_scaled,
                                    scores=scores,
                                    sort_by_dist=False,
                                    metadata=convention_df['speaker'],
                                    x_label='Log frequency',
                                    y_label='L2-Penalized Log Reg Coef')
file_name = 'output/L2vsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
/Users/kesslej/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = -1.
  " = {}.".format(self.n_jobs))
Out[24]:

We can see how this compares to Scaled F-Score

In [30]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('democrat', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('democrat', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=convention_df['speaker'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = 'output/SFSvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[30]:

Given a word $w_i \in W$ and a category $c_j \in C$, define the precision of the word $w_i$ wrt to a category as:

$$ \mbox{prec}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{c \in C} \#(w_i, c)}. $$

Penalized log-odds-ratio

Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis.

I like Scattertext, but dislike that particular metric. Lo freq / hi precision terms brittle / idiosyncratic (~like our Fig 2). JMO/YMMV/etc

$$ \mbox{Odds}(\mbox{word}\ |\ \mbox{class}) = \frac{\#(\mbox{word}, \mbox{class})}{\sum_{\mbox{word}' != \mbox{word}} \#(\mbox{word}', \mbox{class})} $$$$ \mbox{Odds Ratio}(\mbox{word}\ |\ \mbox{class_a, class_b}) = \frac{\mbox{Odds}(\mbox{word}\ |\ \mbox{class}_a)}{\mbox{Odds}(\mbox{word}\ |\ \mbox{class}_b)} $$

Note: Odds Ratio can quickly go to $\inf$ or -$\inf$. if $P(\mbox{word}\ |\ \mbox{class}_b) = 0$ or $P(\mbox{word}\ |\ \mbox{class}_b) = 0$.

Solution: smoothing. Assume every word appeared $\alpha$ (typically 0.01) + the counts in each category. Log ratio, scale for variance (see Monroe et al. for more details).

$$ \mbox{Odds}_{\mbox{uninformative Dirichlet prior}}(\mbox{word}\ |\ \mbox{class}) = \frac{\#(\mbox{word}, \mbox{class}) + \alpha}{\sum_{\mbox{word}' != \mbox{word}} \alpha + \#(\mbox{word}', \mbox{class})} $$

Alternatively, one can use word counts from an in-domain, background corpus (the "informative Dirichlet prior").

$$ \mbox{Odds}_\mbox{informative Dirichlet prior}(\mbox{word}\ |\ \mbox{class}) = \frac{\#(\mbox{word}, \mbox{class}) + \#(\mbox{word}, \mbox{background corpus})}{\sum_{\mbox{word}' != \mbox{word}} \#(\mbox{word}', \mbox{background corpus}) + \#(\mbox{word}', \mbox{class})} $$
In [10]:
freq_df = corpus.get_term_freq_df().rename(columns={'democrat freq': 'y_dem', 'republican freq': 'y_rep'})
a_w = 0.01
y_i, y_j = freq_df['y_dem'].values, freq_df['y_rep'].values
In [14]:
n_i, n_j = y_i.sum(), y_j.sum()
a_0 = len(freq_df) * a_w
delta_i_j = (  np.log((y_i + a_w) / (n_i + a_0 - y_i - a_w))
                 - np.log((y_j + a_w) / (n_j + a_0 - y_j - a_w)))
var_delta_i_j = ( 1./(y_i + a_w) + 1./(y_i + a_0 - y_i - a_w)
                    + 1./(y_j + a_w) + 1./(n_j + a_0 - n_j - a_w))
zeta_i_j = delta_i_j/np.sqrt(var_delta_i_j)
max_abs_zeta = max(zeta_i_j.max(), -zeta_i_j.min())
zeta_scaled_for_charting = ((((zeta_i_j > 0).astype(float) * (zeta_i_j/max_abs_zeta))*0.5 + 0.5)
                            + ((zeta_i_j < 0).astype(float) * (zeta_i_j/max_abs_zeta) * 0.5))
In [153]:
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=zeta_scaled_for_charting,
                                    scores=zeta_i_j,
                                    sort_by_dist=False,
                                    metadata=convention_df['speaker'],
                                    x_label='Log Frequency',
                                    y_label='Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)')
file_name = 'output/LOPriorvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[153]:

And finally, corner score

In [38]:
corner_scores = corpus.get_corner_scores('democrat')
html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corner_scores,
                                    scores=corner_scores,
                                    sort_by_dist=False,
                                    metadata=convention_df['speaker'],
                                    x_label='Log Frequency',
                                    y_label='Corner Scores')
file_name = 'output/CornervsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[38]:
In [ ]: