https://github.com/JasonKessler/scattertext
Cite as: Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2017.
Link to preprint: https://arxiv.org/abs/1703.00565
@article{kessler2017scattertext, author = {Kessler, Jason S.}, title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ}, booktitle = {ACL System Demonstrations}, year = {2017}, }
%matplotlib inline
import scattertext as st
import re, io
from pprint import pprint
import pandas as pd
import numpy as np
from scipy.stats import rankdata, hmean, norm
import spacy.en
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))
nlp = spacy.en.English()
# If this doesn't work, please uncomment the following line and use a regex-based parser instead
#nlp = st.whitespace_nlp_with_sentences
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df.iloc[0]
party democrat speaker BARACK OBAMA text Thank you. Thank you. Thank you. Thank you so ... Name: 0, dtype: object
print("Document Count")
print(convention_df.groupby('party')['text'].count())
print("Word Count")
convention_df.groupby('party').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())
convention_df['parsed'] = convention_df.text.apply(nlp)
Document Count party democrat 123 republican 66 Name: text, dtype: int64 Word Count
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()
Given a word $w_i \in W$ and a category $c_j \in C$, define the precision of the word $w_i$ wrt to a category as: $$ \mbox{prec}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{c \in C} \#(w_i, c)}. $$
The function $\#(w_i, c_j)$ represents either the number of times $w_i$ occurs in a document labeled with the category $c_j$ or the number of documents labeled $c_j$ which contain $w_i$.
Similarly, define the frequency a word occurs in the category as:
$$ \mbox{freq}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{w \in W} \#(w, c_j)}. $$The F-Score of these two values is defined as:
$$ \mathcal{F}_\beta(\mbox{prec}, \mbox{freq}) = (1 + \beta^2) \frac{\mbox{prec} \cdot \mbox{freq}}{\beta^2 \cdot \mbox{prec} + \mbox{freq}}. $$$\beta \in \mathcal{R}^+$ is a scaling factor where frequency is favored if $\beta < 1$, precision if $\beta > 1$, and both are equally weighted if $\beta = 1$. F-Score is equivalent to the harmonic mean where $\beta = 1$.
term_freq_df = corpus.get_term_freq_df()
term_freq_df['dem_precision'] = term_freq_df['democrat freq'] * 1./(term_freq_df['democrat freq'] + term_freq_df['republican freq'])
term_freq_df['dem_freq_pct'] = term_freq_df['democrat freq'] * 1./term_freq_df['democrat freq'].sum()
term_freq_df['dem_hmean'] = term_freq_df.apply(lambda x: (hmean([x['dem_precision'], x['dem_freq_pct']])
if x['dem_precision'] > 0 and x['dem_freq_pct'] > 0
else 0), axis=1)
term_freq_df.sort_values(by='dem_hmean', ascending=False).iloc[:10]
democrat freq | republican freq | dem_precision | dem_freq_pct | dem_hmean | |
---|---|---|---|---|---|
term | |||||
the | 3402 | 2532 | 0.573306 | 0.022343 | 0.043009 |
and | 2709 | 2233 | 0.548159 | 0.017791 | 0.034464 |
to | 2340 | 1667 | 0.583978 | 0.015368 | 0.029948 |
a | 1602 | 1345 | 0.543604 | 0.010521 | 0.020643 |
of | 1569 | 1377 | 0.532587 | 0.010304 | 0.020218 |
that | 1400 | 1051 | 0.571195 | 0.009195 | 0.018098 |
we | 1318 | 1146 | 0.534903 | 0.008656 | 0.017036 |
in | 1291 | 986 | 0.566974 | 0.008479 | 0.016708 |
i | 1098 | 851 | 0.563366 | 0.007211 | 0.014240 |
's | 1037 | 631 | 0.621703 | 0.006811 | 0.013473 |
Define the the Normal CDF as:
$$ \Phi(z) = \int_{-\infty}^z \mathcal{N}(x; \mu, \sigma^2)\ \mathrm{d}x.$$Where $ \mathcal{N} $ is the PDF of the Normal distribution, $\mu$ is the mean, and $\sigma^2$ is the variance.
$\Phi$ is used to scale and standardize the precisions and frequencies, and place them on the same scale $[0,1]$.
Now we can define Scaled F-Score as the harmonic mean of the Normal CDF transformed frequency and precision:
$$ \mathcal{S}_{\beta}(w_i, c_j) = \mathcal{F}_{\beta}(\Phi(\mbox{prec}(w_i, c_j)), \Phi(\mbox{freq}(w_i, c_j))).$$$\mu$ and $\sigma^2$ are defined separately as the mean and variance of precision and frequency.
A $\beta$ of 0.5 is recommended and is the default value in Scattertext.
Note that any function with the range of $[0,1]$ (this includes the identity function) may be used in place of $\Phi$.
def normcdf(x):
return norm.cdf(x, x.mean(), x.std())
term_freq_df['dem_precision_normcdf'] = normcdf(term_freq_df['dem_precision'])
term_freq_df['dem_freq_pct_normcdf'] = normcdf(term_freq_df['dem_freq_pct'])
term_freq_df['dem_scaled_f_score'] = hmean([term_freq_df['dem_precision_normcdf'], term_freq_df['dem_freq_pct_normcdf']])
term_freq_df.sort_values(by='dem_scaled_f_score', ascending=False).iloc[:10]
democrat freq | republican freq | dem_precision | dem_freq_pct | dem_hmean | dem_precision_normcdf | dem_freq_pct_normcdf | dem_scaled_f_score | |
---|---|---|---|---|---|---|---|---|
term | ||||||||
middle class | 148 | 18 | 0.891566 | 0.000972 | 0.001942 | 0.769762 | 1.000000 | 0.869905 |
auto | 37 | 0 | 1.000000 | 0.000243 | 0.000486 | 0.836010 | 0.889307 | 0.861835 |
fair | 45 | 3 | 0.937500 | 0.000296 | 0.000591 | 0.799485 | 0.933962 | 0.861507 |
insurance | 54 | 6 | 0.900000 | 0.000355 | 0.000709 | 0.775397 | 0.965959 | 0.860251 |
forward | 105 | 16 | 0.867769 | 0.000690 | 0.001378 | 0.753443 | 0.999858 | 0.859334 |
president barack | 47 | 4 | 0.921569 | 0.000309 | 0.000617 | 0.789447 | 0.942572 | 0.859241 |
class | 161 | 25 | 0.865591 | 0.001057 | 0.002112 | 0.751919 | 1.000000 | 0.858395 |
middle | 164 | 27 | 0.858639 | 0.001077 | 0.002151 | 0.747021 | 1.000000 | 0.855194 |
the middle | 98 | 17 | 0.852174 | 0.000644 | 0.001286 | 0.742422 | 0.999640 | 0.852041 |
medicare | 84 | 15 | 0.848485 | 0.000552 | 0.001103 | 0.739778 | 0.998050 | 0.849722 |
term_freq_df['dem_corner_score'] = corpus.get_corner_scores('democrat')
term_freq_df.sort_values(by='dem_corner_score', ascending=False).iloc[:10]
democrat freq | republican freq | dem_precision | dem_freq_pct | dem_hmean | dem_precision_normcdf | dem_freq_pct_normcdf | dem_scaled_f_score | dem_corner_score | |
---|---|---|---|---|---|---|---|---|---|
term | |||||||||
auto | 37 | 0 | 1.0 | 0.000243 | 0.000486 | 0.83601 | 0.889307 | 0.861835 | 0.919467 |
america forward | 28 | 0 | 1.0 | 0.000184 | 0.000368 | 0.83601 | 0.817094 | 0.826444 | 0.919436 |
auto industry | 24 | 0 | 1.0 | 0.000158 | 0.000315 | 0.83601 | 0.777205 | 0.805536 | 0.919413 |
insurance companies | 24 | 0 | 1.0 | 0.000158 | 0.000315 | 0.83601 | 0.777205 | 0.805536 | 0.919413 |
pell | 23 | 0 | 1.0 | 0.000151 | 0.000302 | 0.83601 | 0.766509 | 0.799752 | 0.919404 |
last week | 22 | 0 | 1.0 | 0.000144 | 0.000289 | 0.83601 | 0.755535 | 0.793738 | 0.919393 |
pell grants | 21 | 0 | 1.0 | 0.000138 | 0.000276 | 0.83601 | 0.744288 | 0.787487 | 0.919381 |
platform | 20 | 0 | 1.0 | 0.000131 | 0.000263 | 0.83601 | 0.732776 | 0.780996 | 0.919369 |
women 's | 20 | 0 | 1.0 | 0.000131 | 0.000263 | 0.83601 | 0.732776 | 0.780996 | 0.919369 |
millionaires | 18 | 0 | 1.0 | 0.000118 | 0.000236 | 0.83601 | 0.708996 | 0.767282 | 0.919333 |
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')
term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
print("Top 10 Democratic terms")
pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))
print("Top 10 Republican terms")
pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))
Top 10 Democratic terms ['auto', 'america forward', 'fought for', 'insurance companies', 'auto industry', 'fair', 'pell', 'last week', 'fighting for', 'president barack'] Top 10 Republican terms ['unemployment', 'do better', 'liberty', 'olympics', 'built it', 'it has', 'ann', 'reagan', 'big government', 'story of']
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
minimum_term_frequency=5,
transform=st.Scalers.scale,
metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextScale.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
html = st.produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
width_in_pixels=1000,
transform=st.Scalers.log_scale_standardize)
file_name = 'output/Conventions2012ScattertextLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
minimum_term_frequency=5,
transform=st.Scalers.percentile,
metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankData.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
jitter=0.1,
minimum_term_frequency=5,
transform=st.Scalers.percentile,
metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankDataJitter.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
minimum_term_frequency=5,
metadata=convention_df['speaker'])
file_name = 'output/Conventions2012ScattertextRankDefault.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
minimum_term_frequency=5,
scores = corpus.get_logistic_regression_coefs_l1('democrat'),
grey_zero_scores = True,
metadata=convention_df['speaker'])
file_name = 'output/sparseviz.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src='output/sparseviz.html', width = 1200, height=700)
def scale(ar):
return (ar - ar.min()) / (ar.max() - ar.min())
def zero_centered_scale(ar):
scores = np.zeros(len(ar))
scores[ar > 0] = scale(ar[ar > 0])
scores[ar < 0] = -scale(-ar[ar < 0])
return (scores + 1) / 2.
frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('democrat',
LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
width_in_pixels=1000,
x_coords=frequencies_scaled,
y_coords=scores_scaled,
scores=scores,
sort_by_dist=False,
metadata=convention_df['speaker'],
x_label='Log frequency',
y_label='L2-Penalized Log Reg Coef')
file_name = 'output/L2vsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
/Users/kesslej/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = -1. " = {}.".format(self.n_jobs))
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
width_in_pixels=1000,
x_coords=frequencies_scaled,
y_coords=corpus.get_scaled_f_scores('democrat', beta=0.5),
scores=corpus.get_scaled_f_scores('democrat', beta=0.5),
sort_by_dist=False,
metadata=convention_df['speaker'],
x_label='Log Frequency',
y_label='Scaled F-Score')
file_name = 'output/SFSvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Given a word $w_i \in W$ and a category $c_j \in C$, define the precision of the word $w_i$ wrt to a category as:
$$ \mbox{prec}(w_i, c_j) = \frac{\#(w_i, c_j)}{\sum_{c \in C} \#(w_i, c)}. $$Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis.
Note: Odds Ratio can quickly go to $\inf$ or -$\inf$. if $P(\mbox{word}\ |\ \mbox{class}_b) = 0$ or $P(\mbox{word}\ |\ \mbox{class}_b) = 0$.
Solution: smoothing. Assume every word appeared $\alpha$ (typically 0.01) + the counts in each category. Log ratio, scale for variance (see Monroe et al. for more details).
$$ \mbox{Odds}_{\mbox{uninformative Dirichlet prior}}(\mbox{word}\ |\ \mbox{class}) = \frac{\#(\mbox{word}, \mbox{class}) + \alpha}{\sum_{\mbox{word}' != \mbox{word}} \alpha + \#(\mbox{word}', \mbox{class})} $$Alternatively, one can use word counts from an in-domain, background corpus (the "informative Dirichlet prior").
$$ \mbox{Odds}_\mbox{informative Dirichlet prior}(\mbox{word}\ |\ \mbox{class}) = \frac{\#(\mbox{word}, \mbox{class}) + \#(\mbox{word}, \mbox{background corpus})}{\sum_{\mbox{word}' != \mbox{word}} \#(\mbox{word}', \mbox{background corpus}) + \#(\mbox{word}', \mbox{class})} $$freq_df = corpus.get_term_freq_df().rename(columns={'democrat freq': 'y_dem', 'republican freq': 'y_rep'})
a_w = 0.01
y_i, y_j = freq_df['y_dem'].values, freq_df['y_rep'].values
n_i, n_j = y_i.sum(), y_j.sum()
a_0 = len(freq_df) * a_w
delta_i_j = ( np.log((y_i + a_w) / (n_i + a_0 - y_i - a_w))
- np.log((y_j + a_w) / (n_j + a_0 - y_j - a_w)))
var_delta_i_j = ( 1./(y_i + a_w) + 1./(y_i + a_0 - y_i - a_w)
+ 1./(y_j + a_w) + 1./(n_j + a_0 - n_j - a_w))
zeta_i_j = delta_i_j/np.sqrt(var_delta_i_j)
max_abs_zeta = max(zeta_i_j.max(), -zeta_i_j.min())
zeta_scaled_for_charting = ((((zeta_i_j > 0).astype(float) * (zeta_i_j/max_abs_zeta))*0.5 + 0.5)
+ ((zeta_i_j < 0).astype(float) * (zeta_i_j/max_abs_zeta) * 0.5))
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
width_in_pixels=1000,
x_coords=frequencies_scaled,
y_coords=zeta_scaled_for_charting,
scores=zeta_i_j,
sort_by_dist=False,
metadata=convention_df['speaker'],
x_label='Log Frequency',
y_label='Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)')
file_name = 'output/LOPriorvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
corner_scores = corpus.get_corner_scores('democrat')
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
width_in_pixels=1000,
x_coords=frequencies_scaled,
y_coords=corner_scores,
scores=corner_scores,
sort_by_dist=False,
metadata=convention_df['speaker'],
x_label='Log Frequency',
y_label='Corner Scores')
file_name = 'output/CornervsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)