Notebook

Evaluating Topic Models¶

PyData Berlin 2017 Talk¶

This notebook is a companion to the talk I gave at the PyData Berlin 2017 conference on evaluating topic models¶

Unsupervised models in natural language processing (NLP) have a long history but have recently become very popular. Word2vec, GloVe, LSI and LDA provide powerful computational tools to deal with natural language and make exploring and modelling large document collections feasible.

Often evaluating the model output requires an existing understanding of what should come out. For topic models the output should reflect our understanding of the relatedness of topical categories, for instance sports, travel or machine learning. Distributional models of language such as word2vec and GloVe should capture some, or ideally all, of the semantics of how language is used.

This is a lot to ask! Not necessarily because it isn't learneable, after all we've learned it, but because we are not necessarily able to represent the desired output as an evaluation function and data set that can be optimised. As an example topic models are often evaluated with respect to the semantic coherence of the topics based on a set of top words from the topics. It is not clear if a set of words such as {cat, dog, horse, pet} captures the semantics of an animalness or a petsiness fully. Nevertheless these methods are useful in determining if the distributed word representation are capturing some of the information conveyed by words and if a topic model is understandable to a human.

This notebook explores a number of these issues in context and aims to provide an overview of the research that has been done in the past 10 or so years, mostly focusing on topic models.

The notebook is split into three parts

Eye Balling models
- ways of making visual, manual inspection of models easier
Intrinsic Evaluation Methods
- how to measure the internal coherence of topic models
Putting a Number on Human Judgements
- quantitative methods for evaluating human judgement

Random collection of other stuff

While preparing the talk and the notebook I experimented with a lot of different software packages and corpora. These are dumped as a somewhat unorganised collection of "other things" at the end of the notebook

Why Evaluate Models¶

We would like to be able to say if a model is objectively good or bad, and compare different models to each other. This requires us to have an objective measure for the quality of the model but many of the tasks mentioned above require subjective evaluation.

In practical applications one needs to evaluate if "the correct thing" has been learned, often this means applying implicit knowledge and "eye-balling". Documents that talk about football should be in the same category and cat is more similar to dog than to pen. Ideally this information should be captured in a single metric that can be maximised. It is not clear how to formulate such a metric however, over the years there has been numerous attempts from various different angles at formulating semantic coherence, none capture the desired outcome fully and there are numerous issues one should be aware of in applying those metrics.

Some of the issues are related to the metrics being used or issues one should be aware of when applying those metrics, but others are related to external factors, like which kind of held out data to use. Natural language is messy, ambiguous and full of interpretation, that's where a lot of the expressive richness comes from. Sometimes trying to cleanse the ambiguity also reduces language to an unnatural form.

Topic Models¶

Topic models aim to capture the way in which words co-occur in the context of a document and divide the source corpus into some number of (soft) clusters. There are a large number of variations on the topic model, initial work was done be Deerwester in developing Latent Semantic Analysis (LSA/LSI), now the canonical example is Latent Dirichlet Allocation or LDA. The unit upon which topic models work is a sparse document-term matrix depicted below.

Each row is a document, each column is a term and the entries in each cell usually represent the frequency of each term in each document.

In [127]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

documents = ['The cat sat on the mat', 'A red cat sat on the mat', 'No cat sat on the mat']

vectoriser = CountVectorizer().fit(documents)
X = vectoriser.transform(documents)
pd.DataFrame(X.A, columns=sorted(vectoriser.vocabulary_.keys(), key=lambda k: vectoriser.vocabulary_[k])) 

Out[127]:

	cat	mat	no	on	red	sat	the
0	1	1	0	1	0	1	2
1	1	1	0	1	1	1	1
2	1	1	1	1	0	1	1

In the case of LDA a Bayesian model is fitted using the data from this matrix. Each topic in the model becomes a probability distribution over terms (the columns). Conceptually this is saying that semantic concepts can be represented as probabilities over a set of words. This makes sense as the topic of discussion acts as a limiting factor on the vocabulary one is likely to use, hear or read in the context of that dicussion.

Words relating to political campaining are much less likely to be observed in documents that discuss ice hockey. Notice however that is is unlikely not impossible, it is not the case that it can not ever happen, it is simply statistically less likely to be the case that caucus or polling will be in a document that otherwise discusses the Teemu Selänne retiring. A topic therefore is a probability distribution over the entire vocabulary, indicating how likely each word is to occur within that topic.

The documents the model is built over can be as short as a single sentence (a tweet) or as long as a chapter in a book. Typically very short documents tend to be more difficult to built coherent models over than slightly longer documents.

Open source implementation of the models are readily available¶

Latent Semantic Indexing / Latent Semantic Analysis
- http://radimrehurek.com/gensim/models/lsimodel.html
Latent Dirichlet Allocation (LDA)
- and its many many many variants
- http://radimrehurek.com/gensim/models/ldamodel.html
- http://mallet.cs.umass.edu/topics.php
Hierarchical Dirichlet Process (HDP)
- http://radimrehurek.com/gensim/models/hdpmodel.html
Spherical Hierarchical Dirichlet Process (sHDP)
- http://arxiv.org/pdf/1604.00126v1.pdf
- https://github.com/Ardavans/sHDP

A Model¶

In order to evaluate a model, we must of course have one. I'll use the same model(s), built from the Fake News data set on Kaggle, throughout this notebook.

In [4]:

import pandas as pd

df_fake = pd.read_csv('/usr/local/scratch/data/kaggle/fake.csv')
df_fake[['title', 'text', 'language']].head()

Out[4]:

	title	text	language
0	Muslims BUSTED: They Stole Millions In Gov’t B...	Print They should pay all the back all the mon...	english
1	Re: Why Did Attorney General Loretta Lynch Ple...	Why Did Attorney General Loretta Lynch Plead T...	english
2	BREAKING: Weiner Cooperating With FBI On Hilla...	Red State : \nFox News Sunday reported this mo...	english
3	PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...	Email Kayla Mueller was a prisoner and torture...	english
4	FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...	Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...	english

In [5]:

import numpy as np

df_fake = df_fake.loc[(pd.notnull(df_fake.text)) & (df_fake.language == 'english')]
df_fake.shape

Out[5]:

(12357, 20)

There is a total of 12357 non empty english language documents, should be enough to build a model. Let's parse the documents using spacy, getting rid of some non content words and chuck that into gensim. I'll use the gensim.corpora.MmCorpus to serialise the text onto disk, this both saves memory and allows random access to the corpus, which will become useful later for creating different splits of the data.

In [4]:

import spacy

import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary, MmCorpus

spc = spacy.load('en')

KEEP_POS = set([90, 98, 82, 84, 94])  # NOUN, VERB, ADJ, ADV, PROPN
pipe = spc.pipe(df_fake.text, parse=False, entity=False, n_threads=8)
processed = [[token.lemma_ for token in document if token.pos in KEEP_POS] for document in pipe]

vocabulary = Dictionary(processed)
vocabulary.filter_extremes(no_below=3, no_above=0.5)

In [133]:

MmCorpus.serialize('./fake_news.mm', (vocabulary.doc2bow(doc) for doc in processed))

In [8]:

vocabulary.save('./fake_news.vocab')

In [ ]:

del processed

In [5]:

corpus_fake = MmCorpus('./fake_news.mm')

In [137]:

lda_fake = LdaModel(corpus=corpus_fake, id2word=vocabulary, num_topics=35, chunksize=1500, iterations=200, alpha='auto')

Inspecting the top 6 six word from the model we can certainly identify some structure. Ther are topic about the Flint Michigan water crisis (Topic 11), the Dakota Access Pipeline (Topic 9) protests and the US elections.

In [7]:

pd.DataFrame([[word for rank, (word, prob) in enumerate(words)]
              for topic_id, words in lda_fake.show_topics(formatted=False, num_words=6, num_topics=35)].iloc[])

Out[7]:

	0	1	2	3	4	5
0	dna	video	abortion	veritas	project	know
1	al	syrian	force	mosul	aleppo	isis
2	black	obama	white	people	soros	america
3	china	which	alien	would	there	ufo
4	get	go	’	just	know	there
5	russia	russian	putin	nato	missile	nuclear
6	market	cancer	product	bank	study	report
7	use	drug	health	company	report	medical
8	post	comment	facebook	news	video	result
9	pipeline	dakota	police	water	rock	protester
10	child	school	police	family	when	people
11	water	study	health	toxic	flint	tea
12	clinton	campaign	hillary	email	podesta	wikileaks
13	clinton	assange	hillary	there	trump	would
14	obama	money	bank	hillary	president	make
15	party	care	kelly	insurance	obamacare	year
16	trump	vote	clinton	election	hillary	voter
17	people	world	other	man	which	when
18	ukraine	ukrainian	india	japan	fukushima	us
19	turkey	turkish	rebel	adl	erdogan	glacier
20	text	god	which	moon	body	silver
21	fake	earthquake	california	rodgers	jay	quake
22	eddie	government	people	country	would	which
23	which	would	egypt	britain	news	report
24	war	us	syria	russia	military	country
25	israel	israeli	nuclear	zika	palestinian	resolution
26	life	time	other	year	find	people
27	herb	universe	which	korea	planet	x
28	trump	election	day	just	go	get
29	trump	would	president	donald	election	clinton
30	immigration	immigrant	illegal	percent	migrant	new
31	law	court	judge	case	federal	government
32	food	energy	oil	also	use	which
33	clinton	fbi	email	investigation	hillary	comey
34	jews	jewish	which	mars	only	american

In [7]:

pd.DataFrame([[word for rank, (word, prob) in enumerate(words)]
              for topic_id, words in lda_fake.show_topics(formatted=False, num_words=6, num_topics=35)])

Out[7]:

	0	1	2	3	4	5
0	dna	video	abortion	veritas	project	know
1	al	syrian	force	mosul	aleppo	isis
2	black	obama	white	people	soros	america
3	china	which	alien	would	there	ufo
4	get	go	’	just	know	there
5	russia	russian	putin	nato	missile	nuclear
6	market	cancer	product	bank	study	report
7	use	drug	health	company	report	medical
8	post	comment	facebook	news	video	result
9	pipeline	dakota	police	water	rock	protester
10	child	school	police	family	when	people
11	water	study	health	toxic	flint	tea
12	clinton	campaign	hillary	email	podesta	wikileaks
13	clinton	assange	hillary	there	trump	would
14	obama	money	bank	hillary	president	make
15	party	care	kelly	insurance	obamacare	year
16	trump	vote	clinton	election	hillary	voter
17	people	world	other	man	which	when
18	ukraine	ukrainian	india	japan	fukushima	us
19	turkey	turkish	rebel	adl	erdogan	glacier
20	text	god	which	moon	body	silver
21	fake	earthquake	california	rodgers	jay	quake
22	eddie	government	people	country	would	which
23	which	would	egypt	britain	news	report
24	war	us	syria	russia	military	country
25	israel	israeli	nuclear	zika	palestinian	resolution
26	life	time	other	year	find	people
27	herb	universe	which	korea	planet	x
28	trump	election	day	just	go	get
29	trump	would	president	donald	election	clinton
30	immigration	immigrant	illegal	percent	migrant	new
31	law	court	judge	case	federal	government
32	food	energy	oil	also	use	which
33	clinton	fbi	email	investigation	hillary	comey
34	jews	jewish	which	mars	only	american

In [9]:

lda_fake.save('./fake_news_35.lda')

Eye Ballin'¶

usually ML models are evaluated and improved based on a scoring function whose gradient can be followed to a hopefully global minimum
unsupervised models are tricky to evaluate because there usually isn't a suitable error function to optimise
the unsupervised models still come with hyperparameters so how do you know when you've set them correctly
furthermore, how do you know if model A is better than model B

Termite¶

The visualisation in the Termite paper look very promising, but I've been unable to run the code. The original project has been split into two separate projects a data server and a visualisation client. Unfortunately the data server uses an unknown data format in SQLite databases, and the host server where the data sets ought to be is not operational anymore and the project hasn't been maintained since 2014.

The project also relies on web2py which at the moment only supports python 2 and there doesn't seem to be any interest in porting it to python 3.

Anyhow, it would seem to be possible to run the project under a python 2 environment.

modify read_gensim.py to add --sentence-splitter cmd arg
modify bin/apps/SplitSentences.py to have an extra param for sentence_splitter jar location
update code for gensim API breaking changes
- bin/readers/GensimReader.py line 47 ldamodel.show_topics
- bin/readers/GensimReader.py line 51 topic/term distribution does not neet enumerate anymore
- bin/readers/GensimReader.py line 52 swap term and value around - they are the wrong way around

termite makes a lot of assumptions about paths, one needs to be quite careful what the root directory is for running the commands

In [22]:

import sys

sys.path.append('/home/matti/termite-data-server/bin/')

from modellers import GensimLDA

In [29]:

import re
df_fake['text_oneline'] = df_fake.text.apply(lambda s: re.sub(r'\s+', ' ', str(s)))

In [30]:

df_fake[['uuid', 'text_oneline']].to_csv('./fakenews.termite.tsv', sep='\t', header=False, index=False)

In [82]:

py27 = '/home/matti/miniconda3/envs/py27/bin/python'
termite_server_root = '/home/matti/termite-data-server/'

First we need to import the corpus into termite's own special SQLite format

In [76]:

!mkdir termite;\
    cp ./fakenews.termite.tsv ./termite;
    cd termite; $py27 /home/matti/termite-data-server/bin/import_corpus.py ./db ./fakenews.termite.tsv

mkdir: cannot create directory ‘termite’: File exists
WARNING:root:Unable to import plural rules: No module named plural_rules
Importing file [./fakenews.termite.tsv] into database [./db/corpus.db]

Then we need to export that SQLite DB back into a text corpus, there's some magic file names and path structures that happens here so you can't just use the original file

In [79]:

!cd termite; mkdir corpus; $py27 /home/matti/termite-data-server/bin/export_corpus.py ./db ./corpus/corpus.txt

WARNING:root:Unable to import plural rules: No module named plural_rules
Exporting database [./db/corpus.db] to file [./corpus/corpus.txt]

Then train the LDAModel, is should be possible to skip this and just use any model trained with gensim

In [81]:

%capture
!cd termite; $py27 /home/matti/termite-data-server/bin/train_gensim.py --overwrite ./corpus ./models/

--------------------------------------------------------------------------------
Training an LDA topic model using gensim...
       corpus = ./corpus/corpus.txt
        model = ./models/
  token_regex = \w{3,}
       topics = 20
       passes = 1
--------------------------------------------------------------------------------
using symmetric alpha at 0.05
using symmetric eta at 3.17268948888e-05
using serial LDA version on this node
running online (single-pass) LDA training, 20 topics, 1 passes over the supplied corpus of 12357 documents, updating model once every 2000 documents, evaluating perplexity every 12357 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
PROGRESS: pass 0, at document #2000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #0 (0.050): 0.003*"black" + 0.003*"obama" + 0.003*"syria" + 0.002*"white" + 0.002*"party" + 0.002*"military" + 0.002*"great" + 0.002*"syrian" + 0.002*"anti" + 0.002*"russia"
topic #14 (0.050): 0.003*"party" + 0.003*"syria" + 0.002*"isis" + 0.002*"house" + 0.002*"russia" + 0.002*"city" + 0.002*"military" + 0.002*"white" + 0.002*"democratic" + 0.002*"power"
topic #12 (0.050): 0.002*"black" + 0.002*"obama" + 0.002*"power" + 0.002*"white" + 0.002*"things" + 0.002*"police" + 0.002*"fbi" + 0.002*"city" + 0.002*"change" + 0.001*"vote"
topic #13 (0.050): 0.004*"syria" + 0.003*"russia" + 0.003*"syrian" + 0.003*"black" + 0.003*"police" + 0.002*"military" + 0.002*"obama" + 0.002*"power" + 0.002*"white" + 0.002*"russian"
topic #17 (0.050): 0.004*"white" + 0.003*"syria" + 0.002*"party" + 0.002*"obama" + 0.002*"israel" + 0.002*"house" + 0.002*"email" + 0.002*"russia" + 0.002*"iran" + 0.002*"foreign"
topic diff=7.126265, rho=1.000000
PROGRESS: pass 0, at document #4000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #3 (0.050): 0.003*"power" + 0.003*"black" + 0.003*"white" + 0.002*"law" + 0.002*"fbi" + 0.002*"americans" + 0.002*"emails" + 0.002*"economic" + 0.002*"article" + 0.002*"life"
topic #8 (0.050): 0.008*"email" + 0.004*"emails" + 0.003*"obama" + 0.002*"law" + 0.002*"follow" + 0.002*"information" + 0.002*"saudi" + 0.002*"russia" + 0.002*"000" + 0.002*"wikileaks"
topic #4 (0.050): 0.003*"assange" + 0.003*"nuclear" + 0.002*"power" + 0.002*"control" + 0.002*"weapons" + 0.002*"russia" + 0.002*"high" + 0.002*"wikileaks" + 0.002*"facebook" + 0.001*"department"
topic #16 (0.050): 0.006*"stockman" + 0.006*"party" + 0.004*"obama" + 0.004*"black" + 0.003*"police" + 0.003*"democratic" + 0.003*"social" + 0.002*"left" + 0.002*"white" + 0.002*"david"
topic #9 (0.050): 0.006*"fbi" + 0.003*"department" + 0.002*"emails" + 0.002*"including" + 0.002*"vote" + 0.002*"party" + 0.002*"weiner" + 0.002*"democratic" + 0.002*"saudi" + 0.002*"officials"
topic diff=2.843640, rho=0.707107
PROGRESS: pass 0, at document #6000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #18 (0.050): 0.005*"obama" + 0.002*"women" + 0.002*"great" + 0.002*"military" + 0.002*"history" + 0.002*"party" + 0.002*"life" + 0.002*"house" + 0.002*"power" + 0.002*"white"
topic #8 (0.050): 0.009*"email" + 0.005*"emails" + 0.003*"obama" + 0.003*"wikileaks" + 0.003*"saudi" + 0.003*"information" + 0.003*"podesta" + 0.002*"law" + 0.002*"000" + 0.002*"2015"
topic #19 (0.050): 0.007*"vote" + 0.006*"obama" + 0.004*"voting" + 0.004*"voters" + 0.004*"white" + 0.004*"republican" + 0.003*"party" + 0.003*"presidential" + 0.003*"candidate" + 0.003*"podesta"
topic #5 (0.050): 0.008*"russia" + 0.005*"russian" + 0.003*"putin" + 0.003*"syria" + 0.002*"military" + 0.002*"nuclear" + 0.002*"foreign" + 0.002*"anti" + 0.002*"israel" + 0.002*"obama"
topic #4 (0.050): 0.006*"health" + 0.005*"brain" + 0.004*"assange" + 0.004*"nuclear" + 0.003*"widget" + 0.003*"food" + 0.003*"medical" + 0.002*"life" + 0.002*"fat" + 0.002*"cancer"
topic diff=1.850195, rho=0.577350
PROGRESS: pass 0, at document #8000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #2 (0.050): 0.003*"ukraine" + 0.003*"rights" + 0.003*"party" + 0.003*"anti" + 0.002*"ukrainian" + 0.002*"left" + 0.002*"saakashvili" + 0.002*"order" + 0.002*"british" + 0.002*"jewish"
topic #17 (0.050): 0.012*"israel" + 0.006*"iran" + 0.004*"israeli" + 0.004*"jerusalem" + 0.003*"palestinian" + 0.003*"jewish" + 0.003*"minister" + 0.003*"obama" + 0.003*"foreign" + 0.003*"resolution"
topic #4 (0.050): 0.008*"health" + 0.008*"food" + 0.005*"cancer" + 0.004*"water" + 0.004*"foods" + 0.004*"brain" + 0.003*"body" + 0.003*"diet" + 0.003*"study" + 0.003*"high"
topic #14 (0.050): 0.005*"comey" + 0.004*"fbi" + 0.004*"moore" + 0.003*"investigation" + 0.003*"house" + 0.002*"party" + 0.002*"airbnb" + 0.002*"she" + 0.002*"power" + 0.002*"khan"
topic #16 (0.050): 0.008*"party" + 0.005*"court" + 0.004*"black" + 0.004*"obama" + 0.004*"supreme" + 0.004*"democratic" + 0.003*"police" + 0.003*"rights" + 0.002*"republican" + 0.002*"white"
topic diff=1.401028, rho=0.500000
PROGRESS: pass 0, at document #10000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #14 (0.050): 0.005*"onion" + 0.004*"sheeple" + 0.004*"automatically" + 0.004*"text" + 0.003*"republish" + 0.003*"comey" + 0.003*"moore" + 0.003*"spam" + 0.003*"star" + 0.003*"fbi"
topic #15 (0.050): 0.013*"police" + 0.009*"water" + 0.006*"pipeline" + 0.005*"dakota" + 0.005*"law" + 0.004*"federal" + 0.004*"standing" + 0.004*"rock" + 0.003*"land" + 0.003*"protesters"
topic #5 (0.050): 0.017*"russia" + 0.010*"russian" + 0.008*"putin" + 0.004*"foreign" + 0.003*"ukraine" + 0.003*"obama" + 0.003*"policy" + 0.003*"nuclear" + 0.003*"military" + 0.003*"europe"
topic #18 (0.050): 0.008*"obama" + 0.003*"space" + 0.002*"god" + 0.002*"women" + 0.002*"great" + 0.002*"house" + 0.002*"man" + 0.002*"history" + 0.002*"earth" + 0.002*"kelly"
topic #16 (0.050): 0.012*"obama" + 0.007*"court" + 0.007*"party" + 0.006*"supreme" + 0.004*"black" + 0.004*"police" + 0.004*"constitution" + 0.004*"white" + 0.004*"rights" + 0.003*"house"
topic diff=1.336365, rho=0.447214
PROGRESS: pass 0, at document #12000/12357
merging changes from 2000 documents into a model of 12357 documents
topic #9 (0.050): 0.007*"department" + 0.006*"soros" + 0.006*"foundation" + 0.005*"law" + 0.004*"officials" + 0.004*"fbi" + 0.003*"federal" + 0.003*"secret" + 0.003*"million" + 0.003*"attorney"
topic #19 (0.050): 0.009*"obama" + 0.008*"vote" + 0.006*"voters" + 0.005*"presidential" + 0.005*"voting" + 0.005*"white" + 0.005*"republican" + 0.004*"party" + 0.004*"candidate" + 0.004*"house"
topic #12 (0.050): 0.004*"love" + 0.003*"life" + 0.003*"energy" + 0.003*"things" + 0.003*"human" + 0.003*"self" + 0.002*"earth" + 0.002*"control" + 0.002*"power" + 0.002*"feel"
topic #3 (0.050): 0.004*"power" + 0.003*"obamacare" + 0.003*"americans" + 0.002*"care" + 0.002*"global" + 0.002*"change" + 0.002*"pay" + 0.002*"order" + 0.002*"economic" + 0.002*"white"
topic #16 (0.050): 0.009*"obama" + 0.008*"court" + 0.007*"party" + 0.006*"supreme" + 0.005*"immigration" + 0.004*"black" + 0.004*"rights" + 0.004*"constitution" + 0.004*"police" + 0.004*"white"
topic diff=1.076053, rho=0.408248
-9.063 per-word bound, 535.0 perplexity estimate based on a held-out corpus of 357 documents with 68794 words
PROGRESS: pass 0, at document #12357/12357
merging changes from 357 documents into a model of 12357 documents
topic #14 (0.050): 0.009*"moore" + 0.004*"violence" + 0.003*"manchanda" + 0.003*"onion" + 0.003*"star" + 0.003*"joe" + 0.003*"sheeple" + 0.003*"foster" + 0.003*"musket" + 0.003*"8th"
topic #7 (0.050): 0.010*"com" + 0.006*"www" + 0.006*"facebook" + 0.005*"http" + 0.005*"information" + 0.004*"data" + 0.004*"google" + 0.003*"star" + 0.003*"content" + 0.003*"internet"
topic #16 (0.050): 0.015*"court" + 0.010*"supreme" + 0.009*"eddie" + 0.008*"judges" + 0.007*"party" + 0.007*"obama" + 0.006*"marriage" + 0.006*"law" + 0.005*"immigration" + 0.004*"unelected"
topic #19 (0.050): 0.009*"obama" + 0.009*"vote" + 0.006*"voting" + 0.006*"voters" + 0.005*"presidential" + 0.005*"republican" + 0.005*"democratic" + 0.004*"white" + 0.004*"candidate" + 0.004*"party"
topic #1 (0.050): 0.005*"life" + 0.004*"women" + 0.004*"children" + 0.003*"man" + 0.003*"old" + 0.003*"family" + 0.003*"child" + 0.003*"father" + 0.002*"home" + 0.002*"things"
topic diff=0.888690, rho=0.377964
Saving dictionary to disk: ./models//dictionary.gensim
Saving corpus to disk: ./models//corpus.gensim
/home/matti/miniconda3/envs/py27/lib/python2.7/site-packages/gensim/interfaces.py:60: UserWarning: corpus.save() stores only the (tiny) iteration object; to serialize the actual corpus content, use e.g. MmCorpus.serialize(corpus)
  warnings.warn("corpus.save() stores only the (tiny) iteration object; "
Saving model to disk: ./models//lda.gensim

Finally, read in the trained gensim LDA model to termite creating all the necessary data structures for the visualisations to work. This computes, among other thigs, term collocations ($N^2$) so it's going to take a while to run, especially for large vocabularies.

If you set all the paths consistently during the previous steps, this should just work. If not, it's likely there will be some FileNotFound errors.

In [88]:

%capture
!cd termite; cp -r $termite_server_root/tools ./; $py27 /home/matti/termite-data-server/bin/read_gensim.py --overwrite\
  --sentence-split /home/matti/termite-data-server/utils/corenlp/SentenceSplitter.jar\
  gensim_termite ./models/ ./corpus ./db

WARNING:root:Unable to import plural rules: No module named plural_rules
--------------------------------------------------------------------------------
INFO:termite:--------------------------------------------------------------------------------
Import a gensim LDA topic model as a web2py application...
INFO:termite:Import a gensim LDA topic model as a web2py application...
           app_name = gensim_termite
INFO:termite:           app_name = gensim_termite
           app_path = apps/gensim_termite
INFO:termite:           app_path = apps/gensim_termite
         model_path = ./models/
INFO:termite:         model_path = ./models/
    corpus_filename = ./corpus/corpus.txt
INFO:termite:    corpus_filename = ./corpus/corpus.txt
  database_filename = ./db/corpus.db
INFO:termite:  database_filename = ./db/corpus.db
--------------------------------------------------------------------------------
INFO:termite:--------------------------------------------------------------------------------
Creating app: gensim_termite [apps/temp_20170623_234907_893474_8228]
INFO:termite:Creating app: gensim_termite [apps/temp_20170623_234907_893474_8228]
Creating folder: [apps/temp_20170623_234907_893474_8228/data]
INFO:termite:Creating folder: [apps/temp_20170623_234907_893474_8228/data]
Creating folder: [apps/temp_20170623_234907_893474_8228/databases]
INFO:termite:Creating folder: [apps/temp_20170623_234907_893474_8228/databases]
Linking folder: [apps/temp_20170623_234907_893474_8228/models]
INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/models]
Linking folder: [apps/temp_20170623_234907_893474_8228/views]
INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/views]
Linking folder: [apps/temp_20170623_234907_893474_8228/controllers]
INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/controllers]
Linking folder: [apps/temp_20170623_234907_893474_8228/static]
INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/static]
Linking folder: [apps/temp_20170623_234907_893474_8228/modules]
INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/modules]
Creating file: [apps/temp_20170623_234907_893474_8228/__init__.py]
INFO:termite:Creating file: [apps/temp_20170623_234907_893474_8228/__init__.py]
Copying [./db/corpus.db] --> [apps/temp_20170623_234907_893474_8228/databases/corpus.db]
INFO:termite:Copying [./db/corpus.db] --> [apps/temp_20170623_234907_893474_8228/databases/corpus.db]
Copying [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/corpus.txt]
INFO:termite:Copying [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/corpus.txt]
Extracting [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/sentences.txt]
INFO:termite:Extracting [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/sentences.txt]
Preparing the Stanford CoreNLP pipeline...
DEBUG:termite:Preparing the Stanford CoreNLP pipeline...
Adding annotator tokenize
DEBUG:termite:Adding annotator tokenize
Adding annotator ssplit
DEBUG:termite:Adding annotator ssplit
Processing corpus: [./corpus/corpus.txt] -> [apps/temp_20170623_234907_893474_8228/data/sentences.txt]
DEBUG:termite:Processing corpus: [./corpus/corpus.txt] -> [apps/temp_20170623_234907_893474_8228/data/sentences.txt]
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ₨ (U+20A8, decimal: 8360)
DEBUG:termite:WARNING: Untokenizable: ₨ (U+20A8, decimal: 8360)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+FFFC, decimal: 65532)
DEBUG:termite:WARNING: Untokenizable:  (U+FFFC, decimal: 65532)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ⁉ (U+2049, decimal: 8265)
DEBUG:termite:WARNING: Untokenizable: ⁉ (U+2049, decimal: 8265)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+81, decimal: 129)
DEBUG:termite:WARNING: Untokenizable:  (U+81, decimal: 129)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+95, decimal: 149)
DEBUG:termite:WARNING: Untokenizable:  (U+95, decimal: 149)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9C, decimal: 156)
DEBUG:termite:WARNING: Untokenizable:  (U+9C, decimal: 156)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+83, decimal: 131)
DEBUG:termite:WARNING: Untokenizable:  (U+83, decimal: 131)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+99, decimal: 153)
DEBUG:termite:WARNING: Untokenizable:  (U+99, decimal: 153)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ‌ (U+200C, decimal: 8204)
DEBUG:termite:WARNING: Untokenizable: ‌ (U+200C, decimal: 8204)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F4A9, decimal: 62633)
DEBUG:termite:WARNING: Untokenizable:  (U+F4A9, decimal: 62633)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+83, decimal: 131)
DEBUG:termite:WARNING: Untokenizable:  (U+83, decimal: 131)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9F, decimal: 159)
DEBUG:termite:WARNING: Untokenizable:  (U+9F, decimal: 159)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+83, decimal: 131)
DEBUG:termite:WARNING: Untokenizable:  (U+83, decimal: 131)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+8D, decimal: 141)
DEBUG:termite:WARNING: Untokenizable:  (U+8D, decimal: 141)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9A, decimal: 154)
DEBUG:termite:WARNING: Untokenizable:  (U+9A, decimal: 154)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+81, decimal: 129)
DEBUG:termite:WARNING: Untokenizable:  (U+81, decimal: 129)
Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F0B7, decimal: 61623)
DEBUG:termite:WARNING: Untokenizable:  (U+F0B7, decimal: 61623)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+9D, decimal: 157)
DEBUG:termite:WARNING: Untokenizable:  (U+9D, decimal: 157)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F50D, decimal: 62733)
DEBUG:termite:WARNING: Untokenizable:  (U+F50D, decimal: 62733)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
DEBUG:termite:WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F6A8, decimal: 63144)
DEBUG:termite:WARNING: Untokenizable:  (U+F6A8, decimal: 63144)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
DEBUG:termite:WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F682, decimal: 63106)
DEBUG:termite:WARNING: Untokenizable:  (U+F682, decimal: 63106)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
DEBUG:termite:WARNING: Untokenizable:  (U+F1FA, decimal: 61946)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable:  (U+F602, decimal: 62978)
DEBUG:termite:WARNING: Untokenizable:  (U+F602, decimal: 62978)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
DEBUG:termite:WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357)
Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356)
Copying [./models/] --> [apps/temp_20170623_234907_893474_8228/data/gensim-lda]
INFO:termite:Copying [./models/] --> [apps/temp_20170623_234907_893474_8228/data/gensim-lda]
Computing bag-of-words statistics
INFO:termite:Computing bag-of-words statistics
          token_regex = \w{3,}
INFO:termite:          token_regex = \w{3,}
             min_freq = 5
INFO:termite:             min_freq = 5
         min_doc_freq = 3
INFO:termite:         min_doc_freq = 3
       max_freq_count = 4000
INFO:termite:       max_freq_count = 4000
    max_co_freq_count = 100000
INFO:termite:    max_co_freq_count = 100000
Computing document-level statistics...
INFO:termite:Computing document-level statistics...
    Loading corpus: apps/temp_20170623_234907_893474_8228/data/corpus.txt
DEBUG:termite:    Loading corpus: apps/temp_20170623_234907_893474_8228/data/corpus.txt
    Computing term freqs (12357 docs)...
INFO:termite:    Computing term freqs (12357 docs)...
    Computing term co-occurrences (12357 docs)...
INFO:termite:    Computing term co-occurrences (12357 docs)...
    Saving term_texts (4000 terms)...
DEBUG:termite:    Saving term_texts (4000 terms)...
        inserting 4000 rows...
DEBUG:termite:        inserting 4000 rows...
    Saving term_freqs (4000 terms)...
DEBUG:termite:    Saving term_freqs (4000 terms)...
        inserting 4000 rows...
DEBUG:termite:        inserting 4000 rows...
    Saving term_probs (4000 terms)...
DEBUG:termite:    Saving term_probs (4000 terms)...
        inserting 4000 rows...
DEBUG:termite:        inserting 4000 rows...
    Saving term_doc_freqs (4000 terms)...
DEBUG:termite:    Saving term_doc_freqs (4000 terms)...
        inserting 4000 rows...
DEBUG:termite:        inserting 4000 rows...
    Saving term_co_freqs (100000 term pairs)...
DEBUG:termite:    Saving term_co_freqs (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
    Saving term_co_probs (100000 term pairs)...
DEBUG:termite:    Saving term_co_probs (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
    Saving term_g2 (100000 term pairs)...
DEBUG:termite:    Saving term_g2 (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
Computing sentence-level term statistics...
INFO:termite:Computing sentence-level term statistics...
    Loading corpus: apps/temp_20170623_234907_893474_8228/data/sentences.txt
DEBUG:termite:    Loading corpus: apps/temp_20170623_234907_893474_8228/data/sentences.txt
    Computing term freqs (369703 docs)...
INFO:termite:    Computing term freqs (369703 docs)...
    Computing term co-occurrences (369703 docs)...
INFO:termite:    Computing term co-occurrences (369703 docs)...
    Saving sentences_co_freqs (100000 term pairs)...
DEBUG:termite:    Saving sentences_co_freqs (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
    Saving sentences_co_probs (100000 term pairs)...
DEBUG:termite:    Saving sentences_co_probs (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
    Saving sentences_g2 (100000 term pairs)...
DEBUG:termite:    Saving sentences_g2 (100000 term pairs)...
        inserting 100000 rows...
DEBUG:termite:        inserting 100000 rows...
('-->', 'apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim')
Reading gensim LDA output...
INFO:termite:Reading gensim LDA output...
    Loading dictionary: apps/temp_20170623_234907_893474_8228/data/gensim-lda/dictionary.gensim
DEBUG:termite:    Loading dictionary: apps/temp_20170623_234907_893474_8228/data/gensim-lda/dictionary.gensim
    Loading corpus: apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim
DEBUG:termite:    Loading corpus: apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim
('||-->', 'apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim')
<class 'modellers.GensimLDA.GensimTermiteCorpusReader'>
    Loading model: apps/temp_20170623_234907_893474_8228/data/gensim-lda/lda.gensim
DEBUG:termite:    Loading model: apps/temp_20170623_234907_893474_8228/data/gensim-lda/lda.gensim
Writing to database...
INFO:termite:Writing to database...
    Saving term_topic_matrix...
DEBUG:termite:    Saving term_topic_matrix...
    Saving doc_topic_matrix...
DEBUG:termite:    Saving doc_topic_matrix...
    Retrieving terms, documents, and topics...
DEBUG:termite:    Retrieving terms, documents, and topics...
    Retrieving top terms and top documents...
DEBUG:termite:    Retrieving top terms and top documents...
    Saving terms...
DEBUG:termite:    Saving terms...
    Saving docs...
DEBUG:termite:    Saving docs...
    Saving topics...
DEBUG:termite:    Saving topics...
Computing derived LDA topic model statistics...
INFO:termite:Computing derived LDA topic model statistics...
    max_co_topic_count = 10000
INFO:termite:    max_co_topic_count = 10000
    Loading doc_topic_matrix...
DEBUG:termite:    Loading doc_topic_matrix...
    Computing topic cooccurrences...
DEBUG:termite:    Computing topic cooccurrences...
    Computing topic covariance...
DEBUG:termite:    Computing topic covariance...
    Saving topic_covariance...
DEBUG:termite:    Saving topic_covariance...
        inserting 400 rows...
DEBUG:termite:        inserting 400 rows...
Moving app into place: gensim_termite [apps/temp_20170623_234907_893474_8228] -> [apps/gensim_termite]
INFO:termite:Moving app into place: gensim_termite [apps/temp_20170623_234907_893474_8228] -> [apps/gensim_termite]

To start the server and see the visualisations

In [91]:

!$py27 $termite_server_root/web2py/web2py.py

web2py Web Framework
Created by Massimo Di Pierro, Copyright 2007-2017
Version 2.9.5-stable+timestamp.2014.03.16.02.35.39
Database drivers available: SQLite(sqlite3), MySQL(pymysql), PostgreSQL(pg8000), MSSQL(pyodbc), DB2(pyodbc), Teradata(pyodbc), Ingres(pyodbc), IMAP(imaplib)

pyLDAVis¶

Some of the work from Termite has been integrated into pyLDAVis which is being maintained and has good interoperability with gensim. Below is an interactive visualisation of the fake news model trained earlier. Just to see how informative the visualisation is overall, I'll train another model on the same dataset but increaase the number of topics quite a lot.

For a good description of what you see in the visualisation you can look at the presenation from the creator himself

https://www.youtube.com/watch?v=tGxW2BzC_DU&index=4&list=PLykRMO7ZuHwP5cWnbEmP_mUIVgzd5DZgH

In [6]:

lda_fake = LdaModel.load('./fake_news_35.lda')

In [15]:

from gensim.models import LdaModel

import pyLDAvis as ldavis
import pyLDAvis.gensim

ldavis.enable_notebook()
prepared_data = ldavis.gensim.prepare(lda_fake, corpus_fake, vocabulary)

with open('./fake_news_35.lda-LDAVIS.json', 'w') as fh:
    fh.write(prepared_data.to_json())

prepared_data

/home/matti/miniconda3/envs/pydatabln17/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]

Out[15]:

In [6]:

lda_fake_100 = LdaModel(corpus=corpus_fake, id2word=vocabulary, num_topics=100, alpha='auto')

In [8]:

lda_fake_100.save('./fake_news_100.lda')

In [10]:

prepared_data = ldavis.gensim.prepare(lda_fake_100, corpus_fake, vocabulary)

with open('./fake_news_100.lda-LDAVIS.json', 'w') as fh:
    fh.write(prepared_data.to_json())

/home/matti/miniconda3/envs/pydatabln17/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]

In [14]:

prepared_data

Out[14]:

Comparing the two visualisation one can make some comforting observations. In the bottom left right corner in both visualisation there is a cluster of topics relating to the 2016 U.S. presidential election. The 100 topic model has split the documents up in slightly more specific terms but otherwise both models have captured those semantics and more importantly both visualisations display those topics consistently in a cluster.

Similarly in the visualisation of the 100 topic model the cluster in the top right hand corner is semantically coherent and similar to the cluster in the bottom left hand corner in the visualisation for the 35 topic model. Again both models have captured the Syrian civil war and related issues and consistently placed those topics close together in the topic panel.

The main problem I find the LDAVis is that the spatial dimensions on the left hand side panel are somewhat meaningless.

The area of the circle shows the prevalence of a topic, but visually determining the relative sizes of circles is difficult to do, so while you do get an understanding of which topics are the most important you can't really determine how much more important those topics are compared to the others.

The second problem is the distance between the topics. While the positioning of the topics to some exent preserves semantic similarity allowing some related topics to form clusters, it is a little difficult to determine exactly how similar the topics are. To be fair this is not something that can be blamed on LDAVis as measuring the semantic similarity of topics and then collapsing the multidimensional similarity vectors into 2 dimensions is not an easy task to do. Nevertheless, one shouldn't read too much into the topic distances. Different algorithms for computing the locations - essentially doing multidimensional scaling.

Intrinsic Evaluation¶

Perplexity is often used as an example of an intrinsic evaluation measure. It comes from the language modelling community and aims to capture how suprised a model is of new data it has not seen before. This is commonly measured as the normalised log-likelihood of a held out test set

$$ \begin{align} \mathcal{L}(D') &= \frac{\sum_D \log_2 p(w_d;\Theta)}{\mbox{count of tokens}}\\\\ perplexity(D') &= 2^{-\mathcal{L}(D')} \end{align} $$

Focussing on the log-likelihood part, this metric is measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held out data.

Thinking back to what we would like the topic model to do, this makes no sense at all. Let's put aside any specific algorithm for inferring a topic model and focus on what it is that we'd like the model to capture. More often than not the desire is for the model to capture concepts that exist in a particular dataset. Well what is a concept and how can it be represented given the pieces we have?

Let me offer a way of thinking about this that would not pass the mustard in a bachelor's class in philosophy. Luckily we're not in philopshy class at the moment.

Take the following two documents that talk about ice hockey. I've highlighted terms that I think are related to the subject matter, you may disagree with my judgement. Notice that among the terms that I've highlighted as being part of the topic of Ice Hockey are words such as Penguin, opposing and shots. None of these on the face of it would appear to "belong" to Ice Hockey, but seeing them in context makes it clear that Penguin refers to the ice hockey team, shots refers to disk shaped pieces of vulcanised rubber being launched at the goal at various different speeds and opposing refers to the opposing team although it might more commonly be thought to belong politics or the debate club.

... began his professional career in 1989–90 with Jokerit of the SM-liiga and played 21 seasons in the National Hockey League (NHL) for the Winnipeg Jets ...

Rinne stopped 27 of 28 shots from the Penguins in Game 6 at home Sunday, but that lone goal allowed was enough for the opposition to break out the Stanley Cup trophy for the second straight season.

Given the terms that I've determined to be a partial description of Ice Hockey (the concept), one could conceivably measure the coherence of that concept by counting how many times those terms occur with each other - co-occur that is - in some sufficiently large reference corpus.

One of course encounters a problem should the reference corpus never refer to ice hockey. A poorly selected reference corpus could for instance be patent applications from the 1800s, it would be unlikely to find those word pairs in that text.

This is precisely what several research papers have aimed to do. To take the top words from the topics in a topic model and measure the support for those words forming a coherent concept / topic by looking at the co-occurrences of those terms in a reference corpus. The research up to now was finally wrapped up into a single paper where the authors develop a coherence pipeline, which allows plugging in all the different methods into a single framework. This coherence pipeline is partially implemented in gensim, below is a few examples on how to use it.

In [6]:

import spacy

import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary, MmCorpus

spc = spacy.load('en')

KEEP_POS = set([90, 98, 82, 84, 94])  # NOUN, VERB, ADJ, ADV, PROPN
pipe = spc.pipe(df_fake.text, parse=False, entity=False, n_threads=8)
processed = [[token.lemma_ for token in document if token.pos in KEEP_POS] for document in pipe]

vocabulary = Dictionary(processed)
vocabulary.filter_extremes(no_below=3, no_above=0.5)

In [10]:

corpus = MmCorpus('./models/fake_news.mm')

In [14]:

lda_fake_35 = LdaModel.load('./models/fake_news_35.lda')
lda_fake_100 = LdaModel.load('./models/fake_news_100.lda')

In [11]:

from gensim.models import CoherenceModel

cm = CoherenceModel(model=lda_fake_35, corpus=corpus,
                    dictionary=vocabulary, coherence='c_v',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[11]:

0.73858095241592736

In [19]:

cm = CoherenceModel(model=lda_fake_100, corpus=corpus,
                    dictionary=vocabulary, coherence='c_v',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[19]:

0.6594566862444563

In [12]:

cm = CoherenceModel(model=lda_fake_35, corpus=corpus,
                    dictionary=vocabulary, coherence='c_uci',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[12]:

0.51052624873338692

In [15]:

cm = CoherenceModel(model=lda_fake_100, corpus=corpus,
                    dictionary=vocabulary, coherence='c_uci',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[15]:

-0.84043127015386343

In [20]:

cm = CoherenceModel(model=lda_fake_35, corpus=corpus,
                    dictionary=vocabulary, coherence='u_mass',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[20]:

-1.6535058835918122

In [21]:

cm = CoherenceModel(model=lda_fake_100, corpus=corpus,
                    dictionary=vocabulary, coherence='u_mass',
                    texts=[[w for w in d if w in vocabulary.token2id] for d in processed])
cm.get_coherence()

Out[21]:

-3.8603367679749985

References¶

Papers¶

Chang et. al Reading Tea Leaves: How Humans Interpret Topic Models, NIPS 2009
Wallach et. al Evaluation Methods for Topic Models, ICML 2009
Lau et. al Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality, ACL 2014
Röder et. al Exploring the Space of Topic Coherence Methods, Web Search and Data Mining 2015
Sievert et. al LDAvis: A method for visualizing and interpreting topics ACL 2014 Workshop on Interactive Language Learning, Visualization, and Interfaces
Chuang et. al Termite: Visualization Techniques for Assessing Textual Topic Models, AVI 2012 link
Chuang et. al Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment, ICML 2013 link

Software¶

gensim Topic Modelling for Humans (Python)
UMass Machine Learning for Language - Mallet (Java)
Stanford Topic Modelling Toolbox (Java)
Spherical Hierarchical Dirichlet Processes
Termite
scattertext
- scattertext allows you to plot differential word usage patterns from two corpora into an interactive display. It's not exactly an evaluation method for topic models but can be quite useful for analysing corpora
- there's a talk by the creator at PyData Seattle 2017 link

Datasets¶

The model used in this notebook is built on the Kaggle Fake News dataset available here.

Interwebs¶

General stuff about NLP you might be interested in¶

Yoav Goldberg on evaluating NNLMs

Even Remotely Intelligent Stuff Ends Here¶

I am going to start with a slightly silly example that nonetheless nicely illustrates a few important points about evaluating unsupervised models

define what you want out of the model
- running an algorithm won't solve anything unless you have an expectation of what the output should / could look like
- if you don't explicitly know what the output should look like, you probably know it implicitly
applying subjective judgement

I did some analysis on the accepted talks to PyData Berlin 2017 to find out what kind of talks were accepted this year. I plotted the results in a wordcloud (github.com/amueller/word_cloud), but was disappointed that the first approach didn't really reveal the thing I was hoping to analyse. The plot just showed general patterns of english language use.

Raw Frequency of Words

Filtering out high frequency words helped a little bit but the wordcloud still wasn't that informative, it is hardly a surprise that data is a central theme at a PyData conference.

Raw Frequency of Words, 0.5 lt doc_freq removed

So I made some more adjustments to the model and got something that looks more reasonable.

TFIDF filtered scores

I am not trying to claim that the last model is a good one, or even a valid one, but it does correspond to my previously held beliefs of the contents of the conference. It is not surprising that is the case since I arrived at the model by iterating through a number of models that I found to be unsatisfactory, the problem is that I never I actually defined what satisfactory means, there was never an explicitly stated goal towards which I was driving.

This is extremely important to keep in mind as evaluation metrics for unsupervised models often have in built assumptions about what a good model looks like. Those assumptions may or may not be true for your use case. Some metrics aim to satisfy internal constraints

So let's start with what we would like to model about text in an unsupervised manner

the distribution of terms
the co-occurence of terms
- within documents (topic modelling)
- within "sentences" (distributional semantics, word2vec, GloVe)
sequences of terms (language models)

I will focus on evaluation topic models and models of distributional semantics.

open source tools
open access research papers
data visualisation is not my core research area
I am not a political analyst or social scientist, my background is in computer science

In [104]:

import numpy as np
import scipy
from matplotlib import pyplot as plt

fig, ax = plt.subplots(figsize=(1000/72, 750/72), dpi=72)

topics = ['Sports', 'Machine Learning', 'Celebrity', 'Fashion', 'Current Affairs', 'Tennis', 'Medicine', 'Technology', 'Security']
# centers = np.random.randint(low=0, high=20, size=(len(topics), 2))
for topic_name, center in zip(topics, centers): 
    topic = np.random.normal(loc=center, scale=1.0, size=(10, 2))
    dots = ax.scatter(topic[:, 0], topic[:, 1], alpha=0.4)
    bbox_props = dict(boxstyle="circle, pad=0.3", fc=dots.get_facecolor().ravel(), ec="none", alpha=0.1, lw=1)
    t = ax.text(*topic.mean(axis=0), topic_name, ha="center", va="center", rotation=0,
                size=16, bbox=bbox_props)
plt.axis("off");
plt.show()
# plt.savefig('../assets/unsupervised-models/ideal-topics.png')

<matplotlib.figure.Figure at 0x7faa9a4655c0>

This is what could be called a coherent interpretable model

all clusters are more or less self contained
related clusters seem to be close together

The problem here is that the "model" above is entirely made up, and the division is somewhat non sensical.

Topic Models have a number of ways of being evaluated, including

perplexity (might not be such a great measure)
- Chang et. al Reading Tea Leaves: How Humans Interpret Topic Models, NIPS 2009
- Wallach et. al Evaluation Methods for Topic Models, ICML 2009
- Lau et. al Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality, ACL 2014
topic coherence
- Röder et. al Exploring the Space of Topic Coherence Methods, Web Search and Data Mining 2015
human interpretability (word or topic intrusion)
- Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
ontological similarity to link overlap and term co-occurrence (WordNet)
inter-annotator agreement on labels for topics
an external task
- information retrieval (Wei et. al LDA-Based Document Models for Ad-hoc Retrieval, 2006 SIGIR)
- sentiment analysis (Titov et. al A Joint Model of Text and Aspect Ratings for Sentiment Summarization, ACL 2008)

Perplexity and Other Internal Evaluation Metrics¶

Perplexity is a metric for the goodness of fit, it measures the log likelihood of held out data.

$$ 2^{-\sum_x \tilde p(x)\,log_2\,p(x)}$$

The aim is to capture how well the current estimated probability of words predicts the probability of words in a held out dataset. This measure is used internally by topic models the measure the progress of learning the topics. It is not suitable for human evaluation as a model with low perplexity does not necessarily correspond to a model that is interpretable or informative (Reading Tea Leaves: How Humans Interpret Topic Models).

There is a review of internal evaluation measures in Wallach et. al Evaluation methods for topic models. In ICML. 2009 - these measures borrow from the language modelling research.

Topic ID	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
model
1	get	i'll	light	at	he	come	got	go	was	blues	will	oh	dance	over	she
1	gonna	never	night	down	his	me,	they	we're	now	are	let	do	(i	been	her
1	yeah,	see	shine	old	him	better	up	hey	never	as	heart	want	back	oh,	she's
1	yeah	you're	sun	out	man	if	ain't	-	have	day	our	baby	ah	long	girl
1	wanna	way	tonight	run	he's	do	out	let's	were	good	are	can't	bring	gone	woman
NaN
2	oh	want	get	out	she	down	light	am	wanna	blues	if	are	hey,	he	was
2	baby	do	ya	gloom	her	look	our	thro'	up	old	you,	they	ah	his	now
2	gonna	if	ain't	off	she's	down,	will	lord	get	new	would	where	ha	him	out
2	oh,	can't	got	black	girl	at	as	run	la	-	could	home	ah,	man	one
2	yeah	i'll	na	them	got	stop	rain	jesus	let's	hey	me,	people	my,	he's	at

Topic Coherence Model¶

Röder et. al Exploring the Space of Topic Coherence Measures, WSDM 2015

The topic coherence model combines a number of papers into one framework that allows evaluating the coherence of topics inferred by a topic model. In the context of this work coherence is defined as the mutual support of sets of facts - facts are represented by the top N words from a topic.

create tuples from the top N words in a topic
- pairs of single words {(game), (ball)}, {(team), (ball)}
- pairs of pairs of words {(game, ball)}, {(team, ball)}
- ...
measure the probability of those from a reference corpus
- document probability
- word probability
- ...
calculate a confirmation measure per tuple
- UCI normalised sum over PMI values
- UMASS
- NPMI
- ...
aggregate over all the tuples mean

$$ C_{UCI} = \frac{2}{N * (N-1)} \sum_{i-1}^{N-1} \sum_{j = i+1}^{N} PMI(w_i, w_j) $$

where PMI is

$$ PMI(w_i, w_j) = \log \frac{P(w_i, w_j) + \epsilon}{P(w_i)P(w_j)}$$

As pointed out in Reading Tea Leaves: How Humans Interpret Topic Models [emphasis mine].

We emphasize that not measuring the internal representation of topic models is at odds with their presentation and development. Most topic modeling papers display qualitative assessments of the inferred topics or simply assert that topics are semantically meaningful ...

As we can see above, it is not immediately clear how the topics are semantically meaningful, even though the fit to the training data is good.

Topic ID	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
model
1	get	i'll	light	at	he	come	got	go	was	blues	will	oh	dance	over	she
1	gonna	never	night	down	his	me,	they	we're	now	are	let	do	(i	been	her
1	yeah,	see	shine	old	him	better	up	hey	never	as	heart	want	back	oh,	she's
1	yeah	you're	sun	out	man	if	ain't	-	have	day	our	baby	ah	long	girl
1	wanna	way	tonight	run	he's	do	out	let's	were	good	are	can't	bring	gone	woman
NaN
2	oh	want	get	out	she	down	light	am	wanna	blues	if	are	hey,	he	was
2	baby	do	ya	gloom	her	look	our	thro'	up	old	you,	they	ah	his	now
2	gonna	if	ain't	off	she's	down,	will	lord	get	new	would	where	ha	him	out
2	oh,	can't	got	black	girl	at	as	run	la	-	could	home	ah,	man	one
2	yeah	i'll	na	them	got	stop	rain	jesus	let's	hey	me,	people	my,	he's	at

Eye Balling the Model¶

topic - word distributions
document topic distributions (spiky, not spiky)
ldavis / pyldavis
termite (http://vis.stanford.edu)

demo of pyLDAvis¶

pyLDAvis problems¶

visual / eye-balling
- pyLDAvis (PCoA / mmds are topics 15 and 10 close to each other?) http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
- is this a good model? (20 topics, 2000 documents)
- how much does topic 2 cover? what about topic 1?
- left = t-SNE, right = MMDS

Stanford VIS group¶

http://vis.stanford.edu/papers/topic-model-diagnostics

Word Intrusion and Topic Intrusion¶

Chang et. al Reading Tea Leaves: How Humans Interpret Topic Models

word intrusion¶

Find the intruding word in sets of top words picked from a topic in a topic model plus an intruder that has low probability for the current topic but high probability for some other topic. The more the intruder words as judged by humans varies, the less coherent the model is.

{dog, cat, horse, apple, pig, cow}

topic intrusion¶

WARNING - WE ARE VEERING INTO PHILOSOPHY¶

What is The Meaning of Meaning??¶

Douven et. al Measuring Coherence https://www.researchgate.net/publication/220607660_Measuring_coherence

Chang et. al Reading Tea Leaves: How Humans Interpret Topic Models

The more the intruder words as judged by humans varies, the less coherent the model is.

I need word sets that have several equally plausible interpretations

{dog, cat, horse, apple, pig, cow}

{dog, carrot, horse, apple, pig, corn}

{cat, tuna, yarn, horse, stable, hay}

{cat, airport, yarn, horse, security, hay}

Supervised models are trained on labelled data and optimised to maximise an external metric such as log loss or accuracy. Unsupersived models on the other hand at their simplest do frequency counting of terms in context possibly aiming to fit a predefined parameterized distribution to be consistent with the statistics of some unlabelled data set.

More recently maximising the similarity of words that appear in similar contexts have been put into a neural network context. Evaluating the trained model often starts by "eye-balling" the results, i.e. checking that your own expections of similarity are fullfilled by the model.

Documents that talk about football should be in the same category and "cat" is more similar with "dog" than with "pen". Tools such as pyLDAvis and gensim provide many different ways to get an overview of the learned model or a single metric that can be maximised: topic coherence, perplexity, ontological similarity, term co-occurrence, word analogy. Using these methods without a good understanding of what the metric represents can give misleading results. The unsupervised models are also often used as part of larger processing pipelines, it is not clear if these intrinsic evaluation measures are approriate in such cases, perhaps the models should instead be evaluated against an external metric like accuracy for the entire pipeline.

In this talk I will give an intuition of what the evaluation metrics are trying to achieve, give some recommendations for when to use them, what kind of pitfalls one should be aware of when using LDA or word emdeddings and the inherent difficulty in measuring or even defining semantic similarity concisely.

Is "cat" more similar to "tiger" than to "dog"? Ideally this information should be captured in a single metric that can be maximised.

Models like word2vec and GloVe are common ways of creating dense vector representations of word meaning. These allow

You will learn:

Questions and Comments:

what will I learn by attending?
the single metric is interesting! can you incorporate that in the shorter abstract as well?
im not sure you need to tell ppl unsupervised learning is popular, at least not in the shorter abstract imo
i am maybe not target audience but I only half-understand the bullet points. are there less academic or scientific words that can describe the same thing? or perhaps a paraphrase or question next to them like - perplexity: how we do X with Y?

It all depends on what correct and better means¶

Ideally we would be able to say whether a model is intrinsically -- or objectively -- good or bad. Measuring the quality of a topic model, or some other distributional/distributed model, is difficult to do intrinsically, mainly because an objective view for the goodness of the model is elusive. The similarity of pairs of words, or the assignment of documents into topics is contextual; cat is close to dog if the context is bicycle but what if the context is kitten, mouse or ball.

This may seem like a silly example but this is how a distributional composition was evaluated not that long ago. Evaluating topic models is easier than evaluating the similarity of certain word pairs as the topic model itself provides some context. Typically the evalution of model is done using a list of top words from topics.

topic models
- topic coherence
- human interpretability
- what do topics mean for humans
- a document that talks about regulation being considered for vaping equipment: does it belong in the lifestyle topic or politics?
- a document that talks about the negotiation between Lufthansa pilots and the company: is the document about travel or politics?
distributional semantics and what does it mean for something to mean something
- how do individual words get their meaning?
- what about sentences?
- what about documents?

In [6]:

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

def randrange(n, vmin, vmax):
    '''
    Helper function to make an array of random numbers having shape (n, )
    with each number distributed Uniform(vmin, vmax).
    '''
    return (vmax - vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(25, 8))
ax = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

n, b  = 2, 8

# For each set of style and range settings, plot n random points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow, zhigh].
for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^', -b, -5)]:
    xs = randrange(n, 23, 32)
    ys = randrange(n, 0, 100)
    zs = randrange(n, zlow, zhigh)
    ax.scatter(xs, ys, zs, c=c, marker=m)
    ax2.scatter(xs, ys, c=c, marker=m)
    ax3.scatter(xs, np.zeros(ys.shape), c=c, marker=m)

In [7]:

plt.show()

http://distill.pub/2016/misread-tsne/

other measures¶

topic coherence (Reading Tea Leaves: How Humans Interpret Topic Models)
- word / topic intrusion
- perplexity
- Automatic Word Sense Discrimination Schütze 1998
- Automatic Evaluation of Topic Coherence Newman et. al 2010
  - ontological similarity to link overlap and term co-occurrence
  - wordnet (path distance, Leacock-Chodorow, Wu-Palmer, Hirst-St Onge, Resnik Information Content, Lin's measure, Jiang-Conrath, LESK)
- inter-annotator agreement, but of what exactly (useful == coherent, unuseful==incoherent)
- inter-annotator agreement on labels for topics
the measures take into account only information that is present, not information that isn't present
- do you want topics that are highly separated or largely overlapping

In [13]:

Out[13]:

array([ 0.,  1.,  2.,  3.,  4.])

In [46]:

np.arange(binom.ppf(0.01, n, p),
          binom.ppf(0.99, n, p))

Out[46]:

array([ 12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.])

In [67]:

from scipy.stats import binom
import numpy as np

fig, ax = plt.subplots(1, 1)
for n, p in [(20, 0.5), (30, 0.5), (40, 0.5)]:
    x = np.arange(binom.ppf(0.01, n, p),
                  binom.ppf(0.99, n, p), step=1)
    ax.plot(np.arange(0, 30), binom.pmf(np.arange(0, 30), n, p), label=f'(n={n}, p={p})')
plt.legend()
plt.show()

distributional semantics¶

word2vec, glove, APTs and distributional composition
the meaninig of a word is "the company it keeps" - what's the meaning of two or more words put together
intrinsic evaluation is nearly impossible
river delta, river estuary (suisto, estuaari?) - why doesn't finnish have an equivalent for estuary
good, bad, pear, apple
blue, red, green?

analogy task¶

king - man + woman == queen

cider - alchohol == applejuice

apple + drink = cider

cider - apple + (hops + barley) == beer

is good closer to apple than it is to bad?
is good closer to maybe than it is to and?

forget trying to define what the meaning of meaning is and use the damn thing¶

none of the deep learning models are attempting to understand language, they are all treying to solve a task by possibly understanding language
evaluating on a task is also not always easy because humans tend to be messy creatures (multi-label classification, while sports is clear celebriry gossip is less clear)

what does this mean for NLP?¶

In [1]:

from spacy import en

In [3]:

spc = en.English()

In [11]:

len(spc.vocab)

Out[11]:

In [10]:

lex_good = spc.vocab['good']
lex_bad = spc.vocab['bad']
lex_good.vector - lex_bad.vector

Out[10]:

array([ 0.03778473, -0.03715277, -0.07361357, -0.01294211,  0.02956894,
        0.00120951,  0.00651724, -0.04558603, -0.04687905, -0.04818122,
       -0.05862857,  0.0269573 , -0.03065341, -0.02568185,  0.0108215 ,
        0.03609811, -0.00549456, -0.00496092,  0.01636782, -0.11323725,
        0.00400667,  0.03294041, -0.03766143, -0.11303058, -0.01312381,
        0.00074617, -0.02796576,  0.03918718, -0.06753656,  0.04056126,
       -0.02435833,  0.03351083,  0.05406209, -0.05045499, -0.03866494,
       -0.11557291, -0.0241404 , -0.02356568, -0.00907446, -0.03900382,
        0.03083578, -0.0040055 ,  0.03728203,  0.01200607, -0.06109978,
        0.01538801,  0.01623785, -0.00788699,  0.01040638, -0.09391987,
        0.03025245,  0.06502789, -0.0257586 , -0.02438577, -0.08735697,
        0.00546392, -0.04054803, -0.01161486,  0.04618217,  0.08720689,
       -0.01748072,  0.07930992, -0.02152181, -0.0473533 ,  0.06699783,
       -0.05474082, -0.01087604, -0.01853025, -0.07290894,  0.10466487,
       -0.05987743, -0.0461891 ,  0.04221539, -0.00659332, -0.06081026,
       -0.00947931, -0.02315397,  0.00336254, -0.01106939, -0.03976184,
        0.02458255,  0.01908594, -0.00534361,  0.05926706, -0.01174925,
       -0.04580118,  0.02993831,  0.03481253, -0.08198168,  0.06468022,
        0.03533567,  0.02087123, -0.02391223,  0.01522605, -0.05712903,
        0.01759095,  0.04063302,  0.04339333, -0.01607532,  0.07680882,
       -0.01245796, -0.03034243, -0.07419388, -0.06572168, -0.07219325,
        0.06293282,  0.04205929, -0.0036073 , -0.00917283,  0.05123605,
        0.05294095,  0.04576715,  0.03751262, -0.01752511, -0.01464052,
        0.055594  , -0.04663217, -0.15819344, -0.07700203, -0.05248307,
       -0.05859297,  0.04869274, -0.030218  ,  0.02480839,  0.04311489,
       -0.03834178, -0.00127464,  0.01092765, -0.00921719,  0.0550067 ,
        0.0086538 ,  0.02380663,  0.05412368, -0.04334639,  0.03546507,
       -0.0240002 ,  0.00161588,  0.00717151, -0.06340153,  0.03611495,
        0.00684508,  0.03980317,  0.00355432, -0.04512753, -0.04908818,
       -0.01253648, -0.0289698 , -0.01168478, -0.08543831,  0.03253756,
        0.02718301, -0.08928521, -0.02979837, -0.01723359, -0.03895465,
        0.03317515,  0.00332391,  0.00104449,  0.00221558,  0.07882038,
        0.01369842,  0.01719748,  0.02513313,  0.04267995,  0.01815877,
        0.06622416,  0.05711883, -0.00110836,  0.05295034, -0.04159397,
       -0.05750373, -0.07719684,  0.06881331,  0.06464667,  0.0088968 ,
        0.05262515, -0.05066305, -0.05115263,  0.11510406, -0.02314681,
       -0.03311234,  0.04257908,  0.12171125,  0.0367321 ,  0.01495446,
       -0.03988375,  0.0986093 , -0.03378087, -0.04653993,  0.0841633 ,
        0.00041698,  0.02687573,  0.01568856,  0.02757556,  0.05654669,
        0.02147451, -0.04082048,  0.08263498, -0.04396538,  0.00794014,
       -0.05222085, -0.12295387,  0.08206398, -0.04020041,  0.0451312 ,
        0.04493903, -0.04647713, -0.04017274, -0.01821278, -0.01989489,
       -0.0391674 ,  0.02027123,  0.07091188,  0.05683799,  0.09986372,
       -0.03073186, -0.0272452 ,  0.0239141 ,  0.0560332 ,  0.03768767,
       -0.01591393,  0.08834616, -0.00477553, -0.07502193, -0.00355799,
        0.01109183, -0.00536651, -0.00906663,  0.06660841,  0.01939166,
       -0.08296614,  0.04513927,  0.00738879, -0.04783604,  0.01958616,
        0.03531285, -0.02142701,  0.03918985, -0.02637605, -0.00247036,
       -0.01580125, -0.0285096 , -0.00751369,  0.0006706 , -0.03076843,
       -0.00575502, -0.07914151,  0.03216527,  0.00512136, -0.02817366,
       -0.03781955,  0.09049586,  0.01508809,  0.0002537 , -0.05053477,
        0.05154572,  0.12314805, -0.01373708, -0.01833541, -0.03917671,
       -0.00880381, -0.05332572, -0.09396337, -0.05920757,  0.03675275,
       -0.02788009,  0.00083914,  0.00773343, -0.04634326, -0.11024204,
       -0.08266747, -0.01772676, -0.0824718 ,  0.05847647,  0.02679466,
        0.07236496, -0.04660835, -0.04092844,  0.05264228, -0.01337559,
        0.02150504,  0.011709  ,  0.03754292, -0.04147887,  0.10324782,
        0.03408377, -0.02170971, -0.00261355,  0.04005083, -0.05963475,
        0.02341744,  0.0116286 , -0.04728064,  0.00368066,  0.00422844,
        0.01396303,  0.01315625,  0.05517897, -0.00925512,  0.00554182], dtype=float32)

https://www.youtube.com/watch?v=uLgn3geod9Q (How a dictionary writer defines English)
https://youtu.be/uLgn3geod9Q?t=2m3s"

"when we revise a dictionary, you go through it A-Z and you take all of the instances for the word that you're looking at. You're mathing up the word and its contextual use* ... *"

antidisestablishmentarianism

Demonstration of the Topic Coherence model in gensim.

https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb

Topic Coherence

http://qpleple.com/topic-coherence-to-evaluate-topic-models/