Unsupervised models in natural language processing (NLP) have a long history but have recently become very popular. Word2vec, GloVe, LSI and LDA provide powerful computational tools to deal with natural language and make exploring and modelling large document collections feasible.
Often evaluating the model output requires an existing understanding of what should come out. For topic models the output should reflect our understanding of the relatedness of topical categories, for instance sports, travel or machine learning. Distributional models of language such as word2vec
and GloVe
should capture some, or ideally all, of the semantics of how language is used.
This is a lot to ask! Not necessarily because it isn't learneable, after all we've learned it, but because we are not necessarily able to represent the desired output as an evaluation function and data set that can be optimised. As an example topic models are often evaluated with respect to the semantic coherence of the topics based on a set of top words from the topics. It is not clear if a set of words such as {cat, dog, horse, pet}
captures the semantics of an animalness or a petsiness fully. Nevertheless these methods are useful in determining if the distributed word representation are capturing some of the information conveyed by words and if a topic model is understandable to a human.
This notebook explores a number of these issues in context and aims to provide an overview of the research that has been done in the past 10 or so years, mostly focusing on topic models.
The notebook is split into three parts
Random collection of other stuff
While preparing the talk and the notebook I experimented with a lot of different software packages and corpora. These are dumped as a somewhat unorganised collection of "other things" at the end of the notebook
We would like to be able to say if a model is objectively good or bad, and compare different models to each other. This requires us to have an objective measure for the quality of the model but many of the tasks mentioned above require subjective evaluation.
In practical applications one needs to evaluate if "the correct thing" has been learned, often this means applying implicit knowledge and "eye-balling". Documents that talk about football should be in the same category and cat is more similar to dog than to pen. Ideally this information should be captured in a single metric that can be maximised. It is not clear how to formulate such a metric however, over the years there has been numerous attempts from various different angles at formulating semantic coherence, none capture the desired outcome fully and there are numerous issues one should be aware of in applying those metrics.
Some of the issues are related to the metrics being used or issues one should be aware of when applying those metrics, but others are related to external factors, like which kind of held out data to use. Natural language is messy, ambiguous and full of interpretation, that's where a lot of the expressive richness comes from. Sometimes trying to cleanse the ambiguity also reduces language to an unnatural form.
Topic models aim to capture the way in which words co-occur in the context of a document and divide the source corpus into some number of (soft) clusters. There are a large number of variations on the topic model, initial work was done be Deerwester in developing Latent Semantic Analysis (LSA/LSI), now the canonical example is Latent Dirichlet Allocation or LDA. The unit upon which topic models work is a sparse document-term matrix depicted below.
Each row is a document, each column is a term and the entries in each cell usually represent the frequency of each term in each document.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
documents = ['The cat sat on the mat', 'A red cat sat on the mat', 'No cat sat on the mat']
vectoriser = CountVectorizer().fit(documents)
X = vectoriser.transform(documents)
pd.DataFrame(X.A, columns=sorted(vectoriser.vocabulary_.keys(), key=lambda k: vectoriser.vocabulary_[k]))
cat | mat | no | on | red | sat | the | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 0 | 1 | 2 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
In the case of LDA a Bayesian model is fitted using the data from this matrix. Each topic in the model becomes a probability distribution over terms (the columns). Conceptually this is saying that semantic concepts can be represented as probabilities over a set of words. This makes sense as the topic of discussion acts as a limiting factor on the vocabulary one is likely to use, hear or read in the context of that dicussion.
Words relating to political campaining are much less likely to be observed in documents that discuss ice hockey. Notice however that is is unlikely not impossible, it is not the case that it can not ever happen, it is simply statistically less likely to be the case that caucus or polling will be in a document that otherwise discusses the Teemu Selänne retiring. A topic therefore is a probability distribution over the entire vocabulary, indicating how likely each word is to occur within that topic.
The documents the model is built over can be as short as a single sentence (a tweet) or as long as a chapter in a book. Typically very short documents tend to be more difficult to built coherent models over than slightly longer documents.
In order to evaluate a model, we must of course have one. I'll use the same model(s), built from the Fake News data set on Kaggle, throughout this notebook.
import pandas as pd
df_fake = pd.read_csv('/usr/local/scratch/data/kaggle/fake.csv')
df_fake[['title', 'text', 'language']].head()
title | text | language | |
---|---|---|---|
0 | Muslims BUSTED: They Stole Millions In Gov’t B... | Print They should pay all the back all the mon... | english |
1 | Re: Why Did Attorney General Loretta Lynch Ple... | Why Did Attorney General Loretta Lynch Plead T... | english |
2 | BREAKING: Weiner Cooperating With FBI On Hilla... | Red State : \nFox News Sunday reported this mo... | english |
3 | PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe... | Email Kayla Mueller was a prisoner and torture... | english |
4 | FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal... | Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ... | english |
import numpy as np
df_fake = df_fake.loc[(pd.notnull(df_fake.text)) & (df_fake.language == 'english')]
df_fake.shape
(12357, 20)
There is a total of 12357 non empty english language documents, should be enough to build a model. Let's parse the documents using spacy
, getting rid of some non content words and chuck that into gensim
. I'll use the gensim.corpora.MmCorpus
to serialise the text onto disk, this both saves memory and allows random access to the corpus, which will become useful later for creating different splits of the data.
import spacy
import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary, MmCorpus
spc = spacy.load('en')
KEEP_POS = set([90, 98, 82, 84, 94]) # NOUN, VERB, ADJ, ADV, PROPN
pipe = spc.pipe(df_fake.text, parse=False, entity=False, n_threads=8)
processed = [[token.lemma_ for token in document if token.pos in KEEP_POS] for document in pipe]
vocabulary = Dictionary(processed)
vocabulary.filter_extremes(no_below=3, no_above=0.5)
MmCorpus.serialize('./fake_news.mm', (vocabulary.doc2bow(doc) for doc in processed))
vocabulary.save('./fake_news.vocab')
del processed
corpus_fake = MmCorpus('./fake_news.mm')
lda_fake = LdaModel(corpus=corpus_fake, id2word=vocabulary, num_topics=35, chunksize=1500, iterations=200, alpha='auto')
Inspecting the top 6 six word from the model we can certainly identify some structure. Ther are topic about the Flint Michigan water crisis (Topic 11), the Dakota Access Pipeline (Topic 9) protests and the US elections.
pd.DataFrame([[word for rank, (word, prob) in enumerate(words)]
for topic_id, words in lda_fake.show_topics(formatted=False, num_words=6, num_topics=35)].iloc[])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | dna | video | abortion | veritas | project | know |
1 | al | syrian | force | mosul | aleppo | isis |
2 | black | obama | white | people | soros | america |
3 | china | which | alien | would | there | ufo |
4 | get | go | ’ | just | know | there |
5 | russia | russian | putin | nato | missile | nuclear |
6 | market | cancer | product | bank | study | report |
7 | use | drug | health | company | report | medical |
8 | post | comment | news | video | result | |
9 | pipeline | dakota | police | water | rock | protester |
10 | child | school | police | family | when | people |
11 | water | study | health | toxic | flint | tea |
12 | clinton | campaign | hillary | podesta | wikileaks | |
13 | clinton | assange | hillary | there | trump | would |
14 | obama | money | bank | hillary | president | make |
15 | party | care | kelly | insurance | obamacare | year |
16 | trump | vote | clinton | election | hillary | voter |
17 | people | world | other | man | which | when |
18 | ukraine | ukrainian | india | japan | fukushima | us |
19 | turkey | turkish | rebel | adl | erdogan | glacier |
20 | text | god | which | moon | body | silver |
21 | fake | earthquake | california | rodgers | jay | quake |
22 | eddie | government | people | country | would | which |
23 | which | would | egypt | britain | news | report |
24 | war | us | syria | russia | military | country |
25 | israel | israeli | nuclear | zika | palestinian | resolution |
26 | life | time | other | year | find | people |
27 | herb | universe | which | korea | planet | x |
28 | trump | election | day | just | go | get |
29 | trump | would | president | donald | election | clinton |
30 | immigration | immigrant | illegal | percent | migrant | new |
31 | law | court | judge | case | federal | government |
32 | food | energy | oil | also | use | which |
33 | clinton | fbi | investigation | hillary | comey | |
34 | jews | jewish | which | mars | only | american |
pd.DataFrame([[word for rank, (word, prob) in enumerate(words)]
for topic_id, words in lda_fake.show_topics(formatted=False, num_words=6, num_topics=35)])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | dna | video | abortion | veritas | project | know |
1 | al | syrian | force | mosul | aleppo | isis |
2 | black | obama | white | people | soros | america |
3 | china | which | alien | would | there | ufo |
4 | get | go | ’ | just | know | there |
5 | russia | russian | putin | nato | missile | nuclear |
6 | market | cancer | product | bank | study | report |
7 | use | drug | health | company | report | medical |
8 | post | comment | news | video | result | |
9 | pipeline | dakota | police | water | rock | protester |
10 | child | school | police | family | when | people |
11 | water | study | health | toxic | flint | tea |
12 | clinton | campaign | hillary | podesta | wikileaks | |
13 | clinton | assange | hillary | there | trump | would |
14 | obama | money | bank | hillary | president | make |
15 | party | care | kelly | insurance | obamacare | year |
16 | trump | vote | clinton | election | hillary | voter |
17 | people | world | other | man | which | when |
18 | ukraine | ukrainian | india | japan | fukushima | us |
19 | turkey | turkish | rebel | adl | erdogan | glacier |
20 | text | god | which | moon | body | silver |
21 | fake | earthquake | california | rodgers | jay | quake |
22 | eddie | government | people | country | would | which |
23 | which | would | egypt | britain | news | report |
24 | war | us | syria | russia | military | country |
25 | israel | israeli | nuclear | zika | palestinian | resolution |
26 | life | time | other | year | find | people |
27 | herb | universe | which | korea | planet | x |
28 | trump | election | day | just | go | get |
29 | trump | would | president | donald | election | clinton |
30 | immigration | immigrant | illegal | percent | migrant | new |
31 | law | court | judge | case | federal | government |
32 | food | energy | oil | also | use | which |
33 | clinton | fbi | investigation | hillary | comey | |
34 | jews | jewish | which | mars | only | american |
lda_fake.save('./fake_news_35.lda')
The visualisation in the Termite
paper look very promising, but I've been unable to run the code. The original project has been split into two separate projects a data server and a visualisation client. Unfortunately the data server uses an unknown data format in SQLite databases, and the host server where the data sets ought to be is not operational anymore and the project hasn't been maintained since 2014.
The project also relies on web2py
which at the moment only supports python 2 and there doesn't seem to be any interest in porting it to python 3.
Anyhow, it would seem to be possible to run the project under a python 2 environment.
read_gensim.py
to add --sentence-splitter cmd argbin/apps/SplitSentences.py
to have an extra param for sentence_splitter jar locationgensim
API breaking changesbin/readers/GensimReader.py
line 47 ldamodel.show_topics
bin/readers/GensimReader.py
line 51 topic/term distribution does not neet enumerate
anymorebin/readers/GensimReader.py
line 52 swap term
and value
around - they are the wrong way aroundtermite
makes a lot of assumptions about paths, one needs to be quite careful what the root directory is for running the commands
import sys
sys.path.append('/home/matti/termite-data-server/bin/')
from modellers import GensimLDA
import re
df_fake['text_oneline'] = df_fake.text.apply(lambda s: re.sub(r'\s+', ' ', str(s)))
df_fake[['uuid', 'text_oneline']].to_csv('./fakenews.termite.tsv', sep='\t', header=False, index=False)
py27 = '/home/matti/miniconda3/envs/py27/bin/python'
termite_server_root = '/home/matti/termite-data-server/'
First we need to import the corpus into termite
's own special SQLite format
!mkdir termite;\
cp ./fakenews.termite.tsv ./termite;
cd termite; $py27 /home/matti/termite-data-server/bin/import_corpus.py ./db ./fakenews.termite.tsv
mkdir: cannot create directory ‘termite’: File exists WARNING:root:Unable to import plural rules: No module named plural_rules Importing file [./fakenews.termite.tsv] into database [./db/corpus.db]
Then we need to export that SQLite DB back into a text corpus, there's some magic file names and path structures that happens here so you can't just use the original file
!cd termite; mkdir corpus; $py27 /home/matti/termite-data-server/bin/export_corpus.py ./db ./corpus/corpus.txt
WARNING:root:Unable to import plural rules: No module named plural_rules Exporting database [./db/corpus.db] to file [./corpus/corpus.txt]
Then train the LDAModel, is should be possible to skip this and just use any model trained with gensim
%capture
!cd termite; $py27 /home/matti/termite-data-server/bin/train_gensim.py --overwrite ./corpus ./models/
-------------------------------------------------------------------------------- Training an LDA topic model using gensim... corpus = ./corpus/corpus.txt model = ./models/ token_regex = \w{3,} topics = 20 passes = 1 -------------------------------------------------------------------------------- using symmetric alpha at 0.05 using symmetric eta at 3.17268948888e-05 using serial LDA version on this node running online (single-pass) LDA training, 20 topics, 1 passes over the supplied corpus of 12357 documents, updating model once every 2000 documents, evaluating perplexity every 12357 documents, iterating 50x with a convergence threshold of 0.001000 too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy PROGRESS: pass 0, at document #2000/12357 merging changes from 2000 documents into a model of 12357 documents topic #0 (0.050): 0.003*"black" + 0.003*"obama" + 0.003*"syria" + 0.002*"white" + 0.002*"party" + 0.002*"military" + 0.002*"great" + 0.002*"syrian" + 0.002*"anti" + 0.002*"russia" topic #14 (0.050): 0.003*"party" + 0.003*"syria" + 0.002*"isis" + 0.002*"house" + 0.002*"russia" + 0.002*"city" + 0.002*"military" + 0.002*"white" + 0.002*"democratic" + 0.002*"power" topic #12 (0.050): 0.002*"black" + 0.002*"obama" + 0.002*"power" + 0.002*"white" + 0.002*"things" + 0.002*"police" + 0.002*"fbi" + 0.002*"city" + 0.002*"change" + 0.001*"vote" topic #13 (0.050): 0.004*"syria" + 0.003*"russia" + 0.003*"syrian" + 0.003*"black" + 0.003*"police" + 0.002*"military" + 0.002*"obama" + 0.002*"power" + 0.002*"white" + 0.002*"russian" topic #17 (0.050): 0.004*"white" + 0.003*"syria" + 0.002*"party" + 0.002*"obama" + 0.002*"israel" + 0.002*"house" + 0.002*"email" + 0.002*"russia" + 0.002*"iran" + 0.002*"foreign" topic diff=7.126265, rho=1.000000 PROGRESS: pass 0, at document #4000/12357 merging changes from 2000 documents into a model of 12357 documents topic #3 (0.050): 0.003*"power" + 0.003*"black" + 0.003*"white" + 0.002*"law" + 0.002*"fbi" + 0.002*"americans" + 0.002*"emails" + 0.002*"economic" + 0.002*"article" + 0.002*"life" topic #8 (0.050): 0.008*"email" + 0.004*"emails" + 0.003*"obama" + 0.002*"law" + 0.002*"follow" + 0.002*"information" + 0.002*"saudi" + 0.002*"russia" + 0.002*"000" + 0.002*"wikileaks" topic #4 (0.050): 0.003*"assange" + 0.003*"nuclear" + 0.002*"power" + 0.002*"control" + 0.002*"weapons" + 0.002*"russia" + 0.002*"high" + 0.002*"wikileaks" + 0.002*"facebook" + 0.001*"department" topic #16 (0.050): 0.006*"stockman" + 0.006*"party" + 0.004*"obama" + 0.004*"black" + 0.003*"police" + 0.003*"democratic" + 0.003*"social" + 0.002*"left" + 0.002*"white" + 0.002*"david" topic #9 (0.050): 0.006*"fbi" + 0.003*"department" + 0.002*"emails" + 0.002*"including" + 0.002*"vote" + 0.002*"party" + 0.002*"weiner" + 0.002*"democratic" + 0.002*"saudi" + 0.002*"officials" topic diff=2.843640, rho=0.707107 PROGRESS: pass 0, at document #6000/12357 merging changes from 2000 documents into a model of 12357 documents topic #18 (0.050): 0.005*"obama" + 0.002*"women" + 0.002*"great" + 0.002*"military" + 0.002*"history" + 0.002*"party" + 0.002*"life" + 0.002*"house" + 0.002*"power" + 0.002*"white" topic #8 (0.050): 0.009*"email" + 0.005*"emails" + 0.003*"obama" + 0.003*"wikileaks" + 0.003*"saudi" + 0.003*"information" + 0.003*"podesta" + 0.002*"law" + 0.002*"000" + 0.002*"2015" topic #19 (0.050): 0.007*"vote" + 0.006*"obama" + 0.004*"voting" + 0.004*"voters" + 0.004*"white" + 0.004*"republican" + 0.003*"party" + 0.003*"presidential" + 0.003*"candidate" + 0.003*"podesta" topic #5 (0.050): 0.008*"russia" + 0.005*"russian" + 0.003*"putin" + 0.003*"syria" + 0.002*"military" + 0.002*"nuclear" + 0.002*"foreign" + 0.002*"anti" + 0.002*"israel" + 0.002*"obama" topic #4 (0.050): 0.006*"health" + 0.005*"brain" + 0.004*"assange" + 0.004*"nuclear" + 0.003*"widget" + 0.003*"food" + 0.003*"medical" + 0.002*"life" + 0.002*"fat" + 0.002*"cancer" topic diff=1.850195, rho=0.577350 PROGRESS: pass 0, at document #8000/12357 merging changes from 2000 documents into a model of 12357 documents topic #2 (0.050): 0.003*"ukraine" + 0.003*"rights" + 0.003*"party" + 0.003*"anti" + 0.002*"ukrainian" + 0.002*"left" + 0.002*"saakashvili" + 0.002*"order" + 0.002*"british" + 0.002*"jewish" topic #17 (0.050): 0.012*"israel" + 0.006*"iran" + 0.004*"israeli" + 0.004*"jerusalem" + 0.003*"palestinian" + 0.003*"jewish" + 0.003*"minister" + 0.003*"obama" + 0.003*"foreign" + 0.003*"resolution" topic #4 (0.050): 0.008*"health" + 0.008*"food" + 0.005*"cancer" + 0.004*"water" + 0.004*"foods" + 0.004*"brain" + 0.003*"body" + 0.003*"diet" + 0.003*"study" + 0.003*"high" topic #14 (0.050): 0.005*"comey" + 0.004*"fbi" + 0.004*"moore" + 0.003*"investigation" + 0.003*"house" + 0.002*"party" + 0.002*"airbnb" + 0.002*"she" + 0.002*"power" + 0.002*"khan" topic #16 (0.050): 0.008*"party" + 0.005*"court" + 0.004*"black" + 0.004*"obama" + 0.004*"supreme" + 0.004*"democratic" + 0.003*"police" + 0.003*"rights" + 0.002*"republican" + 0.002*"white" topic diff=1.401028, rho=0.500000 PROGRESS: pass 0, at document #10000/12357 merging changes from 2000 documents into a model of 12357 documents topic #14 (0.050): 0.005*"onion" + 0.004*"sheeple" + 0.004*"automatically" + 0.004*"text" + 0.003*"republish" + 0.003*"comey" + 0.003*"moore" + 0.003*"spam" + 0.003*"star" + 0.003*"fbi" topic #15 (0.050): 0.013*"police" + 0.009*"water" + 0.006*"pipeline" + 0.005*"dakota" + 0.005*"law" + 0.004*"federal" + 0.004*"standing" + 0.004*"rock" + 0.003*"land" + 0.003*"protesters" topic #5 (0.050): 0.017*"russia" + 0.010*"russian" + 0.008*"putin" + 0.004*"foreign" + 0.003*"ukraine" + 0.003*"obama" + 0.003*"policy" + 0.003*"nuclear" + 0.003*"military" + 0.003*"europe" topic #18 (0.050): 0.008*"obama" + 0.003*"space" + 0.002*"god" + 0.002*"women" + 0.002*"great" + 0.002*"house" + 0.002*"man" + 0.002*"history" + 0.002*"earth" + 0.002*"kelly" topic #16 (0.050): 0.012*"obama" + 0.007*"court" + 0.007*"party" + 0.006*"supreme" + 0.004*"black" + 0.004*"police" + 0.004*"constitution" + 0.004*"white" + 0.004*"rights" + 0.003*"house" topic diff=1.336365, rho=0.447214 PROGRESS: pass 0, at document #12000/12357 merging changes from 2000 documents into a model of 12357 documents topic #9 (0.050): 0.007*"department" + 0.006*"soros" + 0.006*"foundation" + 0.005*"law" + 0.004*"officials" + 0.004*"fbi" + 0.003*"federal" + 0.003*"secret" + 0.003*"million" + 0.003*"attorney" topic #19 (0.050): 0.009*"obama" + 0.008*"vote" + 0.006*"voters" + 0.005*"presidential" + 0.005*"voting" + 0.005*"white" + 0.005*"republican" + 0.004*"party" + 0.004*"candidate" + 0.004*"house" topic #12 (0.050): 0.004*"love" + 0.003*"life" + 0.003*"energy" + 0.003*"things" + 0.003*"human" + 0.003*"self" + 0.002*"earth" + 0.002*"control" + 0.002*"power" + 0.002*"feel" topic #3 (0.050): 0.004*"power" + 0.003*"obamacare" + 0.003*"americans" + 0.002*"care" + 0.002*"global" + 0.002*"change" + 0.002*"pay" + 0.002*"order" + 0.002*"economic" + 0.002*"white" topic #16 (0.050): 0.009*"obama" + 0.008*"court" + 0.007*"party" + 0.006*"supreme" + 0.005*"immigration" + 0.004*"black" + 0.004*"rights" + 0.004*"constitution" + 0.004*"police" + 0.004*"white" topic diff=1.076053, rho=0.408248 -9.063 per-word bound, 535.0 perplexity estimate based on a held-out corpus of 357 documents with 68794 words PROGRESS: pass 0, at document #12357/12357 merging changes from 357 documents into a model of 12357 documents topic #14 (0.050): 0.009*"moore" + 0.004*"violence" + 0.003*"manchanda" + 0.003*"onion" + 0.003*"star" + 0.003*"joe" + 0.003*"sheeple" + 0.003*"foster" + 0.003*"musket" + 0.003*"8th" topic #7 (0.050): 0.010*"com" + 0.006*"www" + 0.006*"facebook" + 0.005*"http" + 0.005*"information" + 0.004*"data" + 0.004*"google" + 0.003*"star" + 0.003*"content" + 0.003*"internet" topic #16 (0.050): 0.015*"court" + 0.010*"supreme" + 0.009*"eddie" + 0.008*"judges" + 0.007*"party" + 0.007*"obama" + 0.006*"marriage" + 0.006*"law" + 0.005*"immigration" + 0.004*"unelected" topic #19 (0.050): 0.009*"obama" + 0.009*"vote" + 0.006*"voting" + 0.006*"voters" + 0.005*"presidential" + 0.005*"republican" + 0.005*"democratic" + 0.004*"white" + 0.004*"candidate" + 0.004*"party" topic #1 (0.050): 0.005*"life" + 0.004*"women" + 0.004*"children" + 0.003*"man" + 0.003*"old" + 0.003*"family" + 0.003*"child" + 0.003*"father" + 0.002*"home" + 0.002*"things" topic diff=0.888690, rho=0.377964 Saving dictionary to disk: ./models//dictionary.gensim Saving corpus to disk: ./models//corpus.gensim /home/matti/miniconda3/envs/py27/lib/python2.7/site-packages/gensim/interfaces.py:60: UserWarning: corpus.save() stores only the (tiny) iteration object; to serialize the actual corpus content, use e.g. MmCorpus.serialize(corpus) warnings.warn("corpus.save() stores only the (tiny) iteration object; " Saving model to disk: ./models//lda.gensim
Finally, read in the trained gensim
LDA model to termite
creating all the necessary data structures for the visualisations to work. This computes, among other thigs, term collocations ($N^2$) so it's going to take a while to run, especially for large vocabularies.
If you set all the paths consistently during the previous steps, this should just work. If not, it's likely there will be some FileNotFound
errors.
%capture
!cd termite; cp -r $termite_server_root/tools ./; $py27 /home/matti/termite-data-server/bin/read_gensim.py --overwrite\
--sentence-split /home/matti/termite-data-server/utils/corenlp/SentenceSplitter.jar\
gensim_termite ./models/ ./corpus ./db
WARNING:root:Unable to import plural rules: No module named plural_rules -------------------------------------------------------------------------------- INFO:termite:-------------------------------------------------------------------------------- Import a gensim LDA topic model as a web2py application... INFO:termite:Import a gensim LDA topic model as a web2py application... app_name = gensim_termite INFO:termite: app_name = gensim_termite app_path = apps/gensim_termite INFO:termite: app_path = apps/gensim_termite model_path = ./models/ INFO:termite: model_path = ./models/ corpus_filename = ./corpus/corpus.txt INFO:termite: corpus_filename = ./corpus/corpus.txt database_filename = ./db/corpus.db INFO:termite: database_filename = ./db/corpus.db -------------------------------------------------------------------------------- INFO:termite:-------------------------------------------------------------------------------- Creating app: gensim_termite [apps/temp_20170623_234907_893474_8228] INFO:termite:Creating app: gensim_termite [apps/temp_20170623_234907_893474_8228] Creating folder: [apps/temp_20170623_234907_893474_8228/data] INFO:termite:Creating folder: [apps/temp_20170623_234907_893474_8228/data] Creating folder: [apps/temp_20170623_234907_893474_8228/databases] INFO:termite:Creating folder: [apps/temp_20170623_234907_893474_8228/databases] Linking folder: [apps/temp_20170623_234907_893474_8228/models] INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/models] Linking folder: [apps/temp_20170623_234907_893474_8228/views] INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/views] Linking folder: [apps/temp_20170623_234907_893474_8228/controllers] INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/controllers] Linking folder: [apps/temp_20170623_234907_893474_8228/static] INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/static] Linking folder: [apps/temp_20170623_234907_893474_8228/modules] INFO:termite:Linking folder: [apps/temp_20170623_234907_893474_8228/modules] Creating file: [apps/temp_20170623_234907_893474_8228/__init__.py] INFO:termite:Creating file: [apps/temp_20170623_234907_893474_8228/__init__.py] Copying [./db/corpus.db] --> [apps/temp_20170623_234907_893474_8228/databases/corpus.db] INFO:termite:Copying [./db/corpus.db] --> [apps/temp_20170623_234907_893474_8228/databases/corpus.db] Copying [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/corpus.txt] INFO:termite:Copying [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/corpus.txt] Extracting [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/sentences.txt] INFO:termite:Extracting [./corpus/corpus.txt] --> [apps/temp_20170623_234907_893474_8228/data/sentences.txt] Preparing the Stanford CoreNLP pipeline... DEBUG:termite:Preparing the Stanford CoreNLP pipeline... Adding annotator tokenize DEBUG:termite:Adding annotator tokenize Adding annotator ssplit DEBUG:termite:Adding annotator ssplit Processing corpus: [./corpus/corpus.txt] -> [apps/temp_20170623_234907_893474_8228/data/sentences.txt] DEBUG:termite:Processing corpus: [./corpus/corpus.txt] -> [apps/temp_20170623_234907_893474_8228/data/sentences.txt] Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ₨ (U+20A8, decimal: 8360) DEBUG:termite:WARNING: Untokenizable: ₨ (U+20A8, decimal: 8360) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83E, decimal: 55358) DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable:  (U+FFFC, decimal: 65532) DEBUG:termite:WARNING: Untokenizable:  (U+FFFC, decimal: 65532) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:08 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ⁉ (U+2049, decimal: 8265) DEBUG:termite:WARNING: Untokenizable: ⁉ (U+2049, decimal: 8265) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+81, decimal: 129) DEBUG:termite:WARNING: Untokenizable: (U+81, decimal: 129) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83E, decimal: 55358) DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+95, decimal: 149) DEBUG:termite:WARNING: Untokenizable: (U+95, decimal: 149) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:09 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9C, decimal: 156) DEBUG:termite:WARNING: Untokenizable: (U+9C, decimal: 156) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+83, decimal: 131) DEBUG:termite:WARNING: Untokenizable: (U+83, decimal: 131) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+99, decimal: 153) DEBUG:termite:WARNING: Untokenizable: (U+99, decimal: 153) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+200C, decimal: 8204) DEBUG:termite:WARNING: Untokenizable: (U+200C, decimal: 8204) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F4A9, decimal: 62633) DEBUG:termite:WARNING: Untokenizable: (U+F4A9, decimal: 62633) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+83, decimal: 131) DEBUG:termite:WARNING: Untokenizable: (U+83, decimal: 131) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9F, decimal: 159) DEBUG:termite:WARNING: Untokenizable: (U+9F, decimal: 159) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+83, decimal: 131) DEBUG:termite:WARNING: Untokenizable: (U+83, decimal: 131) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+8D, decimal: 141) DEBUG:termite:WARNING: Untokenizable: (U+8D, decimal: 141) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9A, decimal: 154) DEBUG:termite:WARNING: Untokenizable: (U+9A, decimal: 154) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+81, decimal: 129) DEBUG:termite:WARNING: Untokenizable: (U+81, decimal: 129) Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:10 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F0B7, decimal: 61623) DEBUG:termite:WARNING: Untokenizable: (U+F0B7, decimal: 61623) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) DEBUG:termite:WARNING: Untokenizable: ‒ (U+2012, decimal: 8210) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+9D, decimal: 157) DEBUG:termite:WARNING: Untokenizable: (U+9D, decimal: 157) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F50D, decimal: 62733) DEBUG:termite:WARNING: Untokenizable: (U+F50D, decimal: 62733) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) DEBUG:termite:WARNING: Untokenizable: ₹ (U+20B9, decimal: 8377) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F1FA, decimal: 61946) DEBUG:termite:WARNING: Untokenizable: (U+F1FA, decimal: 61946) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F6A8, decimal: 63144) DEBUG:termite:WARNING: Untokenizable: (U+F6A8, decimal: 63144) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F1FA, decimal: 61946) DEBUG:termite:WARNING: Untokenizable: (U+F1FA, decimal: 61946) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F682, decimal: 63106) DEBUG:termite:WARNING: Untokenizable: (U+F682, decimal: 63106) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F1FA, decimal: 61946) DEBUG:termite:WARNING: Untokenizable: (U+F1FA, decimal: 61946) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: (U+F602, decimal: 62978) DEBUG:termite:WARNING: Untokenizable: (U+F602, decimal: 62978) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83E, decimal: 55358) DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83E, decimal: 55358) DEBUG:termite:WARNING: Untokenizable: ? (U+D83E, decimal: 55358) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) DEBUG:termite:WARNING: Untokenizable: ️ (U+FE0F, decimal: 65039) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:11 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: � (U+FFFD, decimal: 65533) DEBUG:termite:WARNING: Untokenizable: � (U+FFFD, decimal: 65533) Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) DEBUG:termite:WARNING: Untokenizable: ? (U+D83D, decimal: 55357) Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next DEBUG:termite:Jun 23, 2017 11:49:12 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83C, decimal: 55356) DEBUG:termite:WARNING: Untokenizable: ? (U+D83C, decimal: 55356) Copying [./models/] --> [apps/temp_20170623_234907_893474_8228/data/gensim-lda] INFO:termite:Copying [./models/] --> [apps/temp_20170623_234907_893474_8228/data/gensim-lda] Computing bag-of-words statistics INFO:termite:Computing bag-of-words statistics token_regex = \w{3,} INFO:termite: token_regex = \w{3,} min_freq = 5 INFO:termite: min_freq = 5 min_doc_freq = 3 INFO:termite: min_doc_freq = 3 max_freq_count = 4000 INFO:termite: max_freq_count = 4000 max_co_freq_count = 100000 INFO:termite: max_co_freq_count = 100000 Computing document-level statistics... INFO:termite:Computing document-level statistics... Loading corpus: apps/temp_20170623_234907_893474_8228/data/corpus.txt DEBUG:termite: Loading corpus: apps/temp_20170623_234907_893474_8228/data/corpus.txt Computing term freqs (12357 docs)... INFO:termite: Computing term freqs (12357 docs)... Computing term co-occurrences (12357 docs)... INFO:termite: Computing term co-occurrences (12357 docs)... Saving term_texts (4000 terms)... DEBUG:termite: Saving term_texts (4000 terms)... inserting 4000 rows... DEBUG:termite: inserting 4000 rows... Saving term_freqs (4000 terms)... DEBUG:termite: Saving term_freqs (4000 terms)... inserting 4000 rows... DEBUG:termite: inserting 4000 rows... Saving term_probs (4000 terms)... DEBUG:termite: Saving term_probs (4000 terms)... inserting 4000 rows... DEBUG:termite: inserting 4000 rows... Saving term_doc_freqs (4000 terms)... DEBUG:termite: Saving term_doc_freqs (4000 terms)... inserting 4000 rows... DEBUG:termite: inserting 4000 rows... Saving term_co_freqs (100000 term pairs)... DEBUG:termite: Saving term_co_freqs (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... Saving term_co_probs (100000 term pairs)... DEBUG:termite: Saving term_co_probs (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... Saving term_g2 (100000 term pairs)... DEBUG:termite: Saving term_g2 (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... Computing sentence-level term statistics... INFO:termite:Computing sentence-level term statistics... Loading corpus: apps/temp_20170623_234907_893474_8228/data/sentences.txt DEBUG:termite: Loading corpus: apps/temp_20170623_234907_893474_8228/data/sentences.txt Computing term freqs (369703 docs)... INFO:termite: Computing term freqs (369703 docs)... Computing term co-occurrences (369703 docs)... INFO:termite: Computing term co-occurrences (369703 docs)... Saving sentences_co_freqs (100000 term pairs)... DEBUG:termite: Saving sentences_co_freqs (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... Saving sentences_co_probs (100000 term pairs)... DEBUG:termite: Saving sentences_co_probs (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... Saving sentences_g2 (100000 term pairs)... DEBUG:termite: Saving sentences_g2 (100000 term pairs)... inserting 100000 rows... DEBUG:termite: inserting 100000 rows... ('-->', 'apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim') Reading gensim LDA output... INFO:termite:Reading gensim LDA output... Loading dictionary: apps/temp_20170623_234907_893474_8228/data/gensim-lda/dictionary.gensim DEBUG:termite: Loading dictionary: apps/temp_20170623_234907_893474_8228/data/gensim-lda/dictionary.gensim Loading corpus: apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim DEBUG:termite: Loading corpus: apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim ('||-->', 'apps/temp_20170623_234907_893474_8228/data/gensim-lda/corpus.gensim') <class 'modellers.GensimLDA.GensimTermiteCorpusReader'> Loading model: apps/temp_20170623_234907_893474_8228/data/gensim-lda/lda.gensim DEBUG:termite: Loading model: apps/temp_20170623_234907_893474_8228/data/gensim-lda/lda.gensim Writing to database... INFO:termite:Writing to database... Saving term_topic_matrix... DEBUG:termite: Saving term_topic_matrix... Saving doc_topic_matrix... DEBUG:termite: Saving doc_topic_matrix... Retrieving terms, documents, and topics... DEBUG:termite: Retrieving terms, documents, and topics... Retrieving top terms and top documents... DEBUG:termite: Retrieving top terms and top documents... Saving terms... DEBUG:termite: Saving terms... Saving docs... DEBUG:termite: Saving docs... Saving topics... DEBUG:termite: Saving topics... Computing derived LDA topic model statistics... INFO:termite:Computing derived LDA topic model statistics... max_co_topic_count = 10000 INFO:termite: max_co_topic_count = 10000 Loading doc_topic_matrix... DEBUG:termite: Loading doc_topic_matrix... Computing topic cooccurrences... DEBUG:termite: Computing topic cooccurrences... Computing topic covariance... DEBUG:termite: Computing topic covariance... Saving topic_covariance... DEBUG:termite: Saving topic_covariance... inserting 400 rows... DEBUG:termite: inserting 400 rows... Moving app into place: gensim_termite [apps/temp_20170623_234907_893474_8228] -> [apps/gensim_termite] INFO:termite:Moving app into place: gensim_termite [apps/temp_20170623_234907_893474_8228] -> [apps/gensim_termite]
To start the server and see the visualisations
!$py27 $termite_server_root/web2py/web2py.py
web2py Web Framework Created by Massimo Di Pierro, Copyright 2007-2017 Version 2.9.5-stable+timestamp.2014.03.16.02.35.39 Database drivers available: SQLite(sqlite3), MySQL(pymysql), PostgreSQL(pg8000), MSSQL(pyodbc), DB2(pyodbc), Teradata(pyodbc), Ingres(pyodbc), IMAP(imaplib)
Some of the work from Termite
has been integrated into pyLDAVis
which is being maintained and has good interoperability with gensim
. Below is an interactive visualisation of the fake news model trained earlier. Just to see how informative the visualisation is overall, I'll train another model on the same dataset but increaase the number of topics quite a lot.
For a good description of what you see in the visualisation you can look at the presenation from the creator himself
lda_fake = LdaModel.load('./fake_news_35.lda')
from gensim.models import LdaModel
import pyLDAvis as ldavis
import pyLDAvis.gensim
ldavis.enable_notebook()
prepared_data = ldavis.gensim.prepare(lda_fake, corpus_fake, vocabulary)
with open('./fake_news_35.lda-LDAVIS.json', 'w') as fh:
fh.write(prepared_data.to_json())
prepared_data
/home/matti/miniconda3/envs/pydatabln17/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix topic_term_dists = topic_term_dists.ix[topic_order]