Note: This is best viewed on NBViewer. It is part of a series on Dirichlet Processes and Nonparametric Bayes.
In 2003, a groundbreaking statistical model called "Latent Dirichlet Allocation" was presented by David Blei, Andrew Ng, and Michael Jordan.
LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see Edwin Chen's post.
The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published a "nonparametric" version of this model that doesn't require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. I wrote an introduction to this HDP-LDA model earlier this year.
For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of "robust, validated Bayesian nonparametric models for discovering structure in data" known as Data Microscopes.
This notebook demonstrates the functionality of this implementation.
The Data Microscopes library is available on anaconda.org for Linux and OS X. microscopes-lda
can be installed with:
$ conda install -c datamicroscopes -c distributions microscopes-lda
%matplotlib inline
import pyLDAvis
import json
import sys
import cPickle
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda import utils
from microscopes.lda import model, runner
from numpy import genfromtxt
from numpy import linalg
from numpy import array
dtm.csv
contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in vocab.txt
. The rows in the matrix correspond to the show urls in urls.txt
.
Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents.
vocab = genfromtxt('./econtalk-data/vocab.txt', delimiter=",", dtype='str').tolist()
dtm = genfromtxt('./econtalk-data/dtm.csv', delimiter=",", dtype=int)
docs = utils.docs_from_document_term_matrix(dtm, vocab=vocab)
urls = [s.strip() for s in open('./econtalk-data/urls.txt').readlines()]
dtm.shape[1] == len(vocab)
True
dtm.shape[0] == len(urls)
True
Here's a utility method to get the title of a webpage that we'll use later.
def get_title(url):
"""Scrape webpage title
"""
import lxml.html
t = lxml.html.parse(url)
return t.find(".//title").text.split("|")[0].strip()
Let's set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters.
N, V = len(docs), len(vocab)
defn = model_definition(N, V)
prng = rng(12345)
state = initialize(defn, docs, prng,
vocab_hp=1,
dish_hps={"alpha": .6, "gamma": 2})
r = runner.runner(defn, docs, state, )
When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After we start to run the MCMC, the score will drop quickly.
print "randomly initialized model:"
print " number of documents", defn.n
print " vocabulary size", defn.v
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()
randomly initialized model: number of documents 454 vocabulary size 16445 perplexity: 16523.1820356 num topics: 9
Run one iteration of the MCMC to make sure everything is working.
%%time
r.run(prng, 1)
CPU times: user 12.3 s, sys: 128 ms, total: 12.4 s Wall time: 12.6 s
Now lets run 1000 generations of the MCMC.
Unfortunately, MCMC is slow going.
%%time
r.run(prng, 500)
CPU times: user 2h 9min 8s, sys: 26.8 s, total: 2h 9min 35s Wall time: 2h 10min 12s
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
cPickle.dump(state, f)
%%time
r.run(prng, 500)
CPU times: user 2h 17min 35s, sys: 30 s, total: 2h 18min 6s Wall time: 2h 18min 50s
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
cPickle.dump(state, f)
Now that we've run the MCMC, the perplexity has dropped significantly.
print "after 1000 iterations:"
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()
after 1000 iterations: perplexity: 2363.65138771 num topics: 11
pyLDAvis projects the topics into two dimensions using techniques described by Carson Sievert.
vis = pyLDAvis.prepare(**state.pyldavis_data())
pyLDAvis.display(vis)
We can extract the term relevance (shown in the right hand side of the visualization) right from our state object. Here are the 10 most relevant words for each topic:
relevance = state.term_relevance_by_topic()
for i, topic in enumerate(relevance):
print "topic", i, ":",
for term, _ in topic[:10]:
print term,
print
topic 0 : banks fed bank money financial monetary debt inflation crisis rates topic 1 : party republicans constitution vote democrats republican tax election president stalin topic 2 : fat science diet eat insulin disease immune replication scientific eating topic 3 : growth trade water cities china city development climate inequality oil topic 4 : people think don just going like say things lot way topic 5 : smith hayek moral economics society adam liberty coase theory rules topic 6 : bitcoin internet software google technology store bitcoins computer machines company topic 7 : prison health drug care drugs medicaid medical patients patient women topic 8 : schools teachers school kids teacher education students parents teaching sports topic 9 : bees honey pollination colony ants bee queen cheung ant colonies topic 10 : museum museums art gallery galleries monet seating trustees admission director
We could assign titles to each of these topics. For example, Topic 5 appears to be about the foundations of classical liberalism. Topic 6 is obviously Bitcoin and Software. Topic 0 is the financial system and monetary policy. Topic 4 seems to be generic words used in most episodes; unfortunately, the prevalence of "don" is a result of my preprocessing which splits up the contraction "don't".
We can also get the topic distributions for each document.
topic_distributions = state.topic_distribution_by_document()
Topic 5 appears to be about the theory of classical liberalism. Let's find the 20 episodes which have the highest proportion of words from that topic.
austrian_topic = 5
foundations_episodes = sorted([(dist[austrian_topic], url) for url, dist in zip(urls, topic_distributions)], reverse=True)
for url in [url for _, url in foundations_episodes][:20]:
print get_title(url), url
The Economics of Organ Donations http://www.econtalk.org/archives/2006/06/the_economics_o_4.html Klein on The Theory of Moral Sentiments, Episode 2--A Discussion of Part I http://www.econtalk.org/archives/2009/04/klein_on_the_th_1.html Boudreaux on Law and Legislation http://www.econtalk.org/archives/2006/12/boudreaux_on_la.html Klein on The Theory of Moral Sentiments, Episode 4--A Discussion of Part III http://www.econtalk.org/archives/2009/04/klein_on_the_th_3.html Klein on The Theory of Moral Sentiments, Episode 5--A Discussion of Parts III (cont.), IV, and V http://www.econtalk.org/archives/2009/05/klein_on_the_th_4.html Klein on The Theory of Moral Sentiments, Episode 3--A Discussion of Part II http://www.econtalk.org/archives/2009/04/klein_on_the_th_2.html Klein on The Theory of Moral Sentiments, Episode 1--An Overview http://www.econtalk.org/archives/2009/04/klein_on_the_th.html Klein on The Theory of Moral Sentiments, Episode 6--A Discussion of Parts VI and VII, and Summary http://www.econtalk.org/archives/2009/05/klein_on_the_th_5.html Wolfe on Liberalism http://www.econtalk.org/archives/2009/05/wolfe_on_libera.html Boettke on Katrina and the Economics of Disaster http://www.econtalk.org/archives/2006/12/boettke_on_katr.html Richard Thaler on Libertarian Paternalism http://www.econtalk.org/archives/2006/11/richard_thaler_1.html Marglin on Markets and Community http://www.econtalk.org/archives/2008/03/marglin_on_mark.html Dan Klein on Coordination and Cooperation http://www.econtalk.org/archives/2008/02/dan_klein_on_co.html Leeson on Pirates and the Invisible Hook http://www.econtalk.org/archives/2009/05/leeson_on_pirat.html Vernon Smith on Markets and Experimental Economics http://www.econtalk.org/archives/2007/05/vernon_smith_on.html Boettke on Elinor Ostrom, Vincent Ostrom, and the Bloomington School http://www.econtalk.org/archives/2009/11/boettke_on_elin.html Postrel on Style http://www.econtalk.org/archives/2006/11/postrel_on_styl.html The Economics of Religion http://www.econtalk.org/archives/2006/10/the_economics_o_7.html Boettke on Mises http://www.econtalk.org/archives/2010/12/boettke_on_mise.html The Economics of Medical Malpractice http://www.econtalk.org/archives/2006/05/the_economics_o_3.html
We could also find the episodes that have notable discussion of both politics AND the financial system.
topic_a = 0
topic_b = 1
joint_episodes = [url for url, dist in zip(urls, topic_distributions) if dist[0] > 0.18 and dist[1] > 0.18]
for url in joint_episodes:
print get_title(url), url
Benn Steil on the Battle of Bretton Woods http://www.econtalk.org/archives/2015/02/benn_steil_on_t.html Epstein on the Rule of Law http://www.econtalk.org/archives/2009/06/epstein_on_the.html Shlaes on the Great Depression http://www.econtalk.org/archives/2007/06/shlaes_on_the_g.html Rabushka on the Flat Tax http://www.econtalk.org/archives/2007/04/rabushka_on_the.html Greg Mankiw on Gasoline Taxes, Keynes and Macroeconomics http://www.econtalk.org/archives/2007/01/greg_mankiw_on.html
We can look at the topic distributions as projections of the documents into a much lower dimension (16). We can try to find shows that are similar by comparing the topic distributions of the documents.
def find_similar(url, max_distance=0.2):
"""Find episodes most similar to input url.
"""
index = urls.index(url)
for td, url in zip(topic_distributions, urls):
if linalg.norm(array(topic_distributions[index]) - array(td)) < max_distance:
print get_title(url), url
Which Econtalk episodes are most similar, in content, to "Mike Munger on the Division of Labor"?
find_similar('http://www.econtalk.org/archives/2007/04/mike_munger_on.html')
Brink Lindsey on the Age of Abundance http://www.econtalk.org/archives/2009/03/brink_lindsey_o.html Richard Epstein on Happiness, Inequality, and Envy http://www.econtalk.org/archives/2008/11/richard_epstein.html Sowell on Economic Facts and Fallacies http://www.econtalk.org/archives/2008/02/sowell_on_econo.html Yandle on the Tragedy of the Commons and the Implications for Environmental Regulation http://www.econtalk.org/archives/2007/10/yandle_on_the_t.html McCraw on Schumpeter, Innovation, and Creative Destruction http://www.econtalk.org/archives/2007/10/mccraw_on_schum.html Boudreaux on Market Failure, Government Failure and the Economics of Antitrust Regulation http://www.econtalk.org/archives/2007/10/boudreaux_on_ma.html Mike Munger on the Division of Labor http://www.econtalk.org/archives/2007/04/mike_munger_on.html
How about episodes similar to "Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market"?
find_similar('http://www.econtalk.org/archives/2008/09/kling_on_freddi.html')
Irwin on the Great Depression and the Gold Standard http://www.econtalk.org/archives/2010/10/irwin_on_the_gr.html Rustici on Smoot-Hawley and the Great Depression http://www.econtalk.org/archives/2010/01/rustici_on_smoo.html Reinhart on Financial Crises http://www.econtalk.org/archives/2009/11/reinhart_on_fin.html Posner on the Financial Crisis http://www.econtalk.org/archives/2009/11/posner_on_the_f.html Sumner on Monetary Policy http://www.econtalk.org/archives/2009/11/sumner_on_monet.html Calomiris on the Financial Crisis http://www.econtalk.org/archives/2009/10/calomiris_on_th.html Gary Stern on Too Big to Fail http://www.econtalk.org/archives/2009/10/gary_stern_on_t.html Cohan on the Life and Death of Bear Stearns http://www.econtalk.org/archives/2009/09/cohan_on_the_li.html John Taylor on the Financial Crisis http://www.econtalk.org/archives/2009/07/john_taylor_on_1.html Meltzer on Inflation http://www.econtalk.org/archives/2009/02/meltzer_on_infl.html Boettke on the Austrian Perspective on Business Cycles and Monetary Policy http://www.econtalk.org/archives/2009/01/boettke_on_the.html Selgin on Free Banking http://www.econtalk.org/archives/2008/11/selgin_on_free.html Kling on Credit Default Swaps, Counterparty Risk, and the Political Economy of Financial Regulation http://www.econtalk.org/archives/2008/11/kling_on_credit.html Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market http://www.econtalk.org/archives/2008/09/kling_on_freddi.html John Taylor on Monetary Policy http://www.econtalk.org/archives/2008/08/john_taylor_on.html Gene Epstein on Gold, the Fed, and Money http://www.econtalk.org/archives/2008/06/gene_epstein_on.html Meltzer on the Fed, Money, and Gold http://www.econtalk.org/archives/2008/05/meltzer_on_the.html Cowen on Monetary Policy http://www.econtalk.org/archives/2008/03/cowen_on_moneta.html
The model also gives us distributions over words for each topic.
word_dists = state.word_distribution_by_topic()
We can use this to find the topics a word is most likely to occur in.
def bars(x, scale_factor=10000):
return int(x * scale_factor) * "="
def topics_related_to_word(word, n=10):
for wd, rel in zip(word_dists, relevance):
score = wd[word]
rel_words = ' '.join([w for w, _ in rel][:n])
if bars(score):
print bars(score), rel_words
What topics are most likely to contain the word "Munger" (as in Mike Munger). The number of equal signs indicates the probability the word is generated by the topic. If a topic isn't shown, it's extremely unlikley to generate the word.
topics_related_to_word('munger')
==== growth trade water cities china city development climate inequality oil ================== smith hayek moral economics society adam liberty coase theory rules === bitcoin internet software google technology store bitcoins computer machines company
Where does Munger come up? In discussing the moral foundations of classical liberalism and microeconomics!
How about the word "lovely"? Russ Roberts uses it often when talking about the Theory of Moral Sentiments. It looks like it also comes up when talking about schools.
topics_related_to_word('lovely')
= smith hayek moral economics society adam liberty coase theory rules
If you have feedback on this implementation of HDP-LDA, you can reach me on Twitter or open an issue on Github.