Notebook

pyLDAvis¶

pyLDAvis is a python libarary for interactive topic model visualization. It is a port of the fabulous R package by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. pyLDAvis makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the original paper explaining it.

This notebook provides a quick overview of how to use pyLDAvis. Refer to the documenation for details.

BYOM - Bring your own model¶

pyLDAvis is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the prepare function that will transform your data into the format needed for the visualization.

Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by Pang and Lee (ACL, 2004), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

In [15]:

import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))

Topic-Term shape: (20, 14567)
Doc-Topic shape: (2000, 20)

Now that we have the data loaded we use the prepare function:

In [16]:

import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)

Once you have the visualization data prepared you can do a number of things with it. You can save the vis to an stand-alone HTML file, serve it, or dispaly it in the notebook. Let's go ahead and display it:

In [17]:

pyLDAvis.display(movies_vis_data)

Out[17]:

Pretty, huh?! Again, you should be thanking the original LDAvis people for that. You may thank me for the IPython integartion though. :)

To see other models visualzied check out this notebook.

ProTip: To avoid tediously typing in display all the time use:

In [18]:

pyLDAvis.enable_notebook()

Making the common case easy - Gensim and others!¶

Built on top of the generic prepare function are helper functions for gensim and GraphLab Create. To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see this notebook for how these were created):

In [19]:

import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')

In the dark ages in order to inspect our topics all we had was show_topics and friends:

In [20]:

lda.show_topics()

Out[20]:

[u'0.020*turks + 0.012*press + 0.010*south + 0.010*international + 0.009*san + 0.009*washington + 0.008*april + 0.008*conference + 0.008*may + 0.008*american',
 u"0.019*players + 0.015*article + 0.014*angeles + 0.014*los + 0.012*university + 0.010*nntp + 0.010*host + 0.010*he's + 0.010*posting + 0.010*alan",
 u'0.298*bike + 0.150*max + 0.068*cnn + 0.041*hst + 0.019*labels + 0.011*dane + 0.011*dilemma + 0.009*nhs + 0.008*lak + 0.008*otc',
 u'0.029*season + 0.028*soviet + 0.019*genocide + 0.013*zone + 0.012*closed + 0.012*beat + 0.011*shots + 0.011*aids + 0.011*article + 0.010*brian',
 u'0.031*drive + 0.019*dos + 0.018*windows + 0.017*disk + 0.013*hard + 0.012*system + 0.010*drives + 0.008*problem + 0.008*controller + 0.008*use',
 u'0.014*one + 0.011*power + 0.009*system + 0.009*secure + 0.008*problem + 0.006*waco + 0.006*light + 0.006*use + 0.006*gaza + 0.005*using',
 u'0.069*posting + 0.066*host + 0.064*nntp + 0.047*edu + 0.026*university + 0.017*article + 0.015*reply + 0.015*distribution + 0.012*usa + 0.011*please',
 u'0.022*president + 0.018*government + 0.015*clinton + 0.013*white + 0.011*house + 0.011*security + 0.010*secret + 0.010*clipper + 0.009*david + 0.009*encryption',
 u'0.033*msg + 0.024*russia + 0.023*detroit + 0.016*patrick + 0.015*adams + 0.013*rangers + 0.013*coach + 0.012*new + 0.012*team + 0.011*racist',
 u'0.029*file + 0.029*output + 0.021*apr + 0.016*gmt + 0.014*program + 0.013*input + 0.012*cancer + 0.012*line + 0.011*entry + 0.011*int']

Thankfully, in addition to these still helpful functions, we can get a feel for all of the topics with this one-liner:

In [21]:

import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

Out[21]:

GraphLab¶

As I mentioned above you can also easily visualize GraphLab TopicModels as well. Check out this notebook if you are interested in that.

Go forth and visualize!¶

What are you waiting for? Go ahead and pip install pyldavis.