`pyLDAvis.sklearn`¶

pyLDAvis now also support LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20 newsgroups dataset as provided by scikit-learn.

In [1]:

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [2]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Load 20 newsgroups dataset¶

First, the 20 newsgroups dataset available in sklearn are loaded. The headers, footers and quotes are removed, as always.

In [3]:

newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print len(docs_raw)

Convert to document-term matrix¶

Next, the raw documents are converted into document-term matrix, possibly as raw counts of TF-IDF form.

In [4]:

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print dtm_tf.shape

(11314, 9145)

In [5]:

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print dtm_tfidf.shape

(11314, 9145)

Fit Latent Dirichlet Allocation models¶

Finally, the LDA models are fitted.

In [6]:

# for TF DTM
lda_tf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tfidf.fit(dtm_tf)

Out[6]:

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=20, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

Visualzing the models with pyLDAvis¶

In [7]:

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

Out[7]:

In [8]:

pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

Out[8]:

In [ ]:

pyLDAvis.sklearn¶

Load 20 newsgroups dataset¶

Convert to document-term matrix¶

Fit Latent Dirichlet Allocation models¶

Visualzing the models with pyLDAvis¶

`pyLDAvis.sklearn`¶