We will be using the u_mass
and c_v
coherence for two different LDA models: a "good" and a "bad" LDA model. The good LDA model will be trained over 50 iterations and the bad one for 1 iteration. Hence in theory, the good LDA model will be able come up with better or more human-understandable topics. Therefore the coherence measure output for the good LDA model should be more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable.
from __future__ import print_function
import os
import logging
import json
import warnings
try:
raise ImportError
import pyLDAvis.gensim
CAN_VISUALIZE = True
pyLDAvis.enable_notebook()
from IPython.display import display
except ImportError:
ValueError("SKIP: please install pyLDAvis")
CAN_VISUALIZE = False
import numpy as np
from gensim.models import CoherenceModel, LdaModel, HdpModel
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet
from gensim.corpora import Dictionary
warnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity
As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. We will be setting up two LDA models. One with 50 iterations of training and the other with just 1. Hence the one with 50 iterations ("better" model) should be able to capture this underlying pattern of the corpus better than the "bad" LDA model. Therefore, in theory, our topic coherence for the good LDA model should be greater than the one for the bad LDA model.
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
We'll be setting up two different LDA Topic models. A good one and bad one. To build a "good" topic model, we'll simply train it using more iterations than the bad one. Therefore the u_mass
coherence should in theory be better for the good model than the bad one since it would be producing more "human-interpretable" topics.
goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)
badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)
goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
Following are the pipeline parameters for u_mass
coherence. By pipeline parameters, we mean the functions being used to calculate segmentation, probability estimation, confirmation measure and aggregation as shown in figure 1 in this paper.
print(goodcm)
Coherence_Measure(seg=<function s_one_pre at 0x7f6b0a12ed90>, prob=<function p_boolean_document at 0x7f6b0a12eea0>, conf=<function log_conditional_probability at 0x7f6b09c326a8>, aggr=<function arithmetic_mean at 0x7f6b09c32f28>)
As we will see below using LDA visualization, the better model comes up with two topics composed of the following words:
Therefore, the topic coherence for the goodLdaModel should be greater for this than the badLdaModel since the topics it comes up with are more human-interpretable. We will see this using u_mass
and c_v
topic coherence measures.
if CAN_VISUALIZE:
prepared = pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)
display(pyLDAvis.display(prepared))
if CAN_VISUALIZE:
prepared = pyLDAvis.gensim.prepare(badLdaModel, corpus, dictionary)
display(pyLDAvis.display(prepared))
print(goodcm.get_coherence())
print(badcm.get_coherence())
-14.695344054692296 -14.722989402972397
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
print(goodcm)
Coherence_Measure(seg=<function s_one_set at 0x7f6b0a12ee18>, prob=<function p_boolean_sliding_window at 0x7f6b0a1421e0>, conf=<function cosine_similarity at 0x7f6b09c328c8>, aggr=<function arithmetic_mean at 0x7f6b09c32f28>)
print(goodcm.get_coherence())
print(badcm.get_coherence())
0.3838413553737203 0.3838413553737203
This API supports gensim's ldavowpalwabbit and ldamallet wrappers as input parameter to model
.
# Replace with path to your Vowpal Wabbit installation
vw_path = '/usr/local/bin/vw'
# Replace with path to your Mallet installation
home = os.path.expanduser('~')
mallet_path = os.path.join(home, 'mallet-2.0.8', 'bin', 'mallet')
model1 = LdaVowpalWabbit(vw_path, corpus=corpus, num_topics=2, id2word=dictionary, passes=50)
model2 = LdaVowpalWabbit(vw_path, corpus=corpus, num_topics=2, id2word=dictionary, passes=1)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-14-9421c07a3fe9> in <module> ----> 1 model1 = LdaVowpalWabbit(vw_path, corpus=corpus, num_topics=2, id2word=dictionary, passes=50) 2 model2 = LdaVowpalWabbit(vw_path, corpus=corpus, num_topics=2, id2word=dictionary, passes=1) ~/git/gensim/gensim/models/wrappers/ldavowpalwabbit.py in __init__(self, vw_path, corpus, num_topics, id2word, chunksize, passes, alpha, eta, decay, offset, gamma_threshold, random_seed, cleanup_files, tmp_prefix) 214 215 if corpus is not None: --> 216 self.train(corpus) 217 218 def train(self, corpus): ~/git/gensim/gensim/models/wrappers/ldavowpalwabbit.py in train(self, corpus) 235 cmd = self._get_vw_train_command(corpus_size) 236 --> 237 _run_vw_command(cmd) 238 239 # ensure that future updates of this model use correct offset ~/git/gensim/gensim/models/wrappers/ldavowpalwabbit.py in _run_vw_command(cmd) 849 logger.info("Running Vowpal Wabbit command: %s", ' '.join(cmd)) 850 proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, --> 851 stderr=subprocess.STDOUT) 852 output = proc.communicate()[0].decode('utf-8') 853 logger.debug("Vowpal Wabbit output: %s", output) /usr/lib/python3.7/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text) 773 c2pread, c2pwrite, 774 errread, errwrite, --> 775 restore_signals, start_new_session) 776 except: 777 # Cleanup if the child failed starting. /usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session) 1520 if errno_num == errno.ENOENT: 1521 err_msg += ': ' + repr(err_filename) -> 1522 raise child_exception_type(errno_num, err_msg, err_filename) 1523 raise child_exception_type(err_msg) 1524 FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/vw': '/usr/local/bin/vw'
cm1 = CoherenceModel(model=model1, corpus=corpus, coherence='u_mass')
cm2 = CoherenceModel(model=model2, corpus=corpus, coherence='u_mass')
print(cm1.get_coherence())
print(cm2.get_coherence())
model1 = LdaMallet(mallet_path, corpus=corpus, num_topics=2, id2word=dictionary, iterations=50)
model2 = LdaMallet(mallet_path, corpus=corpus, num_topics=2, id2word=dictionary, iterations=1)
cm1 = CoherenceModel(model=model1, texts=texts, coherence='c_v')
cm2 = CoherenceModel(model=model2, texts=texts, coherence='c_v')
print(cm1.get_coherence())
print(cm2.get_coherence())
The gensim topics coherence pipeline can be used with other topics models too. Only the tokenized topics
should be made available for the pipeline. Eg. with the gensim HDP model
hm = HdpModel(corpus=corpus, id2word=dictionary)
# To get the topic words from the model
topics = []
for topic_id, topic in hm.show_topics(num_topics=10, formatted=False):
topic = [word for word, _ in topic]
topics.append(topic)
topics[:2]
# Initialize CoherenceModel using `topics` parameter
cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')
cm.get_coherence()
Hence as we can see, the u_mass
and c_v
coherence for the good LDA model is much more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable. The badLdaModel however fails to decipher between these two topics and comes up with topics which are not clear to a human. The u_mass
and c_v
topic coherences capture this wonderfully by giving the interpretability of these topics a number as we can see above. Hence this coherence measure can be used to compare difference topic models based on their human-interpretability.