Author - Murat Apishev (great-mel@yandex.ru)
BigARTM is an open library for topic modeling of text collections. It's based on the ARTM theory. The projects main site is http://bigartm.org/.
The example of BigARTM Python API usage is showen below. We'll proceed a model experiment.
Let's learn two topic models of text collections, ARTM and PLSA, and compare them.
One of the important measures is the perplexity. Nevertheless it's not the only way to check the quality of the model learning. The following qualities are implemented in library:
We'll use the first four ones. The higher values of sparsities and average purity and contrast means the more interpretable model.
We'll try to learn the ARTM model in the way to obtain better values of sparsities and kernel characteristics, than in PLSA, without significant decline of the perplexity.
The main tool to control the learning process is the regularization. Here's the list of currently implemented regularizers:
We'll use the first three regularizers in ARTM model in this experiment. ARTM without the regularization corresponds the PLSA model.
Let's use the small 'kos' collection from the UCI repository https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/. The collection has following parameters:
At first let's import all necessary modules (make sure you have the BigARTM Python API in your PATH variable):
%matplotlib inline
import glob
import os
import matplotlib.pyplot as plt
import artm
print artm.version()
0.8.0
First of all you need to prepare the input data. BigARTM has its own documents format for processing called batches. BigARTM has tools for their creation from Bag-of-Words UCI and Vowpal Wabbit formats (see more at http://docs.bigartm.org/en/latest/formats.html).
Library Python API similarly to scikit-learn algorithms represents input data in the form of one class called BatchVectorizer. This class object get batches or UCI / VW files as inputs and is used as input parameter in all methods. If the given data is not batches, the object will create them and store to disk.
So let's create the object of BatchVectorizer:
batch_vectorizer = None
if len(glob.glob(os.path.join('kos', '*.batch'))) < 1:
batch_vectorizer = artm.BatchVectorizer(data_path='', data_format='bow_uci', collection_name='kos', target_folder='kos')
else:
batch_vectorizer = artm.BatchVectorizer(data_path='kos', data_format='batches')
ARTM is a class, that represents BigARTM Python API. Allows to use almost all library abilities in scikit-learn style. Let's create two topic models for our experiments. The most important parameter of the model is the number of topics. Optionally the user can define the list of regularizers and quality measures (scores) to be used in this model. This step can be done later. Note, that each model defines its own namespace for names of regularizers and scores.
dictionary = artm.Dictionary()
model_plsa = artm.ARTM(topic_names=['topic_{}'.format(i) for i in xrange(15)],
scores=[artm.PerplexityScore(name='PerplexityScore',
use_unigram_document_model=False,
dictionary=dictionary)],
cache_theta=True)
model_artm = artm.ARTM(topic_names=['topic_{}'.format(i) for i in xrange(15)],
scores=[artm.PerplexityScore(name='PerplexityScore',
use_unigram_document_model=False,
dictionary=dictionary)],
regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)],
cache_theta=True)
The next step is to initialize models. It can be done using dictionary. It means:
Note this step is optional: the model will be auto-initialized during the calls of fit_offline() / fit_online().
Dictionary is the object of BigARTM, containing the information about the collection (vocabulary, different counters and values, linked to tokens).
if not os.path.isfile('kos/dictionary.dict'):
dictionary.gather(data_path=batch_vectorizer.data_path)
dictionary.save(dictionary_path='kos/dictionary.dict')
dictionary.load(dictionary_path='kos/dictionary.dict')
dictionary.load(dictionary_path='kos/dictionary.dict')
Then dictionary can be used to initialize topic model:
model_plsa.initialize(dictionary=dictionary)
model_artm.initialize(dictionary=dictionary)
As it was said earlier, ARTM provides the ability to use all the scores of BigARTM. Once the score was included into model, the model will save all its values, obtained at the time of each $\Phi$ matrix update. Let's add the scores we need for our experiment (only ones, missed in the constructors):
model_plsa.scores.add(artm.SparsityPhiScore(name='SparsityPhiScore'))
model_plsa.scores.add(artm.SparsityThetaScore(name='SparsityThetaScore'))
model_plsa.scores.add(artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3))
model_artm.scores.add(artm.SparsityPhiScore(name='SparsityPhiScore'))
model_artm.scores.add(artm.SparsityThetaScore(name='SparsityThetaScore'))
model_artm.scores.add(artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3))
Now we'll do the same thing with the regularizers for artm_model (let's set their start coefficients of the regularization, these values can be changed later):
model_artm.regularizers.add(artm.SmoothSparsePhiRegularizer(name='SparsePhi', tau=-0.1))
model_artm.regularizers.add(artm.DecorrelatorPhiRegularizer(name='DecorrelatorPhi', tau=1.5e+5))
Now we'll try to learn the model in offline mode (e.g. with one $\Phi$ matrix update during one path through the whole collection). Let's start with 15 passes:
model_plsa.num_document_passes = 1
model_artm.num_document_passes = 1
model_plsa.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)
Let's check the results of first part of learning process by comparing the values of scores of both models:
def print_measures(model_plsa, model_artm):
print 'Sparsity Phi: {0:.3f} (PLSA) vs. {1:.3f} (ARTM)'.format(
model_plsa.score_tracker['SparsityPhiScore'].last_value,
model_artm.score_tracker['SparsityPhiScore'].last_value)
print 'Sparsity Theta: {0:.3f} (PLSA) vs. {1:.3f} (ARTM)'.format(
model_plsa.score_tracker['SparsityThetaScore'].last_value,
model_artm.score_tracker['SparsityThetaScore'].last_value)
print 'Kernel contrast: {0:.3f} (PLSA) vs. {1:.3f} (ARTM)'.format(
model_plsa.score_tracker['TopicKernelScore'].last_average_contrast,
model_artm.score_tracker['TopicKernelScore'].last_average_contrast)
print 'Kernel purity: {0:.3f} (PLSA) vs. {1:.3f} (ARTM)'.format(
model_plsa.score_tracker['TopicKernelScore'].last_average_purity,
model_artm.score_tracker['TopicKernelScore'].last_average_purity)
print 'Perplexity: {0:.3f} (PLSA) vs. {1:.3f} (ARTM)'.format(
model_plsa.score_tracker['PerplexityScore'].last_value,
model_artm.score_tracker['PerplexityScore'].last_value)
plt.plot(xrange(model_plsa.num_phi_updates), model_plsa.score_tracker['PerplexityScore'].value, 'b--',
xrange(model_artm.num_phi_updates), model_artm.score_tracker['PerplexityScore'].value, 'r--', linewidth=2)
plt.xlabel('Iterations count')
plt.ylabel('PLSA perp. (blue), ARTM perp. (red)')
plt.grid(True)
plt.show()
print_measures(model_plsa, model_artm)
Sparsity Phi: 0.000 (PLSA) vs. 0.469 (ARTM) Sparsity Theta: 0.000 (PLSA) vs. 0.001 (ARTM) Kernel contrast: 0.466 (PLSA) vs. 0.525 (ARTM) Kernel purity: 0.215 (PLSA) vs. 0.359 (ARTM) Perplexity: 2058.027 (PLSA) vs. 1950.717 (ARTM)
We can see, that we have an improvement of sparsities and kernel measures, and the downgrade of the perplexion isn't big. Let's try to increase the absolute values of regularization coefficients:
model_artm.regularizers['SparsePhi'].tau = -0.2
model_artm.regularizers['SparseTheta'].tau = -0.2
model_artm.regularizers['DecorrelatorPhi'].tau = 2.5e+5
Besides that let's include into each model the TopTokenScore measure, which allows to look at the most probable tokens in each topic:
model_plsa.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))
model_artm.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))
We'll continue the learning process with 25 passes through the collection, and than will look at the values of the scores:
model_plsa.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=25)
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=25)
print_measures(model_plsa, model_artm)
Sparsity Phi: 0.093 (PLSA) vs. 0.841 (ARTM) Sparsity Theta: 0.000 (PLSA) vs. 0.023 (ARTM) Kernel contrast: 0.640 (PLSA) vs. 0.740 (ARTM) Kernel purity: 0.674 (PLSA) vs. 0.822 (ARTM) Perplexity: 1619.031 (PLSA) vs. 1644.220 (ARTM)
Besides let's plot the changes of matrices sparsities by iterations:
plt.plot(xrange(model_plsa.num_phi_updates), model_plsa.score_tracker['SparsityPhiScore'].value, 'b--',
xrange(model_artm.num_phi_updates), model_artm.score_tracker['SparsityPhiScore'].value, 'r--', linewidth=2)
plt.xlabel('Iterations count')
plt.ylabel('PLSA Phi sp. (blue), ARTM Phi sp. (red)')
plt.grid(True)
plt.show()
plt.plot(xrange(model_plsa.num_phi_updates), model_plsa.score_tracker['SparsityThetaScore'].value, 'b--',
xrange(model_artm.num_phi_updates), model_artm.score_tracker['SparsityThetaScore'].value, 'r--', linewidth=2)
plt.xlabel('Iterations count')
plt.ylabel('PLSA Theta sp. (blue), ARTM Theta sp. (red)')
plt.grid(True)
plt.show()
It seems that achieved result is enough. The regularization helped us to improve all scores with quite small perplexity downgrade. Let's look at top-tokens:
for topic_name in model_plsa.topic_names:
print topic_name + ': ',
print model_plsa.score_tracker['TopTokensScore'].last_tokens[topic_name]
topic_0: [u'year', u'tax', u'jobs', u'america', u'president', u'issues'] topic_1: [u'people', u'war', u'service', u'military', u'rights', u'vietnam'] topic_2: [u'november', u'electoral', u'account', u'polls', u'governor', u'contact'] topic_3: [u'republican', u'gop', u'senate', u'senator', u'south', u'conservative'] topic_4: [u'people', u'time', u'country', u'speech', u'talking', u'read'] topic_5: [u'dean', u'democratic', u'edwards', u'primary', u'kerry', u'clark'] topic_6: [u'state', u'party', u'race', u'candidates', u'candidate', u'elections'] topic_7: [u'administration', u'president', u'years', u'bill', u'white', u'cheney'] topic_8: [u'campaign', u'national', u'media', u'local', u'late', u'union'] topic_9: [u'house', u'million', u'money', u'republican', u'committee', u'delay'] topic_10: [u'republicans', u'vote', u'senate', u'election', u'democrats', u'house'] topic_11: [u'iraq', u'war', u'american', u'iraqi', u'military', u'intelligence'] topic_12: [u'kerry', u'poll', u'percent', u'voters', u'polls', u'numbers'] topic_13: [u'news', u'time', u'asked', u'political', u'washington', u'long'] topic_14: [u'bush', u'general', u'bushs', u'kerry', u'oct', u'states']
for topic_name in model_artm.topic_names:
print topic_name + ': ',
print model_artm.score_tracker['TopTokensScore'].last_tokens[topic_name]
topic_0: [u'party', u'political', u'issue', u'tax', u'america', u'issues'] topic_1: [u'people', u'military', u'official', u'officials', u'service', u'public'] topic_2: [u'electoral', u'governor', u'account', u'contact', u'ticket', u'experience'] topic_3: [u'gop', u'convention', u'senator', u'debate', u'south', u'sen'] topic_4: [u'country', u'speech', u'bad', u'read', u'end', u'talking'] topic_5: [u'democratic', u'dean', u'john', u'edwards', u'primary', u'clark'] topic_6: [u'percent', u'race', u'candidates', u'candidate', u'win', u'nader'] topic_7: [u'administration', u'years', u'white', u'year', u'bill', u'jobs'] topic_8: [u'campaign', u'national', u'media', u'press', u'local', u'ads'] topic_9: [u'house', u'republican', u'million', u'money', u'elections', u'district'] topic_10: [u'november', u'poll', u'senate', u'republicans', u'vote', u'election'] topic_11: [u'iraq', u'war', u'american', u'iraqi', u'security', u'united'] topic_12: [u'bush', u'kerry', u'general', u'president', u'voters', u'bushs'] topic_13: [u'time', u'news', u'long', u'asked', u'washington', u'political'] topic_14: [u'state', u'states', u'people', u'oct', u'fact', u'ohio']
We can see, that topics are approximatelly equal in terms of interpretability, but they are more different in ARTM.
Let's extract the $\Phi$ matrix as pandas.DataFrame and print it (to do this operation with more options use ARTm.get_phi()):
print model_artm.phi_
topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 \ parentheses 0.000000 0.000000 0.000000 0.000000 0.000000 0.000277 opinion 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 attitude 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 held 0.000000 0.000385 0.000000 0.000000 0.000000 0.000000 impeachment 0.000000 0.000115 0.000000 0.000000 0.000000 0.000000 platform 0.001717 0.000000 0.000000 0.000000 0.000000 0.000000 msnbc 0.000000 0.000000 0.000000 0.000000 0.000000 0.000194 assault 0.000000 0.000000 0.000000 0.000000 0.001202 0.000000 tools 0.000000 0.000000 0.000000 0.000000 0.000502 0.000000 vance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 contingent 0.000000 0.000544 0.000000 0.000000 0.000000 0.000000 drove 0.000000 0.000146 0.000000 0.000000 0.000374 0.000000 stripped 0.000000 0.000111 0.000000 0.000000 0.000410 0.000000 air 0.000000 0.000593 0.000000 0.000000 0.000124 0.000000 baby 0.000139 0.000003 0.000032 0.000000 0.000000 0.000000 highlight 0.000000 0.000000 0.000000 0.000000 0.000019 0.000000 honored 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 kerry 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 nethercutt 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 kansas 0.000007 0.000000 0.000000 0.000076 0.000000 0.000000 annual 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 disaster 0.000076 0.000000 0.000000 0.000000 0.001151 0.000000 operatives 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 fluid 0.000030 0.000000 0.000000 0.000000 0.000000 0.000344 aug 0.000043 0.000000 0.000000 0.001610 0.003245 0.000000 monitoring 0.000000 0.000000 0.000000 0.000000 0.000000 0.000490 initially 0.000000 0.000113 0.000000 0.000000 0.000000 0.000000 kysen 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 action 0.000904 0.000107 0.000000 0.000000 0.000488 0.000147 wide 0.000000 0.000862 0.000000 0.000000 0.000000 0.000164 ... ... ... ... ... ... ... drake 0.000000 0.000000 0.000000 0.000000 0.000000 0.000348 freeway 0.000000 0.000000 0.000000 0.000000 0.000297 0.000000 playbook 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 hoodies 0.000000 0.000000 0.002087 0.000000 0.000000 0.000000 vaantirepublican 0.000000 0.000000 0.002060 0.000000 0.000000 0.000000 suck 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 thirdplace 0.000000 0.000000 0.000000 0.000000 0.000000 0.000348 wink 0.000000 0.000000 0.000176 0.000000 0.000408 0.000000 inching 0.000000 0.000000 0.000000 0.000000 0.000000 0.000312 intangibles 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 sawyer 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 composite 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 geps 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 postiowa 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 brush 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 caucusgoers 0.000000 0.000000 0.000000 0.000000 0.000000 0.000667 countylevel 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 outreach 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 rncs 0.000297 0.000000 0.000000 0.000000 0.000000 0.000000 uaw 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 deanclark 0.000000 0.000000 0.000000 0.000000 0.000000 0.000348 samarra 0.000000 0.000000 0.000000 0.000000 0.000466 0.000000 postdebate 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 schwarz 0.000534 0.000000 0.000000 0.000000 0.000000 0.000000 asserted 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 aft 0.000000 0.000000 0.000000 0.000000 0.000000 0.000419 spindizzy 0.000000 0.000000 0.000647 0.000000 0.000000 0.000000 barnes 0.000000 0.000000 0.000000 0.000792 0.000000 0.000000 barbour 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 kroll 0.000000 0.000317 0.000000 0.000000 0.000000 0.000000 topic_6 topic_7 topic_8 topic_9 topic_10 topic_11 \ parentheses 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 opinion 0.000394 0.000000 0.000193 0.000000 0.000000 0.000355 attitude 0.000499 0.000000 0.000000 0.000000 0.000000 0.000187 held 0.000000 0.000000 0.000000 0.006317 0.000000 0.000790 impeachment 0.000000 0.000000 0.000000 0.000000 0.000094 0.000000 platform 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 msnbc 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 assault 0.000000 0.000000 0.000000 0.000000 0.000000 0.000471 tools 0.000000 0.000000 0.000329 0.000000 0.000000 0.000000 vance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 contingent 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 drove 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 stripped 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 air 0.000000 0.000025 0.000000 0.000084 0.000000 0.000048 baby 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 highlight 0.000000 0.000590 0.000371 0.000000 0.000000 0.000000 honored 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 kerry 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 nethercutt 0.000244 0.000000 0.000000 0.000000 0.000000 0.000000 kansas 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 annual 0.000000 0.001851 0.000000 0.000000 0.000000 0.000000 disaster 0.000000 0.000000 0.000000 0.000000 0.000000 0.000804 operatives 0.000000 0.000008 0.000149 0.000000 0.000000 0.000431 fluid 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 aug 0.000183 0.000035 0.001008 0.000504 0.000000 0.000000 monitoring 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 initially 0.000000 0.000726 0.000314 0.000000 0.000000 0.000094 kysen 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 action 0.000000 0.001463 0.000212 0.002019 0.000000 0.000034 wide 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... ... ... ... ... ... ... drake 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 freeway 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 playbook 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 hoodies 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 vaantirepublican 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 suck 0.000000 0.000000 0.000368 0.000000 0.000000 0.000000 thirdplace 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 wink 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 inching 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 intangibles 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 sawyer 0.000000 0.000243 0.000000 0.000000 0.000107 0.000000 composite 0.000391 0.000000 0.000000 0.000000 0.000000 0.000000 geps 0.000535 0.000000 0.000000 0.000000 0.000000 0.000000 postiowa 0.000000 0.000000 0.000471 0.000000 0.000000 0.000000 brush 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 caucusgoers 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 countylevel 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 outreach 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 rncs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 uaw 0.000000 0.000000 0.000334 0.000000 0.000000 0.000000 deanclark 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 samarra 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 postdebate 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 schwarz 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 asserted 0.000000 0.000178 0.000000 0.000000 0.000000 0.000000 aft 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 spindizzy 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 barnes 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 barbour 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 kroll 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 topic_12 topic_13 topic_14 parentheses 0.000000 0.000000e+00 0.000000 opinion 0.002908 0.000000e+00 0.000346 attitude 0.000000 0.000000e+00 0.000000 held 0.000526 0.000000e+00 0.000000 impeachment 0.000000 0.000000e+00 0.000396 platform 0.000000 0.000000e+00 0.000000 msnbc 0.000000 0.000000e+00 0.002042 assault 0.000000 4.839028e-05 0.000000 tools 0.000000 3.548828e-04 0.000000 vance 0.000000 0.000000e+00 0.000803 contingent 0.000000 0.000000e+00 0.000000 drove 0.000000 0.000000e+00 0.000000 stripped 0.000000 0.000000e+00 0.000000 air 0.000000 4.631828e-05 0.010024 baby 0.000000 1.057148e-03 0.000000 highlight 0.000000 8.327506e-07 0.000000 honored 0.000000 4.265937e-04 0.000000 kerry 0.138169 0.000000e+00 0.000000 nethercutt 0.000143 0.000000e+00 0.000000 kansas 0.000100 0.000000e+00 0.000969 annual 0.000000 8.087014e-06 0.000000 disaster 0.000000 3.991361e-05 0.000000 operatives 0.000000 7.908880e-04 0.000000 fluid 0.000000 0.000000e+00 0.000000 aug 0.003553 0.000000e+00 0.000000 monitoring 0.000000 0.000000e+00 0.000000 initially 0.000000 1.245606e-04 0.000000 kysen 0.000000 7.598700e-04 0.000000 action 0.000000 0.000000e+00 0.000411 wide 0.000000 0.000000e+00 0.000487 ... ... ... ... drake 0.000000 0.000000e+00 0.000000 freeway 0.000000 0.000000e+00 0.000000 playbook 0.000378 0.000000e+00 0.000000 hoodies 0.000000 0.000000e+00 0.000000 vaantirepublican 0.000000 0.000000e+00 0.000000 suck 0.000000 0.000000e+00 0.000000 thirdplace 0.000000 0.000000e+00 0.000000 wink 0.000000 0.000000e+00 0.000000 inching 0.000000 0.000000e+00 0.000000 intangibles 0.000000 0.000000e+00 0.000310 sawyer 0.000000 0.000000e+00 0.000000 composite 0.000000 0.000000e+00 0.000000 geps 0.000000 0.000000e+00 0.000000 postiowa 0.000000 0.000000e+00 0.000000 brush 0.000000 2.548060e-04 0.000139 caucusgoers 0.000000 0.000000e+00 0.000000 countylevel 0.000000 4.599213e-04 0.000000 outreach 0.000000 0.000000e+00 0.000698 rncs 0.000000 0.000000e+00 0.000000 uaw 0.000000 0.000000e+00 0.000000 deanclark 0.000000 0.000000e+00 0.000000 samarra 0.000000 0.000000e+00 0.000000 postdebate 0.000086 0.000000e+00 0.000763 schwarz 0.000000 0.000000e+00 0.000000 asserted 0.000118 0.000000e+00 0.000000 aft 0.000000 0.000000e+00 0.000000 spindizzy 0.000000 0.000000e+00 0.000000 barnes 0.000000 0.000000e+00 0.000000 barbour 0.000000 0.000000e+00 0.000451 kroll 0.000000 0.000000e+00 0.000000 [6906 rows x 15 columns]
Let's additionally extract $\Theta$ mtrix and print it:
theta_matrix = model_artm.get_theta()
print theta_matrix
3001 3002 3003 3004 3005 3006 \ topic_0 0.076121 0.035935 0.069378 0.136764 0.083617 0.011377 topic_1 0.095387 0.029030 0.034951 0.043516 0.064787 0.022496 topic_2 0.004087 0.048284 0.012102 0.006649 0.015708 0.520682 topic_3 0.057739 0.029823 0.122427 0.085727 0.060414 0.016142 topic_4 0.122660 0.046273 0.084217 0.077540 0.064804 0.014840 topic_5 0.025646 0.010737 0.024132 0.051743 0.059580 0.017807 topic_6 0.044074 0.017331 0.072042 0.042680 0.036825 0.017084 topic_7 0.061899 0.059855 0.022353 0.029254 0.002857 0.005222 topic_8 0.049225 0.048494 0.043635 0.095823 0.025636 0.016371 topic_9 0.070107 0.492084 0.014166 0.131111 0.200206 0.025601 topic_10 0.056218 0.023523 0.158903 0.092290 0.059759 0.239756 topic_11 0.158816 0.031754 0.028020 0.024615 0.041939 0.006004 topic_12 0.054255 0.027387 0.191633 0.078728 0.125391 0.052281 topic_13 0.056038 0.056568 0.025291 0.041948 0.110980 0.032068 topic_14 0.067729 0.042923 0.096752 0.061612 0.047496 0.002270 3007 3008 3009 3010 ... 991 \ topic_0 0.153216 0.031591 0.066032 0.036219 ... 0.053795 topic_1 0.041968 0.009980 0.024485 0.055780 ... 0.119861 topic_2 0.010361 0.000000 0.012824 0.004855 ... 0.201032 topic_3 0.140113 0.095807 0.090738 0.080190 ... 0.015301 topic_4 0.028386 0.042167 0.044592 0.068055 ... 0.048995 topic_5 0.039975 0.080582 0.077005 0.067891 ... 0.014937 topic_6 0.186580 0.291586 0.159585 0.068007 ... 0.005185 topic_7 0.033841 0.019672 0.033055 0.071855 ... 0.035326 topic_8 0.021736 0.015335 0.030470 0.059429 ... 0.030472 topic_9 0.158546 0.344387 0.276788 0.279663 ... 0.025226 topic_10 0.032311 0.000000 0.048914 0.031847 ... 0.080713 topic_11 0.024322 0.002047 0.013785 0.018584 ... 0.272355 topic_12 0.044409 0.017861 0.013088 0.019473 ... 0.023370 topic_13 0.032618 0.034590 0.043202 0.063385 ... 0.040395 topic_14 0.051617 0.014395 0.065438 0.074768 ... 0.033035 992 993 994 995 996 997 \ topic_0 0.057835 0.078560 0.037073 0.170305 0.023683 0.094523 topic_1 0.121933 0.180116 0.112602 0.032622 0.034982 0.143123 topic_2 0.000000 0.057726 0.002886 0.000632 0.007803 0.014957 topic_3 0.040474 0.066918 0.036633 0.043454 0.012304 0.051498 topic_4 0.071596 0.133851 0.082915 0.049283 0.021307 0.059609 topic_5 0.017675 0.021395 0.015159 0.030125 0.042343 0.039248 topic_6 0.009005 0.029403 0.040560 0.104222 0.106431 0.031618 topic_7 0.018613 0.005084 0.109385 0.062973 0.036562 0.036915 topic_8 0.185597 0.034914 0.050408 0.026470 0.023620 0.130382 topic_9 0.012500 0.009456 0.067889 0.088528 0.025921 0.030995 topic_10 0.040815 0.000000 0.018810 0.124785 0.104889 0.031240 topic_11 0.281438 0.232306 0.255559 0.022806 0.116224 0.068760 topic_12 0.014712 0.000000 0.058669 0.136899 0.352839 0.069516 topic_13 0.054435 0.081143 0.079657 0.036140 0.027119 0.173720 topic_14 0.073375 0.069129 0.031795 0.070756 0.063973 0.023897 998 999 1000 topic_0 0.077524 0.081823 0.066142 topic_1 0.185704 0.103474 0.126820 topic_2 0.008666 0.007206 0.016567 topic_3 0.043518 0.090904 0.129180 topic_4 0.077166 0.076402 0.115954 topic_5 0.032495 0.027419 0.030624 topic_6 0.019817 0.011535 0.043325 topic_7 0.060655 0.054204 0.052605 topic_8 0.059885 0.065536 0.083408 topic_9 0.031605 0.023796 0.024791 topic_10 0.011905 0.001742 0.000308 topic_11 0.196675 0.146919 0.169095 topic_12 0.011672 0.079587 0.036432 topic_13 0.134102 0.135609 0.059862 topic_14 0.048609 0.093843 0.044887 [15 rows x 3430 columns]
The model can be used to find $\theta_d$ vectors for new documents via ARTM.transform() method:
test_batch_vectorizer = artm.BatchVectorizer(data_format='batches', data_path='kos_test', batches=['test_docs.batch'])
test_theta_matrix = model_artm.transform(batch_vectorizer=test_batch_vectorizer)
print test_theta_matrix
Empty DataFrame Columns: [] Index: []
Topic modeling task has an infinite set of solutions. It gives us freedom in our choice. Regularizers give an opportunity to get the result, that satisfacts several criteria (such as sparsity, interpretability) at the same time.
Given example is a demonstrative one, one can choose more flexible strategies of regularization to get better result. The experiments with other, bigger collection can be proceeded in the same way as it was described above.