Authors — Nadezhda Chirkova and Artem Popov
In this tutorial we describe principles of building hierarchies in BigARTM. Notebook was writed for Python 2.7.
import numpy as np
import artm
from artm import hARTM
import sys
sys.path.append('utils/')
# you need sklearn for simple loading
from load_collections import load_20newsgroups
import glob
import os
artm.version()
'0.8.3'
Data: documents set $D$, words set $W$, document-word matrix $\{n_{dw}\}_{D \times W}$.
Model: Denote $p(w|d) = \frac{n_{dw}}{\sum_w n_{dw}}$, $T$ is a topics set. The topic model is $$ p(w|d) = \sum_{t \in T} p(w|t) p(t|d) = \sum_{t \in T} \phi_{wt} \theta_{td}, \hspace{3cm} (1) $$ with parameters
Parameter learning: regularizer maximum likelihood maximization $$ \sum_d \sum_w n_{dw} \ln \sum_t \phi_{wt} \theta_{td} + \sum_i \tau_i R_i(\Phi, \Theta) \rightarrow max_{\Phi, \Theta} $$ where regularizers $R(\Phi, \Theta) = \sum_i \tau_i R_i(\Phi, \Theta)$ allows introducing additional subject-specific criterias, $\tau_i$ are regularizers' coefficients.
When we have built parent level, let's denote its topics $a \in A$ (ancestor) and matrices $\Phi^p$ and $\Theta^p$. Now we will build next level model with topics set $T$.
Let's introduce new matrix factorization problem: $$ \phi^p_{wa} = p(w|a) \approx \sum_{t} p(w|t) p(t|a) = \sum_t \phi_{wt} \psi_{ta}$$
with new parameters matrix $\Psi = \{ \psi_{ta} \}_{T \times A}$ containing probabilities p(topic | ancestor topic) calles link weights.
If KL-divergence is a similarity measure between distributions, previous equation produces regularizer for next level model: $$ R(\Phi, \Psi) = \sum_w \sum_a \phi_{wa} \ln \sum_t \phi_{wt} \psi_{ta} \rightarrow max_{\Phi, \Psi} $$.
$$ \sum_d \sum_w n_{dw} \ln \sum_t \phi_{wt} \theta_{td} + \tau R(\Phi, \Psi) \rightarrow max_{\Phi, \Psi, \Theta} $$ Both likelihood and regularizer formulas have common structure. So there is a simple way to train $\Psi$ simultaneously with $\Phi$ and $\Theta$:
we just add $|A|$ pseudodocuments to collection, each representing parent $\Phi$ column: $n{aw} = \tau p(w|a)$._
Hierarchy in BigARTM is implemented in hierarchy_utils module. To build hierarchy, create hARTM instance:
hier = hARTM()
You should pass to hARTM parameters that are common for all levels. These are the same parameters that you pass to usual ARTM model.
Levels will be built one by one. To add first level, use add_level method specifying remaining model parameters (unique for the level):
level0 = hier.add_level(num_topics=10)
level0.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20,
class_id='text'))
This method returns ARTM object so you can work with it as you used: initialize it, fit offline, add regularizer ans scores etc. For example:
load_20newsgroups()
data_path = 'data/20newsgroups/20newsgroups_train.vw'
batches_path = 'data/20newsgroups/batches'
if len(glob.glob(os.path.join(batches_path + '*.batch'))) < 1:
batch_vectorizer = artm.BatchVectorizer(data_path=data_path, data_format='vowpal_wabbit',
target_folder=batches_path)
else:
batch_vectorizer = artm.BatchVectorizer(data_path=batches_path, data_format='batches')
dictionary = artm.Dictionary('dictionary')
dictionary.gather(batches_path)
dictionary.filter(min_df=5, max_tf=1000)
level0.initialize(dictionary=dictionary)
level0.fit_offline(batch_vectorizer, num_collection_passes=30)
for topic_name in level0.topic_names:
print topic_name + ': ',
print ", ".join(level0.score_tracker['TopTokensScore'].last_tokens[topic_name])
topic_0: gun, state, entry, output, guns, control, bill, crime, states, police, rights, section, tax, firearms, federal, laws, weapons, court, public, ok topic_1: card, dos, mb, scsi, disk, hard, memory, pc, mac, video, price, drives, monitor, apple, controller, ram, bus, speed, mhz, software topic_2: encryption, chip, public, nasa, technology, clipper, president, security, keys, privacy, internet, national, launch, administration, earth, research, access, computer, des, phone topic_3: jesus, true, life, bible, christian, church, fact, mean, faith, christians, religion, christ, evidence, rather, seems, man, word, reason, argument, truth topic_4: window, db, server, motif, application, widget, sun, display, list, mit, code, manager, ms, running, user, screen, lib, run, cs, subject topic_5: she, car, her, little, went, maybe, enough, lot, day, thought, anything, put, tell, remember, heard, told, old, again, left, bike topic_6: game, team, play, games, season, hockey, league, players, win, period, st, la, teams, nhl, vs, player, san, pts, gm, second topic_7: armenian, israel, turkish, war, jews, armenians, israeli, health, during, turkey, medical, against, jewish, history, children, state, anti, killed, turks, population topic_8: pl, tm, di, ei, cx, wm, bhj, giz, hz, ah, ey, lk, um, ww, qs, sl, mw, chz, mq, uw topic_9: image, ftp, files, software, graphics, version, pub, email, send, info, ca, format, images, contact, current, line, code, package, jpeg, source
When first level is fit, you have to add next level:
level1 = hier.add_level(num_topics=20, topic_names=['child_topic_' + str(i) for i in range(20)],
parent_level_weight=1)
When you add this level, parent levels phi matrix will be saved into special, parent level batch. It is the way how pseudoduments are created. This created batch will be added to other batches when you fit model. Explaination of add_level parameters:
These two parameters are ignored during creation of first level.
Now you can learn level1 model by any means. For example:
level1.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20,
class_id='text'))
level1.initialize(dictionary=dictionary)
level1.fit_offline(batch_vectorizer, num_collection_passes=30)
The part of $\Theta$ matrix corresponding to parent level batch is $\Psi$ matrix. To get it, use get_psi method:
psi = level1.get_psi()
Note than level0 has no get_psi method.
for topic_name in level1.topic_names:
print(topic_name + ': ')
print(" ".join(level1.score_tracker['TopTokensScore'].last_tokens[topic_name]))
child_topic_0: seems anything isn actually post rather real maybe reason try least enough idea wrong seem course nothing perhaps tell little child_topic_1: window db server motif application widget display sun list mit manager code running lib user screen run font xterm subject child_topic_2: group posting important money support discussion feel lot newsgroup groups care article gay business men agree personal freedom postings community child_topic_3: she her went saw told day came started home says left heard took thought old maybe remember tell again little child_topic_4: ftp pub graphics ca software contact package version email fax comp send cs uk anonymous ac unix computer university address child_topic_5: entry output gun ok rules build check line section info entries printf eof stream int char size open title contest child_topic_6: jesus bible christian true life faith christians religion christ truth evidence man word argument belief religious cannot science christianity example child_topic_7: ground wire current circuit wiring neutral voltage box electrical amp high signal connected line cable test audio electronics subject usually child_topic_8: church stephanopoulos president moral mean catholic father pope yes holy society love spirit sex son authority george marriage day morality child_topic_9: game team games baseball san league hit runs won win players lost th average home season run pitching fan cubs child_topic_10: armenian israel turkish jews armenians war israeli turkey medical during health jewish turks against history anti greek armenia arab genocide child_topic_11: card dos mb scsi disk hard memory pc mac video drives price monitor apple controller bus ram mhz sale speed child_topic_12: image files info jpeg software gif color format looking version images faq hi graphics send advance email post quality keyboard child_topic_13: against fire fbi war koresh killed children country military death kill evidence media trial weapons far compound judge batf killing child_topic_14: encryption chip technology clipper nasa public security keys privacy internet launch president administration national earth des research access secure phone child_topic_15: game team play hockey season period games la nhl vs players pts gm st league win goal teams pittsburgh flyers child_topic_16: car bike water cars engine front enough buy miles oil little speed big air road side high money driving lot child_topic_17: pl tm di ei wm giz bhj ey um sl tq kn bxn qax gk bj qq ql tg mf child_topic_18: cx hz ww qs ah lk uw ck mw mv sc mc pl md zd tl mt chz ma cj child_topic_19: gun state states bill police control rights crime laws tax firearms public federal congress court amendment clinton guns weapons house
You can get levels specifying level_index (from 0 as usually in python so first level has index 0):
level_index = 0
some_level = hier.get_level(level_index)
To delete level, use:
level_index = 1
hier.del_level(level_index)
Be careful: if you delete not the last level, all next levels will be deleted too.
You can access number of levels of hierarchy using .num_levels:
print(hier.num_levels)
1
When building a level of hierarchy a problem can occur. Some topics may have low link weights for all ancestor topics: $$ p(t|a) \approx 0 \quad \forall a $$ This may occur due to real lack of appropriate parent topics or because such topic tries to be a subtopic of all ancestors.
To avoid this situation, special hierarchy sparsing regularizer can be used. It affects $\Psi$ matrix and makes all distributions p(ancestor | topic) be sparse. In this case each topic will have a small amount of parents. As with other sparsing regularizers, we maximize KL(uniform distribution | p(a|t) ). After transformations we get regularizer criteria: $$R_2(\Psi) = \sum_a \sum_t p(a | t) = \sum_a \sum_t \frac{p(t|a) p(a)} {\sum_{t'} p(t'|a) p(a)} = \sum_a \sum_t \frac{\psi_{ta} p(a)} {\sum_{t'} \psi_{ta} p(a)} \rightarrow max_{\Psi}$$
Values p(a) don't slightly affect $\Psi$ so can be set uniform. Updated M-step: $$ \psi_{ta} = \text{norm}_{t} \Biggl[ n_{ta} - \biggl( \frac 1 {|A|} - p(a | t) \biggr) \Biggr]$$
If ancestor $a$ has high $p(a | t)$ for some $t$, it will be more increased. Links with low $p(a | t)$ will be reduced.
As $\Psi$ in BigARTM is part of $\Theta$, then HierarchySparsingRegularizer is theta regularizer. It can be used the same way as other BigARTM regularizers:
level1 = hier.add_level(num_topics=20, topic_names=['child_topic_' + str(i) for i in range(20)],
parent_level_weight=1)
level1.regularizers.add(artm.HierarchySparsingThetaRegularizer(name="HierSp", tau=1.0))
level1.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20,
class_id='text'))
level1.initialize(dictionary=dictionary)
level1.fit_offline(batch_vectorizer, num_collection_passes=30)
This regularizer can affect only special parent levels phi batches. It means that if you add HierarchySparsingRegularizer to usual, not hierarchy level model, regularizer will have no effect. The same with regular batches' theta, it will not be affected by this regularizer.
You can use all BigARTM scores to assess separate level models. Also there are some measures that describe hierarchy structure. They can be easily computed using numpy so they are not implemented in BigARTM.
Usually it is needed to set psi threshold that is min value of $p(t | a)$ so that link a-t will be included to topic graph. But with high threshold some topics will have no parents. We define support as maximum avialable threshold for Psi matrix with which all topics will have at least one ancestor.
psi = level1.get_psi()
print "Psi support:", psi.values.max(axis=1).min()
Psi support: 0.0324899
In BigARTM hierarchy is defined as multilevel topic graph rathjer than topic tree. So it is reasonable to evaluate mean count of parents.
psi_threshold = 0.01
parent_counts = np.zeros(0)
for level_idx in range(1, hier.num_levels):
psi = hier.get_level(level_idx).get_psi().values
parent_counts = np.hstack((parent_counts, (psi > psi_threshold).sum(axis=1)))
print "Mean parents count:", parent_counts.mean()
Mean parents count: 1.15
We can construct parent-children relations with $p(parent|child)$ matrix, but psi matrix gives us only $p(child|parent)$. We can get $p(parent|child)$ this way:
batch = artm.messages.Batch()
batch_name = 'phi1.batch'
with open(batch_name, "rb") as f:
batch.ParseFromString(f.read())
Ntw = np.zeros(len(level0.topic_names))
for i,item in enumerate(batch.item):
for (token_id, token_weight) in zip(item.field[0].token_id, item.field[0].token_weight):
Ntw[i] += token_weight
Nt1t0 = np.array(psi) * Ntw
psi_bayes = (Nt1t0 / Nt1t0.sum(axis=1)[:, np.newaxis]).T
indexes_child = np.argmax(psi_bayes, axis=0)
Look for the topic_4 and its children:
topic_parent_name = 'topic_4'
print(topic_parent_name + ':')
print(" ".join(level0.score_tracker['TopTokensScore'].last_tokens[topic_parent_name]))
print('')
for child in np.where(indexes_child == i)[0]:
print(' ' + level1.topic_names[child] + ': ')
print(" ".join(level1.score_tracker['TopTokensScore'].last_tokens[level1.topic_names[child]]))
print('')
topic_4: window db server motif application widget sun display list mit code manager ms running user screen lib run cs subject child_topic_4: ftp pub graphics ca software contact package version email fax comp send cs uk anonymous ac unix computer university address child_topic_7: ground wire current circuit wiring neutral voltage box electrical amp high signal connected line cable test audio electronics subject usually child_topic_12: image files info jpeg software gif color format looking version images faq hi graphics send advance email post quality keyboard