Notebook

Hierarchies in BigARTM¶

Authors — Nadezhda Chirkova and Artem Popov

In this tutorial we describe principles of building hierarchies in BigARTM. Notebook was writed for Python 2.7.

In [1]:

import numpy as np

import artm
from artm import hARTM

import sys
sys.path.append('utils/')
# you need sklearn for simple loading
from load_collections import load_20newsgroups

import glob 
import os

In [2]:

artm.version()

Out[2]:

'0.8.3'

1. Method explaination¶

Usual ARTM model¶

Data: documents set $D$, words set $W$, document-word matrix $\{n_{dw}\}_{D \times W}$.

Model: Denote $p(w|d) = \frac{n_{dw}}{\sum_w n_{dw}}$, $T$ is a topics set. The topic model is $$ p(w|d) = \sum_{t \in T} p(w|t) p(t|d) = \sum_{t \in T} \phi_{wt} \theta_{td}, \hspace{3cm} (1) $$ with parameters

$\Phi = \{\phi_{wt}\}_{W \times T}$
$\Theta = \{ \theta_{td}\}_{T \times D}$

Parameter learning: regularizer maximum likelihood maximization $$ \sum_d \sum_w n_{dw} \ln \sum_t \phi_{wt} \theta_{td} + \sum_i \tau_i R_i(\Phi, \Theta) \rightarrow max_{\Phi, \Theta} $$ where regularizers $R(\Phi, \Theta) = \sum_i \tau_i R_i(\Phi, \Theta)$ allows introducing additional subject-specific criterias, $\tau_i$ are regularizers' coefficients.

How hierarchy is constructed from several usual models¶

Hierarchy definition:¶

Topic hierarchy is an oriented multipartite (multilevel) graph of topics so that edges connect only topics from neighboring levels.
Zero level consists of the only node called root.
Each none-zero level has more topics than previous one. Previous level is called parent level.
If there is edge topic-subtopic in hierarchy, topic is also called parent topic or ancestor.

Hierarchy construction:¶

Root is associated with the whole collection and doesn't need modeling.
Every non-zero level is a usual topic model.
First level has few topics that are main collection topics. First level topics have the only parent topic (root).
For each level with index > 1 we need to to establish parent-children relationship with previous level topics.

Establishing parent-children relations¶

When we have built parent level, let's denote its topics $a \in A$ (ancestor) and matrices $\Phi^p$ and $\Theta^p$. Now we will build next level model with topics set $T$.

Let's introduce new matrix factorization problem: $$ \phi^p_{wa} = p(w|a) \approx \sum_{t} p(w|t) p(t|a) = \sum_t \phi_{wt} \psi_{ta}$$

with new parameters matrix $\Psi = \{ \psi_{ta} \}_{T \times A}$ containing probabilities p(topic | ancestor topic) calles link weights.

If KL-divergence is a similarity measure between distributions, previous equation produces regularizer for next level model: $$ R(\Phi, \Psi) = \sum_w \sum_a \phi_{wa} \ln \sum_t \phi_{wt} \psi_{ta} \rightarrow max_{\Phi, \Psi} $$.

$$ \sum_d \sum_w n_{dw} \ln \sum_t \phi_{wt} \theta_{td} + \tau R(\Phi, \Psi) \rightarrow max_{\Phi, \Psi, \Theta} $$ Both likelihood and regularizer formulas have common structure. So there is a simple way to train $\Psi$ simultaneously with $\Phi$ and $\Theta$:

we just add $|A|$ pseudodocuments to collection, each representing parent $\Phi$ column: $n{aw} = \tau p(w|a)$._

2. BigARTM implementation¶

Hierarchy in BigARTM is implemented in hierarchy_utils module. To build hierarchy, create hARTM instance:

In [3]:

hier = hARTM()

You should pass to hARTM parameters that are common for all levels. These are the same parameters that you pass to usual ARTM model.

Levels will be built one by one. To add first level, use add_level method specifying remaining model parameters (unique for the level):

In [4]:

level0 = hier.add_level(num_topics=10)
level0.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20, 
                                      class_id='text'))

This method returns ARTM object so you can work with it as you used: initialize it, fit offline, add regularizer ans scores etc. For example:

In [5]:

load_20newsgroups()

In [6]:

data_path = 'data/20newsgroups/20newsgroups_train.vw'
batches_path = 'data/20newsgroups/batches'

In [7]:

if len(glob.glob(os.path.join(batches_path + '*.batch'))) < 1:
    batch_vectorizer = artm.BatchVectorizer(data_path=data_path, data_format='vowpal_wabbit',
                                            target_folder=batches_path)
else:
    batch_vectorizer = artm.BatchVectorizer(data_path=batches_path, data_format='batches')

In [8]:

dictionary = artm.Dictionary('dictionary')
dictionary.gather(batches_path)
dictionary.filter(min_df=5, max_tf=1000)

In [9]:

level0.initialize(dictionary=dictionary)
level0.fit_offline(batch_vectorizer, num_collection_passes=30)

In [10]:

for topic_name in level0.topic_names:
    print topic_name + ': ',
    print ", ".join(level0.score_tracker['TopTokensScore'].last_tokens[topic_name])

topic_0:  gun, state, entry, output, guns, control, bill, crime, states, police, rights, section, tax, firearms, federal, laws, weapons, court, public, ok
topic_1:  card, dos, mb, scsi, disk, hard, memory, pc, mac, video, price, drives, monitor, apple, controller, ram, bus, speed, mhz, software
topic_2:  encryption, chip, public, nasa, technology, clipper, president, security, keys, privacy, internet, national, launch, administration, earth, research, access, computer, des, phone
topic_3:  jesus, true, life, bible, christian, church, fact, mean, faith, christians, religion, christ, evidence, rather, seems, man, word, reason, argument, truth
topic_4:  window, db, server, motif, application, widget, sun, display, list, mit, code, manager, ms, running, user, screen, lib, run, cs, subject
topic_5:  she, car, her, little, went, maybe, enough, lot, day, thought, anything, put, tell, remember, heard, told, old, again, left, bike
topic_6:  game, team, play, games, season, hockey, league, players, win, period, st, la, teams, nhl, vs, player, san, pts, gm, second
topic_7:  armenian, israel, turkish, war, jews, armenians, israeli, health, during, turkey, medical, against, jewish, history, children, state, anti, killed, turks, population
topic_8:  pl, tm, di, ei, cx, wm, bhj, giz, hz, ah, ey, lk, um, ww, qs, sl, mw, chz, mq, uw
topic_9:  image, ftp, files, software, graphics, version, pub, email, send, info, ca, format, images, contact, current, line, code, package, jpeg, source

When first level is fit, you have to add next level:

In [11]:

level1 = hier.add_level(num_topics=20, topic_names=['child_topic_' + str(i) for i in range(20)], 
                        parent_level_weight=1)

When you add this level, parent levels phi matrix will be saved into special, parent level batch. It is the way how pseudoduments are created. This created batch will be added to other batches when you fit model. Explaination of add_level parameters:

parent_level_weight is regularizer's coefficient $\tau$. Token_values in parent level batch will be multiplied by parent_level_weight during learning.
tmp_files_path is a path where model can save this parent level batch.

These two parameters are ignored during creation of first level.

Now you can learn level1 model by any means. For example:

In [12]:

level1.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20, 
                                      class_id='text'))

In [13]:

level1.initialize(dictionary=dictionary)
level1.fit_offline(batch_vectorizer, num_collection_passes=30)

The part of $\Theta$ matrix corresponding to parent level batch is $\Psi$ matrix. To get it, use get_psi method:

In [14]:

psi = level1.get_psi()

Note than level0 has no get_psi method.

In [15]:

for topic_name in level1.topic_names:
    print(topic_name + ': ')
    print(" ".join(level1.score_tracker['TopTokensScore'].last_tokens[topic_name]))

child_topic_0: 
seems anything isn actually post rather real maybe reason try least enough idea wrong seem course nothing perhaps tell little
child_topic_1: 
window db server motif application widget display sun list mit manager code running lib user screen run font xterm subject
child_topic_2: 
group posting important money support discussion feel lot newsgroup groups care article gay business men agree personal freedom postings community
child_topic_3: 
she her went saw told day came started home says left heard took thought old maybe remember tell again little
child_topic_4: 
ftp pub graphics ca software contact package version email fax comp send cs uk anonymous ac unix computer university address
child_topic_5: 
entry output gun ok rules build check line section info entries printf eof stream int char size open title contest
child_topic_6: 
jesus bible christian true life faith christians religion christ truth evidence man word argument belief religious cannot science christianity example
child_topic_7: 
ground wire current circuit wiring neutral voltage box electrical amp high signal connected line cable test audio electronics subject usually
child_topic_8: 
church stephanopoulos president moral mean catholic father pope yes holy society love spirit sex son authority george marriage day morality
child_topic_9: 
game team games baseball san league hit runs won win players lost th average home season run pitching fan cubs
child_topic_10: 
armenian israel turkish jews armenians war israeli turkey medical during health jewish turks against history anti greek armenia arab genocide
child_topic_11: 
card dos mb scsi disk hard memory pc mac video drives price monitor apple controller bus ram mhz sale speed
child_topic_12: 
image files info jpeg software gif color format looking version images faq hi graphics send advance email post quality keyboard
child_topic_13: 
against fire fbi war koresh killed children country military death kill evidence media trial weapons far compound judge batf killing
child_topic_14: 
encryption chip technology clipper nasa public security keys privacy internet launch president administration national earth des research access secure phone
child_topic_15: 
game team play hockey season period games la nhl vs players pts gm st league win goal teams pittsburgh flyers
child_topic_16: 
car bike water cars engine front enough buy miles oil little speed big air road side high money driving lot
child_topic_17: 
pl tm di ei wm giz bhj ey um sl tq kn bxn qax gk bj qq ql tg mf
child_topic_18: 
cx hz ww qs ah lk uw ck mw mv sc mc pl md zd tl mt chz ma cj
child_topic_19: 
gun state states bill police control rights crime laws tax firearms public federal congress court amendment clinton guns weapons house

You can get levels specifying level_index (from 0 as usually in python so first level has index 0):

In [16]:

level_index = 0
some_level = hier.get_level(level_index)

To delete level, use:

In [17]:

level_index = 1
hier.del_level(level_index)

Be careful: if you delete not the last level, all next levels will be deleted too.

You can access number of levels of hierarchy using .num_levels:

In [18]:

print(hier.num_levels)

3. Improving hierarchy structure¶

Hierarchy sparsing regularizer¶

When building a level of hierarchy a problem can occur. Some topics may have low link weights for all ancestor topics: $$ p(t|a) \approx 0 \quad \forall a $$ This may occur due to real lack of appropriate parent topics or because such topic tries to be a subtopic of all ancestors.

To avoid this situation, special hierarchy sparsing regularizer can be used. It affects $\Psi$ matrix and makes all distributions p(ancestor | topic) be sparse. In this case each topic will have a small amount of parents. As with other sparsing regularizers, we maximize KL(uniform distribution | p(a|t) ). After transformations we get regularizer criteria: $$R_2(\Psi) = \sum_a \sum_t p(a | t) = \sum_a \sum_t \frac{p(t|a) p(a)} {\sum_{t'} p(t'|a) p(a)} = \sum_a \sum_t \frac{\psi_{ta} p(a)} {\sum_{t'} \psi_{ta} p(a)} \rightarrow max_{\Psi}$$

Values p(a) don't slightly affect $\Psi$ so can be set uniform. Updated M-step: $$ \psi_{ta} = \text{norm}_{t} \Biggl[ n_{ta} - \biggl( \frac 1 {|A|} - p(a | t) \biggr) \Biggr]$$

If ancestor $a$ has high $p(a | t)$ for some $t$, it will be more increased. Links with low $p(a | t)$ will be reduced.

Reularizer usage¶

As $\Psi$ in BigARTM is part of $\Theta$, then HierarchySparsingRegularizer is theta regularizer. It can be used the same way as other BigARTM regularizers:

In [19]:

level1 = hier.add_level(num_topics=20, topic_names=['child_topic_' + str(i) for i in range(20)], 
                        parent_level_weight=1)
level1.regularizers.add(artm.HierarchySparsingThetaRegularizer(name="HierSp", tau=1.0))
level1.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=20, 
                                      class_id='text'))

In [20]:

level1.initialize(dictionary=dictionary)
level1.fit_offline(batch_vectorizer, num_collection_passes=30)

This regularizer can affect only special parent levels phi batches. It means that if you add HierarchySparsingRegularizer to usual, not hierarchy level model, regularizer will have no effect. The same with regular batches' theta, it will not be affected by this regularizer.

4. Hierarchy structure quality measures¶

You can use all BigARTM scores to assess separate level models. Also there are some measures that describe hierarchy structure. They can be easily computed using numpy so they are not implemented in BigARTM.

Support¶

Usually it is needed to set psi threshold that is min value of $p(t | a)$ so that link a-t will be included to topic graph. But with high threshold some topics will have no parents. We define support as maximum avialable threshold for Psi matrix with which all topics will have at least one ancestor.

In [21]:

psi = level1.get_psi()

In [22]:

print "Psi support:", psi.values.max(axis=1).min()

Psi support: 0.0324899

Mean parents count¶

In BigARTM hierarchy is defined as multilevel topic graph rathjer than topic tree. So it is reasonable to evaluate mean count of parents.

In [23]:

psi_threshold = 0.01
parent_counts = np.zeros(0)
for level_idx in range(1, hier.num_levels):
    psi = hier.get_level(level_idx).get_psi().values
    parent_counts = np.hstack((parent_counts, (psi > psi_threshold).sum(axis=1)))
print "Mean parents count:", parent_counts.mean()

Mean parents count: 1.15

Top tokens relations¶

We can construct parent-children relations with $p(parent|child)$ matrix, but psi matrix gives us only $p(child|parent)$. We can get $p(parent|child)$ this way:

In [24]:

batch = artm.messages.Batch()
batch_name = 'phi1.batch'

with open(batch_name, "rb") as f:
    batch.ParseFromString(f.read())
    
Ntw = np.zeros(len(level0.topic_names))
    
for i,item in enumerate(batch.item):
    for (token_id, token_weight) in zip(item.field[0].token_id, item.field[0].token_weight):
        Ntw[i] += token_weight

Nt1t0 = np.array(psi) * Ntw
psi_bayes = (Nt1t0 / Nt1t0.sum(axis=1)[:, np.newaxis]).T

In [25]:

indexes_child = np.argmax(psi_bayes, axis=0)

Look for the topic_4 and its children:

In [26]:

topic_parent_name = 'topic_4'
print(topic_parent_name + ':')
print(" ".join(level0.score_tracker['TopTokensScore'].last_tokens[topic_parent_name]))
print('')

for child in np.where(indexes_child == i)[0]:
    print('    ' + level1.topic_names[child] + ': ')
    print(" ".join(level1.score_tracker['TopTokensScore'].last_tokens[level1.topic_names[child]]))
    print('')

topic_4:
window db server motif application widget sun display list mit code manager ms running user screen lib run cs subject

    child_topic_4: 
ftp pub graphics ca software contact package version email fax comp send cs uk anonymous ac unix computer university address

    child_topic_7: 
ground wire current circuit wiring neutral voltage box electrical amp high signal connected line cable test audio electronics subject usually

    child_topic_12: 
image files info jpeg software gif color format looking version images faq hi graphics send advance email post quality keyboard