Notebook

Visualizing topic models using tm-navigator¶

This notebook describes a simple way from a raw text collection to its visualization, and uses BigARTM for fitting a topic model and tm-navigator to visualize it.

Set up connection to a tm-navigator server¶

Currently the navigator is running on a remote server and you need ssh access through internet to it, but there is a Dockerfile available, so the navigator can be deployed virtually anywhere.

In [1]:

TMNAV_SERVER='root@ks.plav.in'
TMNAV_PORT=22223  # or 21, for those who have troubles accessing port 22223
TMNAV_PATH='/root/tm_navigator/'

Run these commands with default options in your shell once to allow passwordless ssh to the server:

ssh-keygen
ssh-copy-id -p {TMNAV_PORT} -i ~/.ssh/id_rsa.pub {TMNAV_SERVER}

Now you can easily connect to the server to see the model in your browser: run

ssh -p {TMNAV_PORT} -L 5000:localhost:5000 {TMNAV_SERVER}

in your terminal, and open http://localhost:5000 in a browser. This will show a list of all the datasets and models uploaded before.

Get the collection¶

First we need to get the collection in the bag-of-words format. In this example the MMRO conference (Russian) articles are used:

In [2]:

!mkdir mmro
!rm -rf mmro/*

mkdir: cannot create directory ‘mmro’: File exists

In [3]:

%cd mmro

/root/work/tm_navigator/dev/mmro

In [4]:

!wget https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt
!wget https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z
!7zr e docword.mmro.txt.7z

--2015-11-18 11:42:34--  https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.140
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155766 (152K) [text/plain]
Saving to: ‘vocab.mmro.txt’

vocab.mmro.txt      100%[=====================>] 152.12K  --.-KB/s   in 0.06s  

2015-11-18 11:42:34 (2.43 MB/s) - ‘vocab.mmro.txt’ saved [155766/155766]

--2015-11-18 11:42:34--  https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.132
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 490147 (479K) [application/octet-stream]
Saving to: ‘docword.mmro.txt.7z’

docword.mmro.txt.7z 100%[=====================>] 478.66K  --.-KB/s   in 0.1s   

2015-11-18 11:42:34 (4.27 MB/s) - ‘docword.mmro.txt.7z’ saved [490147/490147]


7-Zip (A) [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Processing archive: docword.mmro.txt.7z

Extracting  docword.mmro.txt

Everything is Ok

Size:       3248571
Compressed: 490147

Load the collection to tm-navigator¶

Install a convenience wrapper for creating csv's:

In [ ]:

!pip install csvwriter

In [5]:

from csvwriter import CsvWriter
import numpy as np
from glob import glob

This collection should be added to tm-navigator. Based on the data available in this case, only the simplest visualization can be built, without even documents names and their authors.

tm-navigator native input format is a bunch of csv files, each corresponding to a database table. Minimally, a dataset (text collection) is described with the following tables:

In [6]:

# list of all modalities
with CsvWriter(open('modalities.csv', 'w')) as out:
    out << [dict(id=1, name='words')] # this one is required

In [7]:

# read the ndw counts
with open('docword.mmro.txt') as f:
    D = int(f.readline())
    W = int(f.readline())
    n = int(f.readline())
    ndw_s = [map(int, line.split()) for line in f.readlines()]
    ndw_s = [(d - 1, w - 1, cnt) for d, w, cnt in ndw_s]  # use 0-based indexing

In [8]:

# all the documents data
with CsvWriter(open('documents.csv', 'w')) as out:
    out << (
        dict(id=d,
             title='Document #{}'.format(d),
             slug='document-{}'.format(d),  # any unique string, identifying the document - appears in short lists and URLs
             file_name='.../{}'.format(d),  # if applicable, a relative filename of the document
             # source='MMRO',  # optional, is displayed as-is, e.g. conference name with year
             # html=...,  # optional, the full HTML content of the document
        )
        for d in range(D)
    )

In [9]:

# terms (in this case, words only)
with open('vocab.mmro.txt') as f, \
     CsvWriter(open('terms.csv', 'w')) as out:
        out << (
            dict(id=i,
                 modality_id=1,  # matches the id in modalities table
                 text=line.strip()
            )
            for i, line in enumerate(f)
        )

In [10]:

# occurrences of terms in documents
with CsvWriter(open('document_terms.csv', 'w')) as out:
    out << (
        dict(document_id=d,
             modality_id=1,
             term_id=w,
             count=cnt)
        for d, w, cnt in ndw_s
    )

So, the required csv's are ready to be loaded into tm-navigator. Upload them to the server:

In [11]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'mkdir {TMNAV_PATH}data_mmro'
for csv in glob('*.csv'):
    !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}

documents.csv                                 100%   41KB  41.3KB/s   00:00    
modalities.csv                                100%   18     0.0KB/s   00:00    
document_terms.csv                            100% 4091KB   4.0MB/s   00:00    
terms.csv                                     100%  212KB 212.0KB/s   00:00

Now the files are on the server, and we need to load them to the database. All database interactions are supposed to be done with the db_manage.py script, which has several commands. A list of parameters for each command can be obtained by adding --help:

In [12]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py'
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py load_dataset --help'

Usage: db_manage.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add_dataset
  add_topicmodel
  describe
  load_dataset
  load_topicmodel
Usage: db_manage.py load_dataset [OPTIONS]

Options:
  -d, --dataset-id INTEGER     [required]
  -t, --title TEXT
  -dir, --directory DIRECTORY  [required]
  --help                       Show this message and exit.

First we add a new dataset and note the given id - it will be used later:

In [13]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_dataset'

Added Dataset #1

And now load the data from CSV files (note the dataset-id, it is the number which was given by the previous command):

In [15]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_dataset --dataset-id 1 --title "Simplest MMRO dataset" -dir data_mmro'

Found files "document_terms.csv", "documents.csv", "modalities.csv", "terms.csv".
Not found files "document_contents.csv".
Will try to continue with the files present.
Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data
Deleting data
Deleting data
Deleting data
Deleting data
Loading data
Loading data
Loading data
Loading data
Loading data

You can check that the dataset was loaded to the DB using the describe command:

In [16]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py describe'

- Dataset #1: Simplest MMRO dataset, 0 models
  Documents: 1061
  Terms: 7805 words with 314081 occurrences

Or just go to the front page at http://localhost:5000 - the last dataset there should be your newly added one.

Build a topic model¶

Next we build a simple ARTM model of this collection using BigARTM. Of course, you can use other tools for this, if you want.

In [17]:

import artm

In [18]:

batch_vectorizer = artm.BatchVectorizer(data_path='', data_format='bow_uci', collection_name='mmro', target_folder='.')

In [19]:

model_artm = artm.ARTM(num_topics=15,
                      scores=[artm.PerplexityScore(name='PerplexityScore',
                                                   use_unigram_document_model=False,
                                                   dictionary_name='dictionary')],
                      regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])

In [20]:

model_artm.load_dictionary(dictionary_name='dictionary', dictionary_path='dictionary')
model_artm.initialize(dictionary_name='dictionary')

In [21]:

model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15, num_document_passes=1)

In the simplest possible case, the model is completely described by its matrices $\Phi$ and $\Theta$:

In [22]:

phi = model_artm.get_phi()
theta = model_artm.fit_transform()

Some other required probabilities are naively computed below. You can use another ways to calculate them.

In [23]:

pwt = phi.as_matrix()
ptd = theta.as_matrix()
pd = 1.0 / theta.shape[1]
pt = (ptd * pd).sum(1)
pw = (pwt * pt).sum(1)
ptw = pwt * pt / pw[:, np.newaxis]
pdt = ptd * pd / pt[:, np.newaxis]

Load the model to tm-navigator¶

After the model is built, it has to be converted to CSV files, like the dataset was. The minimal required set of files is the following:

In [24]:

# the model topics
with CsvWriter(open('topics.csv', 'w')) as out:
    out << [dict(id=0,
                 level=0,
                 id_in_level=0,
                 is_background=False,
                 probability=1)]  # the single zero-level topic with id=0 is required
    out << (dict(id=1 + t,  # any unique ids
                 level=1,  # for a flat non-hierarchical model just leave 1 here
                 id_in_level=t,
                 is_background=False,  # if you have background topics, they should have True here
                 probability=p)
            for t, p in enumerate(pt))

In [25]:

# probabilities of terms in topics
with CsvWriter(open('topic_terms.csv', 'w')) as out:
    out << (dict(topic_id=1 + t,  # same ids as above
                 modality_id=1,
                 term_id=w,
                 prob_wt=pwt[w, t],
                 prob_tw=ptw[w, t])
            for w, t in zip(*np.nonzero(pwt)))

In [26]:

# probabilities of topics in documents
with CsvWriter(open('document_topics.csv', 'w')) as out:
    out << (dict(topic_id=1 + t,  # same ids as above
                 document_id=d,
                 prob_td=ptd[t, d],
                 prob_dt=pdt[t, d])
            for t, d in zip(*np.nonzero(ptd)))

In [27]:

# graph of topics, mostly useful for hierarchical topic models
# the navigator assumes that all topics are reachable by edges from the root topic #0
with CsvWriter(open('topic_edges.csv', 'w')) as out:
    out << (dict(parent_id=0,
                 child_id=1 + t,
                 probability=p)
            for t, p in enumerate(pt))

Now, same as with the dataset before, upload the CSV files:

In [28]:

for csv in glob('*.csv'):
    !scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}

topic_terms.csv                               100% 3492KB   3.4MB/s   00:00    
documents.csv                                 100%   41KB  41.3KB/s   00:00    
modalities.csv                                100%   18     0.0KB/s   00:00    
topic_edges.csv                               100%  260     0.3KB/s   00:00    
document_terms.csv                            100% 4091KB   4.0MB/s   00:00    
terms.csv                                     100%  212KB 212.0KB/s   00:00    
topics.csv                                    100%  416     0.4KB/s   00:00    
document_topics.csv                           100%  420KB 420.4KB/s   00:00

Create a new topic model (same dataset-id as in commands above):

In [29]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_topicmodel --dataset-id 1'

Added Topic Model #1 for Dataset #1

And load CSVs to the database (here note the topicmodel-id):

In [30]:

!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_topicmodel --topicmodel-id 1 --title "Simplest model" -dir data_mmro'

Found files "document_topics.csv", "topic_edges.csv", "topic_terms.csv", "topics.csv".
Not found files "document_content_topics.csv", "document_similarities.csv", "term_similarities.csv", "topic_similarities.csv".
Will try to continue with the files present.
Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data
Deleting data
Deleting data
Deleting data
Deleting data
Loading data
Loading data
Loading data
Loading data
Loading data

That's all, basically - now visit http://localhost:5000 and follow the link to browse your model! The link should look like http://1.localhost:5000, just with another number at the beginning. If you don't like the debug panel at the side, just click Hide there once.

If you build several (any number of) topic models for the same collection, you don't have to add a new dataset each time, just run add_topicmodel with the same dataset id.

Optional data for richer visualization¶

If that's not enough and you need other features, you can feed the navigator with more data. Some useful cases with description on how to do them:

Add titles, slugs and HTML content to the documents: just fill the corresponding fields in `documents.csv'

Add document authors: authors are another modality (like words in the given example), so just add authors modality to modalities.csv, the corresponding terms (individual authors) to terms.csv, and their relation to documents to document_terms.csv. Of course, authors can be used in topic models also.

Highlight documents HTML content: this is a two-step process, one step is related to the dataset and another to the topic model. Basically, you need the start and end positions of each term (word) in HTML, and the top topics for them. If you have these, this is how to generate the corresponding CSVs:

In [ ]:

# this is for dataset
with CsvWriter(open('document_contents.csv', 'w')) as out:
    id_cnt = itertools.count()  # it's just one way to generate ids so that they match in both cases
    out << (dict(id=next(id_cnt),  # must correspond to the ids in document_content_topics below
                 document_id=d,
                 modality_id=1,  # 1 for words
                 term_id=w,
                 start_pos=s, end_pos=e  # the start and end positions in the HTML content
                 )
           for d, w, s, e in ...)

# and this is for topicmodel
with CsvWriter(open('document_content_topics.csv', 'w')) as out:
    id_cnt = it.count()
    out << (dict(document_content_id=next(id_cnt),  # same ids as above
                 topic_id=1 + t  # the top topic id, determines the color
                )
            for d, t in ...)

Show lists of similar documents, topics, or terms on the corresponding pages: the navigator doesn't restrict you in how the similarity is determined, so it must be computed beforehand. Similarities are internally related to topicmodels, not datasets, because they are typically computed using the data from models. Multiple different similarities are supported for each entitity, see below:

In [ ]:

with CsvWriter(open('document_similarities.csv', 'w')) as out:
    out << (dict(a_id=i,  # first document id
                 b_id=sim_i,  # second document id
                 similarity=row[sim_i],  # similarity from [0, 1]
                 similarity_type='Topics'  # free-form short name of this similarity type, common choices probably are Topics and Words
                )
            for i, row in enumerate(distances)  # the precomputed distance matrix
            # tip: don't write the whole n^2 entries to the CSV table not to bloat it,
            # here we limit to 30 similar entities for each row
            for sim_i in row.argsort()[:31]
            if sim_i != i)

with CsvWriter(open('topic_similarities.csv', 'w')) as out:
    out << (dict(a_id=1 + i,
                 b_id=1 + sim_i,
                 similarity=row[sim_i],
                 similarity_type='Words')
            for i, row in enumerate(distances)
            for sim_i in row.argsort()[:]  # if you have hundreds or more topics, limit to first 50 or so here
            if sim_i != i)

with CsvWriter(open('term_similarities.csv', 'w')) as out:
    out << (dict(a_modality_id=1,
                 a_id=i,
                 b_modality_id=1,
                 b_id=sim_i,
                 similarity=row[sim_i],
                 similarity_type='Topics')
            for i, row in enumerate(distances)
            for sim_i in row.argsort()[:21]  # first 20 similar terms
            if sim_i != i)

Hierarchical topic models can easily be represented using several levels of topics and adding edges between them, see above. Actually, even if your topic model isn't hierarchical you can add another middle level with topics to act as groups of real topics, and name them correspondingly.

Remember to upload the CSVs after each change, and load them to the database using load_dataset or load_topicmodel! Use the same dataset or model id as before, if you want to overwrite the dataset or model, or else add a new one with add_* before.

Datasets which have topic models can be changed only by adding some data, like document_contents.csv, not removing anything. The database will give an error if you try to do anything which makes the data inconsistent.

Notes for artm-dev team, which can use the remote server mentioned above¶

No permissions system is currently implemented (nor is planned), so each user on the same server can view or modify all the data. Please treat the server as a disposable storage and store any data you cannot easily generate again on your computer. If you want to use assessment features of the navigator, please contact me beforehand to make sure you don't lose the responses!

This tutorial uses directory named data_mmro on the server, please replace it with something unique among our group for convenience.