This notebook describes a simple way from a raw text collection to its visualization, and uses BigARTM for fitting a topic model and tm-navigator to visualize it.
Currently the navigator is running on a remote server and you need ssh access through internet to it, but there is a Dockerfile available, so the navigator can be deployed virtually anywhere.
TMNAV_SERVER='root@ks.plav.in'
TMNAV_PORT=22223 # or 21, for those who have troubles accessing port 22223
TMNAV_PATH='/root/tm_navigator/'
Run these commands with default options in your shell once to allow passwordless ssh to the server:
ssh-keygen
ssh-copy-id -p {TMNAV_PORT} -i ~/.ssh/id_rsa.pub {TMNAV_SERVER}
Now you can easily connect to the server to see the model in your browser: run
ssh -p {TMNAV_PORT} -L 5000:localhost:5000 {TMNAV_SERVER}
in your terminal, and open http://localhost:5000 in a browser. This will show a list of all the datasets and models uploaded before.
First we need to get the collection in the bag-of-words format. In this example the MMRO conference (Russian) articles are used:
!mkdir mmro
!rm -rf mmro/*
mkdir: cannot create directory ‘mmro’: File exists
%cd mmro
/root/work/tm_navigator/dev/mmro
!wget https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt
!wget https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z
!7zr e docword.mmro.txt.7z
--2015-11-18 11:42:34-- https://s3-eu-west-1.amazonaws.com/artm/vocab.mmro.txt Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.140 Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.140|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 155766 (152K) [text/plain] Saving to: ‘vocab.mmro.txt’ vocab.mmro.txt 100%[=====================>] 152.12K --.-KB/s in 0.06s 2015-11-18 11:42:34 (2.43 MB/s) - ‘vocab.mmro.txt’ saved [155766/155766] --2015-11-18 11:42:34-- https://s3-eu-west-1.amazonaws.com/artm/docword.mmro.txt.7z Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 54.231.133.132 Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|54.231.133.132|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 490147 (479K) [application/octet-stream] Saving to: ‘docword.mmro.txt.7z’ docword.mmro.txt.7z 100%[=====================>] 478.66K --.-KB/s in 0.1s 2015-11-18 11:42:34 (4.27 MB/s) - ‘docword.mmro.txt.7z’ saved [490147/490147] 7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18 p7zip Version 9.20 (locale=C.UTF-8,Utf16=on,HugeFiles=on,4 CPUs) Processing archive: docword.mmro.txt.7z Extracting docword.mmro.txt Everything is Ok Size: 3248571 Compressed: 490147
Install a convenience wrapper for creating csv's:
!pip install csvwriter
from csvwriter import CsvWriter
import numpy as np
from glob import glob
This collection should be added to tm-navigator. Based on the data available in this case, only the simplest visualization can be built, without even documents names and their authors.
tm-navigator native input format is a bunch of csv files, each corresponding to a database table. Minimally, a dataset (text collection) is described with the following tables:
# list of all modalities
with CsvWriter(open('modalities.csv', 'w')) as out:
out << [dict(id=1, name='words')] # this one is required
# read the ndw counts
with open('docword.mmro.txt') as f:
D = int(f.readline())
W = int(f.readline())
n = int(f.readline())
ndw_s = [map(int, line.split()) for line in f.readlines()]
ndw_s = [(d - 1, w - 1, cnt) for d, w, cnt in ndw_s] # use 0-based indexing
# all the documents data
with CsvWriter(open('documents.csv', 'w')) as out:
out << (
dict(id=d,
title='Document #{}'.format(d),
slug='document-{}'.format(d), # any unique string, identifying the document - appears in short lists and URLs
file_name='.../{}'.format(d), # if applicable, a relative filename of the document
# source='MMRO', # optional, is displayed as-is, e.g. conference name with year
# html=..., # optional, the full HTML content of the document
)
for d in range(D)
)
# terms (in this case, words only)
with open('vocab.mmro.txt') as f, \
CsvWriter(open('terms.csv', 'w')) as out:
out << (
dict(id=i,
modality_id=1, # matches the id in modalities table
text=line.strip()
)
for i, line in enumerate(f)
)
# occurrences of terms in documents
with CsvWriter(open('document_terms.csv', 'w')) as out:
out << (
dict(document_id=d,
modality_id=1,
term_id=w,
count=cnt)
for d, w, cnt in ndw_s
)
So, the required csv's are ready to be loaded into tm-navigator. Upload them to the server:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'mkdir {TMNAV_PATH}data_mmro'
for csv in glob('*.csv'):
!scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}
documents.csv 100% 41KB 41.3KB/s 00:00 modalities.csv 100% 18 0.0KB/s 00:00 document_terms.csv 100% 4091KB 4.0MB/s 00:00 terms.csv 100% 212KB 212.0KB/s 00:00
Now the files are on the server, and we need to load them to the database. All database interactions are supposed to be done with the db_manage.py
script, which has several commands. A list of parameters for each command can be obtained by adding --help
:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py'
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py load_dataset --help'
Usage: db_manage.py [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: add_dataset add_topicmodel describe load_dataset load_topicmodel Usage: db_manage.py load_dataset [OPTIONS] Options: -d, --dataset-id INTEGER [required] -t, --title TEXT -dir, --directory DIRECTORY [required] --help Show this message and exit.
First we add a new dataset and note the given id - it will be used later:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_dataset'
Added Dataset #1
And now load the data from CSV files (note the dataset-id
, it is the number which was given by the previous command):
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_dataset --dataset-id 1 --title "Simplest MMRO dataset" -dir data_mmro'
Found files "document_terms.csv", "documents.csv", "modalities.csv", "terms.csv". Not found files "document_contents.csv". Will try to continue with the files present. Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data Deleting data Deleting data Deleting data Deleting data Loading data Loading data Loading data Loading data Loading data
You can check that the dataset was loaded to the DB using the describe
command:
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py describe'
- Dataset #1: Simplest MMRO dataset, 0 models Documents: 1061 Terms: 7805 words with 314081 occurrences
Or just go to the front page at http://localhost:5000 - the last dataset there should be your newly added one.
Next we build a simple ARTM model of this collection using BigARTM. Of course, you can use other tools for this, if you want.
import artm
batch_vectorizer = artm.BatchVectorizer(data_path='', data_format='bow_uci', collection_name='mmro', target_folder='.')
model_artm = artm.ARTM(num_topics=15,
scores=[artm.PerplexityScore(name='PerplexityScore',
use_unigram_document_model=False,
dictionary_name='dictionary')],
regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])
model_artm.load_dictionary(dictionary_name='dictionary', dictionary_path='dictionary')
model_artm.initialize(dictionary_name='dictionary')
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15, num_document_passes=1)
In the simplest possible case, the model is completely described by its matrices $\Phi$ and $\Theta$:
phi = model_artm.get_phi()
theta = model_artm.fit_transform()
Some other required probabilities are naively computed below. You can use another ways to calculate them.
pwt = phi.as_matrix()
ptd = theta.as_matrix()
pd = 1.0 / theta.shape[1]
pt = (ptd * pd).sum(1)
pw = (pwt * pt).sum(1)
ptw = pwt * pt / pw[:, np.newaxis]
pdt = ptd * pd / pt[:, np.newaxis]
After the model is built, it has to be converted to CSV files, like the dataset was. The minimal required set of files is the following:
# the model topics
with CsvWriter(open('topics.csv', 'w')) as out:
out << [dict(id=0,
level=0,
id_in_level=0,
is_background=False,
probability=1)] # the single zero-level topic with id=0 is required
out << (dict(id=1 + t, # any unique ids
level=1, # for a flat non-hierarchical model just leave 1 here
id_in_level=t,
is_background=False, # if you have background topics, they should have True here
probability=p)
for t, p in enumerate(pt))
# probabilities of terms in topics
with CsvWriter(open('topic_terms.csv', 'w')) as out:
out << (dict(topic_id=1 + t, # same ids as above
modality_id=1,
term_id=w,
prob_wt=pwt[w, t],
prob_tw=ptw[w, t])
for w, t in zip(*np.nonzero(pwt)))
# probabilities of topics in documents
with CsvWriter(open('document_topics.csv', 'w')) as out:
out << (dict(topic_id=1 + t, # same ids as above
document_id=d,
prob_td=ptd[t, d],
prob_dt=pdt[t, d])
for t, d in zip(*np.nonzero(ptd)))
# graph of topics, mostly useful for hierarchical topic models
# the navigator assumes that all topics are reachable by edges from the root topic #0
with CsvWriter(open('topic_edges.csv', 'w')) as out:
out << (dict(parent_id=0,
child_id=1 + t,
probability=p)
for t, p in enumerate(pt))
Now, same as with the dataset before, upload the CSV files:
for csv in glob('*.csv'):
!scp -P {TMNAV_PORT} {csv} {TMNAV_SERVER}:{TMNAV_PATH}data_mmro/{csv}
topic_terms.csv 100% 3492KB 3.4MB/s 00:00 documents.csv 100% 41KB 41.3KB/s 00:00 modalities.csv 100% 18 0.0KB/s 00:00 topic_edges.csv 100% 260 0.3KB/s 00:00 document_terms.csv 100% 4091KB 4.0MB/s 00:00 terms.csv 100% 212KB 212.0KB/s 00:00 topics.csv 100% 416 0.4KB/s 00:00 document_topics.csv 100% 420KB 420.4KB/s 00:00
Create a new topic model (same dataset-id
as in commands above):
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && ./db_manage.py add_topicmodel --dataset-id 1'
Added Topic Model #1 for Dataset #1
And load CSVs to the database (here note the topicmodel-id
):
!ssh -p {TMNAV_PORT} {TMNAV_SERVER} 'cd {TMNAV_PATH} && yes | ./db_manage.py load_topicmodel --topicmodel-id 1 --title "Simplest model" -dir data_mmro'
Found files "document_topics.csv", "topic_edges.csv", "topic_terms.csv", "topics.csv". Not found files "document_content_topics.csv", "document_similarities.csv", "term_similarities.csv", "topic_similarities.csv". Will try to continue with the files present. Proceeding will overwrite the corresponding data in the database. Continue? [Y/n]: Deleting data Deleting data Deleting data Deleting data Deleting data Loading data Loading data Loading data Loading data Loading data
That's all, basically - now visit http://localhost:5000 and follow the link to browse your model! The link should look like http://1.localhost:5000, just with another number at the beginning. If you don't like the debug panel at the side, just click Hide
there once.
If you build several (any number of) topic models for the same collection, you don't have to add a new dataset each time, just run add_topicmodel
with the same dataset id.
If that's not enough and you need other features, you can feed the navigator with more data. Some useful cases with description on how to do them:
words
in the given example), so just add authors
modality to modalities.csv
, the corresponding terms (individual authors) to terms.csv
, and their relation to documents to document_terms.csv
. Of course, authors can be used in topic models also.# this is for dataset
with CsvWriter(open('document_contents.csv', 'w')) as out:
id_cnt = itertools.count() # it's just one way to generate ids so that they match in both cases
out << (dict(id=next(id_cnt), # must correspond to the ids in document_content_topics below
document_id=d,
modality_id=1, # 1 for words
term_id=w,
start_pos=s, end_pos=e # the start and end positions in the HTML content
)
for d, w, s, e in ...)
# and this is for topicmodel
with CsvWriter(open('document_content_topics.csv', 'w')) as out:
id_cnt = it.count()
out << (dict(document_content_id=next(id_cnt), # same ids as above
topic_id=1 + t # the top topic id, determines the color
)
for d, t in ...)
with CsvWriter(open('document_similarities.csv', 'w')) as out:
out << (dict(a_id=i, # first document id
b_id=sim_i, # second document id
similarity=row[sim_i], # similarity from [0, 1]
similarity_type='Topics' # free-form short name of this similarity type, common choices probably are Topics and Words
)
for i, row in enumerate(distances) # the precomputed distance matrix
# tip: don't write the whole n^2 entries to the CSV table not to bloat it,
# here we limit to 30 similar entities for each row
for sim_i in row.argsort()[:31]
if sim_i != i)
with CsvWriter(open('topic_similarities.csv', 'w')) as out:
out << (dict(a_id=1 + i,
b_id=1 + sim_i,
similarity=row[sim_i],
similarity_type='Words')
for i, row in enumerate(distances)
for sim_i in row.argsort()[:] # if you have hundreds or more topics, limit to first 50 or so here
if sim_i != i)
with CsvWriter(open('term_similarities.csv', 'w')) as out:
out << (dict(a_modality_id=1,
a_id=i,
b_modality_id=1,
b_id=sim_i,
similarity=row[sim_i],
similarity_type='Topics')
for i, row in enumerate(distances)
for sim_i in row.argsort()[:21] # first 20 similar terms
if sim_i != i)
Remember to upload the CSVs after each change, and load them to the database using load_dataset
or load_topicmodel
! Use the same dataset or model id as before, if you want to overwrite the dataset or model, or else add a new one with add_*
before.
Datasets which have topic models can be changed only by adding some data, like document_contents.csv
, not removing anything. The database will give an error if you try to do anything which makes the data inconsistent.
No permissions system is currently implemented (nor is planned), so each user on the same server can view or modify all the data. Please treat the server as a disposable storage and store any data you cannot easily generate again on your computer. If you want to use assessment features of the navigator, please contact me beforehand to make sure you don't lose the responses!
This tutorial uses directory named data_mmro
on the server, please replace it with something unique among our group for convenience.