In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)
In [2]:
import ktrain
ktrain.__version__
Using TensorFlow backend.
using Keras version: 2.2.4
Out[2]:
'0.6.0'

STEP 1: Get Raw Document Data

In [3]:
# 20newsgroups
from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# compile the texts
texts = newsgroups_train.data +  newsgroups_test.data

# let's also store the newsgroup category associated with each document
# we can display this information in visualizations
targets = [target for target in list(newsgroups_train.target) + list(newsgroups_test.target)]
categories = [newsgroups_train.target_names[target] for target in targets]

STEP 2: Train an LDA Topic Model to Discover Topics

The get_topic_model function learns a topic model using Latent Dirichlet Allocation (LDA).

In [4]:
%%time
tm = ktrain.text.get_topic_model(texts, n_features=10000)
n_topics automatically set to 97
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 16min 18s, sys: 42min 45s, total: 59min 3s
Wall time: 1min 58s

We can examine the discovered topics using print_topics, get_topics, or topics. Here, we will use print_topics:

In [5]:
tm.print_topics()
topic 0 | tape adam tim case moved bag quote mass marked zionism
topic 1 | image jpeg images format programs tiff files jfif save lossless
topic 2 | alternative movie film static cycles films philips dynamic hou phi
topic 3 | hell humans poster frank reality kent gerard gant eternal bell
topic 4 | air phd chz kit cbc ups w-s rus w47 mot
topic 5 | dog math great figure poster couldn don trying rushdie fatwa
topic 6 | collaboration nazi fact end expression germany philly world certified moore
topic 7 | gif points scale postscript mirror plane rendering algorithm polygon rayshade
topic 8 | fonts font shell converted iii characters slight composite breaks compress
topic 9 | power station supply options option led light tank plastic wall
topic 10 | transmission rider bmw driver automatic shift gear japanese stick highway
topic 11 | tyre ezekiel ruler hernia appeared appointed supreme man land power
topic 12 | space nasa earth data launch surface solar moon mission planet
topic 13 | israel jews jewish israeli arab peace war arabs palestinian kuwait
topic 14 | olvwm xremote animals kinds roughing toolkit close corp glenn imakefile
topic 15 | medical health disease cancer patients drug treatment drugs aids study
topic 16 | biden chip gear like information number automatic mode insurance know
topic 17 | graphics zip amiga shareware formats ftp gif program sgi convert
topic 18 | brilliant mail did god coming christianity people got ideas reading
topic 19 | black red white blue green cross wires lines helmet mask
topic 20 | car engine cars miles clutch new ford rear slip road
topic 21 | list mailing service model small large lists radar available major
topic 22 | key encryption chip keys clipper phone security use government privacy
topic 23 | talking pit nyr stl phi edm mtl wsh hfd cgy
topic 24 | signal input switch connected circuit audio noise output control voltage
topic 25 | stuff deleted die posting beware fantastic motives authentic reluctant hope
topic 26 | adams douglas dc-x garrett ingres tin sdio incremental mcdonnell guide
topic 27 | men homosexual homosexuality women gay sexual homosexuals male kinsey pop
topic 28 | usual leo rs-232 martian reading cooperative unmanned somalia decompress visited
topic 29 | edu university information send new computer research mail internet address
topic 30 | reserve naval marine ret commission one-way irgun prior closure facilities
topic 31 | state intelligence militia units army zone georgia sam croats belongs
topic 32 | says article pain known warning doctor stone bug kidney response
topic 33 | faq rsa ripem lights yes patent nist management wax cipher
topic 34 | wolverine comics hulk appearance special liefeld sabretooth incredible hobgoblin x-force
topic 35 | software ram worth cycles controller available make dram dynamic situation
topic 36 | religion people religious catalog bobby used driven involved long like
topic 37 | intel sites experiment ftp does know family good like mrs
topic 38 | armenian people army russian turkish genocide armenians ottoman turks jews
topic 39 | theft geo available face couldn cover sony people number shop
topic 40 | christianity did exists mail matter mind tool status god reading
topic 41 | propane probe earth orbit orbiter titan cassini space atmosphere gravity
topic 42 | people government right think rights law make public fbi don
topic 43 | god people does say believe bible true think evidence religion
topic 44 | mov phone south key war supply push left just registered
topic 45 | period goal pts play chicago pittsburgh buffalo shots new blues
topic 46 | game team games year hockey season players player baseball league
topic 47 | speed dod student technician just hits right note giant light
topic 48 | sex marriage relationship family married couple depression pregnancy childhood trademark
topic 49 | protects rejecting com4 couple decides taking connect unc nearest richer
topic 50 | president states united american national press april washington america white
topic 51 | card memory windows board ram bus drivers driver cpu problem
topic 52 | window application manager display button xterm path widget event resources
topic 53 | cable win van det bos tor cal nyi chi buf
topic 54 | americans baltimore rochester cape springfield moncton providence utica binghamton adirondack
topic 55 | color monitor screen mouse video colors resolution vga colour monitors
topic 56 | option power ssf flights capability module redesign missions station options
topic 57 | body father son vitamin diet day cells cell form literature
topic 58 | max g9v b8f a86 bhj giz bxn biz qax b4q
topic 59 | bit fast chip ibm faster mode chips scsi-2 speeds quadra
topic 60 | book books law adl islam islamic iran media bullock muslims
topic 61 | armenian russian turkish ottoman people army armenians genocide war turks
topic 62 | oscillator partition tune nun umumiye nezareti mecmuasi muharrerat-i evrak version
topic 63 | tongues seat est didn raise copied lazy schemes adapter leap
topic 64 | com object jim app function motorola heterosexual objects pointers encountered
topic 65 | effective boy projects grow jason ain dump keyboards vastly grants
topic 66 | armenian people russian armenians turks ottoman army turkish genocide muslim
topic 67 | mac apple pin ground wire quicktime macs pins connector simms
topic 68 | bastard turning likes hooks notions turks cited proud pointers chuck
topic 69 | bought dealer cost channel replaced face sony stereo warranty tube
topic 70 | myers food reaction msg writes loop eat dee effects taste
topic 71 | lander contradiction reconcile apparent somebody supplement essential needs produce insulin
topic 72 | re-boost systems virginia voice unix input ken easily summary developing
topic 73 | block tests suck shadow dte screws macedonia sunlight fin message
topic 74 | jesus church christ god lord holy spirit mary shall heaven
topic 75 | gun number year guns rate insurance police years new firearms
topic 76 | rule automatically characteristic wider thumb recommendation inline mr2 halfway width
topic 77 | drive disk hard scsi drives controller floppy ide master transfer
topic 78 | stephanopoulos water gas oil heat energy hot temperature cold nuclear
topic 79 | like know does use don just good thanks need want
topic 80 | starters mlb mov higher signing left accessible argument viola teams
topic 81 | entry rules info define entries year int printf include contest
topic 82 | price new sale offer sell condition shipping interested asking prices
topic 83 | issue germany title magazine german cover race generation origin nazi
topic 84 | armenian armenians people turkish war said killed children russian turkey
topic 85 | dos windows software comp library os/2 version microsoft applications code
topic 86 | probe space launch titan earth cassini orbiter orbit atmosphere mission
topic 87 | housed throws fills daylight occurring activities adjacent presenting punish occuring
topic 88 | statement folk raids thor disarmed anatolia polygon inria arrive smehlik
topic 89 | sound steve pro convert ati ultra fahrenheit orchid hercules blaster
topic 90 | joke tricky wearing golden trickle seen geneva csh course caesar
topic 91 | moral objective values morality child defined bank definition wrong different
topic 92 | files file edu ftp available version server data use sun
topic 93 | catalog tons seal ordering kawasaki tools fax free ultraviolet packages
topic 94 | file program error output use section line code command problem
topic 95 | power ssf module capability option flights redesign missions human station
topic 96 | just don think know like time did going didn people

From the above, we can immediately get a feel for what kinds of subjects are discussed within this dataset. For instsance, Topic #13 appears to be about the Middle East with labels: "israel jews jewish israeli arab peace".

STEP 3: Compute the Document-Topic Matrix

In [6]:
%%time
tm.build(texts, threshold=0.25)
done.
CPU times: user 1min 27s, sys: 3min 26s, total: 4min 53s
Wall time: 12.6 s

Since the build method prunes documents based on threshold, we should prune the original data and any metadata in a similar way for consistency. This can be accomplished with the filter method.

In [7]:
texts = tm.filter(texts)
categories = tm.filter(categories)

This is useful to ensure all data and metadata are aligned with the same array indices in case we want to use them later (e.g., in visualizations, for example).

STEP 4: Inspect and Visualize Topics

Let's list the topics by document count:

In [8]:
tm.print_topics(show_counts=True)
topic:79 | count:3782 | like know does use don just good thanks need want
topic:96 | count:3643 | just don think know like time did going didn people
topic:43 | count:1599 | god people does say believe bible true think evidence religion
topic:42 | count:1246 | people government right think rights law make public fbi don
topic:51 | count:900 | card memory windows board ram bus drivers driver cpu problem
topic:46 | count:782 | game team games year hockey season players player baseball league
topic:92 | count:597 | files file edu ftp available version server data use sun
topic:29 | count:399 | edu university information send new computer research mail internet address
topic:82 | count:371 | price new sale offer sell condition shipping interested asking prices
topic:84 | count:312 | armenian armenians people turkish war said killed children russian turkey
topic:12 | count:296 | space nasa earth data launch surface solar moon mission planet
topic:22 | count:283 | key encryption chip keys clipper phone security use government privacy
topic:75 | count:236 | gun number year guns rate insurance police years new firearms
topic:15 | count:157 | medical health disease cancer patients drug treatment drugs aids study
topic:94 | count:152 | file program error output use section line code command problem
topic:74 | count:146 | jesus church christ god lord holy spirit mary shall heaven
topic:45 | count:123 | period goal pts play chicago pittsburgh buffalo shots new blues
topic:13 | count:104 | israel jews jewish israeli arab peace war arabs palestinian kuwait
topic:77 | count:75 | drive disk hard scsi drives controller floppy ide master transfer
topic:85 | count:58 | dos windows software comp library os/2 version microsoft applications code
topic:21 | count:46 | list mailing service model small large lists radar available major
topic:52 | count:29 | window application manager display button xterm path widget event resources
topic:20 | count:28 | car engine cars miles clutch new ford rear slip road
topic:27 | count:22 | men homosexual homosexuality women gay sexual homosexuals male kinsey pop
topic:19 | count:19 | black red white blue green cross wires lines helmet mask
topic:53 | count:16 | cable win van det bos tor cal nyi chi buf
topic:78 | count:16 | stephanopoulos water gas oil heat energy hot temperature cold nuclear
topic:91 | count:14 | moral objective values morality child defined bank definition wrong different
topic:24 | count:13 | signal input switch connected circuit audio noise output control voltage
topic:60 | count:12 | book books law adl islam islamic iran media bullock muslims
topic:17 | count:12 | graphics zip amiga shareware formats ftp gif program sgi convert
topic:32 | count:12 | says article pain known warning doctor stone bug kidney response
topic:55 | count:12 | color monitor screen mouse video colors resolution vga colour monitors
topic:59 | count:11 | bit fast chip ibm faster mode chips scsi-2 speeds quadra
topic:58 | count:10 | max g9v b8f a86 bhj giz bxn biz qax b4q
topic:70 | count:10 | myers food reaction msg writes loop eat dee effects taste
topic:81 | count:9 | entry rules info define entries year int printf include contest
topic:54 | count:8 | americans baltimore rochester cape springfield moncton providence utica binghamton adirondack
topic:50 | count:8 | president states united american national press april washington america white
topic:9 | count:8 | power station supply options option led light tank plastic wall
topic:34 | count:8 | wolverine comics hulk appearance special liefeld sabretooth incredible hobgoblin x-force
topic:67 | count:7 | mac apple pin ground wire quicktime macs pins connector simms
topic:3 | count:6 | hell humans poster frank reality kent gerard gant eternal bell
topic:25 | count:6 | stuff deleted die posting beware fantastic motives authentic reluctant hope
topic:4 | count:5 | air phd chz kit cbc ups w-s rus w47 mot
topic:64 | count:5 | com object jim app function motorola heterosexual objects pointers encountered
topic:47 | count:5 | speed dod student technician just hits right note giant light
topic:8 | count:4 | fonts font shell converted iii characters slight composite breaks compress
topic:7 | count:4 | gif points scale postscript mirror plane rendering algorithm polygon rayshade
topic:0 | count:3 | tape adam tim case moved bag quote mass marked zionism
topic:33 | count:3 | faq rsa ripem lights yes patent nist management wax cipher
topic:83 | count:2 | issue germany title magazine german cover race generation origin nazi
topic:89 | count:2 | sound steve pro convert ati ultra fahrenheit orchid hercules blaster
topic:65 | count:2 | effective boy projects grow jason ain dump keyboards vastly grants
topic:69 | count:1 | bought dealer cost channel replaced face sony stereo warranty tube
topic:48 | count:1 | sex marriage relationship family married couple depression pregnancy childhood trademark
topic:31 | count:1 | state intelligence militia units army zone georgia sam croats belongs
topic:57 | count:1 | body father son vitamin diet day cells cell form literature
topic:10 | count:1 | transmission rider bmw driver automatic shift gear japanese stick highway
topic:76 | count:1 | rule automatically characteristic wider thumb recommendation inline mr2 halfway width

The topic with the most documents appears to be conversational questions, replies, and comments that aren't focused on a particular subject. Other topics are focused on specific domains (e.g., topic 27 with label "jews israel jewish israeli arab muslims palestinian peace arabs land").

Notice that some topics contain only a few documents (e.g., topic #48 about sex, marriage, and relationships). This is typically an indication that this topic is mentioned within documents that also mention other topics prominently (e.g., topics about government policy vs. individual rights).

Let's visualize the corpus:

In [9]:
tm.visualize_documents(doc_topics=tm.get_doctopics())
reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 15644 samples in 0.048s...
[t-SNE] Computed neighbors for 15644 samples in 36.032s...
[t-SNE] Computed conditional probabilities for sample 1000 / 15644
[t-SNE] Computed conditional probabilities for sample 2000 / 15644
[t-SNE] Computed conditional probabilities for sample 3000 / 15644
[t-SNE] Computed conditional probabilities for sample 4000 / 15644
[t-SNE] Computed conditional probabilities for sample 5000 / 15644
[t-SNE] Computed conditional probabilities for sample 6000 / 15644
[t-SNE] Computed conditional probabilities for sample 7000 / 15644
[t-SNE] Computed conditional probabilities for sample 8000 / 15644
[t-SNE] Computed conditional probabilities for sample 9000 / 15644
[t-SNE] Computed conditional probabilities for sample 10000 / 15644
[t-SNE] Computed conditional probabilities for sample 11000 / 15644
[t-SNE] Computed conditional probabilities for sample 12000 / 15644
[t-SNE] Computed conditional probabilities for sample 13000 / 15644
[t-SNE] Computed conditional probabilities for sample 14000 / 15644
[t-SNE] Computed conditional probabilities for sample 15000 / 15644
[t-SNE] Computed conditional probabilities for sample 15644 / 15644
[t-SNE] Mean sigma: 0.071015
[t-SNE] KL divergence after 250 iterations with early exaggeration: 87.022636
[t-SNE] KL divergence after 1000 iterations: 1.863021
done.
Loading BokehJS ...

Top-ranked document for the topic #74, which is about Christianity:

In [10]:
print(tm.get_docs(topic_ids=[74], rank=True)[0]['text'])
For the Lord Himself will descend from Heaven with a shout, with the voice
of an archangel, and with the trumpet of God. And the dead in Christ will
rise first. Then we who are alive and remain will be caught up together
to meet the Lord in the air. And thus we shall always be with the Lord.

Let's visualize the "Christinaity" topic (topic_id=48) and the "Medical" topic (topic_id=15)

In [11]:
doc_topics = tm.get_doctopics(topic_ids=[15, 74])
tm.visualize_documents(doc_topics=doc_topics)
reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 303 samples in 0.001s...
[t-SNE] Computed neighbors for 303 samples in 0.014s...
[t-SNE] Computed conditional probabilities for sample 303 / 303
[t-SNE] Mean sigma: 0.116946
[t-SNE] KL divergence after 250 iterations with early exaggeration: 57.464523
[t-SNE] KL divergence after 1000 iterations: 0.429532
done.
Loading BokehJS ...

STEP 5: Predicting the Topics of New Documents

The predict method can predict the topic probability distribution for any arbitrary document directly from raw text:

In [12]:
tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.'])
Out[12]:
array([[0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.65009096, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.06185567, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214]])

As expected, the highest topic probability for this sentence is from topic #12 (third row and third column), which is about space and related things:

In [13]:
tm.topics[ np.argmax(tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.']))]
Out[13]:
'space nasa earth data launch surface solar moon mission planet'

Saving and Restoring the Topic Model

The topic model can be saved and restored as follows.

Save the Topic Model:

In [14]:
tm.save('/tmp/tm')

Restore the Topic Model and Rebuild the Document-Topic Matrix

In [15]:
tm = ktrain.text.load_topic_model('/tmp/tm')
done.
In [16]:
tm.build(texts, threshold=0.25)
done.
In [17]:
tm.topics[ np.argmax(tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.']))]
Out[17]:
'space nasa earth data launch surface solar moon mission planet'
In [ ]:
 
In [ ]: