Notebook

In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [2]:

import ktrain

I1016 09:52:24.127507 140653204371264 file_utils.py:39] PyTorch version 1.4.0 available.
I1016 09:52:24.130346 140653204371264 file_utils.py:55] TensorFlow version 2.1.0 available.

Learning from Unlabeled Text Data¶

Unlabeled, unstructured text or document data abound, and it is often necessary to "make sense" of these data for various applications. Examples include:

exploratory analysis of text data: provide rich overviews of the information space to discover relevant information for which one may not have even known to look
building training sets for text classification: identifying positive and negative example documents to train a text classifier in a semi-automated fashion
document similarity: measuring the semantic simlarity between documents or sets of documents
document recommender systems: given a specific document of interest, recommend other documents that are semantically similar to it

Each of these examples involve learning from largely unlabeled text data. In this notebook, we will show you how to accomplish the above with minimal coding using ktrain. The ktrain library is an open-source, augmented ML library built around Keras and scikit-learn. It can be installed with pip3 install ktrain and is available on GitHub.

We will use the well-known 20-newsgroup dataset for this demonstration.

Get Raw Document Data¶

In [3]:

# 20newsgroups
from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# compile the texts
texts = newsgroups_train.data +  newsgroups_test.data

# let's also store the newsgroup category associated with each document
# we can display this information in visualizations
targets = [target for target in list(newsgroups_train.target) + list(newsgroups_test.target)]
categories = [newsgroups_train.target_names[target] for target in targets]

We are loading the targets (i.e., newsgroup categories), but will not use them for learning a model. Rather, they are simply employed as an example of how to incorporate metadata about documents in visualizations and anlayses.

Train an LDA Topic Model to Discover Topics¶

The get_topic_model function learns a topic model using Latent Dirichlet Allocation (LDA) by default.

To use non-negative matrix factorization(NMF) instead of LDA, you can supply model_type='nmf' to the get_topic_model function.

The n_features argument specifies the size of the vocabulary, and the n_topics argument sets the number of topics (or clusters) to discover.

In [4]:

%%time
tm = ktrain.text.get_topic_model(texts, n_topics=None, n_features=10000)

n_topics automatically set to 97
lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 15min 53s, sys: 43min 10s, total: 59min 4s
Wall time: 1min 59s

We can examine the discovered topics using print_topics, get_topics, or topics. Here, we will use print_topics:

In [5]:

tm.print_topics()

topic 0 | tape adam tim case moved bag quote mass marked zionism
topic 1 | image jpeg images format programs tiff files jfif save lossless
topic 2 | alternative movie film static cycles films philips dynamic hou phi
topic 3 | hell humans poster frank reality kent gerard gant eternal bell
topic 4 | air phd chz kit cbc ups w-s rus w47 mot
topic 5 | dog math great figure poster couldn don trying rushdie fatwa
topic 6 | collaboration nazi fact end expression germany philly world certified moore
topic 7 | gif points scale postscript mirror plane rendering algorithm polygon rayshade
topic 8 | fonts font shell converted iii characters slight composite breaks compress
topic 9 | power station supply options option led light tank plastic wall
topic 10 | transmission rider bmw driver automatic shift gear japanese stick highway
topic 11 | tyre ezekiel ruler hernia appeared appointed supreme man land power
topic 12 | space nasa earth data launch surface solar moon mission planet
topic 13 | israel jews jewish israeli arab peace war arabs palestinian kuwait
topic 14 | olvwm xremote animals kinds roughing toolkit close corp glenn imakefile
topic 15 | medical health disease cancer patients drug treatment drugs aids study
topic 16 | biden chip gear like information number automatic mode insurance know
topic 17 | graphics zip amiga shareware formats ftp gif program sgi convert
topic 18 | brilliant mail did god coming christianity people got ideas reading
topic 19 | black red white blue green cross wires lines helmet mask
topic 20 | car engine cars miles clutch new ford rear slip road
topic 21 | list mailing service model small large lists radar available major
topic 22 | key encryption chip keys clipper phone security use government privacy
topic 23 | talking pit nyr stl phi edm mtl wsh hfd cgy
topic 24 | signal input switch connected circuit audio noise output control voltage
topic 25 | stuff deleted die posting beware fantastic motives authentic reluctant hope
topic 26 | adams douglas dc-x garrett ingres tin sdio incremental mcdonnell guide
topic 27 | men homosexual homosexuality women gay sexual homosexuals male kinsey pop
topic 28 | usual leo rs-232 martian reading cooperative unmanned somalia decompress visited
topic 29 | edu university information send new computer research mail internet address
topic 30 | reserve naval marine ret commission one-way irgun prior closure facilities
topic 31 | state intelligence militia units army zone georgia sam croats belongs
topic 32 | says article pain known warning doctor stone bug kidney response
topic 33 | faq rsa ripem lights yes patent nist management wax cipher
topic 34 | wolverine comics hulk appearance special liefeld sabretooth incredible hobgoblin x-force
topic 35 | software ram worth cycles controller available make dram dynamic situation
topic 36 | religion people religious catalog bobby used driven involved long like
topic 37 | intel sites experiment ftp does know family good like mrs
topic 38 | armenian people army russian turkish genocide armenians ottoman turks jews
topic 39 | theft geo available face couldn cover sony people number shop
topic 40 | christianity did exists mail matter mind tool status god reading
topic 41 | propane probe earth orbit orbiter titan cassini space atmosphere gravity
topic 42 | people government right think rights law make public fbi don
topic 43 | god people does say believe bible true think evidence religion
topic 44 | mov phone south key war supply push left just registered
topic 45 | period goal pts play chicago pittsburgh buffalo shots new blues
topic 46 | game team games year hockey season players player baseball league
topic 47 | speed dod student technician just hits right note giant light
topic 48 | sex marriage relationship family married couple depression pregnancy childhood trademark
topic 49 | protects rejecting com4 couple decides taking connect unc nearest richer
topic 50 | president states united american national press april washington america white
topic 51 | card memory windows board ram bus drivers driver cpu problem
topic 52 | window application manager display button xterm path widget event resources
topic 53 | cable win van det bos tor cal nyi chi buf
topic 54 | americans baltimore rochester cape springfield moncton providence utica binghamton adirondack
topic 55 | color monitor screen mouse video colors resolution vga colour monitors
topic 56 | option power ssf flights capability module redesign missions station options
topic 57 | body father son vitamin diet day cells cell form literature
topic 58 | max g9v b8f a86 bhj giz bxn biz qax b4q
topic 59 | bit fast chip ibm faster mode chips scsi-2 speeds quadra
topic 60 | book books law adl islam islamic iran media bullock muslims
topic 61 | armenian russian turkish ottoman people army armenians genocide war turks
topic 62 | oscillator partition tune nun umumiye nezareti mecmuasi muharrerat-i evrak version
topic 63 | tongues seat est didn raise copied lazy schemes adapter leap
topic 64 | com object jim app function motorola heterosexual objects pointers encountered
topic 65 | effective boy projects grow jason ain dump keyboards vastly grants
topic 66 | armenian people russian armenians turks ottoman army turkish genocide muslim
topic 67 | mac apple pin ground wire quicktime macs pins connector simms
topic 68 | bastard turning likes hooks notions turks cited proud pointers chuck
topic 69 | bought dealer cost channel replaced face sony stereo warranty tube
topic 70 | myers food reaction msg writes loop eat dee effects taste
topic 71 | lander contradiction reconcile apparent somebody supplement essential needs produce insulin
topic 72 | re-boost systems virginia voice unix input ken easily summary developing
topic 73 | block tests suck shadow dte screws macedonia sunlight fin message
topic 74 | jesus church christ god lord holy spirit mary shall heaven
topic 75 | gun number year guns rate insurance police years new firearms
topic 76 | rule automatically characteristic wider thumb recommendation inline mr2 halfway width
topic 77 | drive disk hard scsi drives controller floppy ide master transfer
topic 78 | stephanopoulos water gas oil heat energy hot temperature cold nuclear
topic 79 | like know does use don just good thanks need want
topic 80 | starters mlb mov higher signing left accessible argument viola teams
topic 81 | entry rules info define entries year int printf include contest
topic 82 | price new sale offer sell condition shipping interested asking prices
topic 83 | issue germany title magazine german cover race generation origin nazi
topic 84 | armenian armenians people turkish war said killed children russian turkey
topic 85 | dos windows software comp library os/2 version microsoft applications code
topic 86 | probe space launch titan earth cassini orbiter orbit atmosphere mission
topic 87 | housed throws fills daylight occurring activities adjacent presenting punish occuring
topic 88 | statement folk raids thor disarmed anatolia polygon inria arrive smehlik
topic 89 | sound steve pro convert ati ultra fahrenheit orchid hercules blaster
topic 90 | joke tricky wearing golden trickle seen geneva csh course caesar
topic 91 | moral objective values morality child defined bank definition wrong different
topic 92 | files file edu ftp available version server data use sun
topic 93 | catalog tons seal ordering kawasaki tools fax free ultraviolet packages
topic 94 | file program error output use section line code command problem
topic 95 | power ssf module capability option flights redesign missions human station
topic 96 | just don think know like time did going didn people

From the above, we can immediately get a feel for what kinds of subjects are discussed within this dataset. For instsance, Topic #13 appears to be about the Middle East with labels: "israel jews jewish israeli arab peace".

We can examine the word weights for this topic, where the "weight" is a pseudo-count (that can be converted to a probability if normalizing over all words in vocabulary):

In [6]:

tm.get_word_weights(topic_id=13, n_words=25)

Out[6]:

[('israel', 832.6639019298735),
 ('jews', 670.7851381220836),
 ('jewish', 539.6844212023212),
 ('israeli', 426.2434455755054),
 ('arab', 376.1511464659808),
 ('peace', 269.33000902133085),
 ('war', 229.3267450288203),
 ('arabs', 223.59854716988627),
 ('palestinian', 191.29031007300767),
 ('kuwait', 182.83909796210693),
 ('land', 173.23994664154932),
 ('palestinians', 158.31898772111572),
 ('state', 151.5461024601567),
 ('palestine', 121.14271427446522),
 ('west', 113.54334001125812),
 ('iraq', 111.54753922935043),
 ('jew', 106.56410679110718),
 ('attacks', 106.47122225017874),
 ('israelis', 99.7315473662208),
 ('gaza', 98.13080829032971),
 ('killed', 92.8557266909479),
 ('occupied', 90.79070215809577),
 ('country', 89.68676723731086),
 ('policy', 86.85153167299889),
 ('civilians', 86.63190206561713)]

Computing the Document-Topic Matrix¶

We will now pre-compute the document-topic matrix. Each row in this matrix represents a document, and the columns represent the probability distribution over the 97 topics. This allows us to easily see what kinds of topics are covered by any specific document in the original corpus.

When computing the document-topic matrix, we will also filter out documents whose maximum topic probability is less than 0.25 in order to consider the most representative documents for each topic. This may help to improve clarity of visualizations (shown later) by removing "unfocused" documents.

In [7]:

%%time
tm.build(texts, threshold=0.25)

done.
CPU times: user 1min 23s, sys: 3min 18s, total: 4min 42s
Wall time: 12.3 s

Since the build method prunes documents based on threshold, we should prune the original data and any metadata in a similar way for consistency. This can be accomplished with the filter method.

In [8]:

texts = tm.filter(texts)
categories = tm.filter(categories)

This is useful to ensure all data and metadata are aligned with the same array indices in case we want to use them later (e.g., in visualizations, for example).

Having computed the document-topic matrix, we can now easily access the topic probablity distribution for any document in the corpus using get_doctopics. For instance, this document in the corpus is about sports:

In [9]:

print(texts[35])

For the second straight game, California scored a ton of late runs to crush
the Brewhas. It was six runs in the 8th for a 12-5 win Monday and five in
the 8th and six in the 9th for a 12-2 win yesterday. Jamie Navarro pitched
seven strong innings, but Orosco, Austin, Manzanillo and Lloyd all took part
in the mockery of a bullpen yesterday. How's this for numbers? Maldanado has
pitched three scoreless innings and Navarro's ERA is 0.75. The next lowest
on the staff is Wegman at 5.14. Ouch!

It doesn't look much better for the hitters. Hamilton is batting .481, while
Thon is hitting .458 and has seven RBI. The next highest is three. The next
best hitter is Jaha at .267 and then Vaughn, who has the team's only HR, at
.238. Another ouch. Looking at the stats, it's not hard to see why the team
is 2-5. In fact, 2-5 doesn't sound bad when you're averaging three runs/game
and giving up 6.6/game. 

Still, it's early and things will undoubtedly get better. The offense should
come around, but the bullpen is a major worry. Fetters, Plesac and Austin gave
the Brewers great middle relief last year. Lloyd, Maldanado, Manzanillo, 
Fetters, Austin and Orosco will have to pick up the pace for the team to be
successful. Milwaukee won a number of games last year when middle relief either
held small leads or kept small deficits in place. The starters will be okay,
the defense will be alright and the hitting will come around, but the bullpen
is a big question mark.

In other news, Nilsson and Doran were reactivated yesterday, while William
Suero was sent down and Tim McIntosh was picked up by Montreal. Today's game
with California was cancelled.

And, here is the topic probability distribution for this document:

In [10]:

tm.get_doctopics(doc_ids=[35])

Out[10]:

array([[4.64381931e-04, 4.64381908e-04, 4.64381908e-04, 4.64381909e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381911e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381912e-04, 4.64381914e-04, 4.64381908e-04, 4.64381909e-04,
        4.64381908e-04, 4.64381909e-04, 4.64381908e-04, 4.64381922e-04,
        4.64381909e-04, 8.60305392e-02, 4.64381913e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381915e-04,
        4.64381908e-04, 4.64381924e-04, 4.64381908e-04, 4.64381919e-04,
        4.64381947e-04, 4.64381915e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381926e-04, 4.64381929e-04,
        4.64381908e-04, 4.64381911e-04, 6.93731322e-01, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 1.50902793e-02, 4.64381911e-04,
        4.64381909e-04, 9.05690040e-03, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 4.64381911e-04, 4.64381908e-04,
        4.64381908e-04, 4.64381908e-04, 1.26873995e-02, 4.64381921e-04,
        4.64381908e-04, 4.64381937e-04, 4.64381908e-04, 1.92280658e-02,
        9.47339092e-03, 4.64381909e-04, 4.64381912e-04, 4.64381908e-04,
        4.64381918e-04, 4.64381908e-04, 4.64381908e-04, 4.64381908e-04,
        4.64381908e-04, 9.47334319e-03, 4.64381908e-04, 4.64381909e-04,
        4.64381910e-04, 4.64381908e-04, 4.64381909e-04, 4.64381908e-04,
        1.04363152e-01]])

As expected, the highest topic probability (69%) is associated with a topic about sports:

In [11]:

tm.topics[ np.argmax(tm.get_doctopics(doc_ids=[35]))]

Out[11]:

'game team games year hockey season players player baseball league'

Predicting the Topics of New Documents¶

The predict method can predict the topic probability distribution for any arbitrary document directly from raw text:

In [12]:

tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.'])

Out[12]:

array([[0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.65009096, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.06185567, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214, 0.00303214, 0.00303214, 0.00303214,
        0.00303214, 0.00303214]])

As expected, the highest topic probability for this sentence is from topic #12 (third row and third column), which is about space and related things:

In [13]:

tm.topics[ np.argmax(tm.predict(['Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees '  +
            'the development and manufacturing of advanced rockets and spacecraft for missions ' +
            'to and beyond Earth orbit.']))]

Out[13]:

'space nasa earth data launch surface solar moon mission planet'

Visualizing Topics¶

Let's take another look at the list of discovered topics but sorted by document count.

In [14]:

tm.print_topics(show_counts=True)

topic:79 | count:3782 | like know does use don just good thanks need want
topic:96 | count:3643 | just don think know like time did going didn people
topic:43 | count:1599 | god people does say believe bible true think evidence religion
topic:42 | count:1246 | people government right think rights law make public fbi don
topic:51 | count:900 | card memory windows board ram bus drivers driver cpu problem
topic:46 | count:782 | game team games year hockey season players player baseball league
topic:92 | count:597 | files file edu ftp available version server data use sun
topic:29 | count:399 | edu university information send new computer research mail internet address
topic:82 | count:371 | price new sale offer sell condition shipping interested asking prices
topic:84 | count:312 | armenian armenians people turkish war said killed children russian turkey
topic:12 | count:296 | space nasa earth data launch surface solar moon mission planet
topic:22 | count:283 | key encryption chip keys clipper phone security use government privacy
topic:75 | count:236 | gun number year guns rate insurance police years new firearms
topic:15 | count:157 | medical health disease cancer patients drug treatment drugs aids study
topic:94 | count:152 | file program error output use section line code command problem
topic:74 | count:146 | jesus church christ god lord holy spirit mary shall heaven
topic:45 | count:123 | period goal pts play chicago pittsburgh buffalo shots new blues
topic:13 | count:104 | israel jews jewish israeli arab peace war arabs palestinian kuwait
topic:77 | count:75 | drive disk hard scsi drives controller floppy ide master transfer
topic:85 | count:58 | dos windows software comp library os/2 version microsoft applications code
topic:21 | count:46 | list mailing service model small large lists radar available major
topic:52 | count:29 | window application manager display button xterm path widget event resources
topic:20 | count:28 | car engine cars miles clutch new ford rear slip road
topic:27 | count:22 | men homosexual homosexuality women gay sexual homosexuals male kinsey pop
topic:19 | count:19 | black red white blue green cross wires lines helmet mask
topic:53 | count:16 | cable win van det bos tor cal nyi chi buf
topic:78 | count:16 | stephanopoulos water gas oil heat energy hot temperature cold nuclear
topic:91 | count:14 | moral objective values morality child defined bank definition wrong different
topic:24 | count:13 | signal input switch connected circuit audio noise output control voltage
topic:60 | count:12 | book books law adl islam islamic iran media bullock muslims
topic:17 | count:12 | graphics zip amiga shareware formats ftp gif program sgi convert
topic:32 | count:12 | says article pain known warning doctor stone bug kidney response
topic:55 | count:12 | color monitor screen mouse video colors resolution vga colour monitors
topic:59 | count:11 | bit fast chip ibm faster mode chips scsi-2 speeds quadra
topic:58 | count:10 | max g9v b8f a86 bhj giz bxn biz qax b4q
topic:70 | count:10 | myers food reaction msg writes loop eat dee effects taste
topic:81 | count:9 | entry rules info define entries year int printf include contest
topic:54 | count:8 | americans baltimore rochester cape springfield moncton providence utica binghamton adirondack
topic:50 | count:8 | president states united american national press april washington america white
topic:9 | count:8 | power station supply options option led light tank plastic wall
topic:34 | count:8 | wolverine comics hulk appearance special liefeld sabretooth incredible hobgoblin x-force
topic:67 | count:7 | mac apple pin ground wire quicktime macs pins connector simms
topic:3 | count:6 | hell humans poster frank reality kent gerard gant eternal bell
topic:25 | count:6 | stuff deleted die posting beware fantastic motives authentic reluctant hope
topic:4 | count:5 | air phd chz kit cbc ups w-s rus w47 mot
topic:64 | count:5 | com object jim app function motorola heterosexual objects pointers encountered
topic:47 | count:5 | speed dod student technician just hits right note giant light
topic:8 | count:4 | fonts font shell converted iii characters slight composite breaks compress
topic:7 | count:4 | gif points scale postscript mirror plane rendering algorithm polygon rayshade
topic:0 | count:3 | tape adam tim case moved bag quote mass marked zionism
topic:33 | count:3 | faq rsa ripem lights yes patent nist management wax cipher
topic:83 | count:2 | issue germany title magazine german cover race generation origin nazi
topic:89 | count:2 | sound steve pro convert ati ultra fahrenheit orchid hercules blaster
topic:65 | count:2 | effective boy projects grow jason ain dump keyboards vastly grants
topic:69 | count:1 | bought dealer cost channel replaced face sony stereo warranty tube
topic:48 | count:1 | sex marriage relationship family married couple depression pregnancy childhood trademark
topic:31 | count:1 | state intelligence militia units army zone georgia sam croats belongs
topic:57 | count:1 | body father son vitamin diet day cells cell form literature
topic:10 | count:1 | transmission rider bmw driver automatic shift gear japanese stick highway
topic:76 | count:1 | rule automatically characteristic wider thumb recommendation inline mr2 halfway width

The topic with the most documents appears to be conversational questions, replies, and comments that aren't focused on a particular subject. Other topics are focused on specific domains (e.g., topic 15 is about medicine with label "medical health disease cancer patients drug treatment").

We can easily generate an interactive visualization of the documents under consideration using visualize_documents:

In [15]:

tm.visualize_documents(doc_topics=tm.get_doctopics())

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 15644 samples in 0.174s...
[t-SNE] Computed neighbors for 15644 samples in 41.661s...
[t-SNE] Computed conditional probabilities for sample 1000 / 15644
[t-SNE] Computed conditional probabilities for sample 2000 / 15644
[t-SNE] Computed conditional probabilities for sample 3000 / 15644
[t-SNE] Computed conditional probabilities for sample 4000 / 15644
[t-SNE] Computed conditional probabilities for sample 5000 / 15644
[t-SNE] Computed conditional probabilities for sample 6000 / 15644
[t-SNE] Computed conditional probabilities for sample 7000 / 15644
[t-SNE] Computed conditional probabilities for sample 8000 / 15644
[t-SNE] Computed conditional probabilities for sample 9000 / 15644
[t-SNE] Computed conditional probabilities for sample 10000 / 15644
[t-SNE] Computed conditional probabilities for sample 11000 / 15644
[t-SNE] Computed conditional probabilities for sample 12000 / 15644
[t-SNE] Computed conditional probabilities for sample 13000 / 15644
[t-SNE] Computed conditional probabilities for sample 14000 / 15644
[t-SNE] Computed conditional probabilities for sample 15000 / 15644
[t-SNE] Computed conditional probabilities for sample 15644 / 15644
[t-SNE] Mean sigma: 0.071015
[t-SNE] KL divergence after 250 iterations with early exaggeration: 87.289635
[t-SNE] KL divergence after 1000 iterations: 1.866425
done.

Loading BokehJS ...

The visualization allows you to hover over points to inspect documents. The extra_info argument to visualize_documents allows you to customize what is displayed in the hover pop-up.

Inspecting Topics¶

The get_docs method allows you to retrieve document data by doc_id or topic_id. When rank=True, documents are sorted based on relevance to the topic. This is particularly useful for inspecting the most relevant documents to each topic.

Consider Topic # 51, which appears to be about computer hardware subjects:

In [16]:

tm.topics[51]

Out[16]:

'card memory windows board ram bus drivers driver cpu problem'

Let's examine this the top most relevant document to this topic:

In [17]:

doc = tm.get_docs(topic_ids=[51], rank=True)[0]
print('DOC_ID: %s'  % (doc['doc_id']))
print('TOPIC SCORE: %s '% (doc['topic_proba']))
print('TOPIC_ID: %s' % (doc['topic_id']))
print('TEXT: %s' % (doc['text']))

DOC_ID: 8252
TOPIC SCORE: 0.7938141074892827 
TOPIC_ID: 51
TEXT: Hi there,

With a 16Megs of RAM, is there a need to run/load Smartdrv for
Windows 3.1?  If yes, can I run/load Ramdrive without Smartdrv?
If I need both Ramdrive & Smartdrv, is the following Config.Sys
settings OK:  ...SMARTDRV.SYS 2048 2048
              ...RAMDRIVE.SYS 2048 /E

Thanks in advance for e-mail reply.

Looks right to me. Note that the get_docs method returns a list of dicts with keys:

text: the raw text of the document
doc_id: the index into the array returned by get_doctopics
topic_id: the index of the topic in the range of range(n_topics)
topic_proba: the relevance of this document to the topic represented by topic_id

When rank=True, within each topic_id, the dicts are sorted in desceding order by the topic_proba score. Hence, the first item is the most relevant document to the selected topic (topic_id=51). If rank=False, results are sorted in ascending order by doc_id (i.e,. the same order as texts that was supplied as input to build.

A Note About `get_docs` vs. `get_sorted_docs`¶

When we executed print_topics(show_counts=True) above, you may have noticed that topics towards the bottom of the list only had a few or even one document. For instance, topic_ic=48 is about sex, marriage and relationships and is associated with only one document. This is because these counts are generated by assigning each document to the one topic to which it is most related. Does this mean there is only one document talking about sex, marriage, and relationships? No. Rather, it just means that there is only one document that is most relevant to this topic over other topics. Other documents may pertain to sex, marriage, and relationships, but were determined to pertain most to another topic. That is, although these other documents talk about sex, marriage, and relationships, their primary topic were determined to be something else.

When invoking tm.get_docs(topic_ids[48]), this means only one document will be returned, since only a single document is primarily related to that topic. To see other documents that mention sex, marriage, and relationships (i.e., topic_id=48), we can use the get_sorted_docs method instead. The get_sorted_docs method will return all documents sorted by relevance to the given topic_id. For instance, this is the second most relevant document to sex, marriage, and relationships. Although it pertains to this topic, it was assigned to topic_id=42 (people and governments topic) because it is discussing sex in the context of society and government.

In [18]:

doc = tm.get_sorted_docs(topic_id=48)[1]
print('DOC_ID: %s'  % (doc['doc_id']))
print('TOPIC SCORE: %s '% (doc['topic_proba']))
print('TOPIC_ID: %s' % (doc['topic_id']))
print('TEXT: %s' % (doc['text']).strip())

DOC_ID: 685
TOPIC SCORE: 0.42620079390118243 
TOPIC_ID: 42
TEXT: [...]

Note that I _never_ said that depression and the destruction of the
nuclear family is due _solely_ to extra-marital sex.  I specifically
said that it was "a prime cause" of this, not "the prime cause" or "the
only cause" of this -- I recognize that there are probably other factors
too, but I think that extra-marital sex and subsequent destabilization
of the family is probably a significant factor to the rise in
psychological problems, including depression, in the West in the 20th
century.

Next, let's examine some additional topics that appears related to the larger theme of technology.

Here is a topic which appears to be about Windows software:

In [19]:

tm.topics[85]

Out[19]:

'dos windows software comp library os/2 version microsoft applications code'

In [20]:

tm.get_docs(topic_ids=[85], rank=True)[0]

Out[20]:

{'text': "1.  Software publishing SuperBase 4 windows v.1.3           --->$80\n\n2.  OCR System ReadRight v.3.1 for Windows                  --->$65\n\n3.  OCR System ReadRight  v.2.01 for DOS                    --->$65\n\n4.  Unregistered Zortech 32 bit C++ Compiler v.3.1          --->$ 250\n     with Multiscope windows Debugger,\n     WhiteWater Resource Toolkit, Library Source Code\n\n5.  Glockenspiel/ImageSoft Commonview 2 Windows\n     Applications Framework for Borland C++                 --->$70\n\n6.  Spontaneous Assembly Library With Source Code           --->$50\n\n7.  Microsoft Macro Assembly 6.0                            --->$50\n\n8.  Microsoft Windows v.3.1 SDK Documentation               --->$125\n\n9.  Microsoft FoxPro V.2.0                                  --->$75\n\n10.  WordPerfect 5.0 Developer's Toolkit                    --->$20\n\n11.  Kedwell Software DataBoss v.3.5 C Code Generator       --->$100\n\n12.  Kedwell InstallBoss v.2.0 Installation Generator       --->$35\n\n13.  Liant Software C++/Views v.2.1\n       Windows Application Framework with Source Code       --->$195\n\n14.  IBM OS/2 2.0 & Developer's Toolkit                     --->$95\n\n15.  CBTree DOS/Windows Library with Source Code            --->$120\n\n16.  Symantec TimeLine for Windows                          --->$90\n\n17.  TimeSlip TimeSheet Professional for Windows            --->$30",
 'doc_id': 84,
 'topic_proba': 0.9014064435233053,
 'topic_id': 85}

A topic about Programming:

In [21]:

tm.topics[94]

Out[21]:

'file program error output use section line code command problem'

In [22]:

tm.get_docs(topic_ids=[94], rank=True)[0]

Out[22]:

{'text': 'I am trying to write an image display program that uses\nthe MIT shared memory extension.  The shared memory segment\ngets allocated and attached to the process with no problem.\nBut the program crashes at the first call to XShmPutImage,\nwith the following message:\n\nX Error of failed request:  BadShmSeg (invalid shared segment parameter)\n  Major opcode of failed request:  133 (MIT-SHM)\n  Minor opcode of failed request:  3 (X_ShmPutImage)\n  Segment id in failed request 0x0\n  Serial number of failed request:  741\n  Current serial number in output stream:  742\n\nLike I said, I did error checking on all the calls to shmget\nand shmat that are necessary to create the shared memory\nsegment, as well as checking XShmAttach.  There are no\nproblems.\n\nIf anybody has had the same problem or has used MIT-SHM without\nhaving the same problem, please let me know.\n\nBy the way, I am running OpenWindows 3.0 on a Sun Sparc2.',
 'doc_id': 42,
 'topic_proba': 0.8379850047418861,
 'topic_id': 94}

A topic about cryptography:

In [23]:

tm.topics[22]

Out[23]:

'key encryption chip keys clipper phone security use government privacy'

In [24]:

tm.get_docs(topic_ids=[22], rank=True)[0]

Out[24]:

{'text': 'Here is a revised version of my summary which corrects some errors\nand provides some additional information and explanation.\n\n\n THE CLIPPER CHIP: A TECHNICAL SUMMARY\n\n Dorothy Denning\n\n Revised, April 21, 1993\n\n\nINTRODUCTION\n\nOn April 16, the President announced a new initiative that will bring\ntogether the Federal Government and industry in a voluntary program\nto provide secure communications while meeting the legitimate needs of\nlaw enforcement. At the heart of the plan is a new tamper-proof encryption\nchip called the "Clipper Chip" together with a split-key approach to\nescrowing keys. Two escrow agencies are used, and the key parts from\nboth are needed to reconstruct a key.\n\n\nCHIP CONTENTS\n\nThe Clipper Chip contains a classified single-key 64-bit block\nencryption algorithm called "Skipjack." The algorithm uses 80 bit keys\n(compared with 56 for the DES) and has 32 rounds of scrambling\n(compared with 16 for the DES). It supports all 4 DES modes of\noperation. The algorithm takes 32 clock ticks, and in Electronic\nCodebook (ECB) mode runs at 12 Mbits per second.\n\nEach chip includes the following components:\n\n the Skipjack encryption algorithm\n F, an 80-bit family key that is common to all chips\n N, a 30-bit serial number (this length is subject to change)\n U, an 80-bit secret key that unlocks all messages encrypted with the chip\n\nThe chips are programmed by Mykotronx, Inc., which calls them the\n"MYK-78." The silicon is supplied by VLSI Technology Inc. They are\nimplemented in 1 micron technology and will initially sell for about\n$30 each in quantities of 10,000 or more. The price should drop as the\ntechnology is shrunk to .8 micron.\n\n\nENCRYPTING WITH THE CHIP\n\nTo see how the chip is used, imagine that it is embedded in the AT&T\ntelephone security device (as it will be). Suppose I call someone and\nwe both have such a device. After pushing a button to start a secure\nconversation, my security device will negotiate an 80-bit session key K\nwith the device at the other end. This key negotiation takes place\nwithout the Clipper Chip. In general, any method of key exchange can\nbe used such as the Diffie-Hellman public-key distribution method.\n\nOnce the session key K is established, the Clipper Chip is used to\nencrypt the conversation or message stream M (digitized voice). The\ntelephone security device feeds K and M into the chip to produce two\nvalues:\n\n E[M; K], the encrypted message stream, and \n E[E[K; U] + N; F], a law enforcement field , \n\nwhich are transmitted over the telephone line. The law enforcement\nfield thus contains the session key K encrypted under the unit key U\nconcatenated with the serial number N, all encrypted under the family\nkey F. The law enforcement field is decrypted by law enforcement after\nan authorized wiretap has been installed.\n\nThe ciphertext E[M; K] is decrypted by the receiver\'s device using the\nsession key:\n\n D[E[M; K]; K] = M .\n\n\nCHIP PROGRAMMING AND ESCROW\n\nAll Clipper Chips are programmed inside a SCIF (Secure Compartmented\nInformation Facility), which is essentially a vault. The SCIF contains\na laptop computer and equipment to program the chips. About 300 chips\nare programmed during a single session. The SCIF is located at\nMykotronx.\n\nAt the beginning of a session, a trusted agent from each of the two key\nescrow agencies enters the vault. Agent 1 enters a secret, random\n80-bit value S1 into the laptop and agent 2 enters a secret, random\n80-bit value S2. These random values serve as seeds to generate unit\nkeys for a sequence of serial numbers. Thus, the unit keys are a\nfunction of 160 secret, random bits, where each agent knows only 80.\n \nTo generate the unit key for a serial number N, the 30-bit value N is\nfirst padded with a fixed 34-bit block to produce a 64-bit block N1.\nS1 and S2 are then used as keys to triple-encrypt N1, producing a\n64-bit block R1:\n\n R1 = E[D[E[N1; S1]; S2]; S1] .\n\nSimilarly, N is padded with two other 34-bit blocks to produce N2 and\nN3, and two additional 64-bit blocks R2 and R3 are computed: \n\n R2 = E[D[E[N2; S1]; S2]; S1] \n R3 = E[D[E[N3; S1]; S2]; S1] .\n\nR1, R2, and R3 are then concatenated together, giving 192 bits. The\nfirst 80 bits are assigned to U1 and the second 80 bits to U2. The\nrest are discarded. The unit key U is the XOR of U1 and U2. U1 and U2\nare the key parts that are separately escrowed with the two escrow\nagencies.\n\nAs a sequence of values for U1, U2, and U are generated, they are\nwritten onto three separate floppy disks. The first disk contains a\nfile for each serial number that contains the corresponding key part\nU1. The second disk is similar but contains the U2 values. The third\ndisk contains the unit keys U. Agent 1 takes the first disk and agent\n2 takes the second disk. Thus each agent walks away knowing\nan 80-bit seed and the 80-bit key parts. However, the agent does not\nknow the other 80 bits used to generate the keys or the other 80-bit\nkey parts. \n\nThe third disk is used to program the chips. After the chips are\nprogrammed, all information is discarded from the vault and the agents\nleave. The laptop may be destroyed for additional assurance that no\ninformation is left behind.\n \nThe protocol may be changed slightly so that four people are in the\nroom instead of two. The first two would provide the seeds S1 and S2,\nand the second two (the escrow agents) would take the disks back to\nthe escrow agencies. \n\nThe escrow agencies have as yet to be determined, but they will not\nbe the NSA, CIA, FBI, or any other law enforcement agency. One or\nboth may be independent from the government.\n\n\nLAW ENFORCEMENT USE\n\nWhen law enforcement has been authorized to tap an encrypted line, they\nwill first take the warrant to the service provider in order to get\naccess to the communications line. Let us assume that the tap is in\nplace and that they have determined that the line is encrypted with the\nClipper Chip. The law enforcement field is first decrypted with the\nfamily key F, giving E[K; U] + N. Documentation certifying that a tap\nhas been authorized for the party associated with serial number N is\nthen sent (e.g., via secure FAX) to each of the key escrow agents, who\nreturn (e.g., also via secure FAX) U1 and U2. U1 and U2 are XORed\ntogether to produce the unit key U, and E[K; U] is decrypted to get the\nsession key K. Finally the message stream is decrypted. All this will\nbe accomplished through a special black box decoder.\n\n\nCAPSTONE: THE NEXT GENERATION\n\nA successor to the Clipper Chip, called "Capstone" by the government\nand "MYK-80" by Mykotronx, has already been developed. It will include\nthe Skipjack algorithm, the Digital Signature Standard (DSS), the\nSecure Hash Algorithm (SHA), a method of key exchange, a fast\nexponentiator, and a randomizer. A prototoype will be available for\ntesting on April 22, and the chips are expected to be ready for\ndelivery in June or July.\n',
'doc_id': 5011,
'topic_proba': 0.7120663836493891,
'topic_id': 22}

Compiling a Sample of Interesting Documents¶

Let's combine these technology-related documents into a set of positive examples of technology-focused posts. We can use these documents as seeds to find new documents about technology. To measure semantic similarity among documents, we will represent each document by its topic probability distribution.

In [25]:

tech_topics = [51, 85, 94, 22]
tech_probs = tm.get_doctopics(topic_ids=tech_topics)
doc_ids = [doc['doc_id'] for doc in tm.get_docs(topic_ids=tech_topics)]

Let's visualize these technology-focused documents. We will also compile the original newsgroup categories for each document, so that they can be included in the visualization. (This is why we invoked the filter method earler.)

In [26]:

newsgroup_categories = [categories[doc_id] for doc_id in doc_ids]
tm.visualize_documents(doc_topics=tech_probs, extra_info={'cat': newsgroup_categories, 'doc_id':doc_ids})

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1393 samples in 0.009s...
[t-SNE] Computed neighbors for 1393 samples in 0.264s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1393
[t-SNE] Computed conditional probabilities for sample 1393 / 1393
[t-SNE] Mean sigma: 0.082664
[t-SNE] KL divergence after 250 iterations with early exaggeration: 65.742271
[t-SNE] KL divergence after 1000 iterations: 0.995216
done.

Loading BokehJS ...

Scoring Documents by Similarity¶

Once you've identified a set of documents that are interesting to your use case, you may want to identify additional documents that are semantically similar to this set. Here, suppose we wanted to identify new documents that are related to computer technology. We can accomplish this with the train_scorer and score methods. The train_scorer method compiles a list of seed documents based on supplied topic_ids or doc_ids. The score method scores new documents based on their similarity to the seed documents. Internally, this is accomplished by training a rudimentary One-Class classifier. While this classifier can be used as is, it is also useful to use the score method to help compile a training set for a traditinal binary classifier.

In [27]:

tm.train_scorer(topic_ids=tech_topics)

We can now invoke the score method to measure the degree to which new documents are similar to our technology-related topics. Note that, although we are applying the scorer to documents within the set corpus used to train the topic model, this is not required. Our scorer can be applied to any arbitrary set of documents.

Let's retrieve the text associated with all documents not associated with our selected technology-focused topics. These documents stored in other_texts are present in the original corpus. You could also construct other_texts to contain an entirely new, unseen corpus of documents.

In [28]:

other_topics = [i for i in range(tm.n_topics) if i not in tech_topics]
other_texts = [d['text'] for d in tm.get_docs(topic_ids=other_topics)]

Let's score these documents and place into a Pandas dataframe.

In [29]:

# score documents based on similarity
other_scores = tm.score(other_texts)

In [30]:

# display results in Pandas dataframe
other_preds = [int(score > 0) for score in other_scores]
data = sorted(list(zip(other_preds, other_scores, other_texts)), key=lambda item:item[1], reverse=True)
print('Top Inliers (or Most Similar to Our Technology-Related Topics')
print('\t\tNumber of Predicted Inliners: %s' % sum(other_preds))
df = pd.DataFrame(data, columns=['Prediction', 'Score', 'Text'])
df.head()

Top Inliers (or Most Similar to Our Technology-Related Topics
		Number of Predicted Inliners: 377

Out[30]:

	Prediction	Score	Text
0	1	0.212587	I'm looking for recommendations for a laser printer. It will\nbe used mostly for text by a single user. It doesn't need to\nbe a postscript printer. Any advice would be appreciated.\n
1	1	0.211690	I get the picture, I just find it humorous that Running Windows 3.1 apps ( 3.0 for 2.0 ) \nis what makes os/2 more credible...
2	1	0.211690	Two-part question:\n\n1) What is Windows NT - a 'real' windows OS?\n\n2) This past weekend, a local 'hacker' radio show metioned a new product\n from Microsoft called 'Chicago' if I recall. Anyone know what this is?\n\nThat is it -\n\nThanks a heap.\n\n- Alan\n
3	1	0.211690	Is there any one know:\n\nWhat is the FTP tool for Windows and where to get the tool ?\n\nThanks for any help !!
4	1	0.205488	Could someone point me toward a source (FTP/BBS/whatever) for development\ntools for the 8051 microprocessor. I specifically am looking for a Macintosh\ncross-assembler/disassembler. Also, is there a mailing-list dedicated to\ndiscussing the 8051? Thanks.\n

As you can see, we see we've found additional technology-related posts in the dataset.

Our scorer assigns a score to each document, where higher scores indicate a higher degree of similarity to technology-related seed docments. The scorer implements a decision function to make binary decisions on similarity such that documents with positive scores are deemed as similar and negative scores are deemed dissimilar. We've used this to create a prediction of 1 for similar and 0 for dissimilar. This identifies 377 documents as similar. The scorer, however, employs a One-Class classifier, which tends to be more strict. That is, there are likely documents with negative scores close to zero that are also similar. Let's look at these.

In [31]:

df[df.Score <=0].head()

Out[31]:

	Score	Text
377	-0.000711	I was at avalon today and found texture maps in some "tex" and "txc"\nformat, something I've never encountered before. These are obviously\nnot tex or LaTeX files.\n\nIF you have a clue how I can convert these to something\nreasonable, please let me know.
378	-0.002478	I need to be able to cause a beep, but without using any interrupt\nroutines, as I cannot use the BIOS. I believe that the PIC might have\nsomething to do with it, but I'm having troubles deciphering the\ninformation I have on it to figure out how to program it!\n\n\tI'm programming all of this in Turbo C, if that makes any\ndiference at all...\n\n\tPlease can anyone help me??!\n\nThanks,
379	-0.003216	\nThe only things you'll be able to salvage from the junior are the floppy drives\nand monitor. The floppies are 360k, and the monitor is CGA, but you will need\nan adaptor cable to use it. The junior does not use standard cards. Unless \nyou're really strapped for cash, you should just junk the thing and buy new \nstuff.\n\nDan\n
380	-0.003326	\nMacintosh II cx with 40 MB HD, 8 MB RAM and 19" monochrome\nmonitor (Ikegami) is for sale.\nAsking $3,000, no reasonable (best) offer will be rejected.\nContact Konrad at (416) 365-0564m Mon-Frii 9-5.\n
381	-0.003883	I edited a few newsgroup from that line (don't like to crosspost THAT\nmuch). I can't compare the two, but I recently got an HP DeskJet 500.\n\nI'm very pleased with the output (remember that I'm used to imagens,\nlaser and postscript printers at school -- looks very good. You have\nto be careful to let it dry before touching it, as it will smudge.\n\nThe deskjet is SLOW. This is in comparison to the other printers I\nmentioned. I have no idea how the bubblejet compares.\n\nThe interface between Win3.1 and the printer is just dandy, I've not\nhad any problems with it.\n\nHope that helps some.\n\n--Cindy\n\n--\nCindy Tittle Moore

As you can see, these documents are also similar and related to technology (albeit slightly different aspects of technology than that of our seed set of documents). Such negatively-scored documents are useful for identifying so-called informative examples. Since documents are sorted by score (descending order), we can start at the beginning of the dataframe containing negatively-scored documents and add documents to the positive class until we start seeing negative documents that are not related to technology. These informative negative examples can, then, be added to a negative class for training a traditional binary classsifier. This process is referred to as active learning.

For instance, in this example, scores below -0.5 start to become unrelated to the themes covered by our technical topics.

In [32]:

df[(df.Score<-0.51)].head()

Out[32]:

	Score	Text
2217	-0.510036	\nDon't forget Chemical Abstracts Service (which is pretty much the international\nclearinghouse for all chemical information), whose former director (Ronald\nWigington) and head of R&D (Nick Farmer) were openly former NSA employees.
2218	-0.510492	From article <1993Apr21.013846.1374@cx5.com>, by tlc@cx5.com:\n\nAccording to my ColoRIX manual .SCF files are 640x480x256\n\n\nYou may try VPIC, I think it handles the 256 color RIX files OK..\n
2219	-0.510556	What about disks? Won't it erase them if you're carrying them in the bag?
2220	-0.510582	was\nYuppies\nstarted\nYep, that's when I noticed it too. I stopped replacing the hood badge \nafter the second or third one (at $12.00 each).\n\n2002 drivers used to flash their headlight at each other in greeting. Try \nflashing your headlights at a 318i driver and see what kind of look you \nget. They usually check their radar detector...they think you're alerting \nthem to a cop.
2221	-0.510692	refrettably you are mistaken. alt.drugs was used to recruit people for the\nworldwide pot religion. I, however hve no problem being in both of them\n\n

Using Keyword Searches to Construct Seed Sets¶

Let's construct a set of seed documents from a keyword search instead of by LDA-discovered topics. Let's search all the documents for the word 'Christ':

In [33]:

results = tm.search('Christ', case_sensitive=False)

done.

There are 313 of them.

In [34]:

len(results)

Out[34]:

Many documents in this set are about Christianity, as expected:

In [35]:

print(results[0]['text'])

Yep, that's pretty much it. I'm not a Jew but I understand that this is the
Jewish way of thinking. However, the Jews believe that the Covenant between
YHWH and the Patriarchs (Abraham and Moses, in this case) establishes a Moral
Code to follow for mankind. Even the Jews could not decide where the boundaries
fall, though.

As I understand it, the Sadducees believed that the Torah was all that was
required, whereas the Pharisees (the ancestors of modern Judaism) believed that
the Torah was available for interpretation to lead to an understanding of
the required Morality in all its nuances (->Talmud).

The essence of all of this is that Biblical Morality is an interface between
Man and YHWH (for a Jew or Christian) and does not necessarily indicate
anything about YHWH outside of that relationship (although one can speculate).


The trouble with all of this is that we don't really know what the "created
in His image" means. I've heard a number of different opinions on this and
have still not come to any conclusion. This rather upsets the Apple Cart if
one wants to base a Life Script on this shaky foundation (to mix metaphors
unashamedly!) As to living by Christ's example, we know very little about
Jesus as a person. We only have his recorded utterances in a set of narratives
by his followers, and some very small references from comtemporary historians.
Revelation aside, one can only "know" Christ second-hand or worse.

This is not an attempt to debunk Christianity (although it may seem that way
initially), the point I`m trying to make is that we only really have the Bible
to interpret, and that interpretation is by humanity. I guess this is where
Faith or Relevation comes in with all its inherent subjectiveness.


No. There may be an absolute moral code. There are undoubtably multiple
moral codes. The multiple moral codes may be founded in the absolute moral
code. As an example, a parent may tell a child never to swear, and the child
may assume that the parent never swears simply because the parent has told
the child that it is "wrong". Now, the parent may swear like a trooper in
the pub or bar (where there are no children). The "wrongness" here is if
the child disobeys the parent. The parent may feel that it is "inappropriate"
to swear in front of children but may be quite happy to swear in front of
animals. The analogy does not quite hold water because the child knows that
he is of the same type as the parent (and may be a parent later in life) but
you get the gist of it? Incidentally, the young child considers the directive
as absolute until he gets older (see Piaget) and learns a morality of his own.

David.

---
On religion:

However, since we compiled the seed set of documents based on the keyword "Christ", some documents in the set may be only loosely related to Christianity (if at all). We will see below how this impacts results.

Let's construct a positive class from these 313 documents and use them to find other religious documents:

In [36]:

# compile doc_ids
doc_ids = [doc['doc_id'] for doc in results]

# train scorer from document IDs returned by keyword search
tm.train_scorer(doc_ids=doc_ids)

# get text and scores of remaining documents
other_texts = [d['text'] for d in tm.get_docs() if d['doc_id'] not in doc_ids]
other_scores = tm.score(other_texts)

# display results in Pandas dataframe
other_preds = [int(score > 0) for score in other_scores]
data = sorted(list(zip(other_preds, other_scores, other_texts)), key=lambda item:item[1], reverse=True)
print('Top Inliers (or Most Similar to Our Technology-Related Topics')
print('\t\tNumber of Predicted Inliners: %s' % sum(other_preds))
df = pd.DataFrame(data, columns=['Prediction', 'Score', 'Text'])
df.head(3)

Top Inliers (or Most Similar to Our Technology-Related Topics
		Number of Predicted Inliners: 4759

Out[36]:

	Prediction	Score	Text
0	1	0.418117	\nAnd does it not say in scripture that no man knows the hour of His coming, not\neven the angels in Heaven but only the Father Himself? DK was trying to play\nGod by breaking the seals himself. DK killed himself and as many of his\nfollowers as he could. BTW, God did save the children. They are in Heaven,\na far better place. How do I know? By faith.\n\nGod be with you,
1	1	0.409673	\n \nFirst of all, the original poster misquoted. The reference is from 2 Tim 3:16.\nThe author was Paul, and his revelations were anything but "(at best) \nsecond-hand".\n\n\t"And is came about that as [Saul] journeyed, he was approaching\n\t Damascus, and suddenly a light from heaven flashed around him; and\n\t he fell to the ground, and heard a voice saying to him, "Saul, Saul,\n\t why are you persecuting Me?" And he said, "Who art Thou, Lord?" And\n\t He said, "I am Jesus whom you are persecuting, . . ."\n\t\t(Acts 9:3-5, NAS)\n\nPaul received revelation directly from the risen Jesus! (Pretty cool, eh?) He\nbecame closely involved with the early church, the leaders of which were \nfollowers of Jesus throughout his ministry on earth.\n\n\nI agree. I don't believe anyone but the Spirit would be able to convince you \nthe Spirit exists. Please don't complain about this being circular. I know\nit is, but really, can anything of the natural world explain the supernatural?\n(This is why revelation is necessary to the authors of the Bible.)\n\n\nThe Spirit is part of God. How much closer to the source can you get?\nThe Greek in 2 Timothy which is sometimes translated as "inspired by God", \nliterally means "God-breathed". In other words, God spoke the actual words \ninto the scriptures. Many theologians and Bible scholars (Dr. James Boice is \none that I can remember off-hand) get quite annoyed by the dryness and \nincompleteness of "inspired by God".\n\n\nThat's what the verse taken from 2 Timothy was all about. The continuity of a \nbook written over a span of 1500 years by more than 40 authors from all walks \nof life is a testimony to the single authorship of God.\n\n\n\nWhat source to you claim to have discovered which has information of superior\nhistoricity to the Bible? Certainly not Josephus' writings, or the writings \nof the Gnostics which were third century, at the earliest.\n\n\nJesus was fully God as well. That's why I'd assert that he is wise.\n\n\nPlease rethink this last paragraph. If there is no God, which seems to be your\ncurrent belief, then Jesus was either a liar or a complete nut because not\nonly did he assert that God exists, but he claimed to be God himself! (regards\nto C.S. Lewis) How then could you have the least bit of respect for Jesus?\n\tIn conclusion, be careful about logically unfounded hypotheses based\non gut feelings about the text and other scholars' unsubstantiated claims. \nThe Bible pleads that we take it in its entirety or throw the whole book out.\n\tAbout your reading of the Bible, not only does the Spirit inspire the\nwriters, but he guides the reader as well. We cannot understand it in the \nleast without the Spirit's guidance:\n\n\t"For to us God revealed them through the Spirit; for the Spirit \n\tsearches all things, even the depths of God." (1 Cor 2:10, NAS)\n \nPeace and may God guide us in wisdom.\n\n+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=\nCarter C. Page \| Of happiness the crown and chiefest part is wisdom,\nA Carpenter's Apprentice \| and to hold God in awe. This is the law that,\ncpage@seas.upenn.edu \| seeing the stricken heart of pride brought down,\n \| we learn when we are old. -Adapted from Sophocles\n+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=+-=-+-=+-=-+=-+-=-+-=-+=-+-=
2	1	0.405350	I differ with our moderator on this. I thought the whole idea of God coming\ndown to earth to live as one of us "subject to sin and death" (as one of\nthe consecration prayers in the Book of Common Prayer (1979) puts it) was\nthat Jesus was tempted, but did not succumb. If sin is not part of the\nbasic definition of humanity, then Jesus "fully human" (Nicea) would not\nbe "subject to sin", but then the Resurrection loses some of its meaning,\nbecause we encounter our humanity most powerfully when we sin. To distinguish\nbetween "human" and "fallen human" makes Jesus less like one of us at the\ntime we need him most.\n\n\nFirst, the Monophysites inherited none of Nestorius's version -- they \nwere on the opposite end of the spectrum from him. Second, the historical\nrecord suggests that the positions attributed to Nestorius were not as\nextreme as his (successful) opponents (who wrote the conventional history)\nclaimed. Mainly Nestorius opposed the term Theotokos for Mary, arguing\n(I think correctly) that a human could not be called Mother of God. I mean,\nin the Athanasian Creed we talk about the Son "uncreate" -- surely even \nArians would concede that Jesus existed long before Mary. Anyway, Nestorius's\nopponents claimed that by saying Mary was not Theotokos, that he claimed\nthat she only gave birth to the human nature of Jesus, which would require\ntwo seperate and distinct natures. The argument fails though, because\nMary simply gave birth to Jesus, who preexisted her either divinely,\nif you accept "Nestorianism" as commonly defined, or both natures intertwined,\na la Chalcedon.\n\nSecond, I am not sure that "Nestorianism" is not a better alternative than\nthe orthodox view. After all, I find it hard to believe that pre-Incarnation\nthat Jesus's human nature was in heaven; likewise post-Ascension. I think\nrather that God came to earth and took our nature upon him. It was a seperate\nnature, capable of being tempted as in Gethsemane (since I believe the divine\nnature could never be tempted) but in its moments of weakness the divine nature\nprevailed.\n\nComments on the above warmly appreciated.\n\nJason Albert\n\n[There may be differences in what we mean by "subject to sin". The\noriginal complaint was from someone who didn't see how we could call\nJesus fully human, because he didn't sin. I completely agree that\nJesus was subject to temptation. I simply object to the idea that by\nnot succumbing, he is thereby not fully human. I believe that you do\nnot have to sin in order to be human.\n\nI again apologize for confusing Nestorianism and monophysitism. I\nagree with you, and have said elsewhere, that there's reason to think\nthat not everyone who is associated with heretical positions was in\nfact heretical. There are scholars who maintain that Nestorius was\nnot Nestorian. I have to confess that the first time I read some of\nthe correspondence between Nestorius and his opponents, I thought he\ngot the better of them.\n\nHowever, most scholars do believe that the work that eventually led to\nChalcedon was an advance, and that Nestorius was at the very least\n"rash and dogmatic" (as the editor of "The Christological Controversy"\nrefers to him) in rejecting all approaches other than his own. As\nregular Usenet readers know, narrowness can be just as much an\nimpediment as being wrong. Furthermore, he did say some things that I\nthink are problematical. He responds to a rather mild letter from\nCyril with a flame worthy of Usenet. In it he says "To attribute also\nto [the Logos], in the name of [the incarnation] the characteristics\nof the flesh that has been conjoined with him ... is, my brother,\neither the work of a mind which truly errs in the fashion of the\nGreeks or that of a mind diseased with the insane heresy of Arius and\nApollinaris and the others. Those who are thus carried away with the\nidea of this association are bound, because of it, to make the divine\nLogos have a part in being fed with milk and participate to some\ndegree in growh and stand in need of angelic assistance because of his\nfearfulness ... These things are taken falsely when they are put off\non the deity and they become the occasion of just condemnation for us\nwho perpetrate the falsehood."\n\nIt's all well and good to maintain a proper distinction between\nhumanity and divinity. But the whole concept of incarnation is based\non exactly the idea that the divine Logos does in fact have "to some\ndegree" a part in being born, growing up, and dying. Of course it\nmust be understood that there's a certain indirectness in the Logos'\nparticipation in these things. But there must be some sort of\nidentification between the divine and human, or we don't have an\nincarnation at all. Nestorius seemed to think in black and white\nterms, and missed the sorts of nuances one needs to deal with this\narea.\n\nYou say "I find it hard to believe that pre-Incarnation that Jesus's\nhuman nature was in heaven." I don't think that's required by\northodox doctrine. It's the divine Logos that is eternal.

Here, we've easily found other documents about religion-focused documents that do not explicitly mention Christ. Note that our One-Class classifier predicted a positive label for 4759 documents here. The reason for this large number is that some of the documents in the seed set containing the word 'Christ' may be only loosely related to religion or Christinity (if at all).

This document, for example, mentions 'Christ' somewhere in the post, but is largely unrelated to Christanity or religion:

In [37]:

print(texts[11311][:1024])


Pardon me? History shows that within the last 170 years, Greeks played 
that game twice: They used Istanbul Patriarch Grigorios in 1822 to 
instigate the Morea rebellion that resulted in the massacres of 
the Muslim people. Again, the Orthodox Patriarch Constantine V 
invited the Russian Czar Nicholas II to invade the Ottoman Empire 
'in the name of Jesus,' and save his flock from Ottoman rule. 

Source: "The 'Past' in Medieval and Modern Greek Culture," in Speros
         Vryonis, ed., 'Byzantina kai Metabyzantina,' Vol I (Malibu,
         Calif., 1978).

p. 161.

In the words of Professor Skiotis, "With savage jubilance, [the Greeks]
sang the words 'Let no Turk remain in the Morea, nor in the whole world.'
The Greeks were determined to achieve to 'Romaiko' in the only way they
knew how: through a war of religious extermination."


  <<The leader of the Ashkenazi community of Corlu complained to the
  president of AIU [Alliance Israelite Universelle] in 1902 about
  persistent Greek attacks against its Jew

When including such documents in the seed set, documents unrelated to Christianity or religion are predicted with a positive label (i.e, false positives). However, higher scoring documents will be related to Christianity, since documents in this dataset containing "Christ" are more likely to be about Christianity and relgion. Thus, as a document similarity scorer, constructing a positive class by pulling in documents based on keywords can still work.

Recommending Similar Documents¶

In the previous section, given a set of seed documents, we scored new documents based on similarity. Here, we will reverse this process. Given a new document, we will find (or recommend) documents that are semantically similar to it from the 20newsgroup corpus.

We must first train the recommender. The train_recommender method trains a Nearest Neighbors model that can be used to perform semantic searches and generate document recommendations on your dataset.

In [38]:

tm.train_recommender()

Now, let's create some text about space exploration and recommend the top newsgroup posts similar to this text.

In [39]:

rawtext = """
            Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees
            the development and manufacturing of advanced rockets and spacecraft for missions
            to and beyond Earth orbit.
            """

Here is the top recommended 20newsgroup post based on semantic similarity to the text above. This can be considered a semantic text search as the recommend method will return documents that are semantically-related to the supplied text.

In [40]:

for i, doc in enumerate(tm.recommend(text=rawtext, n=1)):
    print('RESULT #%s'% (i+1))
    print('TEXT:\n\t%s' % (doc['text']))
    print('NEWSGROUP:\n\t%s'% (categories[doc['doc_id']]))
    print('TOPIC:\n\t%s' % (tm.topics[doc['topic_id']]))
    print()

RESULT #1
TEXT:
	Archive-name: space/new_probes
Last-modified: $Date: 93/04/01 14:39:17 $

UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES

    Information on upcoming or currently active missions not mentioned below
    would be welcome. Sources: NASA fact sheets, Cassini Mission Design
    team, ISAS/NASDA launch schedules, press kits.


    ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into
    Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20
    Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas
    scintillation proportional counters.


    CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint
    NASA/ESA project designed to accomplish an exploration of the Saturnian
    system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini
    is scheduled for launch aboard a Titan IV/Centaur in October of 1997.
    After gravity assists of Venus, Earth and Jupiter in a VVEJGA
    trajectory, the spacecraft will arrive at Saturn in June of 2004. Upon
    arrival, the Cassini spacecraft performs several maneuvers to achieve an
    orbit around Saturn. Near the end of this initial orbit, the Huygens
    Probe separates from the Orbiter and descends through the atmosphere of
    Titan. The Orbiter relays the Probe data to Earth for about 3 hours
    while the Probe enters and traverses the cloudy atmosphere to the
    surface. After the completion of the Probe mission, the Orbiter
    continues touring the Saturnian system for three and a half years. Titan
    synchronous orbit trajectories will allow about 35 flybys of Titan and
    targeted flybys of Iapetus, Dione and Enceladus. The objectives of the
    mission are threefold: conduct detailed studies of Saturn's atmosphere,
    rings and magnetosphere; conduct close-up studies of Saturn's
    satellites, and characterize Titan's atmosphere and surface.

    One of the most intriguing aspects of Titan is the possibility that its
    surface may be covered in part with lakes of liquid hydrocarbons that
    result from photochemical processes in its upper atmosphere. These
    hydrocarbons condense to form a global smog layer and eventually rain
    down onto the surface. The Cassini orbiter will use onboard radar to
    peer through Titan's clouds and determine if there is liquid on the
    surface. Experiments aboard both the orbiter and the entry probe will
    investigate the chemical processes that produce this unique atmosphere.

    The Cassini mission is named for Jean Dominique Cassini (1625-1712), the
    first director of the Paris Observatory, who discovered several of
    Saturn's satellites and the major division in its rings. The Titan
    atmospheric entry probe is named for the Dutch physicist Christiaan
    Huygens (1629-1695), who discovered Titan and first described the true
    nature of Saturn's rings.

	 Key Scheduled Dates for the Cassini Mission (VVEJGA Trajectory)
	 -------------------------------------------------------------
	   10/06/97 - Titan IV/Centaur Launch
	   04/21/98 - Venus 1 Gravity Assist
	   06/20/99 - Venus 2 Gravity Assist
	   08/16/99 - Earth Gravity Assist
	   12/30/00 - Jupiter Gravity Assist
	   06/25/04 - Saturn Arrival
	   01/09/05 - Titan Probe Release
	   01/30/05 - Titan Probe Entry
	   06/25/08 - End of Primary Mission
	    (Schedule last updated 7/22/92)


    GALILEO - Jupiter orbiter and atmosphere probe, in transit. Has returned
    the first resolved images of an asteroid, Gaspra, while in transit to
    Jupiter. Efforts to unfurl the stuck High-Gain Antenna (HGA) have
    essentially been abandoned. JPL has developed a backup plan using data
    compression (JPEG-like for images, lossless compression for data from
    the other instruments) which should allow the mission to achieve
    approximately 70% of its original objectives.

	   Galileo Schedule
	   ----------------
	   10/18/89 - Launch from Space Shuttle
	   02/09/90 - Venus Flyby
	   10/**/90 - Venus Data Playback
	   12/08/90 - 1st Earth Flyby
	   05/01/91 - High Gain Antenna Unfurled
	   07/91 - 06/92 - 1st Asteroid Belt Passage
	   10/29/91 - Asteroid Gaspra Flyby
	   12/08/92 - 2nd Earth Flyby
	   05/93 - 11/93 - 2nd Asteroid Belt Passage
	   08/28/93 - Asteroid Ida Flyby
	   07/02/95 - Probe Separation
	   07/09/95 - Orbiter Deflection Maneuver
	   12/95 - 10/97 - Orbital Tour of Jovian Moons
	   12/07/95 - Jupiter/Io Encounter
	   07/18/96 - Ganymede
	   09/28/96 - Ganymede
	   12/12/96 - Callisto
	   01/23/97 - Europa
	   02/28/97 - Ganymede
	   04/22/97 - Europa
	   05/31/97 - Europa
	   10/05/97 - Jupiter Magnetotail Exploration


    HITEN - Japanese (ISAS) lunar probe launched 1/24/90. Has made
    multiple lunar flybys. Released Hagoromo, a smaller satellite,
    into lunar orbit. This mission made Japan the third nation to
    orbit a satellite around the Moon.


    MAGELLAN - Venus radar mapping mission. Has mapped almost the entire
    surface at high resolution. Currently (4/93) collecting a global gravity
    map.


    MARS OBSERVER - Mars orbiter including 1.5 m/pixel resolution camera.
    Launched 9/25/92 on a Titan III/TOS booster. MO is currently (4/93) in
    transit to Mars, arriving on 8/24/93. Operations will start 11/93 for
    one martian year (687 days).


    TOPEX/Poseidon - Joint US/French Earth observing satellite, launched
    8/10/92 on an Ariane 4 booster. The primary objective of the
    TOPEX/POSEIDON project is to make precise and accurate global
    observations of the sea level for several years, substantially
    increasing understanding of global ocean dynamics. The satellite also
    will increase understanding of how heat is transported in the ocean.


    ULYSSES- European Space Agency probe to study the Sun from an orbit over
    its poles. Launched in late 1990, it carries particles-and-fields
    experiments (such as magnetometer, ion and electron collectors for
    various energy ranges, plasma wave radio receivers, etc.) but no camera.

    Since no human-built rocket is hefty enough to send Ulysses far out of
    the ecliptic plane, it went to Jupiter instead, and stole energy from
    that planet by sliding over Jupiter's north pole in a gravity-assist
    manuver in February 1992. This bent its path into a solar orbit tilted
    about 85 degrees to the ecliptic. It will pass over the Sun's south pole
    in the summer of 1993. Its aphelion is 5.2 AU, and, surprisingly, its
    perihelion is about 1.5 AU-- that's right, a solar-studies spacecraft
    that's always further from the Sun than the Earth is!

    While in Jupiter's neigborhood, Ulysses studied the magnetic and
    radiation environment. For a short summary of these results, see
    *Science*, V. 257, p. 1487-1489 (11 September 1992). For gory technical
    detail, see the many articles in the same issue.


    OTHER SPACE SCIENCE MISSIONS (note: this is based on a posting by Ron
    Baalke in 11/89, with ISAS/NASDA information contributed by Yoshiro
    Yamada (yamada@yscvax.ysc.go.jp). I'm attempting to track changes based
    on updated shuttle manifests; corrections and updates are welcome.

    1993 Missions
	o ALEXIS [spring, Pegasus]
	    ALEXIS (Array of Low-Energy X-ray Imaging Sensors) is to perform
	    a wide-field sky survey in the "soft" (low-energy) X-ray
	    spectrum. It will scan the entire sky every six months to search
	    for variations in soft-X-ray emission from sources such as white
	    dwarfs, cataclysmic variable stars and flare stars. It will also
	    search nearby space for such exotic objects as isolated neutron
	    stars and gamma-ray bursters. ALEXIS is a project of Los Alamos
	    National Laboratory and is primarily a technology development
	    mission that uses astrophysical sources to demonstrate the
	    technology. Contact project investigator Jeffrey J Bloch
	    (jjb@beta.lanl.gov) for more information.

	o Wind [Aug, Delta II rocket]
	    Satellite to measure solar wind input to magnetosphere.

	o Space Radar Lab [Sep, STS-60 SRL-01]
	    Gather radar images of Earth's surface.

	o Total Ozone Mapping Spectrometer [Dec, Pegasus rocket]
	    Study of Stratospheric ozone.

	o SFU (Space Flyer Unit) [ISAS]
	    Conducting space experiments and observations and this can be
	    recovered after it conducts the various scientific and
	    engineering experiments. SFU is to be launched by ISAS and
	    retrieved by the U.S. Space Shuttle on STS-68 in 1994.

    1994
	o Polar Auroral Plasma Physics [May, Delta II rocket]
	    June, measure solar wind and ions and gases surrounding the
	    Earth.

	o IML-2 (STS) [NASDA, Jul 1994 IML-02]
	    International Microgravity Laboratory.

	o ADEOS [NASDA]
	    Advanced Earth Observing Satellite.

	o MUSES-B (Mu Space Engineering Satellite-B) [ISAS]
	    Conducting research on the precise mechanism of space structure
	    and in-space astronomical observations of electromagnetic waves.

    1995
	LUNAR-A [ISAS]
	    Elucidating the crust structure and thermal construction of the
	    moon's interior.


    Proposed Missions:
	o Advanced X-ray Astronomy Facility (AXAF)
	    Possible launch from shuttle in 1995, AXAF is a space
	    observatory with a high resolution telescope. It would orbit for
	    15 years and study the mysteries and fate of the universe.

	o Earth Observing System (EOS)
	    Possible launch in 1997, 1 of 6 US orbiting space platforms to
	    provide long-term data (15 years) of Earth systems science
	    including planetary evolution.

	o Mercury Observer
	    Possible 1997 launch.

	o Lunar Observer
	    Possible 1997 launch, would be sent into a long-term lunar
	    orbit. The Observer, from 60 miles above the moon's poles, would
	    survey characteristics to provide a global context for the
	    results from the Apollo program.

	o Space Infrared Telescope Facility
	    Possible launch by shuttle in 1999, this is the 4th element of
	    the Great Observatories program. A free-flying observatory with
	    a lifetime of 5 to 10 years, it would observe new comets and
	    other primitive bodies in the outer solar system, study cosmic
	    birth formation of galaxies, stars and planets and distant
	    infrared-emitting galaxies

	o Mars Rover Sample Return (MRSR)
	    Robotics rover would return samples of Mars' atmosphere and
	    surface to Earch for analysis. Possible launch dates: 1996 for
	    imaging orbiter, 2001 for rover.

	o Fire and Ice
	    Possible launch in 2001, will use a gravity assist flyby of
	    Earth in 2003, and use a final gravity assist from Jupiter in
	    2005, where the probe will split into its Fire and Ice
	    components: The Fire probe will journey into the Sun, taking
	    measurements of our star's upper atmosphere until it is
	    vaporized by the intense heat. The Ice probe will head out
	    towards Pluto, reaching the tiny world for study by 2016.

NEWSGROUP:
	sci.space
TOPIC:
	space nasa earth data launch surface solar moon mission planet

Saving and Restoring the Topic Model¶

The topic model can be saved and restored as follows.

Save the Topic Model:

In [41]:

tm.save('/tmp/tm')

Restore the Topic Model and Rebuild the Document-Topic Matrix

In [42]:

tm = ktrain.text.load_topic_model('/tmp/tm')

done.

In [43]:

tm.build(texts, threshold=0.25)

done.

Note that the scorer and recommender are not saved, only the LDA topic model is saved. So, the scorer and recommender should be retrained prior to use as follows:

In [44]:

tm.train_recommender()

In [45]:

rawtext = """
            Elon Musk leads Space Exploration Technologies (SpaceX), where he oversees
            the development and manufacturing of advanced rockets and spacecraft for missions
            to and beyond Earth orbit.
            """

In [46]:

print(tm.recommend(text=rawtext, n=1)[0]['text'])

Archive-name: space/new_probes
Last-modified: $Date: 93/04/01 14:39:17 $

UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES

    Information on upcoming or currently active missions not mentioned below
    would be welcome. Sources: NASA fact sheets, Cassini Mission Design
    team, ISAS/NASDA launch schedules, press kits.


    ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into
    Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20
    Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas
    scintillation proportional counters.


    CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint
    NASA/ESA project designed to accomplish an exploration of the Saturnian
    system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini
    is scheduled for launch aboard a Titan IV/Centaur in October of 1997.
    After gravity assists of Venus, Earth and Jupiter in a VVEJGA
    trajectory, the spacecraft will arrive at Saturn in June of 2004. Upon
    arrival, the Cassini spacecraft performs several maneuvers to achieve an
    orbit around Saturn. Near the end of this initial orbit, the Huygens
    Probe separates from the Orbiter and descends through the atmosphere of
    Titan. The Orbiter relays the Probe data to Earth for about 3 hours
    while the Probe enters and traverses the cloudy atmosphere to the
    surface. After the completion of the Probe mission, the Orbiter
    continues touring the Saturnian system for three and a half years. Titan
    synchronous orbit trajectories will allow about 35 flybys of Titan and
    targeted flybys of Iapetus, Dione and Enceladus. The objectives of the
    mission are threefold: conduct detailed studies of Saturn's atmosphere,
    rings and magnetosphere; conduct close-up studies of Saturn's
    satellites, and characterize Titan's atmosphere and surface.

    One of the most intriguing aspects of Titan is the possibility that its
    surface may be covered in part with lakes of liquid hydrocarbons that
    result from photochemical processes in its upper atmosphere. These
    hydrocarbons condense to form a global smog layer and eventually rain
    down onto the surface. The Cassini orbiter will use onboard radar to
    peer through Titan's clouds and determine if there is liquid on the
    surface. Experiments aboard both the orbiter and the entry probe will
    investigate the chemical processes that produce this unique atmosphere.

    The Cassini mission is named for Jean Dominique Cassini (1625-1712), the
    first director of the Paris Observatory, who discovered several of
    Saturn's satellites and the major division in its rings. The Titan
    atmospheric entry probe is named for the Dutch physicist Christiaan
    Huygens (1629-1695), who discovered Titan and first described the true
    nature of Saturn's rings.

	 Key Scheduled Dates for the Cassini Mission (VVEJGA Trajectory)
	 -------------------------------------------------------------
	   10/06/97 - Titan IV/Centaur Launch
	   04/21/98 - Venus 1 Gravity Assist
	   06/20/99 - Venus 2 Gravity Assist
	   08/16/99 - Earth Gravity Assist
	   12/30/00 - Jupiter Gravity Assist
	   06/25/04 - Saturn Arrival
	   01/09/05 - Titan Probe Release
	   01/30/05 - Titan Probe Entry
	   06/25/08 - End of Primary Mission
	    (Schedule last updated 7/22/92)


    GALILEO - Jupiter orbiter and atmosphere probe, in transit. Has returned
    the first resolved images of an asteroid, Gaspra, while in transit to
    Jupiter. Efforts to unfurl the stuck High-Gain Antenna (HGA) have
    essentially been abandoned. JPL has developed a backup plan using data
    compression (JPEG-like for images, lossless compression for data from
    the other instruments) which should allow the mission to achieve
    approximately 70% of its original objectives.

	   Galileo Schedule
	   ----------------
	   10/18/89 - Launch from Space Shuttle
	   02/09/90 - Venus Flyby
	   10/**/90 - Venus Data Playback
	   12/08/90 - 1st Earth Flyby
	   05/01/91 - High Gain Antenna Unfurled
	   07/91 - 06/92 - 1st Asteroid Belt Passage
	   10/29/91 - Asteroid Gaspra Flyby
	   12/08/92 - 2nd Earth Flyby
	   05/93 - 11/93 - 2nd Asteroid Belt Passage
	   08/28/93 - Asteroid Ida Flyby
	   07/02/95 - Probe Separation
	   07/09/95 - Orbiter Deflection Maneuver
	   12/95 - 10/97 - Orbital Tour of Jovian Moons
	   12/07/95 - Jupiter/Io Encounter
	   07/18/96 - Ganymede
	   09/28/96 - Ganymede
	   12/12/96 - Callisto
	   01/23/97 - Europa
	   02/28/97 - Ganymede
	   04/22/97 - Europa
	   05/31/97 - Europa
	   10/05/97 - Jupiter Magnetotail Exploration


    HITEN - Japanese (ISAS) lunar probe launched 1/24/90. Has made
    multiple lunar flybys. Released Hagoromo, a smaller satellite,
    into lunar orbit. This mission made Japan the third nation to
    orbit a satellite around the Moon.


    MAGELLAN - Venus radar mapping mission. Has mapped almost the entire
    surface at high resolution. Currently (4/93) collecting a global gravity
    map.


    MARS OBSERVER - Mars orbiter including 1.5 m/pixel resolution camera.
    Launched 9/25/92 on a Titan III/TOS booster. MO is currently (4/93) in
    transit to Mars, arriving on 8/24/93. Operations will start 11/93 for
    one martian year (687 days).


    TOPEX/Poseidon - Joint US/French Earth observing satellite, launched
    8/10/92 on an Ariane 4 booster. The primary objective of the
    TOPEX/POSEIDON project is to make precise and accurate global
    observations of the sea level for several years, substantially
    increasing understanding of global ocean dynamics. The satellite also
    will increase understanding of how heat is transported in the ocean.


    ULYSSES- European Space Agency probe to study the Sun from an orbit over
    its poles. Launched in late 1990, it carries particles-and-fields
    experiments (such as magnetometer, ion and electron collectors for
    various energy ranges, plasma wave radio receivers, etc.) but no camera.

    Since no human-built rocket is hefty enough to send Ulysses far out of
    the ecliptic plane, it went to Jupiter instead, and stole energy from
    that planet by sliding over Jupiter's north pole in a gravity-assist
    manuver in February 1992. This bent its path into a solar orbit tilted
    about 85 degrees to the ecliptic. It will pass over the Sun's south pole
    in the summer of 1993. Its aphelion is 5.2 AU, and, surprisingly, its
    perihelion is about 1.5 AU-- that's right, a solar-studies spacecraft
    that's always further from the Sun than the Earth is!

    While in Jupiter's neigborhood, Ulysses studied the magnetic and
    radiation environment. For a short summary of these results, see
    *Science*, V. 257, p. 1487-1489 (11 September 1992). For gory technical
    detail, see the many articles in the same issue.


    OTHER SPACE SCIENCE MISSIONS (note: this is based on a posting by Ron
    Baalke in 11/89, with ISAS/NASDA information contributed by Yoshiro
    Yamada (yamada@yscvax.ysc.go.jp). I'm attempting to track changes based
    on updated shuttle manifests; corrections and updates are welcome.

    1993 Missions
	o ALEXIS [spring, Pegasus]
	    ALEXIS (Array of Low-Energy X-ray Imaging Sensors) is to perform
	    a wide-field sky survey in the "soft" (low-energy) X-ray
	    spectrum. It will scan the entire sky every six months to search
	    for variations in soft-X-ray emission from sources such as white
	    dwarfs, cataclysmic variable stars and flare stars. It will also
	    search nearby space for such exotic objects as isolated neutron
	    stars and gamma-ray bursters. ALEXIS is a project of Los Alamos
	    National Laboratory and is primarily a technology development
	    mission that uses astrophysical sources to demonstrate the
	    technology. Contact project investigator Jeffrey J Bloch
	    (jjb@beta.lanl.gov) for more information.

	o Wind [Aug, Delta II rocket]
	    Satellite to measure solar wind input to magnetosphere.

	o Space Radar Lab [Sep, STS-60 SRL-01]
	    Gather radar images of Earth's surface.

	o Total Ozone Mapping Spectrometer [Dec, Pegasus rocket]
	    Study of Stratospheric ozone.

	o SFU (Space Flyer Unit) [ISAS]
	    Conducting space experiments and observations and this can be
	    recovered after it conducts the various scientific and
	    engineering experiments. SFU is to be launched by ISAS and
	    retrieved by the U.S. Space Shuttle on STS-68 in 1994.

    1994
	o Polar Auroral Plasma Physics [May, Delta II rocket]
	    June, measure solar wind and ions and gases surrounding the
	    Earth.

	o IML-2 (STS) [NASDA, Jul 1994 IML-02]
	    International Microgravity Laboratory.

	o ADEOS [NASDA]
	    Advanced Earth Observing Satellite.

	o MUSES-B (Mu Space Engineering Satellite-B) [ISAS]
	    Conducting research on the precise mechanism of space structure
	    and in-space astronomical observations of electromagnetic waves.

    1995
	LUNAR-A [ISAS]
	    Elucidating the crust structure and thermal construction of the
	    moon's interior.


    Proposed Missions:
	o Advanced X-ray Astronomy Facility (AXAF)
	    Possible launch from shuttle in 1995, AXAF is a space
	    observatory with a high resolution telescope. It would orbit for
	    15 years and study the mysteries and fate of the universe.

	o Earth Observing System (EOS)
	    Possible launch in 1997, 1 of 6 US orbiting space platforms to
	    provide long-term data (15 years) of Earth systems science
	    including planetary evolution.

	o Mercury Observer
	    Possible 1997 launch.

	o Lunar Observer
	    Possible 1997 launch, would be sent into a long-term lunar
	    orbit. The Observer, from 60 miles above the moon's poles, would
	    survey characteristics to provide a global context for the
	    results from the Apollo program.

	o Space Infrared Telescope Facility
	    Possible launch by shuttle in 1999, this is the 4th element of
	    the Great Observatories program. A free-flying observatory with
	    a lifetime of 5 to 10 years, it would observe new comets and
	    other primitive bodies in the outer solar system, study cosmic
	    birth formation of galaxies, stars and planets and distant
	    infrared-emitting galaxies

	o Mars Rover Sample Return (MRSR)
	    Robotics rover would return samples of Mars' atmosphere and
	    surface to Earch for analysis. Possible launch dates: 1996 for
	    imaging orbiter, 2001 for rover.

	o Fire and Ice
	    Possible launch in 2001, will use a gravity assist flyby of
	    Earth in 2003, and use a final gravity assist from Jupiter in
	    2005, where the probe will split into its Fire and Ice
	    components: The Fire probe will journey into the Sun, taking
	    measurements of our star's upper atmosphere until it is
	    vaporized by the intense heat. The Ice probe will head out
	    towards Pluto, reaching the tiny world for study by 2016.

In [ ]: