Semantic similarity chatbot (with movie dialog)

By Allison Parrish

bot screenshot

I teach programming, arts and design and a perennial project idea is to make a chatbot that mimics someone or something—a famous author, a historical figure, or even the student's own e-mails or messaging logs. This notebook and the software described herein is intended to give those students some sample code to work with and a bit of a head start on concepts and architecture. (In particular, this material was inspired by conversations I had with Utsav Chadha and Nouf Aljowaysir during the Spring 2018 semester at ITP.)

In the notebook, I'll show how the chatbot works and build an example chatbot using the Cornell Movie Dialog Corpus. Even if you don't know anything about programming or natural language processing or machine learning or whatever, you can step through the cells in this notebook and play around with the chatbot itself at the very end.

TLDR version: To run the chatbot, just keep hitting shift+enter until you reach the end. (A bunch of stuff needs to download and build, so it'll take a few minutes. Sorry.)

Content warning: The Cornell Movie Dialog Corpus has dialog from many movies, including some with potentially objectionable content. When playing around with this code, you might see text from the dialog of these films, including (in some cases) violent language and slurs directed at marginalized groups. If you make a chatbot with this code and this corpus and make it available to a wide audience, consider including a content warning similar to this one and/or filtering the corpus and output of the bot to exclude words and sentiments like this.

Making a chatbot the easy way

There are lots of ways to author chatbots, but many of them are oriented toward particular use cases (i.e., automating customer service), and require extensive hand-authoring of content or hand-labelling of data. Others (i.e., those that use seq2seq) require you to train a neural network from scratch, which is fine if you're into that kind of thing, but can sometimes feel like a rotten way to spend your money and your afternoon (or weekend, or month, or whatever).

The chatbot in this notebook won't pass a Turing test or push percentage points on any machine learning accuracy evaluations, but it's (a) easy to understand (b) works with any corpus (c) doesn't require training a new model and (d) uncannily faithful to whatever source material you give it while still being amusingly bizarre. From a technical perspective, you can think of it as a sort of low-rent version of Talk to Books, which (as I understand it) works along similar principles.

So how does this chatbot work? To answer that question we have to think about how conversations work.

Defining the conversation

For the purposes of this chatbot, let's make a very simple "toy" definition of conversation. We'll say that a conversation consists of two people taking turns at making utterances. We'll call any individual utterance a turn. When one participant finishes their turn, the next participant can take their own turn; we'll call this second turn a response to the first. The conversation continues this way, with each turn being a response to the previous turn, until it comes to an end (usually due to a mutual agreement reached by the participants, which in the case of our chatbot, means whenever the human gets sick of chatting and closes the browser tab).

To illustrate, here's a simple conversation I just invented between two participants, A and B. The first column numbers the turns, the second column labels the participant, and the third column gives the text of the turn:

# P Text
1 A Hello.
2 B Good to see you!
3 A I'm reading a tutorial on semantic similarity and chatbots. It's quite interesting.
4 B Thanks for letting me know.
5 A Any time. Well, I gotta go.
6 B Talk to you soon!
7 A Goodbye.

This fascinating conversation has seven turns. Turn 2 is the response to turn 1, turn 3 is the response to turn 2, etc.

Note: I said this was a "toy" definition for a reason—conversations are actually way more complicated than this. If you're interested in how conversations actually work, check out conversation analysis, a whole subfield of linguistics devoted to this kind of thing.

Taking a turn

At a certain basic level, the job of a chatbot at any moment in a conversation is to produce a conversational turn that seems to plausibly be in response to the turn that preceded it. There are a number of different ways to solve this problem. Our strategy is going to be the following:

  1. Make a database of conversations and the turns that constitute them;
  2. Assign a vector to each turn that corresponds to its meaning (more on this in a second);
  3. When asked to respond to a conversational turn from the user, display the response to the turn in the database most similar in meaning to the user's turn.

For example, take the conversation that I invented earlier. Imagine putting all of these turns into the database and assigning each turn a vector representing its meaning. Our chatbot now has a database of six possible responses (not counting the first turn, since it began the conversation and wasn't in response to any other turn). If the user typed in something like...

> Howdy!

... our chatbot would then search its database for the turn closest in meaning to Howdy! Maybe that turn is turn #1 (Hello.). The chatbot would then display the turn that happened in response to turn #1 (i.e., turn #2, Good to see you!). If the user typed in...

> Thank you for the great conversation!

... our chatbot would find the turn in its database closest in meaning, maybe turn #4 (Thanks for letting me know.), and then print out its associated response (turn #5, Any time. Well, I gotta go.). The final transcript of this imaginary (and admittedly a little contrived) conversation, with the human's turn labelled with H and the bot as B:

H: Howdy!
B: Hello.
H: Thank you for the great conversation!
B: Any time. Well, I gotta go!

Perfectly plausible!

So you can think of this semantic similarity chatbot as a kind of search engine. When you type something into the chat, the chatbot searches its database for the most appropriate response.

Word vectors

"This is all well and good," you say. "But how do you make a computer program that knows how similar in meaning two sentences are? How do you even measure similarity in meaning?" Figuring out a way to measure similarity in meaning is one of the classic problems in computational linguistics, and it's still very much an open problem. But there are certain easy-to-use techniques that are "good enough" for our purposes. In particular, we're going to use word vectors.

I've written a more detailed introduction to word vectors here, if you want the whole story. But the short version is this: using machine learning techniques and a lot of data, it's possible to assign each word a sequence of numbers (i.e., a vector) that encodes the word's meaning. (Actually, it's encoding the word's distribution, or all of the other words that the word is usually seen alongside. But it turns out that this is a good substitute for representing a word's meaning.)

A word vector looks a lot like the Cartesian X, Y coordinates you likely studied in school, except that they usually have many hundreds of dimensions, not just two. (More dimensions means more information about the word's distribution.) For example, here's the vector for the word "cheese" using the fifty-dimensional pre-trained vectors from GloVe:

-0.053903 -0.30871 -1.3285 -0.43342 0.31779 1.5224 -0.6965 -0.037086 -0.83784 0.074107 -0.30532 -0.1783 1.2337 0.085473 0.17362 -0.19001 0.36907 0.49454 -0.024311 -1.0535 0.5237 -1.1489 0.95093 1.1538 -0.52286 -0.14931 -0.97614 1.3912 0.79875 -0.72134 1.5411 -0.15928 -0.30472 1.7265 0.13124 -0.054023 -0.74212 1.675 1.9502 -0.53274 1.1359 0.20027 0.02245 -0.39379 1.0609 1.585 0.17889 0.43556 0.68161 0.066202

Experts have made large databases of word vectors available for people to download and use, so that you don't have to train them yourself. (Though you can train them yourself if you want to.)

Sentence vectors

Importantly, two words with similar meanings will also have similar vectors (meaning, more or less, that all of the numbers in the vectors are similar in value). So you can tell if two words are synonymous by checking the similarity between their vectors.

But what about the meaning of entire sentences? This is a little bit more difficult, and there are a number of different and sophisticated solutions (including Google's Universal Sentence Encoder and doc2vec). It turns out, though, that you can get a pretty good vector for a sentence simply by averaging together the vectors for the words in the sentence. We'll call such vectors sentence vectors or summary vectors.

Intuitively, this makes sense: finding the average is a time-tested method in statistics of characterizing a data set. It's apparently no different with word vectors. This method has the additional benefits of being fast and easy to explain.

Writing the code

With your understanding of these concepts, we can actually start writing some code. For our semantic similarity chatbot, we need:

  • Pre-trained word vectors
  • A corpus of conversations
  • Some code to parse conversations into turns and map each turn to its response
  • Some code that can average the word vectors in some text to produce a sentence vector
  • A database that will allow us to store sentence vectors and look them up by similarity
  • Some code to take an incoming conversational turn, turn it into a sentence vector, and then look up the most similar vector in the database

Let's take these one-by-one.

Pre-trained word vectors

We're going to use spaCy, a wonderful Python library for natural language processing, both to tokenize text (i.e., turn text into a list of words) and for its database of word vectors. spaCy has already been installed in this notebook.

It turns out that spaCy requires a "model" file, which is a bundle of statistical information that allows the library to parse text into words and parts of speech. While spaCy comes with a model when you install it, that model does not include word vectors, so you'll need to download a model that does include them. For English, I recommend en_core_web_lg. This file has already been loaded when this Binder generated.

The code in the following cell loads spacy and the model you just downloaded:

In [1]:
import spacy
nlp = spacy.load('en_core_web_lg')

You can look up the word vector for a particular word using spaCy right out the box like so:

In [2]:
nlp.vocab['cheese'].vector # replace cheese with whatever word you want!
array([-5.5252e-01,  1.8894e-01,  6.8737e-01, -1.9789e-01,  7.0575e-02,
        1.0075e+00,  5.1789e-02, -1.5603e-01,  3.1941e-01,  1.1702e+00,
       -4.7248e-01,  4.2867e-01, -4.2025e-01,  2.4803e-01,  6.8194e-01,
       -6.7488e-01,  9.2401e-02,  1.3089e+00, -3.6278e-02,  2.0098e-01,
        7.6005e-01, -6.6718e-02, -7.7794e-02,  2.3844e-01, -2.4351e-01,
       -5.4164e-01, -3.3540e-01,  2.9805e-01,  3.5269e-01, -8.0594e-01,
       -4.3611e-01,  6.1535e-01,  3.4212e-01, -3.3603e-01,  3.3282e-01,
        3.8065e-01,  5.7427e-02,  9.9918e-02,  1.2525e-01,  1.1039e+00,
        3.6678e-02,  3.0490e-01, -1.4942e-01,  3.2912e-01,  2.3300e-01,
        4.3395e-01,  1.5666e-01,  2.2778e-01, -2.5830e-02,  2.4334e-01,
       -5.8136e-02, -1.3486e-01,  2.4521e-01, -3.3459e-01,  4.2839e-01,
       -4.8181e-01,  1.3403e-01,  2.6049e-01,  8.9933e-02, -9.3770e-02,
        3.7672e-01, -2.9558e-02,  4.3841e-01,  6.1212e-01, -2.5720e-01,
       -7.8506e-01,  2.3880e-01,  1.3399e-01, -7.9315e-02,  7.0582e-01,
        3.9968e-01,  6.7779e-01, -2.0474e-03,  1.9785e-02, -4.2059e-01,
       -5.3858e-01, -5.2155e-02,  1.7252e-01,  2.7547e-01, -4.4482e-01,
        2.3595e-01, -2.3445e-01,  3.0103e-01, -5.5096e-01, -3.1159e-02,
       -3.4433e-01,  1.2386e+00,  1.0317e+00, -2.2728e-01, -9.5207e-03,
       -2.5432e-01, -2.9792e-01,  2.5934e-01, -1.0421e-01, -3.3876e-01,
        4.2470e-01,  5.8335e-04,  1.3093e-01,  2.8786e-01,  2.3474e-01,
        2.5905e-02, -6.4359e-01,  6.1330e-02,  6.3842e-01,  1.4705e-01,
       -6.1594e-01,  2.5097e-01, -4.4872e-01,  8.6825e-01,  9.9555e-02,
       -4.4734e-02, -7.4239e-01, -5.9147e-01, -5.4929e-01,  3.8108e-01,
        5.5177e-02, -1.0487e-01, -1.2838e-01,  6.0521e-03,  2.8743e-01,
        2.1592e-01,  7.2871e-02, -3.1644e-01, -4.3321e-01,  1.8682e-01,
        6.7274e-02,  2.8115e-01, -4.6222e-02, -9.6803e-02,  5.6091e-01,
       -6.7762e-01, -1.6645e-01,  1.5553e-01,  5.2301e-01, -3.0058e-01,
       -3.7291e-01,  8.7895e-02, -1.7963e-01, -4.4193e-01, -4.4607e-01,
       -2.4122e+00,  3.3738e-01,  6.2416e-01,  4.2787e-01, -2.5386e-01,
       -6.1683e-01, -7.0097e-01,  4.9303e-01,  3.6916e-01, -9.7499e-02,
        6.1411e-01, -4.7572e-03,  4.3916e-01, -2.1551e-01, -5.6745e-01,
       -4.0278e-01,  2.9459e-01, -3.0850e-01,  1.0103e-01,  7.9741e-02,
       -6.3811e-01,  2.4781e-01, -4.4546e-01,  1.0828e-01, -2.3624e-01,
       -5.0838e-01, -1.7001e-01, -7.8735e-01,  3.4073e-01, -3.1830e-01,
        4.5286e-01, -9.5118e-02,  2.0772e-01, -8.0183e-02, -3.7982e-01,
       -4.9949e-01,  4.0759e-02, -3.7724e-01, -8.9705e-02, -6.8187e-01,
        2.2106e-01, -3.9931e-01,  3.2329e-01, -3.6180e-01, -7.2093e-01,
       -6.3404e-01,  4.3125e-01, -4.9743e-01, -1.7395e-01, -3.8779e-01,
       -3.2556e-01,  1.4423e-01, -8.3401e-02, -2.2994e-01,  2.7793e-01,
        4.9112e-01,  6.4511e-01, -7.8945e-02,  1.1171e-01,  3.7264e-01,
        1.3070e-01, -6.1607e-02, -4.3501e-01,  2.8999e-02,  5.6224e-01,
        5.8012e-02,  4.7078e-02,  4.2770e-01,  7.3245e-01, -2.1150e-02,
        1.1988e-01,  7.8823e-02, -1.9106e-01,  3.5278e-02, -3.1102e-01,
        1.3209e-01, -2.8606e-01, -1.5649e-01, -6.4339e-01,  4.4599e-01,
       -3.0912e-01,  4.4520e-01, -3.6774e-01,  2.7327e-01,  6.7833e-01,
       -8.3830e-02, -4.5120e-01,  1.0754e-01, -4.5908e-01,  1.5095e-01,
       -4.5856e-01,  3.4465e-01,  7.8013e-02, -2.8319e-01, -2.8149e-02,
        2.4404e-01, -7.1345e-01,  5.2834e-02, -2.8085e-01,  2.5344e-02,
        4.2979e-02,  1.5663e-01, -7.4647e-01, -1.1301e+00,  4.4135e-01,
        3.1444e-01, -1.0018e-01, -5.3526e-01, -9.0601e-01, -6.4954e-01,
        4.2664e-02, -7.9927e-02,  3.2905e-01, -3.0797e-01, -1.9190e-02,
        4.2765e-01,  3.1460e-01,  2.9051e-01, -2.7386e-01,  6.8483e-01,
        1.9395e-02, -3.2884e-01, -4.8239e-01, -1.5747e-01, -1.6036e-01,
        4.9164e-01, -7.0352e-01, -3.5591e-01, -7.4887e-01, -5.2827e-01,
        4.4983e-02,  5.9247e-02,  4.6224e-01,  8.9697e-02, -7.5618e-01,
        6.3682e-01,  9.0680e-02,  6.8830e-02,  1.8296e-01,  1.0754e-01,
        6.7811e-01, -1.4716e-01,  1.7029e-01, -5.2630e-01,  1.9268e-01,
        9.3130e-01,  8.0363e-01,  6.1324e-01, -3.0494e-01,  2.0236e-01,
        5.8520e-01,  2.6484e-01, -4.5863e-01,  2.1035e-03, -5.6990e-01,
       -4.9092e-01,  4.2511e-01, -1.0954e+00,  1.7124e-01,  2.2495e-01],

It might not look much, but that list of three hundred numbers is spaCy's idea of what "cheese" means.

Parsing a corpus of conversations

So now we need some data for the bot. In particular, we need some conversations: the text of the turns along with information about which turn is in response to which. Fortunately, some researchers at Cornell University have made available a very interesting corpus of conversations: The Cornell Movie Dialog Corpus, containing "220,579 conversational exchanges between 10,292 pairs of movie characters." Very cool. The data is stored in several plain text files, which you can download by running the following cells:

In [5]:
!curl -L -O
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9684k  100 9684k    0     0  5246k      0  0:00:01  0:00:01 --:--:-- 5246k
In [6]:
   creating: cornell movie-dialogs corpus/
  inflating: cornell movie-dialogs corpus/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/cornell movie-dialogs corpus/
  inflating: __MACOSX/cornell movie-dialogs corpus/._.DS_Store  
  inflating: cornell movie-dialogs corpus/chameleons.pdf  
  inflating: __MACOSX/cornell movie-dialogs corpus/._chameleons.pdf  
  inflating: cornell movie-dialogs corpus/movie_characters_metadata.txt  
  inflating: cornell movie-dialogs corpus/movie_conversations.txt  
  inflating: cornell movie-dialogs corpus/movie_lines.txt  
  inflating: cornell movie-dialogs corpus/movie_titles_metadata.txt  
  inflating: cornell movie-dialogs corpus/raw_script_urls.txt  
  inflating: cornell movie-dialogs corpus/README.txt  
  inflating: __MACOSX/cornell movie-dialogs corpus/._README.txt  

We'll be working with two files from this corpus. One file (movie_lines.txt) has the movie lines themselves, associated with a short unique identifier; another file (movie_conversations.txt) has lists of which lines occurred together in conversations, in the order in which they occurred. The following two cells parse these two files and create lookup dictionaries that associate unique IDs to lines (movie_lines) and each line to the line that follows it (responses).

In [7]:
movie_lines = {}
for line in open("./cornell movie-dialogs corpus/movie_lines.txt",
    line = line.strip()
    parts = line.split(" +++$+++ ")
    if len(parts) == 5:
        movie_lines[parts[0]] = parts[4]
        movie_lines[parts[0]] = ""
In [8]:
import json
responses = {}
for line in open("./cornell movie-dialogs corpus/movie_conversations.txt",
    line = line.strip()
    parts = line.split(" +++$+++ ")
    line_ids = json.loads(parts[3].replace("'", '"'))
    for first, second in zip(line_ids[:-1], line_ids[1:]):
        responses[first] = second

Just to make sure everything works, the cell below prints out five random pairs of conversational turns from the corpus:

In [9]:
import random
for pair in random.sample(responses.items(), 5):
    print("A:", movie_lines[pair[0]])
    print("B:", movie_lines[pair[1]])
A: I've taken pity on you, my angel. I heard your wish.
B: Oh. Well, thank you! How wonderful. Some people get all the luck.

A: I was here!
B:'s not in the file... I swear... I know your file... your first job was Geneva!... I swear to God you never worked here!...

A: Stop that!
B: Don't make too much noise, Miz Lampert --

A: Is it so unrealistic to think Ruiz, who doesn't even want us here, is throwing us to the wolves? As an apology? And I don't even know what we're dropping off or picking up --
B: We're getting ahead of ourselves. We haven't gotten any sleep. Let's just keep our mouthes shut and not make any mistakes. Now hurry up and get your shit on so we're not late and make things worse.

A: Hey, Sheriff.
B: Down the road a piece is the Golden Sunset, the no-tell motel, Socorro's contribution to international relations. The car's just sitting there, no activity. I've had a couple Hispanic officers casing it all day. Want to take a look?

Making a sentence vector

To make the sentence vector for each line of dialog, we're going to use spaCy. The function sentence_mean below takes the spaCy object that we loaded earlier (nlp) and uses it to tokenize the string that you pass into the function (i.e., break it up into words). It then uses numpy's mean() function to find the average of the vectors, producing a new vector. The shape of the resulting vector (i.e., the number of dimensions) should be the same as the shape of the individual word vectors.

(Note: I disabled the tagger and parser parts of spaCy's pipeline to improve performance. We're not using part of speech tags or dependency relations in this chatbot, so there's no reason to spend time calculating them.)

In [10]:
import numpy as np
def sentence_mean(nlp, s):
    if s == "":
        s = " "
    doc = nlp(s, disable=['tagger', 'parser'])
    return np.mean(np.array([w.vector for w in doc]), axis=0)
sentence_mean(nlp, "This... is a test.").shape

Similarity lookups

Now that we have conversational turns and a way to vectorize those turns, we can make our database for semantic similarity lookup! The kind of "database" we'll need to use for this is an approximate nearest neighbors lookup, which allows you to store items along with the vector that represents them, and then do fast searches to find items with similar vectors (even items that weren't in the original dataset).

I made a Python library to make it easy to build databases like this called Simple Neighbors. It's a lightweight wrapper around the industrial-strength approximate nearest neighbors lookup library called Annoy. Simple Neighbors has already been installed in this notebook.

The cell below makes a new Simple Neighbors object called nns and initializes it with 300 dimensions (the shape of the word vectors in spaCy, and also the shape of our summary vectors). It then samples ten thousand random conversational turns from the Cornell corpus, finds sentence vectors for each of them, and adds them to the database. (The np.any() line just checks to make sure that we don't add any vectors that are all zeroes by accident—this can mess up the nearest-neighbor search.)

Notes on the code below:

  • I decided to just sample ten thousand turns so that the index will build faster. You can change this number to your liking!
  • It only adds turns that have responses to the database (i.e., keys in the responses lookup). Because of the way the bot works, we don't need to keep track of the last turn of a conversation, since it (by definition) will have no replies.
In [13]:
from simpleneighbors import SimpleNeighbors

nns = SimpleNeighbors(300)
for i, line_id in enumerate(random.sample(list(responses.keys()), 10000)):
    # show progress
    if i % 1000 == 0: print(i, line_id, movie_lines[line_id])
    line_text = movie_lines[line_id]
    summary_vector = sentence_mean(nlp, line_text)
    if np.any(summary_vector):
        nns.add_one(line_id, summary_vector)
0 L120032 Hi.
1000 L638222 All right.  Who wasn't in the O.R.?
2000 L253741 Why talk -- it's over -- it's over -- it's finished. You've broken off negotiations. You did it. You're calling them off. You had nothing on your mind all day, but Manchester, -- Manchester -- Manchester.  You don't suppose for one moment that I'm such a fool as not to have something that I could say definitely about Manchester.
3000 L278169 Said he's not an actor.
4000 L550407 Oh, I know, I know that. Well, sharing your story, your ups and downs, and so forth, can I hope, be an illuminating experience.
5000 L483462 Don't do that again.
6000 L389812 Where we headed?
7000 L388275 Sort it all out.
8000 L484219 I was just going to tell you that I love you.  I said it.
9000 L57643 Have we met?

Let's take it for a spin! The code in the following cell finds the turn most similar to the string in the variable sentence. (You can change this string to whatever you want.) It then uses the Simple Neighbors object to find the turn in the database with the most similar vector, and then uses the responses lookup to find the response to that turn. That response will be our bot's output.

In [14]:
sentence = "I like making bots."
picked = nns.nearest(sentence_mean(nlp, sentence), 5)[0]
response_line_id = responses[picked]

print("Your line:\n\t", sentence)
print("Most similar turn:\n\t", movie_lines[picked])
print("Response to most similar turn:\n\t", movie_lines[response_line_id])
Your line:
	 I like making bots.
Most similar turn:
	 I know. I called them.
Response to most similar turn:
	 Shouldn't we --

Putting it all together

The code above is all you need to make a conversational chatbot based on semantic similarity. But there's a lot of stuff to keep track of! So I wrote a little bit of "glue code" to make it even easier. You can see the source code on GitHub; all the important stuff is in this file. I'm going to use this library to rewrite the code above in just a few lines, and then we'll use the resulting object to make a chatbot you can use in the browser.

We have already installed this library in the notebook, so you don't need to install it at the moment.

Then create a chatbot object, passing in the spaCy language object (nlp) and the number of dimensions:

In [16]:
from semanticsimilaritychatbot import SemanticSimilarityChatbot
chatbot = SemanticSimilarityChatbot(nlp, 300)

The .add_pair() method in the object takes two strings: a turn and the response to that turn. We'll get these from the responses and movie_lines lookups, again sampling ten thousand pairs at random. This cell will take a little while to run:

In [17]:
sample_n = 10000
for first_id, second_id in random.sample(list(responses.items()), sample_n):
    chatbot.add_pair(movie_lines[first_id], movie_lines[second_id])

Once you've built the database, the .response_for() method returns a plausible response from the database, based on semantic similarity. Try it out by changing the text between the quotation marks:

In [19]:
print(chatbot.response_for("Hello computer!"))
Shut up, Jesse.

To add variety, the .response_for() method actually selects randomly among several similar turns. You can change the number of turns it chooses from by passing a second parameter (a number) to the method. In general, the higher the number, the greater the chance is that you'll get an unusual result:

In [20]:
my_turn = "The weather's nice today, don't you think?"
for i in range(5, 51, 5):
    print("picking from", i, "possible responses:")
    print(chatbot.response_for(my_turn, i))
picking from 5 possible responses:
I buy them flowers.

picking from 10 possible responses:
Is it my imagination or are you just going through the motions?

picking from 15 possible responses:
Jake, I mean, come on --

picking from 20 possible responses:
Your Majesty, Herr Mozart -

picking from 25 possible responses:
I promise i'll take good care of these people, they deserve it, they're dead, all they've got left is their looks.

picking from 30 possible responses:
What sign?

picking from 35 possible responses:
We used to fly together. I'm... John.

picking from 40 possible responses:
I don't know...

picking from 45 possible responses:
He has come to help Mister Burns. Somehow I feel responsible.

picking from 50 possible responses:
Tita, I've got to go...

The Semantic Similarity Chatbot object has a .save() method that saves the pre-built database to disk, using a filename prefix you supply. (It saves three different files: <prefix>.annoy, <prefix>-data.pkl, and <prefix>-chatbot.pkl).

In [21]:"movielines-10k-sample")

You can use a previously-saved database using the .load() class method, like so. (This means you don't have to build the database again: you can just load it and start calling .response_for().)

In [45]:
chatbot = SemanticSimilarityChatbot.load("movielines-10k-sample", nlp)
In [46]:
print(chatbot.response_for("It belongs in a museum!"))
I can get it.
In [47]:
print(chatbot.response_for("Hello computer!"))
I'm so burnt-out.
In [50]:
print(chatbot.response_for("Why is that?"))
A passionfruit smoothee.

Try having a conversation now - you can rerun the three lines above, or you can create new cellblocks with the phrases you want to try in them.

Some things to try

If you enjoyed following along, here are some things to try:

  • Use the metadata file that comes with the Cornell corpus to make a chatbot that only uses lines from a particular genre of movie. (How is a comedy chatbot different from an action chatbot?)
  • Use a different corpus of conversation altogether. Your own chat logs? Conversational exchanges from a novel? Transcripts of interviews on news programs?
  • Incorporate some context from the conversation when vectorizing the turns. (You might, for example, include the average of not just the given turn but also the turn that preceded it.)