#!/usr/bin/env python # coding: utf-8 # # Semantic similarity chatbot (with movie dialog) # # By [Allison Parrish](http://www.decontextualize.com/) # # ![bot screenshot](http://static.decontextualize.com/snaps/semantic-similarity-chatbot.png) # # I teach [programming, arts and design](https://itp.nyu.edu/) and a perennial project idea is to make a chatbot that mimics someone or something—a famous author, a historical figure, or even the student's own e-mails or messaging logs. This notebook and the software described herein is intended to give those students some sample code to work with and a bit of a head start on concepts and architecture. (In particular, this material was inspired by conversations I had with [Utsav Chadha](https://itp.nyu.edu/thesis2018/#/student/utsav-chadha) and [Nouf Aljowaysir](https://itp.nyu.edu/thesis2018/#/student/nouf-aljowaysir) during the Spring 2018 semester at ITP.) # # In the notebook, I'll show how the chatbot works and build an example chatbot using the [Cornell Movie Dialog Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Even if you don't know anything about programming or natural language processing or machine learning or whatever, you can step through the cells in this notebook and play around with the chatbot itself at the very end. # # > **TLDR version**: To run the chatbot, just keep hitting shift+enter until you reach the end. (A bunch of stuff needs to download and build, so it'll take a few minutes. Sorry.) # # > **Content warning**: The Cornell Movie Dialog Corpus has dialog from many movies, including some with potentially objectionable content. When playing around with this code, you might see text from the dialog of these films, including (in some cases) violent language and slurs directed at marginalized groups. If you make a chatbot with this code and this corpus and make it available to a wide audience, consider including a content warning similar to this one and/or filtering the corpus and output of the bot to exclude words and sentiments like this. # # ## Making a chatbot the easy way # # There are [lots](https://www.rivescript.com/) [of](https://rasa.com/) [ways](https://botpress.io/) to author chatbots, but many of them are oriented toward particular use cases (i.e., automating customer service), and require extensive hand-authoring of content or hand-labelling of data. Others (i.e., those that use seq2seq) require you to train a neural network from scratch, which is fine if you're into that kind of thing, but can sometimes feel like a rotten way to spend your money and your afternoon (or weekend, or month, or whatever). # # The chatbot in this notebook won't pass a Turing test or push percentage points on any machine learning accuracy evaluations, but it's (a) easy to understand (b) works with any corpus (c) doesn't require training a new model and (d) uncannily faithful to whatever source material you give it while still being amusingly bizarre. From a technical perspective, you can think of it as a sort of low-rent version of [Talk to Books](https://books.google.com/talktobooks/), which (as I understand it) works along similar principles. # # So how does this chatbot work? To answer that question we have to think about how *conversations* work. # # ### Defining the conversation # # For the purposes of this chatbot, let's make a very simple "toy" definition of conversation. We'll say that a conversation consists of *two people taking turns at making utterances.* We'll call any individual utterance a *turn*. When one participant finishes their turn, the next participant can take their own turn; we'll call this second turn a *response* to the first. The conversation continues this way, with each turn being a response to the previous turn, until it comes to an end (usually due to a mutual agreement reached by the participants, which in the case of our chatbot, means whenever the human gets sick of chatting and closes the browser tab). # # To illustrate, here's a simple conversation I just invented between two participants, A and B. The first column numbers the turns, the second column labels the participant, and the third column gives the text of the turn: # # | # | P | Text | # |-|-|:-| # | 1 | A | Hello. | # | 2 | B | Good to see you! | # | 3 | A | I'm reading a tutorial on semantic similarity and chatbots. It's quite interesting. | # | 4 | B | Thanks for letting me know. | # | 5 | A | Any time. Well, I gotta go. | # | 6 | B | Talk to you soon! | # | 7 | A | Goodbye. | # # This fascinating conversation has seven turns. Turn 2 is the response to turn 1, turn 3 is the response to turn 2, etc. # # > *Note:* I said this was a "toy" definition for a reason—conversations are actually *way* more complicated than this. If you're interested in how conversations actually work, check out [conversation analysis](https://en.wikipedia.org/wiki/Conversation_analysis), a whole subfield of linguistics devoted to this kind of thing. # # ### Taking a turn # # At a certain basic level, the job of a chatbot at any moment in a conversation is to produce a conversational turn that seems to plausibly be in response to the turn that preceded it. There are a number of different ways to solve this problem. Our strategy is going to be the following: # # 1. Make a database of conversations and the turns that constitute them; # 2. Assign a *vector* to each turn that corresponds to its meaning (more on this in a second); # 3. When asked to respond to a conversational turn from the user, display the *response* to the turn in the database most similar in meaning to the user's turn. # # For example, take the conversation that I invented earlier. Imagine putting all of these turns into the database and assigning each turn a vector representing its meaning. Our chatbot now has a database of six possible responses (not counting the first turn, since it began the conversation and wasn't in response to any other turn). If the user typed in something like... # # > Howdy! # # ... our chatbot would then search its database for the turn closest in meaning to `Howdy!` Maybe that turn is turn #1 (`Hello.`). The chatbot would then display the turn that happened *in response* to turn #1 (i.e., turn #2, `Good to see you!`). If the user typed in... # # > Thank you for the great conversation! # # ... our chatbot would find the turn in its database closest in meaning, maybe turn #4 (`Thanks for letting me know.`), and then print out its associated response (turn #5, `Any time. Well, I gotta go.`). The final transcript of this imaginary (and admittedly a little contrived) conversation, with the human's turn labelled with `H` and the bot as `B`: # # H: Howdy! # B: Hello. # H: Thank you for the great conversation! # B: Any time. Well, I gotta go! # # Perfectly plausible! # # So you can think of this semantic similarity chatbot as a kind of search engine. When you type something into the chat, the chatbot *searches its database for the most appropriate response*. # # ### Word vectors # # "This is all well and good," you say. "But how do you make a computer program that knows how similar in meaning two sentences are? How do you even *measure* similarity in meaning?" Figuring out a way to measure similarity in meaning is one of the classic problems in computational linguistics, and it's still very much an open problem. But there are certain easy-to-use techniques that are "good enough" for our purposes. In particular, we're going to use *word vectors*. # # [I've written a more detailed introduction to word vectors here](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb), if you want the whole story. But the short version is this: using machine learning techniques and a lot of data, it's possible to assign each word a sequence of numbers (i.e., a vector) that encodes the word's meaning. (Actually, it's encoding the word's *distribution*, or all of the other words that the word is usually seen alongside. But it turns out that this is a good substitute for representing a word's meaning.) # # A word vector looks a lot like the Cartesian X, Y coordinates you likely studied in school, except that they usually have many hundreds of dimensions, not just two. (More dimensions means more information about the word's distribution.) For example, here's the vector for the word "cheese" using the fifty-dimensional pre-trained vectors from GloVe: # # -0.053903 -0.30871 -1.3285 -0.43342 0.31779 1.5224 -0.6965 -0.037086 -0.83784 0.074107 -0.30532 -0.1783 1.2337 0.085473 0.17362 -0.19001 0.36907 0.49454 -0.024311 -1.0535 0.5237 -1.1489 0.95093 1.1538 -0.52286 -0.14931 -0.97614 1.3912 0.79875 -0.72134 1.5411 -0.15928 -0.30472 1.7265 0.13124 -0.054023 -0.74212 1.675 1.9502 -0.53274 1.1359 0.20027 0.02245 -0.39379 1.0609 1.585 0.17889 0.43556 0.68161 0.066202 # # Experts have made [large databases of word vectors available for people to download and use](https://nlp.stanford.edu/projects/glove/), so that you don't have to train them yourself. (Though [you can train them yourself if you want to](https://radimrehurek.com/gensim/models/word2vec.html).) # # ### Sentence vectors # # Importantly, two words with similar meanings will also have similar vectors (meaning, more or less, that all of the numbers in the vectors are similar in value). So you can tell if two words are synonymous by checking the similarity between their vectors. # # But what about the meaning of *entire sentences*? This is a little bit more difficult, and there are a number of different and sophisticated solutions (including Google's [Universal Sentence Encoder](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2) and [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html)). It turns out, though, that you can get a pretty good vector for a sentence simply by *averaging together the vectors for the words in the sentence*. We'll call such vectors *sentence vectors* or *summary vectors*. # # Intuitively, this makes sense: finding the average is a time-tested method in statistics of characterizing a data set. It's apparently no different with word vectors. This method has the additional benefits of being fast and easy to explain. # # ## Writing the code # # With your understanding of these concepts, we can actually start writing some code. For our semantic similarity chatbot, we need: # # * Pre-trained word vectors # * A corpus of conversations # * Some code to parse conversations into turns and map each turn to its response # * Some code that can average the word vectors in some text to produce a sentence vector # * A database that will allow us to store sentence vectors and look them up by similarity # * Some code to take an incoming conversational turn, turn it into a sentence vector, and then look up the most similar vector in the database # # Let's take these one-by-one. # ### Pre-trained word vectors # # We're going to use [spaCy](https://spacy.io), a wonderful Python library for natural language processing, both to tokenize text (i.e., turn text into a list of words) and for its database of word vectors. spaCy has already been installed in this notebook. # It turns out that spaCy requires a "model" file, which is a bundle of statistical information that allows the library to parse text into words and parts of speech. While spaCy comes with a model when you install it, that model does *not* include word vectors, so you'll need to download a model that does include them. For English, I recommend `en_core_web_lg`. This file has already been loaded when this Binder generated. # The code in the following cell loads `spacy` and the model you just downloaded: # In[1]: import spacy nlp = spacy.load('en_core_web_lg') # You can look up the word vector for a particular word using spaCy right out the box like so: # In[2]: nlp.vocab['cheese'].vector # replace cheese with whatever word you want! # It might not look much, but that list of three hundred numbers is spaCy's idea of what "cheese" means. # ### Parsing a corpus of conversations # # So now we need some data for the bot. In particular, we need some conversations: the text of the turns along with information about which turn is in response to which. Fortunately, some researchers at Cornell University have made available a very interesting corpus of conversations: [The Cornell Movie Dialog Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), containing "220,579 conversational exchanges between 10,292 pairs of movie characters." Very cool. The data is stored in several plain text files, which you can download by running the following cells: # In[5]: get_ipython().system('curl -L -O http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip') # In[6]: get_ipython().system('unzip cornell_movie_dialogs_corpus.zip') # We'll be working with two files from this corpus. One file (`movie_lines.txt`) has the movie lines themselves, associated with a short unique identifier; another file (`movie_conversations.txt`) has lists of which lines occurred together in conversations, in the order in which they occurred. The following two cells parse these two files and create lookup dictionaries that associate unique IDs to lines (`movie_lines`) and each line to the line that follows it (`responses`). # In[7]: movie_lines = {} for line in open("./cornell movie-dialogs corpus/movie_lines.txt", encoding="latin1"): line = line.strip() parts = line.split(" +++$+++ ") if len(parts) == 5: movie_lines[parts[0]] = parts[4] else: movie_lines[parts[0]] = "" # In[8]: import json responses = {} for line in open("./cornell movie-dialogs corpus/movie_conversations.txt", encoding="latin1"): line = line.strip() parts = line.split(" +++$+++ ") line_ids = json.loads(parts[3].replace("'", '"')) for first, second in zip(line_ids[:-1], line_ids[1:]): responses[first] = second # Just to make sure everything works, the cell below prints out five random pairs of conversational turns from the corpus: # In[9]: import random for pair in random.sample(responses.items(), 5): print("A:", movie_lines[pair[0]]) print("B:", movie_lines[pair[1]]) print() # ### Making a sentence vector # # To make the sentence vector for each line of dialog, we're going to use spaCy. The function `sentence_mean` below takes the spaCy object that we loaded earlier (`nlp`) and uses it to tokenize the string that you pass into the function (i.e., break it up into words). It then uses numpy's `mean()` function to find the average of the vectors, producing a new vector. The shape of the resulting vector (i.e., the number of dimensions) should be the same as the shape of the individual word vectors. # # (Note: I disabled the `tagger` and `parser` parts of spaCy's pipeline to improve performance. We're not using part of speech tags or dependency relations in this chatbot, so there's no reason to spend time calculating them.) # In[10]: import numpy as np def sentence_mean(nlp, s): if s == "": s = " " doc = nlp(s, disable=['tagger', 'parser']) return np.mean(np.array([w.vector for w in doc]), axis=0) sentence_mean(nlp, "This... is a test.").shape # ### Similarity lookups # # Now that we have conversational turns and a way to vectorize those turns, we can make our database for semantic similarity lookup! The kind of "database" we'll need to use for this is an [approximate nearest neighbors](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods) lookup, which allows you to store items along with the vector that represents them, and then do fast searches to find items with similar vectors (even items that weren't in the original dataset). # # [I made a Python library to make it easy to build databases like this](https://pypi.org/project/simpleneighbors/) called Simple Neighbors. It's a lightweight wrapper around the industrial-strength approximate nearest neighbors lookup library called [Annoy](https://pypi.python.org/pypi/annoy). Simple Neighbors has already been installed in this notebook. # The cell below makes a new Simple Neighbors object called `nns` and initializes it with 300 dimensions (the shape of the word vectors in spaCy, and also the shape of our summary vectors). It then samples ten thousand random conversational turns from the Cornell corpus, finds sentence vectors for each of them, and adds them to the database. (The `np.any()` line just checks to make sure that we don't add any vectors that are all zeroes by accident—this can mess up the nearest-neighbor search.) # # Notes on the code below: # # * I decided to just sample ten thousand turns so that the index will build faster. You can change this number to your liking! # * It only adds *turns that have responses* to the database (i.e., keys in the `responses` lookup). Because of the way the bot works, we don't need to keep track of the last turn of a conversation, since it (by definition) will have no replies. # In[13]: from simpleneighbors import SimpleNeighbors nns = SimpleNeighbors(300) for i, line_id in enumerate(random.sample(list(responses.keys()), 10000)): # show progress if i % 1000 == 0: print(i, line_id, movie_lines[line_id]) line_text = movie_lines[line_id] summary_vector = sentence_mean(nlp, line_text) if np.any(summary_vector): nns.add_one(line_id, summary_vector) nns.build() # Let's take it for a spin! The code in the following cell finds the turn most similar to the string in the variable `sentence`. (You can change this string to whatever you want.) It then uses the Simple Neighbors object to find the turn in the database with the most similar vector, and then uses the `responses` lookup to find the *response* to that turn. That response will be our bot's output. # In[14]: sentence = "I like making bots." picked = nns.nearest(sentence_mean(nlp, sentence), 5)[0] response_line_id = responses[picked] print("Your line:\n\t", sentence) print("Most similar turn:\n\t", movie_lines[picked]) print("Response to most similar turn:\n\t", movie_lines[response_line_id]) # ## Putting it all together # # The code above is all you need to make a conversational chatbot based on semantic similarity. But there's a lot of stuff to keep track of! So I wrote a little bit of "glue code" to make it even easier. You can [see the source code on GitHub](https://github.com/aparrish/semanticsimilaritychatbot/); all the important stuff is [in this file](https://github.com/aparrish/semanticsimilaritychatbot/blob/master/semanticsimilaritychatbot/__init__.py). I'm going to use this library to rewrite the code above in just a few lines, and then we'll use the resulting object to make a chatbot you can use in the browser. # # We have already installed this library in the notebook, so you don't need to install it at the moment. # Then create a chatbot object, passing in the spaCy language object (`nlp`) and the number of dimensions: # In[16]: from semanticsimilaritychatbot import SemanticSimilarityChatbot chatbot = SemanticSimilarityChatbot(nlp, 300) # The `.add_pair()` method in the object takes two strings: a turn and the response to that turn. We'll get these from the `responses` and `movie_lines` lookups, again sampling ten thousand pairs at random. This cell will take a little while to run: # In[17]: sample_n = 10000 for first_id, second_id in random.sample(list(responses.items()), sample_n): chatbot.add_pair(movie_lines[first_id], movie_lines[second_id]) chatbot.build() # Once you've built the database, the `.response_for()` method returns a plausible response from the database, based on semantic similarity. Try it out by changing the text between the quotation marks: # In[19]: print(chatbot.response_for("Hello computer!")) # To add variety, the `.response_for()` method actually selects randomly among several similar turns. You can change the number of turns it chooses from by passing a second parameter (a number) to the method. In general, the higher the number, the greater the chance is that you'll get an unusual result: # In[20]: my_turn = "The weather's nice today, don't you think?" for i in range(5, 51, 5): print("picking from", i, "possible responses:") print(chatbot.response_for(my_turn, i)) print() # The Semantic Similarity Chatbot object has a `.save()` method that saves the pre-built database to disk, using a filename prefix you supply. (It saves three different files: `.annoy`, `-data.pkl`, and `-chatbot.pkl`). # In[21]: chatbot.save("movielines-10k-sample") # You can use a previously-saved database using the `.load()` class method, like so. (This means you don't have to build the database again: you can just load it and start calling `.response_for()`.) # In[45]: chatbot = SemanticSimilarityChatbot.load("movielines-10k-sample", nlp) # In[46]: print(chatbot.response_for("It belongs in a museum!")) # In[47]: print(chatbot.response_for("Hello computer!")) # In[50]: print(chatbot.response_for("Why is that?")) # Try having a conversation now - you can rerun the three lines above, or you can create new cellblocks with the phrases you want to try in them. # ## Some things to try # # If you enjoyed following along, here are some things to try: # # * Use the metadata file that comes with the Cornell corpus to make a chatbot that only uses lines from a particular genre of movie. (How is a comedy chatbot different from an action chatbot?) # * Use a different corpus of conversation altogether. Your own chat logs? Conversational exchanges from a novel? Transcripts of interviews on news programs? # * Incorporate some context from the conversation when vectorizing the turns. (You might, for example, include the average of not just the given turn but also the turn that preceded it.)