This notebook is a fast introduction to a few techniques for working with narrative corpora. By "narrative corpora," I mean pre-existing bodies of text that mostly contain the texts of narratives. In particular, we're going to use Mark Riedl's WikiPlots corpus, which has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia.
The notebook takes you through using spaCy to extract words, noun chunks, parts of speech and entities from the text and then sew them back together with Tracery. It then shows how to use Markovify to create new narratives from existing narrative text, and how to prepare the narratives for use as a training corpus for a large pre-trained language model like GPT-2.
The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed. Even if the notebook itself doesn't end up being useful to you, hopefully it spurs a few ideas that you can take with you into your practice as a storyteller and/or programmer.
If you're running this code on Binder, you should be good to go. Just keep on executing the cells below. If you're running this notebook on Google Colab, you'll need to run the following cells to install the necessary libraries and download the data:
!pip install markovify
!pip install tracery
!pip install spacy==2.3.2
!python -m spacy download en_core_web_sm
!curl -L -O https://github.com/aparrish/corpus-driven-narrative-generation/raw/master/romcom_plot_sentences.tsv
The first step is to get the narrative corpus into the program. Because WikiPlots is so big, we're actually going to be working with a smaller subset: only the plot summaries for romantic comedy movies. The subcorpus was made using this notebook on creating a subcorpus of WikiPlots, which you can consult if you want to make your own with a different subset of WikiPlots.
The corpus we're working with takes the form of a TSV file ("tab separated values"), with each line containing the title of the movie, a number indicating where in the plot summary the sentence for this line occurs, the total number of sentences in the summary, and the actual text of the sentence. The following cell loads the data into a list of dictionaries:
sentences = []
for line in open("romcom_plot_sentences.tsv"):
line = line.strip()
items = line.split("\t")
sentences.append(
{'title': items[0],
'index': int(items[1]),
'total': int(items[2]),
'text': items[3]})
Just to make sure it worked, we'll print out a random sentence:
import random
random.choice(sentences)
{'title': 'It Should Happen to You', 'index': 49, 'total': 72, 'text': "That's what you're doing, isn't it."}
Note: You can make your own corpus that works with the code in this notebook by exporting your data in TSV format with one line per sentence, with columns for the following:
title
: the title of the work that the sentence comes fromindex
: the index of the sentence in the worktotal
: the total number of sentences in the worktext
: the text of the sentenceTo get an idea of what's happening in the text of the plots, we can do a bit of Natural Language Processing. I cover just the bare essentials in this notebook. Here's a more in-depth tutorial that I wrote.
Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to install it (i.e., download the code and put it in a place where Python can find it) and download the language model. (The language model contains statistical information about a particular language that makes it possible for spaCy to do things like parse sentences into their constituent parts.)
Run the following cell to load spaCy's model:
import spacy
nlp = spacy.load('en_core_web_sm')
(This could also take a while–the model is potentially very large and your computer needs to load it from your hard drive and into memory. When you see a [*]
next to a cell, that means that your computer is still working on executing the code in the cell.)
Right off the bat, the spaCy library gives us access to a number of interesting units of text:
doc.sents
)doc
)doc.ents
)The cell below, we extract these into variables so we can play around with them a little bit. (Parsing sentences is hungry work and the following cell will take a while to execute.)
words = []
noun_chunks = []
entities = []
# only use 1000 sentences sampled at random by default; comment out this `for...`
# uncomment the `for...` beneath to use every sentence in the corpus.
for i, sent in enumerate(random.sample(sentences, 1000)):
#for i, sent in enumerate(sentences):
if i % 100 == 0:
print(i, len(sentences))
doc = nlp(sent['text'])
words.extend([w for w in list(doc) if w.is_alpha])
noun_chunks.extend(list(doc.noun_chunks))
entities.extend(list(doc.ents))
0 28785 100 28785 200 28785 300 28785 400 28785 500 28785 600 28785 700 28785 800 28785 900 28785
Just to make sure it worked, print out ten random words:
for item in random.sample(words, 10):
print(item.text)
attention are no does and realize to called auto is
Ten random noun chunks:
for item in random.sample(noun_chunks, 10):
print(item.text)
the woman their parents' home him the counter Lanie that night's performance prayers (Eric Blore an accident the wedding
Ten random entities:
for item in random.sample(entities, 10):
print(item.text)
Elizabeth Martha Paul Alison Sam Cathy falls American five Philip Danny
The parser included with spaCy can also give us information about the grammatical roles in the sentence. For example, the .root.dep_
attribute of a noun chunk tells us whether that noun chunk is the subject of the sentence ("nsubj") or a direct object ("dobj") of the sentence. (See the "Universal Dependency Labels" of spaCy's annotation specs for more possible roles.) Using this information, we can make a list of sentence subjects and sentence objects:
subjects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'nsubj']
objects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'dobj']
random.sample(subjects, 10)
[who, who, Clarisse, Casey, he, she, he, they, she, He]
random.sample(objects, 10)
[Ese, her affair, them, Mary, sex, himself, the next town, his prior dream, the Rose Bowl's committee, entirely different paths]
The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—nouns
, verbs
, adjs
and advs
—that contain only words of the specified parts of speech. Using the .tag_
attribute, we can easily get only particular forms of verbs; in this case, I'm just getting verbs that are in the past tense. (There's a full list of part of speech tags here.)
nouns = [w for w in words if w.pos_ == "NOUN"]
verbs = [w for w in words if w.pos_ == "VERB"]
past_tense_verbs = [w for w in words if w.tag_ == 'VBD']
adjs = [w for w in words if w.tag_ == "JJ"]
advs = [w for w in words if w.pos_ == "ADV"]
And now we can print out a random sample of any of these:
for item in random.sample(nouns, 12): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists!
print(item.text)
kiss glitch veteran friend snare run game approval ship day projects pocket
The parser in spaCy not only identifies "entities" but also assigns them to a particular type. See a full list of entity types here. Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:
people = [e for e in entities if e.label_ == "PERSON"]
locations = [e for e in entities if e.label_ == "LOC"]
times = [e for e in entities if e.label_ == "TIME"]
And then you can print out a random sample:
for item in random.sample(times, 12): # change "times" to "people" or "locations" to sample those lists
print(item.text.strip())
the night The next morning One morning night night the night night the night afternoon Later that night One night the next morning
We won't go too deep into text analysis in this tutorial, but it's useful to be able to do the most fundamental task in text analysis: finding the things that are most common. The code to do this task looks like the following, which gives us a way to look up how often any word occurs in the text:
from collections import Counter
word_count = Counter([w.text for w in words])
word_count['Meanwhile']
15
... and also tells us which words are most common:
word_count.most_common(12)
[('the', 896), ('to', 760), ('and', 734), ('a', 588), ('her', 406), ('is', 337), ('of', 321), ('in', 302), ('with', 269), ('his', 267), ('he', 260), ('that', 246)]
You can make a counter for any of the other lists we've worked with using the same syntax. Just make up a unique variable name on the left of the =
sign and put the name of the list you want to count in the brackets to the right (replacing words
). E.g., to find the most common people:
people_count = Counter([w.text for w in people])
people_count.most_common(12)
[('Tom', 18), ('Joe', 18), ('Mary', 14), ('Adam', 10), ('Michael', 10), ('Charlie', 10), ('Andy', 9), ('Max', 9), ('Peter', 9), ('Paul', 8), ('Elizabeth', 8), ('James', 8)]
The most common past-tense verbs:
vbd_count = Counter([w.text for w in past_tense_verbs])
vbd_count.most_common(12)
[('was', 50), ('had', 26), ('did', 7), ('were', 7), ('left', 4), ('died', 3), ('came', 3), ('saw', 3), ('took', 3), ('called', 2), ('made', 2), ('put', 2)]
The following cell defines a function for writing data from a Counter
object to a file. The file is in "tab-separated values" format, which you can open using most spreadsheet programs. Execute it before you continue:
def save_counter_tsv(filename, counter, limit=1000):
with open(filename, "w") as outfile:
outfile.write("key\tvalue\n")
for item, count in counter.most_common():
outfile.write(item.strip() + "\t" + str(count) + "\n")
Now, run the following cell. You'll end up with a file in the same directory as this notebook called 100_common_words.tsv
that has two columns, one for the words and one for their associated counts:
save_counter_tsv("100_common_words.tsv", word_count, 100)
Try opening this file in Excel or Google Docs or Numbers!
If you want to write the data from another Counter
object to a file:
.tsv
extension)word_count
with the name of any of the Counter
objects we've made in this sheet and use it in place of word_count
Here's another example. Using the times
entities, we can make a spreadsheet of how often particular "times" (durations, times of day, etc.) are mentioned in the text.
time_counter = Counter([e.text.lower().strip() for e in times])
save_counter_tsv("time_count.tsv", time_counter, 100)
Do the same thing, but with people:
people_counter = Counter([e.text.lower() for e in people])
save_counter_tsv("people_count.tsv", people_counter, 100)
Once you've isolated entities and parts of speech, you can recombine them in interesting ways. One is to use a Tracery grammar to write sentences that include the isolated parts. Because the parts have been labelled using spaCy, you can be reasonbly sure that they'll fit into particular slots in the sentence. (I used a similar technique for my Cheap Space Nine bot.)
import tracery
from tracery.modifiers import base_english
rules = {
"subject": [w.text for w in subjects],
"object": [w.text for w in objects],
"verb": [w.text for w in past_tense_verbs if w.text not in ('was', 'were', 'went')], # exclude common irregular verbs
"adj": [w.text for w in adjs],
"people": [w.text for w in people],
"loc": [w.text for w in locations],
"time": [w.text for w in times],
"origin": "#scene#\n\n[charA:#subject#][charB:#subject#][prop:#object#]#sentences#",
"scene": "SCENE: #loc#, #time.lowercase#",
"sentences": [
"#sentence#\n#sentence#",
"#sentence#\n#sentence#\n#sentence#",
"#sentence#\n#sentence#\n#sentence#\n#sentence#"
],
"sentence": [
"#charA.capitalize# #verb# #prop#.",
"#charB.capitalize# #verb# #prop#.",
"#prop.capitalize# became #adj#.",
"#charA.capitalize# and #charB# greeted each other.",
"'Did you hear about #object.lowercase#?' said #charA#.",
"'#subject.capitalize# is #adj#,' said #charB#.",
"#charA.capitalize# and #charB# #verb# #object#.",
"#charA.capitalize# and #charB# looked at each other.",
"#sentence#\n#sentence#"
]
}
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)
for i in range(3):
print(grammar.flatten("#origin#"))
print()
SCENE: Mars, the next morning 'Did you hear about kimmy wallace?' said he. 'Aram is aware,' said The conference. 'Did you hear about her date?' said he. 'Bryce is indispensable,' said The conference. The conference called him. He and The conference looked at each other. SCENE: the Wild West Show, the night The next town became decent. He called the next town. Elizabeth and he looked at each other. 'Irene is routine,' said he. Elizabeth and he greeted each other. SCENE: Orient, night The spirited Sarah and Charlie greeted each other. Charlie had school rules. The spirited Sarah had school rules. Charlie buried school rules.
Another way to produce new narratives from existing narrative text is to find statistical patterns in the text itself and then make the computer create new text that follows those statistical patterns. Markov chain text generation has been a pastime of poets and programmers going back all the way to 1983, so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is Markovify, a Markov chain text generation library originally developed for BuzzFeed, apparently. Writing code to implement a Markov chain generator on your own is certainly possible, but Markovify comes with a lot of extra niceties that will make our lives easier.
To install Markovify on your computer, run the cell below. (You can skip this step if you're using this notebook in Binder.)
!pip install markovify
Requirement already satisfied: markovify in /Users/allison/anaconda/lib/python3.6/site-packages (0.7.1)
Requirement already satisfied: unidecode in /Users/allison/anaconda/lib/python3.6/site-packages (from markovify) (1.0.22)
WARNING: You are using pip version 19.1.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
And then run this cell to make the library available in your notebook:
import markovify
We need a list of strings to train the Markov generator. For now, let's just get all of the sentences from any movie in the corpus:
all_text = [item['text'] for item in sentences]
The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable all_text_gen
.
all_text_gen = markovify.Text(all_text)
You can then call the .make_sentence()
method to generate a sentence from the model:
print(all_text_gen.make_sentence())
Dodge distracts her by helping shovel coal.
The .make_short_sentence()
method allows you to specify a maximum length for the generated sentence:
print(all_text_gen.make_short_sentence(50))
During the date as well.
By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the .make_sentence()
or .make_short_sentence()
methods will return None
, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the tries
parameter:
print(all_text_gen.make_short_sentence(40, tries=100))
He goes to visit next month.
Or by disabling the check altogether with test_output=False
:
print(all_text_gen.make_short_sentence(40, test_output=False))
She accepts.
When you create the model, you can specify the order of the model using the state_size
parameter. It defaults to 2. Let's make two model with different orders and compare:
gen_1 = markovify.Text(all_text, state_size=1)
gen_4 = markovify.Text(all_text, state_size=4)
print("order 1")
print(gen_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_4.make_sentence(test_output=False))
order 1 He supports the green. order 4 Amanda then finds Harold at the bar where they first met, drunk and surrounded by giggling women.
In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error.
Markovify, by default, works with words as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:
class SentencesByChar(markovify.Text):
def word_split(self, sentence):
return list(sentence)
def word_join(self, words):
return "".join(words)
Any of the parameters you passed to markovify.Text
you can also pass to SentencesByChar
. The state_size
parameter still controls the order of the model, but now the n-grams are characters, not words.
The following cell implements a character-level Markov text generator for the word "condescendences":
con_model = SentencesByChar("condescendences", state_size=2)
Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!
con_model.make_sentence()
'condescendencencendes'
Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:
gen_char = SentencesByChar(all_text, state_size=7)
And the cell below prints out a random sentence from this generator. (The .replace()
is to get rid of any newline characters in the output.)
print(gen_char.make_sentence(test_output=False))
Elizabeth explains that he denies, however, the kissed.
It's one thing to be able to produce one plausible sentence of a plot summary using Markov chains, but another to create a sense of overall structure between sentences, and generating narratives with these kinds of long-term dependencies is still an open problem in computational creativity. The approach I'm going to suggest below relies on the intuition that sentences in a plot summary share characteristics based on their position in the summary. First sentences will generally introduce characters and present an initial situation; last sentences will generally describe how the situation was resolved; and sentences in between will describe developing action.
Following this intuition, let's create three different Markov chains: one for beginning sentences, one for middle sentences, and one for final sentences. We can use the index
of each sentence in our corpus to give us this information.
First, the beginnings are lines whose index is zero (i.e., they're the first sentence for this plot):
beginnings = [line['text'] for line in sentences if line['index'] == 0]
random.sample(beginnings, 5)
['Sutter Keely (Miles Teller) is a high school senior, charming and self-possessed.', 'Wealthy Alice Bond (Rosemary Lane), dissatisfied with her dishwater-dull fiance Marshall Winkler (John Eldredge), throws him over in favor of Michael Stevens (George Reeves).', 'Emily "Jacks" Jackson, who spent her childhood in America, now lives and works in London, at British Vogue, and shares an apartment with her gay friend Peter Simon, a screenwriter.', 'Dave (Vince Vaughn), a dealer for Guitar Hero, and Ronnie (Malin Åkerman), a stay-at-home mom, are a typical couple raising two young children in the suburbs of Chicago.', 'Andy Stitzer (Steve Carell) is a 40-year-old virgin who lives alone, collects action figures, plays video games, and his social life seems to consist of watching Survivor with his elderly neighbors.']
And endings are sentences that come last in the plot (i.e., their index is one less than the total number of sentences):
endings = [line['text'] for line in sentences if line['index'] == line['total'] - 1]
random.sample(endings, 5)
['In the end, Debi and Martin leave Grosse Pointe together.', 'Frank and Meredith, along with the other members of the dance class, continue to find friendship and healing in each other.', 'She says that she knew it all along and decides to marry him.', 'The two reunite when Miss Lily returns to the circus.', 'They talk about what to eat for dinner before thinking about running a furniture business.']
And "middles" are anything in between:
middles = [line['text'] for line in sentences if 0 < line['index'] < line['total'] - 1]
random.sample(middles, 5)
['Brian, in his first relationship and out of his depth, does what he believes is the right thing.', 'Laida calls it a night.', 'Accompanied by Willoughby, she drives Herbie onto the window-cleaning machine of Hawk’s skyscraper to reach his 28th-floor office, where mrs Steinmetz overhears a telephoned conversation with Loostgarten about the deal to demolish the firehouse and activates the window cleaning machine to fill the office with foam and water.', 'She goes over there to pick him up, but he is going on and on about his amnesia.', 'His father calls to inform him that he has the ring.']
The following cell creates the models:
beginning_gen = markovify.Text(beginnings)
middle_gen = markovify.Text(middles)
ending_gen = markovify.Text(endings)
Now you can generate tiny narratives by producing a beginning sentence, a middle sentence, and an ending sentence:
print(beginning_gen.make_short_sentence(100))
print(middle_gen.make_short_sentence(100))
print(ending_gen.make_short_sentence(100))
Joan Howell intends to be bride Sophie. Eventually Robin cracks under the guise of an important counterattack. Father thinks it may be yet again pregnant.
The narratives still feel disconnected (and there are often jarring mismatches in pronoun antecedents), but the artifacts produced with this method do feel a bit narrative-like? Maybe?
Markovify has a handy feature that allows you to combine models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call .combine()
to combine them.
The code below combines models for beginning sentences, middle sentences, and ending sentences into one model:
combo = markovify.combine([beginning_gen, middle_gen, ending_gen], [10, 1, 10])
The bit of code [10, 1, 10]
controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly beginnings with but a bit of middles and a soupçon of ends, try [10, 2, 1]
.)
Then you can create sentences using the combined model:
print(combo.make_short_sentence(120))
Afterwards, Tony approaches Amelia in the Whittaker counterfeit ring Agent Rivera who has stolen her heart.
Markov chains are cheap and fun, but they don't do a great job of the one thing we expect from stories: maintaining coherence over a long stretch of text. Accomplishing this is a more difficult task, and requires making use of more sophisticated machine learning models, belonging to the category of large pre-trained neural networks. These models are fundamentally similar to Markov chains, in that they make a prediction about what will come next in a text, given some stretch of context. Unlike a Markov chain, a large pre-trained neural network can predict what will come next in a text, even if the context you give it has never been seen in the training text. It can also work on contexts of arbitrary and variable length. Handy!
These language models are already trained on a large amount of text. Generally, you don't train them from scratch on your own, but instead "fine-tune" them to bring their probabilities more in line with a particular source text.
One such model, OpenAI's GPT-2 does a pretty good job of maintaining long-distance coherence, and it's easy to fine-tune the model with Max Woolf's aitextgen. We'll use the example Colab notebook from the aitextgen repository. This notebook works best when it's fine-tuned on text in a prose format. The model can also learn ad-hoc markup elements that you add to the text. We'll use this feature of the model to make it possible to generate stories from beginning to end, by adding a [BEGIN STORY]
marker before each story in the source text, followed by the title of the story.
out = []
last_title = None
for sent in sentences[:10000]:
if sent['title'] != last_title:
out.append("")
out.append("[BEGIN STORY]")
out.append(sent['title'])
out.append("")
last_title = sent['title']
out.append(sent['text'])
Here's what the data look like:
out[:25]
['', '[BEGIN STORY]', 'Four Weddings and a Funeral', '', 'The film follows the adventures of a group of friends through the eyes of Charles, a good-natured but socially awkward man living in London, who becomes smitten with Carrie, an American whom Charles keeps meeting at four weddings and a funeral.', 'The first wedding is that of Angus and Laura, at which Charles is the best man.', 'Charles and his single friends wonder whether they will ever get married.', 'Charles meets Carrie and spends the night with her.', 'Carrie pretends that, now they have slept together, they will have to get married, to which Charles endeavours to respond before realising she is joking.', 'Carrie observes that they may have missed an opportunity and then returns to America.', 'The second wedding is that of Bernard and Lydia, a couple who became romantically involved at the previous wedding.', 'Charles encounters Carrie again, but she introduces him to her fiancé, Sir Hamish Banks, a wealthy politician.', 'At the reception, Charles finds himself seated with several ex-girlfriends who relate embarrassing stories about his inability to be discreet and afterwards bumps into Henrietta, known among Charles\' friends as "Duckface", with whom he had a particularly difficult relationship.', 'Charles retreats to an empty hotel suite, seeing Carrie and Hamish leave in a taxicab, only to be trapped in a cupboard after the newlyweds stumble into the room to have sex.', 'After Charles awkwardly exits the room, Henrietta confronts him about his habit of "serial monogamy", telling him he is afraid of letting anyone get too close to him.', 'Charles then runs into Carrie, and they end up spending another night together.', "A month later, Charles receives an invitation to Carrie's wedding.", 'While shopping for a present, he coincidentally encounters Carrie and ends up helping her select her wedding dress.', 'Carrie lists her more than thirty sexual partners.', 'Charles later awkwardly tries confessing his love to her and hinting that he would like to have a relationship with her, to no avail.', 'The third wedding is that of Carrie and Hamish.', 'Charles attends, depressed at the prospect of Carrie marrying Hamish.', "At the reception, Gareth instructs his friends to seek potential mates; Fiona's brother, Tom, stumbles through an attempt to connect with a woman until she reveals that she is the minister's wife, while Charles's flatmate, Scarlett, strikes up a conversation with an American named Chester.", 'As Charles watches Carrie and Hamish dance, Fiona deduces his feelings about Carrie.', 'When Charles asks why Fiona is not married, she confesses that she has loved Charles since they first met years earlier.']
The following cell writes this out to a file, which you can then upload to the aitextgen notebook on Google Colab to train the model:
with open("story_training.txt", "w") as fh:
fh.write("\n".join(out))
In the text generation section of that notebook, try prompting the model with [BEGIN STORY]
followed by the title of a story you'd like to generate!