#!/usr/bin/env python # coding: utf-8 # # A few simple corpus-driven approaches to narrative analysis and generation # # By [Allison Parrish](http://www.decontextualize.com/) # # This notebook is a fast introduction to a few techniques for working with narrative corpora. By "narrative corpora," I mean pre-existing bodies of text that mostly contain the texts of narratives. In particular, we're going to use Mark Riedl's [WikiPlots corpus](https://github.com/markriedl/WikiPlots), which has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia. # # The notebook takes you through using [spaCy](http://spacy.io) to extract words, noun chunks, parts of speech and entities from the text and then sew them back together with [Tracery](http://tracery.io). It then shows how to use [Markovify](https://github.com/jsvine/markovify) to create new narratives from existing narrative text, and how to prepare the narratives for use as a training corpus for a large pre-trained language model like GPT-2. # # The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed. Even if the notebook itself doesn't end up being useful to you, hopefully it spurs a few ideas that you can take with you into your practice as a storyteller and/or programmer. # # If you're running this code on Binder, you should be good to go. Just keep on executing the cells below. If you're running this notebook on Google Colab, you'll need to run the following cells to install the necessary libraries and download the data: # In[ ]: get_ipython().system('pip install markovify') get_ipython().system('pip install tracery') get_ipython().system('pip install spacy==2.3.2') get_ipython().system('python -m spacy download en_core_web_sm') get_ipython().system('curl -L -O https://github.com/aparrish/corpus-driven-narrative-generation/raw/master/romcom_plot_sentences.tsv') # ## Loading the corpus # # The first step is to get the narrative corpus into the program. Because WikiPlots is so big, we're actually going to be working with a smaller subset: only the plot summaries for romantic comedy movies. The subcorpus was made using [this notebook on creating a subcorpus of WikiPlots](https://github.com/aparrish/corpus-driven-narrative-generation/blob/master/creating-a-wikiplots-subcorpus.ipynb), which you can consult if you want to make your own with a different subset of WikiPlots. # # The corpus we're working with takes the form of a TSV file ("tab separated values"), with each line containing the title of the movie, a number indicating where in the plot summary the sentence for this line occurs, the total number of sentences in the summary, and the actual text of the sentence. The following cell loads the data into a list of dictionaries: # In[1]: sentences = [] for line in open("romcom_plot_sentences.tsv"): line = line.strip() items = line.split("\t") sentences.append( {'title': items[0], 'index': int(items[1]), 'total': int(items[2]), 'text': items[3]}) # Just to make sure it worked, we'll print out a random sentence: # In[4]: import random # In[5]: random.choice(sentences) # Note: You can make your own corpus that works with the code in this notebook by exporting your data in TSV format with one line per sentence, with columns for the following: # # * `title`: the title of the work that the sentence comes from # * `index`: the index of the sentence in the work # * `total`: the total number of sentences in the work # * `text`: the text of the sentence # ## Natural language processing # To get an idea of what's happening in the text of the plots, we can do a bit of Natural Language Processing. I cover just the bare essentials in this notebook. [Here's a more in-depth tutorial that I wrote](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb). # # Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to install it (i.e., download the code and put it in a place where Python can find it) and download the language model. (The language model contains statistical information about a particular language that makes it possible for spaCy to do things like parse sentences into their constituent parts.) # # Run the following cell to load spaCy's model: # In[6]: import spacy nlp = spacy.load('en_core_web_sm') # (This could also take a while–the model is potentially very large and your computer needs to load it from your hard drive and into memory. When you see a `[*]` next to a cell, that means that your computer is still working on executing the code in the cell.) # Right off the bat, the spaCy library gives us access to a number of interesting units of text: # # * All of the sentences (`doc.sents`) # * All of the words (`doc`) # * All of the "named entities," like names of places, people, #brands, etc. (`doc.ents`) # * All of the "noun chunks," i.e., nouns in the text plus surrounding matter like adjectives and articles # # The cell below, we extract these into variables so we can play around with them a little bit. (Parsing sentences is hungry work and the following cell will take a while to execute.) # In[7]: words = [] noun_chunks = [] entities = [] # only use 1000 sentences sampled at random by default; comment out this `for...` # uncomment the `for...` beneath to use every sentence in the corpus. for i, sent in enumerate(random.sample(sentences, 1000)): #for i, sent in enumerate(sentences): if i % 100 == 0: print(i, len(sentences)) doc = nlp(sent['text']) words.extend([w for w in list(doc) if w.is_alpha]) noun_chunks.extend(list(doc.noun_chunks)) entities.extend(list(doc.ents)) # Just to make sure it worked, print out ten random words: # In[8]: for item in random.sample(words, 10): print(item.text) # Ten random noun chunks: # In[9]: for item in random.sample(noun_chunks, 10): print(item.text) # Ten random entities: # In[10]: for item in random.sample(entities, 10): print(item.text) # ### Grammatical roles # # The parser included with spaCy can also give us information about the grammatical roles in the sentence. For example, the `.root.dep_` attribute of a noun chunk tells us whether that noun chunk is the subject of the sentence ("nsubj") or a direct object ("dobj") of the sentence. (See the "Universal Dependency Labels" of spaCy's [annotation specs](https://spacy.io/api/annotation) for more possible roles.) Using this information, we can make a list of sentence subjects and sentence objects: # In[11]: subjects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'nsubj'] objects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'dobj'] # In[12]: random.sample(subjects, 10) # In[13]: random.sample(objects, 10) # ### Parts of speech # # The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. Using the `.tag_` attribute, we can easily get only particular forms of verbs; in this case, I'm just getting verbs that are in the past tense. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english).) # In[14]: nouns = [w for w in words if w.pos_ == "NOUN"] verbs = [w for w in words if w.pos_ == "VERB"] past_tense_verbs = [w for w in words if w.tag_ == 'VBD'] adjs = [w for w in words if w.tag_ == "JJ"] advs = [w for w in words if w.pos_ == "ADV"] # And now we can print out a random sample of any of these: # In[15]: for item in random.sample(nouns, 12): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists! print(item.text) # ### Entity types # # The parser in spaCy not only identifies "entities" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text: # In[16]: people = [e for e in entities if e.label_ == "PERSON"] locations = [e for e in entities if e.label_ == "LOC"] times = [e for e in entities if e.label_ == "TIME"] # And then you can print out a random sample: # In[17]: for item in random.sample(times, 12): # change "times" to "people" or "locations" to sample those lists print(item.text.strip()) # ### Finding the most common # # We won't go too deep into text analysis in this tutorial, but it's useful to be able to do the most fundamental task in text analysis: finding the things that are most common. The code to do this task looks like the following, which gives us a way to look up how often any word occurs in the text: # In[18]: from collections import Counter word_count = Counter([w.text for w in words]) # In[19]: word_count['Meanwhile'] # ... and also tells us which words are most common: # In[20]: word_count.most_common(12) # You can make a counter for any of the other lists we've worked with using the same syntax. Just make up a unique variable name on the left of the `=` sign and put the name of the list you want to count in the brackets to the right (replacing `words`). E.g., to find the most common people: # In[21]: people_count = Counter([w.text for w in people]) # In[22]: people_count.most_common(12) # The most common past-tense verbs: # In[23]: vbd_count = Counter([w.text for w in past_tense_verbs]) # In[24]: vbd_count.most_common(12) # ### Writing to a file # # The following cell defines a function for writing data from a `Counter` object to a file. The file is in "tab-separated values" format, which you can open using most spreadsheet programs. Execute it before you continue: # In[25]: def save_counter_tsv(filename, counter, limit=1000): with open(filename, "w") as outfile: outfile.write("key\tvalue\n") for item, count in counter.most_common(): outfile.write(item.strip() + "\t" + str(count) + "\n") # Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts: # In[26]: save_counter_tsv("100_common_words.tsv", word_count, 100) # Try opening this file in Excel or Google Docs or Numbers! # # If you want to write the data from another `Counter` object to a file: # # * Change the filename to whatever you want (though you should probably keep the `.tsv` extension) # * Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count` # * Change the number to the number of rows you want to include in your spreadsheet. # ### When do things happen in this text? # # Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular "times" (durations, times of day, etc.) are mentioned in the text. # In[27]: time_counter = Counter([e.text.lower().strip() for e in times]) save_counter_tsv("time_count.tsv", time_counter, 100) # Do the same thing, but with people: # In[28]: people_counter = Counter([e.text.lower() for e in people]) save_counter_tsv("people_count.tsv", people_counter, 100) # ### Generating stories from a corpus and Tracery grammars # # Once you've isolated entities and parts of speech, you can recombine them in interesting ways. One is to use a Tracery grammar to write sentences that include the isolated parts. Because the parts have been labelled using spaCy, you can be reasonbly sure that they'll fit into particular slots in the sentence. (I used a similar technique for my [Cheap Space Nine](https://twitter.com/cheapspacenine) bot.) # In[29]: import tracery from tracery.modifiers import base_english # In[40]: rules = { "subject": [w.text for w in subjects], "object": [w.text for w in objects], "verb": [w.text for w in past_tense_verbs if w.text not in ('was', 'were', 'went')], # exclude common irregular verbs "adj": [w.text for w in adjs], "people": [w.text for w in people], "loc": [w.text for w in locations], "time": [w.text for w in times], "origin": "#scene#\n\n[charA:#subject#][charB:#subject#][prop:#object#]#sentences#", "scene": "SCENE: #loc#, #time.lowercase#", "sentences": [ "#sentence#\n#sentence#", "#sentence#\n#sentence#\n#sentence#", "#sentence#\n#sentence#\n#sentence#\n#sentence#" ], "sentence": [ "#charA.capitalize# #verb# #prop#.", "#charB.capitalize# #verb# #prop#.", "#prop.capitalize# became #adj#.", "#charA.capitalize# and #charB# greeted each other.", "'Did you hear about #object.lowercase#?' said #charA#.", "'#subject.capitalize# is #adj#,' said #charB#.", "#charA.capitalize# and #charB# #verb# #object#.", "#charA.capitalize# and #charB# looked at each other.", "#sentence#\n#sentence#" ] } # In[41]: grammar = tracery.Grammar(rules) grammar.add_modifiers(base_english) # In[42]: for i in range(3): print(grammar.flatten("#origin#")) print() # ## Markov chain text generation # # Another way to produce new narratives from existing narrative text is to find statistical patterns in the text itself and then make the computer create new text that follows those statistical patterns. Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. Writing [code to implement a Markov chain generator](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) on your own is certainly possible, but Markovify comes with a lot of extra niceties that will make our lives easier. # To install Markovify on your computer, run the cell below. (You can skip this step if you're using this notebook in Binder.) # In[121]: get_ipython().system('pip install markovify') # And then run this cell to make the library available in your notebook: # In[43]: import markovify # We need a list of strings to train the Markov generator. For now, let's just get all of the sentences from any movie in the corpus: # In[44]: all_text = [item['text'] for item in sentences] # The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `all_text_gen`. # In[45]: all_text_gen = markovify.Text(all_text) # You can then call the `.make_sentence()` method to generate a sentence from the model: # In[47]: print(all_text_gen.make_sentence()) # The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence: # In[48]: print(all_text_gen.make_short_sentence(50)) # By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter: # In[49]: print(all_text_gen.make_short_sentence(40, tries=100)) # Or by disabling the check altogether with `test_output=False`: # In[50]: print(all_text_gen.make_short_sentence(40, test_output=False)) # ### Changing the order # # When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare: # In[51]: gen_1 = markovify.Text(all_text, state_size=1) gen_4 = markovify.Text(all_text, state_size=4) # In[52]: print("order 1") print(gen_1.make_sentence(test_output=False)) print() print("order 4") print(gen_4.make_sentence(test_output=False)) # In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error. # ### Changing the level # # Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing: # In[53]: class SentencesByChar(markovify.Text): def word_split(self, sentence): return list(sentence) def word_join(self, words): return "".join(words) # Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words. # # The following cell implements a character-level Markov text generator for the word "condescendences": # In[54]: con_model = SentencesByChar("condescendences", state_size=2) # Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier! # In[55]: con_model.make_sentence() # Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A: # In[56]: gen_char = SentencesByChar(all_text, state_size=7) # And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.) # In[59]: print(gen_char.make_sentence(test_output=False)) # ### Thinking about structure # # It's one thing to be able to produce one plausible sentence of a plot summary using Markov chains, but another to create a sense of overall structure between sentences, and generating narratives with these kinds of long-term dependencies is still an open problem in computational creativity. The approach I'm going to suggest below relies on the intuition that sentences in a plot summary share characteristics based on their position in the summary. First sentences will generally introduce characters and present an initial situation; last sentences will generally describe how the situation was resolved; and sentences in between will describe developing action. # # Following this intuition, let's create *three different Markov chains*: one for beginning sentences, one for middle sentences, and one for final sentences. We can use the `index` of each sentence in our corpus to give us this information. # # First, the beginnings are lines whose index is zero (i.e., they're the first sentence for this plot): # In[60]: beginnings = [line['text'] for line in sentences if line['index'] == 0] # In[61]: random.sample(beginnings, 5) # And endings are sentences that come last in the plot (i.e., their index is one less than the total number of sentences): # In[62]: endings = [line['text'] for line in sentences if line['index'] == line['total'] - 1] # In[63]: random.sample(endings, 5) # And "middles" are anything in between: # In[64]: middles = [line['text'] for line in sentences if 0 < line['index'] < line['total'] - 1] # In[65]: random.sample(middles, 5) # The following cell creates the models: # In[66]: beginning_gen = markovify.Text(beginnings) middle_gen = markovify.Text(middles) ending_gen = markovify.Text(endings) # Now you can generate tiny narratives by producing a beginning sentence, a middle sentence, and an ending sentence: # In[68]: print(beginning_gen.make_short_sentence(100)) print(middle_gen.make_short_sentence(100)) print(ending_gen.make_short_sentence(100)) # The narratives still feel disconnected (and there are often jarring mismatches in pronoun antecedents), but the artifacts produced with this method do feel a bit narrative-like? Maybe? # ### Combining models # # Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them. # # The code below combines models for beginning sentences, middle sentences, and ending sentences into one model: # In[69]: combo = markovify.combine([beginning_gen, middle_gen, ending_gen], [10, 1, 10]) # The bit of code `[10, 1, 10]` controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly beginnings with but a bit of middles and a *soupçon* of ends, try `[10, 2, 1]`.) # # Then you can create sentences using the combined model: # In[70]: print(combo.make_short_sentence(120)) # ## Prepping the corpus for fine-tuning a large language model # # Markov chains are cheap and fun, but they don't do a great job of the one thing we expect from stories: maintaining coherence over a long stretch of text. Accomplishing this is a more difficult task, and requires making use of more sophisticated machine learning models, belonging to the category of large pre-trained neural networks. These models are fundamentally similar to Markov chains, in that they make a prediction about what will come next in a text, given some stretch of context. Unlike a Markov chain, a large pre-trained neural network can predict what will come next in a text, even if the context you give it has never been seen in the training text. It can also work on contexts of arbitrary and variable length. Handy! # # These language models are already trained on a large amount of text. Generally, you don't train them from scratch on your own, but instead "fine-tune" them to bring their probabilities more in line with a particular source text. # # One such model, [OpenAI's GPT-2](https://github.com/openai/gpt-2) does a pretty good job of maintaining long-distance coherence, and it's easy to fine-tune the model with Max Woolf's [aitextgen](https://github.com/minimaxir/aitextgen/). We'll use the [example Colab notebook](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing) from the aitextgen repository. This notebook works best when it's fine-tuned on text in a prose format. The model can also learn ad-hoc markup elements that you add to the text. We'll use this feature of the model to make it possible to generate stories from beginning to end, by adding a `[BEGIN STORY]` marker before each story in the source text, followed by the title of the story. # In[74]: out = [] last_title = None for sent in sentences[:10000]: if sent['title'] != last_title: out.append("") out.append("[BEGIN STORY]") out.append(sent['title']) out.append("") last_title = sent['title'] out.append(sent['text']) # Here's what the data look like: # In[84]: out[:25] # The following cell writes this out to a file, which you can then upload to the aitextgen notebook on Google Colab to train the model: # In[76]: with open("story_training.txt", "w") as fh: fh.write("\n".join(out)) # In the text generation section of that notebook, try prompting the model with `[BEGIN STORY]` followed by the title of a story you'd like to generate!