“Every programmer is an author.”
― Sercan Leylek
The purpose of this chapter is to demonstrate that the Python programming language can be a useful and powerful tool (among others) in the toolset of a scholar doing computational literary analysis. We will introduce Python in general, and a technology called iPython (or Jupyter) Notebooks in particular, we will briefly review some related technologies, and we will show a simple example of using Python to explore character name networks in The Adventures of Sherlock Holmes. Though working in Python won't be everyone's cup of tea, we hope this brief overview will serve to help readers recognize some of its strengths and potential applications.
Python is a programming language, in other words, it's a language (with a vocabulary and syntax) intended to give instructions to a computer. Compared to many other programming languages, Python is also deliberately designed to be more easily read by humans. In principle, a programming language that's easier to read is also easier to learn (though it's never easy to learn a first programming language), as well as easier for other programmers to understand, maintain, and extend.
We won't go into details on the syntax of Python here (there are plenty of good introductory guides and tutorials for Python), but there are a few characteristics that are worth mentioning (we invite you to skip over the exotic-sounding parts that don't make sense). Python
It may be useful to see a concrete bit of code, even if it's a contrived example. Let's imagine that we wanted to list the even numbers between zero and ten (exclusively). We can iterate over a range of numbers and use the modulo operator (%) to determine if a number divided by two has a remainder (if it doesn't then it's an even number).
# loop through each number in the range from 0 up to (but not including) 9
for n in range(0,9):
# if n is true (not zero) and there's no remainder, print the number
if n and not n % 2: print("even:", n)
even: 2 even: 4 even: 6 even: 8
So the syntax in Python could be something like this:
for n in range(0,9):
if n and not n % 2: print("even: ", n)
Let's compare that to an equivalent bit of code in PHP:
foreach(range(0,9) as $n) {
if ($n && !(n % 2)) {echo "$n\n";}
}
Do you find one easier to read than the other? Note some important stylistic differences and characteristics of Python:
Again, readability of code is an important aspect of Python – the easier it is to read, the quicker it is to learn and master. Style is a consideration for Python programmers, there's even the concept of a more pythonic way of doing things, which roughly corresponds to using idioms motivated by the core philosophy of the language. Indeed, one of Python's aphorisims in the Zen of Python is that "Flat is better than nested", which means that our loop above, with nested indentation, might be better expressed using a structure called list comprehension:
[print("even: ", n) for n in range(0,9) if n and not n % 2]
Another Python aphorism is that "Readability counts", and we're not entirely sure that this more compact version is more readable, which reinforces that programmers are authors who make stylistic choices, in part subject to personal preferences.
In any case, another compelling advantage of Python is the large number of standard (built-in) and external libraries available for working with texts. Chief among these is Natural Language Toolkit (NLTK), which provides convenient functions for processing a corpus, segmenting words, analyzing and graphing frequencies, concordancing, part-of-speech tagging (identifying determiners, nouns, etc.), and many other features. Moreover, the NLTK library can be used in conjunction with other libraries for advanced statistical analysis, machine learning and graphing.
Python code can be executed in several different ways. A common scenario is to use the command line to invoke the Python interpreter which parses and executes the contents of a file like this (where python
is the command and myscript.py
is the name of the file to be executed):
$ python myscript.py
Another model is to an interactive command line environment. Instead of executing a file at once and then terminating, Python allows you to keep a session "live" which means that functions and variables that you define remain in memory until you leave the interactive environment. Here's an example session, where ">>>" indicates the Python command line:
$ python
>>> msg = "Hello World!"
>>> print(msg)
Hello World!
Jupyter Notebooks are in some ways a web-based version of interactive Python: they allow you to edit and execute bits of code directly in your browser and to keep your work (variables, functions, etc.) in memory for the duration of the session. But Jupyter Notebooks are much more than just interactive Python because they also add a second component: the ability to create and edit blocks of text that can be intertwingled with the blocks of code. This mix of code and text is sometimes called literate programming.
Literate programming was first described by Donald Knuth in 1984, it emphasizes the primacy of thinking-through in a “literate” or human-readable fashion for others what a program should do. As Knuth put it (Knuth 1984, p. 1):
Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.
Although no one would claim that literate programming has revolutionized computer science or become de facto standard practice, we believe that literate programming has enormous potential for the humanities because it privileges argumentation and human expression. Jupyter Notebooks allow students and scholars to develop an argument in prose while also providing direct access to the methods used. Indeed, our enthusiasm for literate programming in Python has led us to write an introductory guide called the Art of Literary Programming (which includes instructions for getting setup with Jupyter Notebooks. This guide to iPython Notebooks, written in iPython Notebooks, is hosted at GitHub, a broadly used service for collaborative development (GitHub renders iPython notebooks directly, but we still tend to use nbviewer.ipython.org because of better formatting).
The opening claim in this chapter is that Python is a useful tool, which of course is not a claim that Python is always and necessarily the best tool for all aspects of computational literary analysis. On the contrary, just as no carpenter would embark on a project with a single tool, we believe that it's best to explore the broad range of tools available and to make strategic choices. Factors in choosing technologies strategically include
There are several lists of text analysis tools out there (some of which are more or less up-to-date), but we'll mention two in particular that are maintained by and for digital humanists. First, there's TAPoR (tapor.ca), the Text Analysis Portal for Research (led by co-author Rockwell and others). TAPoR includes hundreds of tools organized by categories and user-defined tags. Each tool entry defines useful characteristics such as "ease of use" and "type of license" and includes user-contributed reviews. Second, there's DiRT (dirtdirectory.org), the Digital Research Tools Directory. DiRT organizes a broad range of tools (not just for literary analysis) by research task and by kinds of data. DiRT also defines several useful characteristics for tool entries, as well as screenshots for many of the tools.
There's a dizzying number of specialized tools that can be used for literary text analysis (we distinguish here between specialized tools with defined functionality and programming languages that are more general and can be used to develop specialized tools). We (the authors) have developed Voyant Tools (voyant-tools.org), a web-based digital text reading and analysis environment. Voyant allows users to upload content in a a variety of of formats (plain text, MS Word, XML, PDF, etc.) and provides extensive analytic functionality, particularly for studying the frequency and distribution of words in one or more documents. Another specialized tool for literary text analysis is TXM (textometrie.ens-lyon.fr), which is especially good at working with XML and TEI documents. Commercial and tiered-license solutions include RapidMiner (rapidminer.com) and Provalis Research (provalisresearch.com).
A strength of Voyant is that it provides a (relatively) friendly and familiar user interface for working with texts, no programming or advanced technical skills are needed. However, this is also a weakness: the functionality is limited to what the developers have included, and if users want to do something differently, they're stuck. That's one of the strongest arguments for learning how to program, to not be constrained by what others have created.
Just as there's no best tool for literary analysis, there's no best programming language (even though some programmers are committed to their favourite language with almost religious zeal). We have developed functional tools in C, Java, Javascript, Mathematica, Perl, PHP, Python, R, Ruby, Swift, XSLT (and probably others we've forgotten). If anything, learning new languages often makes us better programmers in whatever language we happen to use.
Any text analysis operation that can be programmed in one language can also be programmed in another, but two important factors are how quickly and conveniently things can be done, as well as how generally useful a language is. We like PHP as a programming language because it's widely used for web applications (like WordPress) and because some common operations are very easily accomplished. As mentioned before, Python has the advantage of a large number of useful libraries for natural language processing. We tend to use Java when speed is more important and when projects get bigger.
Below is a simple example of fetching the contents of a URL into a string variable with three different languages: PHP, Python and Java. This operation could be programmed in different ways for each language, the purpose here is to show more generally that that are significant differences in style and that there are always trade-offs between brevity, clarity, convenience, speed, robustness, and other considerations.
// PHP (notice that no additional modules or libraries need to be included)
$holmes = file_get_contents("http://www.gutenberg.org/cache/epub/1661/pg1661.txt")
# Python
import urllib.request
url = "http://www.gutenberg.org/cache/epub/1661/pg1661.txt"
holmes = urllib.request.urlopen(url).read().decode()
// Java
URL holmesUrl = new URL("http://www.gutenberg.org/cache/epub/1661/pg1661.txt");
URLConnection connection = holmesUrl.openConnection();
InputStream is = new InputStreamReader(connection.getInputStream())
BufferedReader in = new BufferedReader(is);
StringBuilder response = new StringBuilder();
String inputLine;
while ((inputLine = in.readLine()) != null)
response.append(inputLine);
in.close();
String holmes = response.toString();
Let's now look at a more developed example of a literate programming notebook in Python.
Who talks to whom in a story? Which characters in a story form into a social network by virtue of being mentioned in close proximity? Are there associations we may not have noticed during a close reading? Are there suggestive structures we can observe in a larger corpus that we have not read closely? These are some of the questions that might motivate literary text analysis with computers (indeed, the example below is derived from a question first posed on Twitter).
What follows is a simple example of a Jupyter iPython Notebook for the rapid analysis of character names in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle. Our purpose here is to provide a quick demonstration of an an iPython Notebook, including the mix of prose and code, as well as the power of Python, and we make no claims about contributing to Holmesian studies. The superficiality of this example can even be considered a virtue: sometimes computational methods allow us to quickly ask questions (that we might not otherwise be able to ask) and to determine if more rigorous analysis is warranted.
Our plan for studying character name networks can be summarized as follows:
Not all texts are created equal when it comes to literary text analysis: some texts are more readily available than others (because of copyright, format, language, condition of the original, etc.). Fortunately, Conan Doyle's works are public domain and available from various sources, including Project Gutenberg. While Project Gutenberg texts may not be the preferred editions for some kinds of scholarship, for our experiment on character names they will work just fine.
We'll begin by fetching the contents of Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle (it's worth opening the contents of this link to see what we'll be getting). Using Python to fetch the contents of a URL is more fully described in Getting Texts.
In the Python code blocks below, the hash (#) symbol indicates a comment (not executed as code), which can be a useful way of differentiating pragmatic programming concerns with the conceptual narration that is expressed in this prosaic (non-code) block of text.
# start by importing a module that helps with fetching contents form URLs
import urllib.request
# define the URL to fetch
url = "http://www.gutenberg.org/cache/epub/1661/pg1661.txt"
# fetch the contens of the URL and convert from bytes to UTF-8
string = urllib.request.urlopen(url).read().decode()
If all goes well, we now have a string variable (called string
that contains the feteched contents. We can double check the length of the string and have a peek at its contents. It's a good idea to have these sanity checks as we go along since a lot of things can go wrong (like a URL no longer working).
# show the length of the string
print("string contains", len(string), "characters (not to be confused with narrative characters)")
# show the first 75 characters
print("start of string:", string[:75], "…")
string contains 594916 characters (not to be confused with narrative characters) start of string: Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doy …
We have the full string, but that includes some paratextual content like the Project Gutenberg header and license. By looking at the original file, we can identify the "real" start and end of the of the text (for our purposes), and then isolate just that section.
# find the character index of the "real" start of the text
startText = string.find("ADVENTURE I. A SCANDAL IN BOHEMIA")
# find the character index of the "real" end of the text"
endText = string.find("End of the Project Gutenberg EBook of The Adventures of Sherlock Holmes")
# keep only the "real" text, filtering out the rest and stripping trailing whitespace
filteredText = string[startText:endText].strip()
# show the length of the string
print("filteredText contains", len(filteredText), "characters")
# show the first 30 characters
print("start of filteredText:", filteredText[:30], "…")
filteredText contains 574456 characters start of filteredText: ADVENTURE I. A SCANDAL IN BOHE …
We now have one long string with all the texts, but what we really want is to look for the co-occurrence of character names by paragraph, so we need a way of segmenting our string into paragraphs. For the sake of simplicity, we can simply use double newline characters as an indication of a new paragraph. We can see this by looking at the first couple of hundred characters:
print(filteredText[:230])
ADVENTURE I. A SCANDAL IN BOHEMIA I. To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he
Project Gutenberg uses the Windows cariage return (\r) and line feed (\n) sequence, as we can see if we show the first couple of hundred characters in a rawer format:
filteredText[:230]
'ADVENTURE I. A SCANDAL IN BOHEMIA\r\n\r\nI.\r\n\r\nTo Sherlock Holmes she is always THE woman. I have seldom heard\r\nhim mention her under any other name. In his eyes she eclipses\r\nand predominates the whole of her sex. It was not that he '
So, to identify paragraphs, we can separate the text using two or more sequences of \r\n in a row. The code below uses Regular Expressions to split our text into a list of smaller strings based on newlines. It filters out paragraphs that are empty (once leading and trailing whitespace is removed). The code uses a very compact and convenient syntax called list comprehension to manipulate our list.
import re
# create a list of paragraphs from (Windows) double (or more) newlines
# the p.strip() removes any leading or trailing whitespace and the paragraph
# is only kept if there are still characters after stripping whitespace
paragraphs = [p for p in re.split(r'\r\n(\r\n)+', filteredText) if p.strip()]
print("we have", len(paragraphs), "paragraphs")
we have 2537 paragraphs
We now have a list of paragraphs. Next we'll want to look for character names in each paragraph. There are several ways this might be done, including looking for capitalized words (probably too generic), looking for matches from known lists of names (probably too specific), or combining a variety of heuristics. This last approach is essentially what the Natural Language Toolkit part-of-speech tagger does, using what it knows about a multi-domain training corpus (in other words, based on tags that have been manually verified in a variety of known texts, it tries to tag the contents of texts it hasn't seen). The NLTK library is one of the real benefits of working with Python. It's very convenient for a wide range of natural language analysis tasks, and though it's not always as accurate as we might like, it's still very powerful.
Part of the NLTK worklow is to create a linguistic tree that can combine smaller units. For instance a tree might represent an entire sentence, a clause, or successive proper nouns that together represent the first and last names of a person. So we'll create a function to findPeople that takes as an argument a tree where the tree is either labeled as a person or another kind of tree that we can further breakdown into branches. This is called a recursive function, a function that can call itself until it finds what it needs (or can look no further).
# import the Natural Language Toolkit
import nltk
# a function that looks at a parsed NLTK tree (or list of trees) and looks for tagged people
def findPeopleInTree(tree):
# an empty set of people
people = []
# if we have a tree and its label is "PERSON", return the person
if type(tree) is nltk.tree.Tree and tree.label() == "PERSON":
return " ".join([word for word, pos in tree])
# otherwise, if we have a list, try calling this function recursively
elif (type(tree) is nltk.tree.Tree) or (type(tree) is list):
for branch in tree:
person = findPeopleInTree(branch)
# if we actually get a person, add it to our list
if person:
people.append(person)
# return a set of unique people (may be empty)
return people
That code may be a bit difficult to understand in the abstract, so let's now create some code that will let us demonstrate it. Essentially we want to be able to send a paragraph through a pipeline that accomplishes the following operations:
# a function that parses a pargraph and returns a list of tagged people
def findPeopleInParagraph(paragraph):
# parse sentences from the paragraph
sentences = nltk.sent_tokenize(paragraph)
# word tokenization of the sentences
tokenizedSentences = [nltk.word_tokenize(sent) for sent in sentences]
# part-of-speech tagging of the tokenized sentences
taggedSentences = [nltk.pos_tag(sent) for sent in tokenizedSentences]
# chunk tagged sentences (allows first and last names to be combined, for instance)
chunkedSentences = [nltk.ne_chunk(sent) for sent in taggedSentences]
# get people from our chunked sentences
people = findPeopleInTree(chunkedSentences)
# flatten the list of lists
people = [person for sublist in people for person in sublist]
# return a list without duplicates
return list(set(people))
Now we can show our new functions in action. We'll show the first full paragraph in our corpus as an example and try to find people in it.
print(paragraphs[2])
To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.
findPeopleInParagraph(paragraphs[2])
['Irene Adler']
We see here some good and bad: the character name "Irene Adler" was correctly identified, but the name "Sherlock Holmes" was not. This may seem surprising in some ways, but "Sherlock" and "Sherlock Holmes" can actually refer to many things, as evidence by the disambiguation page on Wikipedia. Still, let's press forward, noting that we want to be even more circumspect about the results given these disambiguation challenges and the fact that we may be missing several character names (and possibly including some "people" who aren't actually character names).
Following this demonstration with one paragraph, we're now ready to try to parse the entire corpus.
all_people = [findPeopleInParagraph(paragraph) for paragraph in paragraphs]
Before we go further, we can have a peek at the character names that have been identified by generating a frequency graph.
# make sure to plot graphs inline
%matplotlib inline
# let's have a peak at the results
nltk.FreqDist([person for sublist in all_people for person in sublist]).plot(20)
Now that we've identified (some) people mentioned in each paragraph, we can look for co-occurrences by creating links between any two character names that occur within a span of two paragraphs. As we're doing this, we'll do some clean-up of the names, including the following:
# this is a list of paragraphs that contain multiple people
multi_people = []
# this variable stores people from the previous paragraph
previous_people = []
# loop through each list in our list of all_people
for people in all_people:
# if we have any people in our list
if people:
# normalize Holmes
people = ["Holmes" if ("Holmes" in person or "Sherlock" in person) else person for person in people]
# remove titles
people = [re.sub("(Mr\.|Miss|Sir|Lord)", "", person) for person in people]
# remove names with stopwords
people = [person for person in people if not any([w for w in ["Street", "Station"] if w in person])]
# remove people where nothing is left
people = list(set([person.strip() for person in people if person.strip()]))
# combine people from this paragraph with those from the previous one
combined_people = people+previous_people
# try to remove possible first names only (localized to current and previous paragraphs)
firstnames = [name.split(" ")[0] for name in combined_people if len(name.split(" "))>1]
combined_people = [person for person in people if person not in firstnames]
# remove duplicates of people
combined_people = set(combined_people)
# we have a winner (more than one person in this paragraph)
if len(combined_people) > 1:
multi_people.append(combined_people)
# no people in this paragraph, so let's ensure an empty list
else:
people = []
# remember the people from the current paragraph to use with the next one
previous_people = people
Now we'll go through and prepare a set of relationships that can be counted. We'll create a link between every pair of people for each element in our list (which contains people from each paragraph and the one before it). So, if people A, B and C are mentioned, we'll have three relationships:
Then we can count how many times these relationships repeat.
import itertools
from collections import defaultdict
# we want to create a dictionary with edges and counts (with default of 0)
edgesDictionary = defaultdict(int)
# we go through each set of people
for people in multi_people:
# go through each combination of people: A, B, C => A,B, A,C, B,C
for (a,b) in itertools.combinations(people, 2):
# increment this combination
edgesDictionary[(a,b)]+=1
# now count the frequencies
edgesFreqs = nltk.FreqDist(edgesDictionary)
# and have a peek
edgesFreqs.most_common(5)
[(('Watson', 'Holmes'), 18), (('Hosmer Angel', 'Holmes'), 7), (('Windibank', 'Holmes'), 6), (('Lestrade', 'Holmes'), 6), (('Doctor', 'Holmes'), 6)]
It's reassuring to see that Holmes and Watson have the most co-occurrences (within subsequent paragraphs). Of course we also see another example of the limitations of this approach: "Doctor" (in the fourth line) probably refers most often to "Doctor Watson" and so the two could conceivably be collapsed.
Now that we have relationships and counts, we can graph the results. The code below provides only superficial explanations of graphing with the NetworkX module, but much more detailed descriptions are available from the Getting Graphical and Topic Modelling notebooks.
import networkx as nx # a network graph module
import matplotlib.pyplot as plt # a general graphing module
# create a new 10x10 network graph
G = nx.Graph()
plt.figure(figsize=(10,10))
# create graph edges (node pairs) and keep track of edges for each count
edges = defaultdict(list)
for names, count in edgesFreqs.most_common():
# only include repeating pairs
if count > 1:
# isolate each person in the pair and create an edge
G.add_edge(names[0], names[1], width=count)
edges[count].append((names[0], names[1]))
# determine the position of nodes using a force directed spring algorithm
pos = nx.spring_layout(G, k=.9)
# draw labels (nx.draw(G) doesn't really work)
nx.draw_networkx_labels(G, pos)
# draw edges with different widths indicating the frequency
for count, edgelist in edges.items():
g = G.subgraph(pos)
nx.draw_networkx_edges(g, pos, edgelist=edgelist, width=count, alpha=0.05)
# wrap it up!
plt.axis('off')
plt.show()
Tada! This graph may not reveal any deep cosmic mysteries, but it does places Holmes and Watson at the centre and establishes several other associations that might warrant further exploration. Let's reiterate that this was a quick and dirty experiment to try to automatically identify character names and to represent the network of character names that occur in promimity, but as we've seen throughout, there are important limitations to this approach and choices to be made. Still, we've used iPython to demonstrate that with some relatively modest code, we're able to begin exploring some literary analysis questions while narrating the process. For a somewhat less contrived example, see Matthew Wilken's Literary Attention Lag blog post and acccompanying iPython Notebook.
If you want to have a bookself, you can get a pre-assembled one, you can get a kit of parts and assemble it, or you can get materials and tools and build one yourself. Much will depend on your needs, your time and your competencies. Programming is a bit like building your own furniture: in some ways it can be messier and more work, but if you have very specific needs (unusual dimensions, particular materials, etc.), it may be the best solution (it can also be very rewarding).
If you're thinking about programming for literary analysis, Python is worth considering: it's a high-level language designed for rapid development and for readability. With iPython (Jupyter) Notebooks, it also supports a literate programming model that combines prosaic expression and code. Because of these aspects, as well as the potential for web-based sharing and adaptation, iPython Notebooks are valuable tools in the toolkit of the digital humanist.
If we've succeeded in conveying some of the potential of Python and Jupyter Notebooks, you may be looking for some further resources. Here are some starting points: