#!/usr/bin/env python # coding: utf-8 # # Getting Texts # # This notebook is focused on an essential component of digital text analysis: preparing a corpus of texts. It's part of the [Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) and assumes that you've already worked through [Getting Setup](GettingSetup.ipynb) and [Getting Started](GettingStarted.ipynb). In this notebook we'll look at: # # * [Accessing plain texts online](#Accessing-Plain-Texts-Online) # * [Working with strings](#Some-Simple-String-Functions) # * String length # * String sequences (substrings) # * Counting strings # * Extracting strings # * [Counting occurrences of a term](#Counting-Occurrences-of-a-String) # * [Extracting part of a string](#Extracting-Text) # * [Accessing Local Plain Texts](#Accessing-Local-Plain-Texts) # * [Listing files in a local directory](#Listing-Files-in-a-Local-Directory) # # Note that we're especially interested here in working with plain texts, in later notebooks we'll deal with other formats. # # ## Accessing Plain Texts Online # # Let's create a new notebook and set its name (by clicking on the _Untitled_ label above the toolbar) to _GettingTexts_. # # Now we'll change the editing mode of the first cell to _Markdown_ and copy and paste in the following Markdown-encoded text: # # > \# Getting Texts # > # > We are first going to experiment with loading a plain text into memory from the Gutenberg Project (http://gutenberg.org), an online library with tens of thousands of free texts in different languages and formats. We can Google something like `python3 read from url` to discover pages like https://docs.python.org/3/howto/urllib2.html that explain the basics of reading content. # # Hit Shift-Enter to evaluate/format the Markdown cell and create a new code cell. # # The Markdown cell explains the essentials of what we want to do: fetch the contents of a plain text document from a URL, assign it to a string variable, and demonstrate various basic operatons we can perform on a string. For this program we'll choose the [Works of Edgar Allan Poe, Volume 1](http://www.gutenberg.org/ebooks/2147), available at http://www.gutenberg.org/files/2147/2147-0.txt (you're encouraged to visit this link before continuing). # # First, let's fetch our document using [urllib.request](https://docs.python.org/3/library/urllib.request.html) (not all the code below will be explained in detail now, but we'll come back to it). # In[1]: import urllib.request poeUrl = "http://www.gutenberg.org/files/2147/2147-0.txt" poeString = urllib.request.urlopen(poeUrl).read().decode().strip() print("This string has", len(poeString), "characters") # Most of the principles involved have already been covered in the [Getting Started](GettingStarted.ipynb) # # 1. Import module (in this case `urllib.request` instead of the `time` module) # 1. Assigning a string (the url) to a variable name of our choice ```poeUrl``` # 1. Making function calls and assigning the result to the variable name ```poeString``` # 1. Printing the last expression (line of code), in this case to show the number of characters in our string # # In this case [urlopen()](https://docs.python.org/3/library/urllib.request.html?highlight=urlopen#urllib.request.urlopen) is the function name with an argument that contains our Poe URL and returns an HTTP response object one which we can invoke [read()](https://docs.python.org/3/library/http.client.html?highlight=httpresponse.read#http.client.HTTPResponse.read) to get the bytes data at our URL. Next, we call [decode()](https://docs.python.org/3/library/stdtypes.html#bytes.decode) to convert the bytes data to a proper (Unicode by default) string. Finally, we call [strip()](https://docs.python.org/3/library/stdtypes.html#str.strip) to remove any leading and trailing whitespace. # # Many things can go wrong during networking calls, but if all goes well, we should now have a variable (poeString) containing a string with the same contents as at our [URL](http://www.gutenberg.org/cache/epub/2147/pg2147.txt). # # Fetching the contents of a URL is a relatively "expensive" operation (in code-speak this means that it's more computationally or time intensive), so we want to isolate that in its own Jupyter cell so that we don't have to run it more times that necessary. If we want to explore various aspects of the poeString string that we fetched, we should do that in a separate cell so that we're not re-fetching the string each time. # # There's an additional motivation for not repeating the fetching operation: Project Gutenberg (and some other sites) monitor how many requests are made from your IP address, and it can temporarily cut you off if it detects what it considers to be too many requests (waiting a while will usually lift this restriction). Multiple requests in a short period of time can also be a problem for shared IP addresses, like in a classroom setting. # ## Some Simple String Functions # # One of the essential concepts of Jupyter (and the underlying iPython) is that once code is executed, any variables remain accessible in memory for subsequent cells that are executed. This is essentially the _kernel_ which interpretes and executes code and stores things in the memory ("Kernel" is one of the items in the File menu of Jupyter). Typically we execute cells as we proceed through a notebook (see the options under the _Cell_ File menu). # # We've already seen above how to show the length of a string using the [len()](https://docs.python.org/3/library/functions.html?highlight=len#len) function. The length is a number but it can be a bit difficult to read because there is no thousands separator. Let's improve the output by searching how to [format a number with a thousands separator](https://www.google.ca/search?q=python%203%20format%20number%20thousands), which leads in particular to [this suggestion](http://stackoverflow.com/questions/1823058/how-to-print-number-with-commas-as-thousands-separators#10742904) for using the [format specification mini-language](https://docs.python.org/3/library/string.html#format-specification-mini-language). # In[2]: poeStringLen = len(poeString) poeStringLenFormatted = "{:,}".format(poeStringLen) # format mini-language print("This string has", poeStringLenFormatted, "characters") # This suggests that there are 550,332 characters (because we're in Python 3.x we should be dealing with Unicode, and so this should be a true count of the characters, not just the bytes since some characters require multiple bytes). # # We've shown a longer form of the code above, but we can also nest functions, or have function arguments that contain other functions – this works as well: # # ```python # print("This string has", "{:,}".format(len(poeString)), "characters")``` # # This version is more succinct, but it can be more difficult to read (and to debug or resolve if there's a problem), programming is about choices! # # ### Working with Parts of a String # # A string is a sequence of characters and python has a powerful way of working with sequences. For instance, I can do this to get the first 25 characters of our poeString: # In[3]: poeString[:25] # Our string is a sequence where each character has an index position. Python, like many languages, starts its indexing at 0, so we get something like this, where there are 25 characters (including the "P" in index 0): # # |P|r|o|j|e|c|t| |G|u|t|e|n|b|e|r|g|'|s| |T|h|e| | W|…| # |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| # |0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|…| # # Let's see a few more examples of working with string sequences: # In[4]: print("First character:", poeString[0]) print("Last character", poeString[-1]) print("First 25 characters:", poeString[:25]) print("Last 25 characters:", poeString[-25:]) print("Characters 8 to 25:", poeString[8:30]) # Working with character sequences like this is an essential aspect of text analysis and it's well worth becoming familiar with this syntax. # # What else can we do with a string? A good place to look is the [string methods documentation](https://docs.python.org/3/library/stdtypes.html#string-methods). # # ### Counting Occurrences of a String # # For instance, let's count the occurrences of one sequence (*corpse*) within another string `poeString`: # In[5]: print("Occurrences of 'corpse':", poeString.count("corpse")) # Clearly Poe likes talking about corpses. Is the [count()](https://docs.python.org/3/library/stdtypes.html#string-methods) function case-sensitive? Does it match full words or just strings within strings? Let's see: # In[6]: print("Occurrences of 'corpse':", poeString.count("corpse")) print("Occurrences of 'corps':", poeString.count("corpse")) print("Occurrences of 'Corpse':", poeString.count("Corpse")) # Apparently `count()` is case-sensitive and is matching strings, not words. # # So what if we wanted to be sure to count all occurrences of _corpse_ regardless of case? One solution would be to convert our string to the case we want to use. We'd probably do this with [lower()](https://docs.python.org/3/library/stdtypes.html#str.lower), but for the sake of demonstration, let's do it with [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper): # In[7]: print("Occurrences of 'CORPSE':", poeString.upper().count("CORPSE")) # This again demonstrates function chaining: `poeString.upper()` returns a new string and the new string has an available count() function. It's important to realize that `poeString.upper()` doesn't modify the variable `poeString`, it returns a new copy of the string. # # We converted our poeString to lowercase characters since _corpse_ (lowercase) isn't the same as Corpse (capitalized), though in this case it doesn't make any difference. # # What if we wanted to [find](https://docs.python.org/3/library/stdtypes.html?highlight=index#str.find) the index of the first occurrence of _corpse_ and show the surrounding text? # In[8]: firstCorpus = poeString.find("corpse") # the index position of the first occurrence of "corpse" context = 30 # number of characters to show on either side of the index position print(poeString[firstCorpus-context : firstCorpus+context]) # ### Extracting Text # # Our [Poe text](http://www.gutenberg.org/cache/epub/2147/pg2147.txt) is actually a volume of multiple texts. What if we wanted to isolate only one of the texts, such as "The Gold Bug?" # # To isolate the "The Gold Bug" in our Poe text, we might do something like the following (sometimes planning a program in natural language, rather than in computer code, can be useful): # # 1. Find the index position of the start of the story, i.e. "THE GOLD-BUG" # 1. Find the index position of the end of the story, or the start of the next story, i.e. "FOUR BEASTS IN ONE" # 1. Create a new string from the index position of the start of the story (from step 1) to the index position of the end of the story (from step 2) # # We know how to find the first two steps, and we've already seen a variant of the second step when we asked for the first few characters of the full Poe text. Let's first try in a simplified form to isolate "Gutenberg's" from our string "Project Gutenberg's The": # In[9]: start = poeString.find("THE GOLD-BUG") end = poeString.find("FOUR BEASTS IN ONE") goldBugString = poeString[start:end].strip() # show start and end of goldBugString print(goldBugString[:50], "[…] ", goldBugString[-50:]) # ## Accessing Local Plain Texts # Code that relies on URL content is convenient, though not nearly as robust as content that's already been downloaded and stored locally: content can change or disappear from the web, and maybe you want to work on your notebook in a remote location or in an airplane without internet connectivity. Moreover, accessing content from your local machine is typically much faster than interacting with web-based content. # # What we'll do in the next section is the following: # # 1. create a local directory for data (if necessary) # 1. open a new file and write our goldBugString to the file # 1. (re)open the file and read from it # # Let's begin by creating a new subdirectory (relative to the current notebook directory), using the [os](https://docs.python.org/3/library/os.html) module. # In[10]: import os directory = "data" if not os.path.exists(directory): os.makedirs(directory) # This demonstrates a [conditional structure in Python](http://en.wikibooks.org/wiki/Python_Programming/Conditional_Statements) where we test for a boolean value (true or false) of whether or not the directory [exists](https://docs.python.org/3/library/os.path.html?highlight=exists#os.path.exists). # # Python uses a colon and indentation to indicate the parts of the conditional block. If we want to execute a block when a condition evaluates to true (like ```1 < 5```, one _is_ smaller than five): # #

if _condition_:
#     _block_

# # Or if a condition is not true (like ```1 > 5```, one _is not_ smaller than five): # #

if *not* _condition_:
#     _block_

# # If the _data_ directory does't exist, we create it using [mkdirs()](https://docs.python.org/3/library/os.html?highlight=mkdirs#os.makedirs). # # Now that we have a data directory, we need to open a new file in write ("w") mode and write out the string contents of goldBugString. The [with](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement) block syntax we present here takes care of closing the file we've opened once we're done with it (once we're out of the indented block). # In[11]: with open("data/goldBug.txt", "w") as f: f.write(goldBugString) # The ```open()``` function returns a file descriptor (that we've named ```f```) and to which we can write contents. An alternative, by the way, to reading from a URL to a string and then writing the string to a file is to use the [urlretrive](https://docs.python.org/3.0/library/urllib.request.html#urllib.request.urlretrieve) function, though our method should work just fine as well. # # Assuming things did work out, we can now turn around and open the file in read mode ("r" instead "w"), read the contents into a new variable that we'll call ```goldBugString2```, and then close the file. # In[12]: with open("data/goldBug.txt", "r") as f: goldBugString2 = f.read() # Let's have a peek at the contents in our goldBugString2 variable (read directly from a file), the same way we did before. # In[13]: print(goldBugString2[:50], "[…] ", goldBugString2[-50:]) # Looks good! # # In fact, as a digression, it's not quite the same string since the original uses Windows-based linefeed characters that were stripped during the file writing and reading process. # In[14]: goldBugString == goldBugString2 # are these two strings the same? # Notice here that we're using the equality operator with two equal signs (==), otherwise, we're making an assignment the same way we do when assigning a value to a variable. # ## Listing Files in a Local Directory # As with many things in programming languages like Python, there's more than one way of listing files in a directory. We're going to introduce a way here that also introduces a loop: a process that is repeated multiple times for each element in a list or for as long as a condition is true. We'll go a bit quickly here, but we'll come back to these concepts again soon. # # But first let's start with the [glob()](https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob) function that allows us to list the files in a directory. # In[15]: import glob textFiles = glob.glob("data/*txt") textFiles # The results are shown as a list (delimited by the square brackets), with each element inside separated by a comma (here we only have one element because we only have one file so far). # # We can ask what kind of object our ```textFiles``` variable contains. # In[16]: type(textFiles) # Lists are a type of variable that lend themselves to loops or to iterating over each element. For instance, to show each filename with the number of characters, we could do something like this: # In[17]: totalCharacters = 0 for textFile in textFiles: f = open(textFile, "r") textString = f.read() f.close() chars = len(textString) print(textFile, "has", chars, "characters") totalCharacters += chars print("total characters: ", totalCharacters) # The code above is of the general form # #

 for _item_ in _list_:
#     _block_

# # In other words, for each item in our ```textFiles``` list, we execute the block where ```textFile``` is the local variable holding the item in the list. Just as with the conditionals, the colon and indentation indicate what the loop condition is (as long as more elements exist in the list) and what block to execute for each iteration. # # In the code above we're also calculating the total number of characters (tracking them in a variable that we've called ```totalCharacters```. Each time we iterate over the list of files, we add the length of characters for the current file. # # > ```python # totalCharacters += chars``` # # The += operater is a compact way to add a value to an existing variable. It's the equivalent of this: # # > ```python # totalCharacters = totalCharacters + chars``` # # Finally, we're using the ```print()``` function here because it's a simple way of combining a string ("total characters: ") and a number (```totalCharacters```) – in Python you can't simply concatenate a string and a number. # ## Next Steps # Here are some tasks to try: # # * How would you create a subdirectory called ```Austen``` under the ```data``` directory we've already created? # * For each of the plain text novels in English of [Jane Austen](http://www.gutenberg.org/ebooks/author/68) in Project Gutenberg # * How would you isolate the text content (without the Project Gutenberg header and footer)? # * How would you save the text-only content into the ```data/Austen``` directory? # * How would you loop over the files in the ```data/Austen``` directory and for each one print the file name and a count of "his" and "her"? # * What is the total number of characters in the Austen corpus? # # In the next notebook ([Getting NLTK](GettingNltk.ipynb)) we're going to introduce the Natural Language Toolkit that provides a huge number of useful functions for text analysis. # --- # [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com). Edited and revised by [Melissa Mony](http://melissamony.com).
Created January 12, 2015 and last modified February 7, 2019 (Jupyter 5.0.0)