3 Working with Text: Strings and string methods¶

Text Mining for Historians (with Python)¶

A Gentle Introduction to Working with Textual Data in Python¶

Created by Kaspar Beelen and Luke Blaxill¶

For the German Historical Institute, London¶

3.1 String variables and methods¶

Variables can contain, or more correctly, refer to strings. You may have noticed how operations (such as addition) allow you to perform simple string manipulations. For example, we can write a program that prints a greeting with a name.

-- Exercise:¶

Change the value of the first_name and last_name variables so that the cell below prints a correct greeting.

In [1]:

first_name = 'First_name' # change this your first name
last_name = 'Last_name' # enter last now
print("Hello"+' '+first_name+' '+last_name) # this combines the variables in a greeting

Hello First_name Last_name

We'd achieve the same results by passing these variables as separate arguments to the print() function.

In [2]:

print("Hello", first_name, last_name)

Hello First_name Last_name

But Python provides you with many more tools to process and manipulate strings (and, by extension, whole documents).

Below we first inspect the general syntax and discuss a few simple examples.

The Breakout provides more detailed background information.

Let's store (a part of) the famous opening sentence " A Tale of Two Cities" in a variable first_sentence.

In [3]:

first_sentence = "It was the best of times, it was the worst of times."

-- Exercise:¶

Print the content of first_sentence.

In [4]:

# Enter answer here

String variables (and numbers) can be thought of as objects, "things you can do stuff with". In Python, each object has a set of methods/functions attached to it, which are the tools that enable you to manipulate these objects.

If objects can be thought of as the nouns of a programming language, then methods/functions serve as the verbs, they are the tools that operate on (do something with) these objects.

In general the methods (or functions) appear in these forms:

function(object)
object.method()

For string objects (str in Python), we can change the general notation to:

function(str)
str.method()

This may look confusing at first—and we can't go into detail here about these syntactic differences—but you will get familiar with the syntax pretty soon, we promise.

Below we discuss a few functions and methods, which will provide you with the tools for working with text data (more technically strings).

`len()`¶

len() takes an object and returns the number of elements, i.e. the length of the object. When given a string len() counts the number of characters, not words.

Applying len() to first_sentence should return 52.

In [5]:

len(first_sentence)

Out[5]:

The first_sentence variable is just a toy example. We can easily load the actual content of "A Tale of Two Cities" and print the number of characters it contains. (Please ignore the code in the example, we show it here only to convince you how easy you could scale up from one line of text to a whole book)

In [6]:

import requests 
book = requests.get('https://www.gutenberg.org/files/98/98-0.txt').content.decode('utf-8') # download book
print(book[:1000]) # print first 1000 characters

The Project Gutenberg eBook of A Tale of Two Cities, by Charles Dickens

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: A Tale of Two Cities
       A Story of the French Revolution

Author: Charles Dickens

Release Date: January, 1994 [eBook #98]
[Most recently updated: December 20, 2020]

Language: English

Character set encoding: UTF-8

Produced by: Judith Boss and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***




A TALE OF TWO CITIES

A STORY OF THE FRENCH REVOLUTION

By Charles Dickens


CONTENTS


     Book the

In [7]:

print(len(book)) # print the number of characters

`str.lowercase()`¶

Lowercasing is often useful for normalizing texts, i.e. removing distinctions between words we don't really care about when analysing collections at scale. For example, many search engines use lowercasing in the background to provide you with all document that matches your query, i.e. if you search for berlin you will also get results for Berlin etc. Later in this course, when we focus on counting words, lowercasing will also be useful because we want to count "Book" and "book" as the same word.

Converting all capitals to lowercase is common practice in text mining, but of course, whether it's appropriate or not depends on the purposes of your research. For example, if you are interested in Named Entities (such as place names, you better retain capitals as these contain use signals for detecting such entities).

However, the most important thing at this point, is that you understand the syntax of the statement and what it returns. str.lowercase() acts on the string (which comes before the dot) and returns a string object.

Please note that this method works directly on string or on a variable referring to a string.

In [8]:

print('LOWERCASE ME!'.lower()) # lowercase and print

lowercase me!

In [9]:

lowercase = 'LOWERCASE ME!' # variable assignment
print(lowercase.lower()) # lowercase variable and print

lowercase me!

Both len() and str.lowercase() are called fruitful functions/methods, they return something (i.e. a number or a string respectively)

-- Exercise¶

Lowercase the variable first_sentence, store the lowercased version in a new variable and print the length of this variable.

In [10]:

# add answer here

`str.endswith(parameter)`¶

str.endswith(parameter) is another commonly used string method. It slightly differs from str.lower() because it usually requires an argument for the parameter between the parentheses. str.endswith(parameter) will return a boolean value (True or False) if the string at the left-hand side of the . ends with the string given as an argument. This is commonly used to check the extension of a document, for example:

In [11]:

filename = 'document_1.txt'
filename.endswith('.txt')

Out[11]:

True

In [12]:

filename.endswith('.doc')

Out[12]:

False

We are using some technical terms here, which will be explained in more detail later. However, we hope that you slowly start to pick up and remember some of these terms just by reading through the notebook. Don't worry too much about the explanations, try to understand how the code works, that's the most important thing at this point!

dir()¶

Of course the Python string toolkit is much larger. Use the dir() function to see all the methods you can apply to a string.

In [13]:

print(dir(str))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

In [14]:

print(dir("Hello World."))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

dir() returns a list of all the tools that apply to a string. You can ignore the items starting with __, but please look at those elements further down, for examples the str.upper() method.

To inspect the docstring of a method, which explain its functionality, use help().

In [15]:

help(str.upper)

Help on method_descriptor:

upper(self, /)
    Return a copy of the string converted to uppercase.

Let's see what str.upper() does!

In [16]:

'hello'.upper()

Out[16]:

'HELLO'

-- Exercise¶

Create a few code cell below
Inspect the docstring of the following methods str.strip(), str.isalpha() and str.startswith()
Create a new string variable (whatever text you prefer)
Apply the above methods to the string and print the outcome

`Breakout:`¶

more about string methods

Indexing and slicing¶

Another common type of string manipulation is indexing and slicing. Indexing here means retrieving characters of a string (it could also be another data type) by their position (i.e. obtaining the fifth or last character of a word).

In Python, we start counting from 0: to retrieve the first element, we add [0] to the end of a string (variable). Note the square brackets!

In [17]:

print(first_sentence[0])

To print the second character, we need to access the item at position 1.

In [18]:

print(first_sentence[1])

To access the last character, use [-1].

In [19]:

print(first_sentence[-1])

Slicing is similar to indexing, but it allows you to select a sequence of (multiple) characters. We still use square brackets but add a colon. At the left of the colon stands the first character, at the right the last characters.

Below we printh everything between (and including) the sixth and tenth character.

In [20]:

print(first_sentence[5:10])

s the

Negative indices can also be used for slicing.

In [21]:

print(first_sentence[-6:-1])

times

The first or last character can remain implicit.

In [22]:

print(first_sentence[:5])

It wa

In [23]:

print(first_sentence[-5:])

imes.

Even though these operations seem pretty abstract, we will use indexing and slicing frequently later in this course. Please consult the breakout for more information.

-- Exercise¶

Assign the sentence (from Jane Austen's "Pride and Prejudice") below to a variable named sentence. (Please remember, double click on any Markdown cell to reveal the actual text)

"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."

Lowercase the sentence and assign it to sentence_lower
Print the first and last words of the lowercased sentence

In [24]:

# Enter code here

`Breakout`:¶

4.3 Reading and Opening Text Files¶

In this section, we transition from experimenting with mock examples to working with more realistic, historical examples. First, we do this on a small scale, but soon we'll be processing thousands of newspaper articles!

To open a file in Python, you have to first explain where it is stored. More technically you provide a location or path as a string. The Break out will point you to more information about the path syntax, for now a simple example of (what is called) a relative path should suffice.

A relative tells to the location of a file, relative to your current position in the folder structure of your working environment. In our case, this means relative to where the Notebook (the one in which you are working at the moment) is located.

The see the files in the current folder run the ls . or list command in the cell below.

In [25]:

!ls .

10_-_Hypothesis_Testing.ipynb
11_-_Linear_Regression.ipynb
12_-_Generalised_Linear_Models.ipynb
13_-_Supervised_Learning.ipynb
14_-_Topic_Modelling.ipynb
15_-_Word_Vectors.ipynb
1_-_Introduction.ipynb
2_-_Values_and_Variables.ipynb
3_-_Text_and_String_Methods.ipynb
4_-_Processing_texts.ipynb
5_-_Corpus_Selection.ipynb
6_-_Corpus_Exploration.ipynb
7_-_Trends_over_time.ipynb
8_-_Data_Exploration_with_Pandas_I.ipynb
9_-_Data_Exploration_with_Pandas_Part_II.ipynb
LICENSE
README.md
break_out
colab_backup
data
example_data
imgs
lecture_1
lecture_2
postBuild
requirements.txt
utils

Please note that !ls starts with an exclamation mark. ls is a bash command you'd normally use in a terminal. This is not very important at the moment, just remember that lines starting with ! are not Python code.

You see the folder working_data appearing. Now we can list the items in working_data again using ls.

In [26]:

!ls example_data/

notebook_3

In [27]:

!ls example_data/notebook_3

shakespeare_sonnet_i.txt

The relative path to our file is working_data/shakespeare_sonnet_i.txt. Python requires you to define the path as a string (i.e. enclosed by single or double quotation marks).

Getting the location right is the first part of the puzzle. Next, we need some Python tools to open a file and read its content. It may sound confusing at first (why open and read?), but these are separate steps in Python.

Let's use the open() function to open the sonnet. As you notice, this doesn't return the actual text, but a _io.TextIOWrapper object (you can ignore that safely.

In [28]:

path = "example_data/notebook_3/shakespeare_sonnet_i.txt"
sonnet = open(path)
sonnet

Out[28]:

<_io.TextIOWrapper name='example_data/notebook_3/shakespeare_sonnet_i.txt' mode='r' encoding='UTF-8'>

We need to apply the read() method to the _io.TextIOWrapper object to inspect the content of the file.

In [29]:

sonnet = open(path).read()
sonnet

Out[29]:

"From fairest creatures we desire increase,\nThat thereby beauty's rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."

Please note the special characters such as \n (which marks a new line). This becomes apparent when we print the sonnet.

In [30]:

print(sonnet)

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

Since the sonnet variable refers to a string, we can use everything we learned before to analyse and manipulate this string.

In [31]:

len(sonnet)

Out[31]:

In [32]:

sonnet.lower()

Out[32]:

"from fairest creatures we desire increase,\nthat thereby beauty's rose might never die,\nbut as the riper should by time decease,\nhis tender heir might bear his memory:\nbut thou, contracted to thine own bright eyes,\nfeed'st thy light's flame with self-substantial fuel,\nmaking a famine where abundance lies,\nthyself thy foe, to thy sweet self too cruel:\nthou that art now the world's fresh ornament,\nand only herald to the gaudy spring,\nwithin thine own bud buriest thy content,\nand tender churl mak'st waste in niggarding:\npity the world, or else this glutton be,\nto eat the world's due, by the grave and thee."

str.find() allows you to query a string for a substring. It will return the index of the lowest index of the first match for your query substring S.

In [33]:

help(str.find)

Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.

In [34]:

sonnet.find('riper')

Out[34]:

In [35]:

sonnet[98:]

Out[35]:

"riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."

`Break out`¶

paths and filenames

3 Working with Text: Strings and string methods¶

Text Mining for Historians (with Python)¶

A Gentle Introduction to Working with Textual Data in Python¶

Created by Kaspar Beelen and Luke Blaxill¶

For the German Historical Institute, London¶

3.1 String variables and methods¶

-- Exercise:¶

-- Exercise:¶

len()¶

str.lowercase()¶

-- Exercise¶

str.endswith(parameter)¶

dir()¶

-- Exercise¶

Breakout:¶

Indexing and slicing¶

-- Exercise¶

Breakout:¶

4.3 Reading and Opening Text Files¶

Break out¶

Fin.¶

`len()`¶

`str.lowercase()`¶

`str.endswith(parameter)`¶

`Breakout:`¶

`Breakout`:¶

`Break out`¶