Variables can contain, or more correctly, refer to strings. You may have noticed how operations (such as addition) allow you to perform simple string manipulations. For example, we can write a program that prints a greeting with a name.
Change the value of the first_name
and last_name
variables so that the cell below prints a correct greeting.
first_name = 'First_name' # change this your first name
last_name = 'Last_name' # enter last now
print("Hello"+' '+first_name+' '+last_name) # this combines the variables in a greeting
Hello First_name Last_name
We'd achieve the same results by passing these variables as separate arguments to the print()
function.
print("Hello", first_name, last_name)
Hello First_name Last_name
But Python provides you with many more tools to process and manipulate strings (and, by extension, whole documents).
Below we first inspect the general syntax and discuss a few simple examples.
The Breakout
provides more detailed background information.
Let's store (a part of) the famous opening sentence " A Tale of Two Cities" in a variable first_sentence
.
first_sentence = "It was the best of times, it was the worst of times."
Print the content of first_sentence
.
# Enter answer here
String variables (and numbers) can be thought of as objects, "things you can do stuff with". In Python, each object has a set of methods/functions attached to it, which are the tools that enable you to manipulate these objects.
If objects can be thought of as the nouns of a programming language, then methods/functions serve as the verbs, they are the tools that operate on (do something with) these objects.
In general the methods (or functions) appear in these forms:
function(object)
object.method()
For string objects (str
in Python), we can change the general notation to:
function(str)
str.method()
This may look confusing at first—and we can't go into detail here about these syntactic differences—but you will get familiar with the syntax pretty soon, we promise.
Below we discuss a few functions and methods, which will provide you with the tools for working with text data (more technically strings).
len()
¶len()
takes an object and returns the number of elements, i.e. the length of the object. When given a string len()
counts the number of characters, not words.
Applying len()
to first_sentence
should return 52.
len(first_sentence)
52
The first_sentence
variable is just a toy example. We can easily load the actual content of "A Tale of Two Cities" and print the number of characters it contains. (Please ignore the code in the example, we show it here only to convince you how easy you could scale up from one line of text to a whole book)
import requests
book = requests.get('https://www.gutenberg.org/files/98/98-0.txt').content.decode('utf-8') # download book
print(book[:1000]) # print first 1000 characters
The Project Gutenberg eBook of A Tale of Two Cities, by Charles Dickens This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title: A Tale of Two Cities A Story of the French Revolution Author: Charles Dickens Release Date: January, 1994 [eBook #98] [Most recently updated: December 20, 2020] Language: English Character set encoding: UTF-8 Produced by: Judith Boss and David Widger *** START OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES *** A TALE OF TWO CITIES A STORY OF THE FRENCH REVOLUTION By Charles Dickens CONTENTS Book the
print(len(book)) # print the number of characters
793331
str.lowercase()
¶Lowercasing is often useful for normalizing texts, i.e. removing distinctions between words we don't really care about when analysing collections at scale. For example, many search engines use lowercasing in the background to provide you with all document that matches your query, i.e. if you search for berlin
you will also get results for Berlin
etc. Later in this course, when we focus on counting words, lowercasing will also be useful because we want to count "Book"
and "book"
as the same word.
Converting all capitals to lowercase is common practice in text mining, but of course, whether it's appropriate or not depends on the purposes of your research. For example, if you are interested in Named Entities (such as place names, you better retain capitals as these contain use signals for detecting such entities).
However, the most important thing at this point, is that you understand the syntax of the statement and what it returns. str.lowercase()
acts on the string (which comes before the dot) and returns a string object.
Please note that this method works directly on string or on a variable referring to a string.
print('LOWERCASE ME!'.lower()) # lowercase and print
lowercase me!
lowercase = 'LOWERCASE ME!' # variable assignment
print(lowercase.lower()) # lowercase variable and print
lowercase me!
Both len()
and str.lowercase()
are called fruitful functions/methods, they return something (i.e. a number or a string respectively)
Lowercase the variable first_sentence, store the lowercased version in a new variable and print the length of this variable.
# add answer here
str.endswith(parameter)
¶str.endswith(parameter)
is another commonly used string method. It slightly differs from str.lower()
because it usually requires an argument for the parameter between the parentheses. str.endswith(parameter)
will return a boolean value (True
or False
) if the string at the left-hand side of the .
ends with the string given as an argument. This is commonly used to check the extension of a document, for example:
filename = 'document_1.txt'
filename.endswith('.txt')
True
filename.endswith('.doc')
False
We are using some technical terms here, which will be explained in more detail later. However, we hope that you slowly start to pick up and remember some of these terms just by reading through the notebook. Don't worry too much about the explanations, try to understand how the code works, that's the most important thing at this point!
Of course the Python string toolkit is much larger. Use the dir()
function to see all the methods you can apply to a string.
print(dir(str))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
print(dir("Hello World."))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
dir()
returns a list of all the tools that apply to a string. You can ignore the items starting with __
, but please look at those elements further down, for examples the str.upper()
method.
To inspect the docstring
of a method, which explain its functionality, use help()
.
help(str.upper)
Help on method_descriptor: upper(self, /) Return a copy of the string converted to uppercase.
Let's see what str.upper()
does!
'hello'.upper()
'HELLO'
str.strip()
, str.isalpha()
and str.startswith()
Breakout:
¶Another common type of string manipulation is indexing and slicing. Indexing here means retrieving characters of a string (it could also be another data type) by their position (i.e. obtaining the fifth or last character of a word).
In Python, we start counting from 0
: to retrieve the first element, we add [0]
to the end of a string (variable). Note the square brackets!
print(first_sentence[0])
I
To print the second character, we need to access the item at position 1.
print(first_sentence[1])
t
To access the last character, use [-1]
.
print(first_sentence[-1])
.
Slicing is similar to indexing, but it allows you to select a sequence of (multiple) characters. We still use square brackets but add a colon. At the left of the colon stands the first character, at the right the last characters.
Below we printh everything between (and including) the sixth and tenth character.
print(first_sentence[5:10])
s the
Negative indices can also be used for slicing.
print(first_sentence[-6:-1])
times
The first or last character can remain implicit.
print(first_sentence[:5])
It wa
print(first_sentence[-5:])
imes.
Even though these operations seem pretty abstract, we will use indexing and slicing frequently later in this course. Please consult the breakout
for more information.
sentence
. (Please remember, double click on any Markdown cell to reveal the actual text)"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."
sentence_lower
# Enter code here
Breakout
:¶In this section, we transition from experimenting with mock examples to working with more realistic, historical examples. First, we do this on a small scale, but soon we'll be processing thousands of newspaper articles!
To open a file in Python, you have to first explain where it is stored. More technically you provide a location or path
as a string. The Break out
will point you to more information about the path syntax, for now a simple example of (what is called) a relative path should suffice.
A relative tells to the location of a file, relative to your current position in the folder structure of your working environment. In our case, this means relative to where the Notebook (the one in which you are working at the moment) is located.
The see the files in the current folder run the ls .
or list command in the cell below.
!ls .
10_-_Hypothesis_Testing.ipynb 11_-_Linear_Regression.ipynb 12_-_Generalised_Linear_Models.ipynb 13_-_Supervised_Learning.ipynb 14_-_Topic_Modelling.ipynb 15_-_Word_Vectors.ipynb 1_-_Introduction.ipynb 2_-_Values_and_Variables.ipynb 3_-_Text_and_String_Methods.ipynb 4_-_Processing_texts.ipynb 5_-_Corpus_Selection.ipynb 6_-_Corpus_Exploration.ipynb 7_-_Trends_over_time.ipynb 8_-_Data_Exploration_with_Pandas_I.ipynb 9_-_Data_Exploration_with_Pandas_Part_II.ipynb LICENSE README.md break_out colab_backup data example_data imgs lecture_1 lecture_2 postBuild requirements.txt utils
Please note that !ls
starts with an exclamation mark. ls
is a bash command you'd normally use in a terminal. This is not very important at the moment, just remember that lines starting with !
are not Python code.
You see the folder working_data
appearing. Now we can list the items in working_data
again using ls
.
!ls example_data/
notebook_3
!ls example_data/notebook_3
shakespeare_sonnet_i.txt
The relative path to our file is working_data/shakespeare_sonnet_i.txt
. Python requires you to define the path as a string (i.e. enclosed by single or double quotation marks).
Getting the location right is the first part of the puzzle. Next, we need some Python tools to open a file and read its content. It may sound confusing at first (why open and read?), but these are separate steps in Python.
Let's use the open()
function to open the sonnet. As you notice, this doesn't return the actual text, but a _io.TextIOWrapper
object (you can ignore that safely.
path = "example_data/notebook_3/shakespeare_sonnet_i.txt"
sonnet = open(path)
sonnet
<_io.TextIOWrapper name='example_data/notebook_3/shakespeare_sonnet_i.txt' mode='r' encoding='UTF-8'>
We need to apply the read()
method to the _io.TextIOWrapper
object to inspect the content of the file.
sonnet = open(path).read()
sonnet
"From fairest creatures we desire increase,\nThat thereby beauty's rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."
Please note the special characters such as \n
(which marks a new line). This becomes apparent when we print the sonnet.
print(sonnet)
From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.
Since the sonnet
variable refers to a string, we can use everything we learned before to analyse and manipulate this string.
len(sonnet)
609
sonnet.lower()
"from fairest creatures we desire increase,\nthat thereby beauty's rose might never die,\nbut as the riper should by time decease,\nhis tender heir might bear his memory:\nbut thou, contracted to thine own bright eyes,\nfeed'st thy light's flame with self-substantial fuel,\nmaking a famine where abundance lies,\nthyself thy foe, to thy sweet self too cruel:\nthou that art now the world's fresh ornament,\nand only herald to the gaudy spring,\nwithin thine own bud buriest thy content,\nand tender churl mak'st waste in niggarding:\npity the world, or else this glutton be,\nto eat the world's due, by the grave and thee."
str.find()
allows you to query a string for a substring. It will return the index of the lowest index of the first match for your query substring S.
help(str.find)
Help on method_descriptor: find(...) S.find(sub[, start[, end]]) -> int Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 on failure.
sonnet.find('riper')
98
sonnet[98:]
"riper should by time decease,\nHis tender heir might bear his memory:\nBut thou, contracted to thine own bright eyes,\nFeed'st thy light's flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world's fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd tender churl mak'st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world's due, by the grave and thee."