(text-intro)=
This chapter covers how to use code to work with text as data, including opening files with text in, changing and cleaning text, and vectorised operations on text.
It has benefitted from the Python String Cook Book and Jake VanderPlas' Python Data Science Handbook.
Note that regexes are mentioned a few times in this chapter; you'll find out much more about them in the {ref}text-regex
chapter.
Before we get to the good stuff, we need to talk about string encodings. Whether you're using code or a text editor (Notepad, Word, Pages, Visual Studio Code), every bit of text that you see on a computer will have an encoding behind the scenes that tells the computer how to display the underlying data. There is no such thing as 'plain' text: all text on computers is the result of an encoding. Oftentimes, a computer programme (email reader, Word, whatever) will guess the encoding and show you what it thinks the text should look like. But it doesn't always know, or get it right: that is what is happening when you get an email or open a file full of weird symbols and question marks. If a computer doesn't know whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), it simply cannot display it correctly and you get gibberish.
When it comes to encodings, there are just two things to remember: i) you should use UTF-8 (aka Unicode), it's the international standard. ii) the Windows operating system tends to use either Latin 1 or Windows 1252 but (and this is good news) is moving to UTF-8.
Unicode is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.
Take special care when saving CSV files containing text on a Windows machine using Excel; unless you specify it, the text may not be saved in UTF-8. If your computer and you get confused enough about encodings and re-save a file with the wrong ones, you could lose data.
Hopefully you'll never have to worry about string encodings. But if you do see weird symbols appearing in your text, at least you'll know that there's an encoding problem and will know where to start Googling. You can find a much more in-depth explanation of text encodings here.
Note that there are many built-in functions for using strings in Python, you can find a comprehensive list here.
Strings are the basic data type for text in Python. They can be of any length. A string can be signalled by quote marks or double quote marks like so:
'text'
or
"text"
Style guides tend to prefer the latter but some coders (ahem!) have a bad habit of using the former. We can put this into a variable like so:
var = "banana"
Now, if we check the type of the variable:
type(var)
We see that it is str
, which is short for string.
Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like.
var[:3]
The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the var[start:stop:step]
syntax. Here's an example of getting every other character from the string starting from the 2nd position.
var[1::2]
Note that strings, like tuples such as (1, 2, 3)
but unlike lists such as [1, 2, 3]
, are immutable. This means commands like var[0] = "B"
will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be var = "Banana"
.
Like lists, you can find the length of a string using len()
:
len(var)
The +
operator concatenates two or more strings:
second_word = "panther"
first_word = "black"
print(first_word + " " + second_word)
Note that we added a space so that the phrase made sense. Another way of achieving the same end that scales to many words more efficiently (if you have them in a list) is:
" ".join([first_word, second_word])
Three useful functions to know about are upper()
, lower()
, and title()
. Let's see what they do
var = "input TEXT"
var_list = [var.upper(), var.lower(), var.title()]
print(var_list)
{admonition}
Reverse the string `"gnirts desrever a si sihT"` using indexing operations.
While we're using print()
, it has a few tricks. If we have a list, we can print out entries with a given separator:
print(*var_list, sep="; and \n")
(We'll find out more about what '\n' does shortly.) To turn variables of other kinds into strings, use the str()
function, for example
(
"A boolean is either "
+ str(True)
+ " or "
+ str(False)
+ ", there are only "
+ str(2)
+ " options."
)
In this example two boolean variables and one integer variable were converted to strings. str()
generally makes an intelligent guess at how you'd like to convert your non-string type variable into a string type. You can pass a variable or a literal value to str()
.
The example above is quite verbose. Another way of combining strings with variables is via f-strings. A simple f-string looks like this:
variable = 15.32399
print(f"You scored {variable}")
This is similar to calling str
on variable and using +
for concatenation but much shorter to write. You can add expressions to f-strings too:
print(f"You scored {variable**2}")
This also works with functions; after all **2
is just a function with its own special syntax.
In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more information to the f-string to get the output formatted just the way you want. Let's say we wanted two decimal places and a sign (although you always write +
in the formatting, the sign comes out as + or - depending on the value):
print(f"You scored {variable:+.2f}")
There are a whole range of formatting options for numbers as shown in the following table:
| Number | Format | Output | Description | |------------ |--------- |------------ |----------------------------------------------- | | 15.32347 | {:.2f} | 15.32 | Format float 2 decimal places | | 15.32347 | {:+.2f} | +15.32 | Format float 2 decimal places with sign | | -1 | {:+.2f} | -1.00 | Format float 2 decimal places with sign | | 15.32347 | {:.0f} | 15 | Format float with no decimal places | | 3 | {:0>2d} | 03 | Pad number with zeros (left padding, width 2) | | 3 | {:<4d} | 3** | Pad number with ’s (right padding, width 4) | | 13 | {:<4d} | 13** | Pad number with *’s (right padding, width 4) | | 1000000 | {:,} | 1,000,000 | Number format with comma separator | | 0.25 | {:.1%} | 25.0% | Format percentage | | 1000000000 | {:.2e} | 1.00e+09 | Exponent notation | | 12 | {:10d} | 12 | Right aligned (default, width 10) | | 12 | {:<10d} | 12 | Left aligned (width 10) | | 12 | {:^10d} | 12 | Center aligned (width 10) |
As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at this link.
Python has a string module that comes with some useful built-in strings and characters. For example
import string
string.punctuation
gives you all of the punctuation,
string.ascii_letters
returns all of the basic letters in the 'ASCII' encoding (with .ascii_lowercase
and .ascii_uppercase
variants), and
string.digits
gives you the numbers from 0 to 9. Finally, though less impressive visually, string.whitespace
gives a string containing all of the different (there is more than one!) types of whitespace.
There are other special characters around; in fact, we already met the most famous of them: "\n" for new line. To actually print "\n" we have to 'escape' the backward slash by adding another backward slash:
print("Here is a \n new line")
print("Here is an \\n escaped new line ")
The table below shows the most important escape commands:
| Code | Result |
|------ |----------------- |
| \'
| Single Quote (useful if using '
for strings) |
| \"
| Double Quote (useful if using "
for strings) |
| \\
| Backslash |
| \n
| New Line |
| \r
| Carriage Return |
| \t
| Tab |
from rich import inspect
var_of_type_str = "string"
inspect(var_of_type_str, methods=True)
You often want to make changes to the text you're working with. In this section, we'll look at the various options to do this.
A common text task is to replace a substring within a longer string. Let's say you have a string variable var
. You can use .replace(old_text, new_text)
to do this.
"Value is objective".replace("objective", "subjective")
As with any variable of a specific type (here, string), this would also work with variables:
text = "Value is objective"
old_substr = "objective"
new_substr = "subjective"
text.replace(old_substr, new_substr)
Note that .replace()
performs an exact replace and so is case-sensitive.
A character is an individual entry within a string, like the 'l' in 'equilibrium'. You can always count the number of characters in a string variable called var
by using len(var)
. A very fast method for replacing individual characters in a string is str.translate()
.
Replacing characters is extremely useful in certain situations, most commonly when you wish to remote all punctuation prior to doing other text analysis. You can use the built-in string.punctuation
for this.
Let's see how to use it to remove all of the vowels from some text. With apologies to economist Lisa Cook, we'll use the abstract from {cite:t}cook2011inventing
as the text we'll modify and we'll first create a dictionary of translations of vowels to nothing, i.e. ""
.
example_text = "Much recent work has focused on the influence of social capital on innovative outcomes. Little research has been done on disadvantaged groups who were often restricted from participation in social networks that provide information necessary for invention and innovation. Unique new data on African American inventors and patentees between 1843 and 1930 permit an empirical investigation of the relation between social capital and economic outcomes. I find that African Americans used both traditional, i.e., occupation-based, and nontraditional, i.e., civic, networks to maximize inventive output and that laws constraining social-capital formation are most negatively correlated with economically important inventive activity."
vowels = "aeiou"
translation_dict = {x: "" for x in vowels}
translation_dict
Now we turn our dictionary into a string translator and apply it to our text:
translator = example_text.maketrans(translation_dict)
example_text.translate(translator)
{admonition}
Use `translate()` to replace all punctuation from the following sentence with spaces: "The well-known story I told at the conferences [about hypocondria] in Boston, New York, Philadelphia,...and Richmond went as follows: It amused people who knew Tommy to hear this; however, it distressed Suzi when Tommy (1982--2019) asked, \"How can I find out who yelled, 'Fire!' in the theater?\" and then didn't wait to hear Missy give the answer---'Dick Tracy.'"
Generally, str.translate()
is very fast at replacing individual characters in strings. But you can also do it using a list comprehension and a join()
of the resulting list, like so:
"".join(
[
ch
for ch in "Example. string. with- excess_ [punctuation]/,"
if ch not in string.punctuation
]
)
A special case of string cleaning occurs when you are given text with lots of non-standard characters in, and spaces, and other symbols; and what you want is a clean string suitable for a filename or column heading in a dataframe. Remember that it's best practice to have filenames that don't have spaces in. Slugiyfing is the process of creating the latter from the former and we can use the slugify package to do it.
Here are some examples of slugifying text:
from slugify import slugify
txt = "the quick brown fox jumps over the lazy dog"
slugify(txt, stopwords=["the"])
In this very simple example, the words listed in the stopwords=
keyword argument (a list), are removed and spaces are replaced by hyphens. Let's now see a more complicated example:
slugify("当我的信息改变时... àccêntæd tËXT ")
Slugify converts text to latin characters, while also removing accents and whitespace (of all kinds-the last whitespace is a tab). There's also a replacement=
keyword argument that will replace specific strings with other strings using a list of lists format, eg replacement=[['old_text', 'new_text']]
If you want to split a string at a certain position, there are two quick ways to do it. The first is to use indexing methods, which work well if you know at which position you want to split text, eg
"This is a sentence and we will split it at character 18"[:18]
Next up we can use the built-in split
function, which returns a list of places where a given sub-string occurs:
"This is a sentence. And another sentence. And a third sentence".split(".")
Note that the character used to split the string is removed from the resulting list of strings. Let's see an example with a string used for splitting instead of a single character:
"This is a sentence. And another sentence. And a third sentence".split("sentence")
A useful extra function to know about is splitlines()
, which splits a string at line breaks and returns the split parts as a list.
Let's do some simple counting of words within text using str.count()
. Let's use the first verse of Elizabeth Bishop's sestina 'A Miracle for Breakfast' for our text.
text = "At six o'clock we were waiting for coffee, \n waiting for coffee and the charitable crumb \n that was going to be served from a certain balcony \n --like kings of old, or like a miracle. \n It was still dark. One foot of the sun \n steadied itself on a long ripple in the river."
word = "coffee"
print(f'The word "{word}" appears {text.count(word)} times.')
Meanwhile, find()
returns the position where a particular word or character occurs.
text.find(word)
We can check this using the number we get and some string indexing:
text[text.find(word) : text.find(word) + len(word)]
But this isn't the only place where the word 'coffee' appears. If we want to find the last occurrence, it's
text.rfind(word)
For this section, it's useful to be familiar with the pandas package, which is covered in the Data Analysis Quickstart and Working with Data sections. This section will closely follow the treatment by Jake VanderPlas.
We've seen how to work with individual strings. But often we want to work with a group of strings, otherwise known as a corpus, that is a collection of texts. It could be a collection of words, sentences, paragraphs, or some domain-based grouping (eg job descriptions).
Fortunately, many of the methods that we have seen deployed on a single string can be straightforwardly scaled up to hundreds, thousands, or millions of strings using pandas or other tools. This scaling up is achieved via vectorisation, in analogy with going from a single value (a scalar) to multiple values in a list (a vector).
As a very minimal example, here is capitalisation of names vectorised using a list comprehension:
[name.capitalize() for name in ["ada", "adam", "elinor", "grace", "jean"]]
A pandas series can be used in place of a list. Let's create the series first:
import pandas as pd
dfs = pd.Series(
["ada lovelace", "adam smith", "elinor ostrom", "grace hopper", "jean bartik"],
dtype="string",
)
dfs
Now we use the syntax series.str.function to change the text series:
dfs.str.title()
If we had a dataframe and not a series, the syntax would change to refer just to the column of interest like so:
df = pd.DataFrame(dfs, columns=["names"])
df["names"].str.title()
The table below shows a non-exhaustive list of the string methods that are available in pandas.
Function (preceded by .str. ) |
What it does |
---|---|
len() |
Length of string. |
lower() |
Put string in lower case. |
upper() |
Put string in upper case. |
capitalize() |
Put string in leading upper case. |
swapcase() |
Swap cases in a string. |
translate() |
Returns a copy of the string in which each character has been mapped through a given translation table. |
ljust() |
Left pad a string (default is to pad with spaces) |
rjust() |
Right pad a string (default is to pad with spaces) |
center() |
Pad such that string appears in centre (default is to pad with spaces) |
zfill() |
Pad with zeros |
strip() |
Strip out leading and trailing whitespace |
rstrip() |
Strip out trailing whitespace |
lstrip() |
Strip out leading whitespace |
find() |
Return the lowest index in the data where a substring appears |
split() |
Split the string using a passed substring as the delimiter |
isupper() |
Check whether string is upper case |
isdigit() |
Check whether string is composed of digits |
islower() |
Check whether string is lower case |
startswith() |
Check whether string starts with a given sub-string |
Regular expressions can also be scaled up with pandas. The below table shows vectorised regular expressions.
Function | What it does |
---|---|
match() |
Call re.match() on each element, returning a boolean. |
extract() |
Call re.match() on each element, returning matched groups as strings. |
findall() |
Call re.findall() on each element |
replace() |
Replace occurrences of pattern with some other string |
contains() |
Call re.search() on each element, returning a boolean |
count() |
Count occurrences of pattern |
split() |
Equivalent to str.split() , but accepts regexes |
rsplit() |
Equivalent to str.rsplit() , but accepts regexes |
Let's see a couple of these in action. First, splitting on a given sub-string:
df["names"].str.split(" ")
It's fairly common that you want to split out strings and save the results to new columns in your dataframe. You can specify a (max) number of splits via the n=
kwarg and you can get the columns using expand
df["names"].str.split(" ", n=2, expand=True)
{admonition}
Using vectorised operations, create a new column with the index position where the first vowel occurs for each row in the `names` column.
Here's an example of using a regex function with pandas:
df["names"].str.extract("(\w+)", expand=False)
There are a few more vectorised string operations that are useful.
Method | Description |
---|---|
get() |
Index each element |
slice() |
Slice each element |
slice_replace() |
Replace slice in each element with passed value |
cat() |
Concatenate strings |
repeat() |
Repeat values |
normalize() |
Return Unicode form of string |
pad() |
Add whitespace to left, right, or both sides of strings |
wrap() |
Split long strings into lines with length less than a given width |
join() |
Join strings in each element of the Series with passed separator |
get_dummies() |
extract dummy variables as a dataframe |
The get()
and slice()
methods give access to elements of the lists returned by split()
. Here's an example that combines split()
and get()
:
df["names"].str.split().str.get(-1)
We already saw get_dummies()
in the Regression chapter, but it's worth revisiting it here with strings. If we have a column with tags split by a symbol, we can use this function to split it out. For example, let's create a dataframe with a single column that mixes subject and nationality tags:
df = pd.DataFrame(
{
"names": [
"ada lovelace",
"adam smith",
"elinor ostrom",
"grace hopper",
"jean bartik",
],
"tags": ["uk; cs", "uk; econ", "usa; econ", "usa; cs", "usa; cs"],
}
)
df
If we now use str.get_dummies
and split on ;
we can get a dataframe of dummies.
df["tags"].str.get_dummies(";")
If you have just a plain text file, you can read it in like so:
fname = 'book.txt'
with open(fname, encoding='utf-8') as f:
text_of_book = f.read()
You can also read a text file directly into a pandas dataframe using
df = pd.read_csv('book.txt', delimiter = "\n")
In the above, the delimiter for different rows of the dataframe is set as "\n", which means new line, but you could use whatever delimiter you prefer.
{admonition}
Download the file 'smith_won.txt' from this book's github repository using this [link](https://github.com/aeturrell/coding-for-economists/blob/main/data/smith_won.txt) (use right-click and save as). Then read the text in using **pandas**.
CSV files are already split into rows. By far the easiest way to read in csv files is using pandas,
df = pd.read_csv('book.csv')
Remember that pandas can read many other file types too.