#!/usr/bin/env python
# coding: utf-8
# **Note**: Click on "*Kernel*" > "*Restart Kernel and Clear All Outputs*" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud
](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_content.ipynb).
# # Chapter 6: Text & Bytes (continued)
# In this second part of the chapter, we look in more detail at how `str` objects work in memory, in particular how the $0$s and $1$s in the memory translate into characters.
# ## Special Characters
# As previously seen, some characters have a special meaning when following the **escape character** `"\"`. Besides escaping the kind of quote used as the `str` object's delimiter, `'` or `"`, most of these **escape sequences** (i.e., `"\"` with the subsequent character), act as a **control character** that moves the "cursor" in the output *without* generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the [print()
](https://docs.python.org/3/library/functions.html#print) function. The [documentation
](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) lists all available escape sequences, of which we show the most important ones below.
#
# The most common escape sequence is `"\n"` that "prints" a [newline character
](https://en.wikipedia.org/wiki/Newline) that is also called the line feed character or LF for short.
# In[1]:
"This is a sentence\nthat is printed\non three lines."
# In[2]:
print("This is a sentence\nthat is printed\non three lines.")
# `"\b"` is the [backspace character
](https://en.wikipedia.org/wiki/Backspace), or BS for short, that moves the cursor back by one character.
# In[3]:
print("ABC\bX")
# In[4]:
print("ABC\bXY")
# Similarly, `"\r"` is the [carriage return character
](https://en.wikipedia.org/wiki/Carriage_return), or CR for short, that moves the cursor back to the beginning of the line.
# In[5]:
print("ABC\rX")
# In[6]:
print("ABC\rXY")
# While Linux and modern MacOS systems use solely `"\n"` to express a new line, Windows systems default to using `"\r\n"`. This may lead to "weird" bugs on software projects where people using both kind of operating systems collaborate.
# In[7]:
print("This is a sentence\r\nthat is printed\r\non three lines.")
# `"\t"` makes the cursor "jump" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results.
# In[8]:
print("Jump\tfrom\ttab\tstop\tto\ttab\tstop.\nThe\tsecond\tline\tdoes\tso\ttoo.")
# ### Raw Strings
# Sometimes we do *not* want the backslash `"\"` and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `"\n"` does *not* makes sense here.
# In[9]:
print("C:\Programs\new_application")
# Some `str` objects even produce a `SyntaxError` because the `"\U"` can *not* be interpreted as a Unicode code point (cf., next section).
# In[10]:
print("C:\Users\Administrator\Desktop\Project")
# A simple solution would be to escape the escape character with a *second* backslash `"\"`.
# In[11]:
print("C:\\Programs\\new_application")
# In[12]:
print("C:\\Users\\Administrator\\Desktop\\Project")
# However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a `r`. The literal is then interpreted in a "raw" way.
# In[13]:
print(r"C:\Programs\new_application")
# In[14]:
print(r"C:\Users\Administrator\Desktop\Project")
# ## Characters are Numbers with a Convention
# So far, we used the term **character** without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.
#
# [Chapter 5
](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/05_numbers/00_content.ipynb) gives us an idea on how individual **bits** are used to express all types of numbers, from "simple" `int` objects to "complex" `float` ones. To model characters, another **layer of abstraction** is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters.
# ### ASCII
# Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called [American Standard Code for Information Interchange
](https://en.wikipedia.org/wiki/ASCII), or **ASCII** for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers `0` through `127`.
#
# A mapping from characters to numbers is referred to by the technical term **encoding**. We may use the built-in [ord()
](https://docs.python.org/3/library/functions.html#ord) function to **encode** any single character. The inverse to that is the built-in [chr()
](https://docs.python.org/3/library/functions.html#chr) function, which **decodes** a number into a character.
# In[15]:
ord("A")
# In[16]:
chr(65)
# Of course, unprintable escape sequences like `"\n"` count as only *one* character.
# In[17]:
ord("\n")
# In[18]:
chr(10)
# In ASCII, the numbers `0` through `31` (and `127`) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers `48` through `57`, the upper case letters with `65` through `90`, and the lower case letters with `97` through `122`. While this seems random as first, there is of course a "sophisticated" system behind it. That can immediately be seen when looking at the encoded numbers in their *binary* representations.
#
# For example, the digit `5` is mapped to the number `53` in ASCII. The binary representation of `53` is `0b_11_0101` and the least significant four bits, `0101`, mean $5$. Similarly, the letter `"E"` is the fifth letter in the alphabet. It is encoded with the number `69` in ASCII, which is `0b_100_0101` in binary. And, the least significant bits, `0_0101`, mean $5$. Analogously, `"e"` is encoded with `101` in ASCII, which is `0b_110_0101` in binary. And, the least significant bits, `0_0101`, mean $5$ again. This encoding was chosen mainly because programmers "in the old days" needed to implement these encodings "by hand." Python abstracts that logic away from its users.
#
# This encoding scheme is also the cause for the "weird" sorting in the "*String Comparison*" section in the [first part
](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_content.ipynb#String-Comparison) of this chapter, where `"apple"` comes *after* `"Banana"`. As `"a"` is encoded with `97` and `"B"` with `66`, `"Banana"` must of course be "smaller" than `"apple"` when comparison is done in a pairwise fashion of the individual characters.
# In[19]:
for number in range(48, 58):
print(number, bin(number), "-> ", chr(number))
# In[20]:
for i, number in enumerate(range(65, 91), start=1):
end = "\n" if i % 3 == 0 else "\t"
print(number, bin(number), "-> ", chr(number), end=end)
# In[21]:
for i, number in enumerate(range(97, 123), start=1):
end = "\n" if i % 3 == 0 else "\t"
print(str(number).rjust(3), bin(number), "-> ", chr(number), end=end)
# The remaining `symbols` encoded in ASCII are encoded with the numbers still unused, which is why they are scattered.
# In[22]:
symbols = (
list(range(32, 48))
+ list(range(58, 65))
+ list(range(91, 97))
+ list(range(123, 127))
)
# In[23]:
for i, number in enumerate(symbols, start=1):
end = "\n" if i % 3 == 0 else "\t"
print(str(number).rjust(3), bin(number).rjust(10), "-> ", chr(number), end=end)
# As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are [ISO 8859-1
](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) for western European letters or [Windows 1250
](https://en.wikipedia.org/wiki/Windows-1250) for Latin ones. Many of these encodings use 8-bit numbers (i.e., `0` through `255`) to map the multitude of non-English letters (e.g., the German [umlauts
](https://en.wikipedia.org/wiki/Umlaut_%28linguistics%29) `"ä"`, `"ö"`, `"ü"`, or `"ß"`).
# ### Unicode
# However, none of these specialized encodings can map *all* characters of *all* languages around the world from *all* times in human history. To achieve that, a truly global standard called **[Unicode
](https://en.wikipedia.org/wiki/Unicode)** was developed and its first version released in 1991. Since then, Unicode has been amended with many other "characters." The most popular among them being [emojis
](https://en.wikipedia.org/wiki/Emoji) or the [Klingon
](https://en.wikipedia.org/wiki/Klingon_scripts) language (from the science fiction series [Star Trek
](https://en.wikipedia.org/wiki/Star_Trek)). In Unicode, every character is given an identity referred to as the **code point**. Code points are hexadecimal numbers from `0x0000` through `0x10ffff`, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first `127` code points are identical to the ASCII encoding for reasons explained in the "*The `bytes` Type*" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., [Wikipedia
](https://en.wikipedia.org/wiki/List_of_Unicode_characters)).
#
# All we need to know to print a character is its code point. Python uses the escape sequence `"\U"` that is followed by eight hexadecimal digits. Underscore separators are unfortunately *not* allowed here.
#
# So, to print a smiley, we just need to look up the corresponding number (e.g., [here
](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks)).
# In[24]:
"\U0001f604"
# Every Unicode character also has a descriptive name that we can use with the escape sequence `"\N"` and within curly braces `{}`.
# In[25]:
"\N{FACE WITH TEARS OF JOY}"
# Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence `"\u"` for brevity.
# In[26]:
"\U00000041" # hex(65) == 0x41
# In[27]:
"\u0041"
# Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence `"\x"` for even conciser code.
# In[28]:
"\x41"
# As the `str` type is based on Unicode, a `str` object's behavior is more in line with how humans view text and not how it is expressed in source code.
#
# For example, while it is obvious that `len("A")` evaluates to `1`, ...
# In[29]:
len("A")
# ... what should `len("\N{SNAKE}")` evaluate to? As the idea of a snake is expressed as *one* "character," [len()
](https://docs.python.org/3/library/functions.html#len) also returns `1` here.
# In[30]:
"\N{SNAKE}"
# In[31]:
len("\N{SNAKE}")
# Many of the built-in `str` methods also consider Unicode. For example, in contrast to [lower()
](https://docs.python.org/3/library/stdtypes.html#str.lower), the [casefold()
](https://docs.python.org/3/library/stdtypes.html#str.casefold) method knows that the German `"ß"` is commonly converted to `"ss"`. So, when searching for exact matches, normalizing text with [casefold()
](https://docs.python.org/3/library/stdtypes.html#str.casefold) may yield better results than with [lower()
](https://docs.python.org/3/library/stdtypes.html#str.lower).
# In[32]:
"Straße".lower()
# In[33]:
"Straße".casefold()
# Many other methods like [isdecimal()
](https://docs.python.org/3/library/stdtypes.html#str.isdecimal), [isdigit()
](https://docs.python.org/3/library/stdtypes.html#str.isdigit), [isnumeric()
](https://docs.python.org/3/library/stdtypes.html#str.isnumeric), [isprintable()
](https://docs.python.org/3/library/stdtypes.html#str.isprintable), [isidentifier()
](https://docs.python.org/3/library/stdtypes.html#str.isidentifier), and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning.
# ## Multi-line Strings
# Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8
](https://www.python.org/dev/peps/pep-0008/) or because the text consists of many lines and typing out `"\n"` is tedious. However, using single double quotes `"` around multiple lines results in a `SyntaxError`.
# In[34]:
"
Do not break the lines like this
"
# Instead, we may enclose a string literal with either **triple double** quotes `"""` or **triple single** quotes `'''`. Then, newline characters in the source code are converted into `"\n"` characters in the resulting `str` object. Docstrings are precisely that, and, by convention, always written within triple double quotes `"""`.
# In[35]:
multi_line = """
I am a multi-line string
consisting of four lines.
"""
# A caveat is that `"\n"` characters are often inserted at the beginning or end of the text when we try to format the source code nicely.
# In[36]:
multi_line
# In[37]:
print(multi_line)
# Using the [split()
](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional `sep` argument, we confirm that `multi_line` consists of *four* lines with the first and last line being empty.
# In[38]:
for i, line in enumerate(multi_line.split("\n"), start=1):
print(i, line)
# To mitigate that, we often see the [strip()
](https://docs.python.org/3/library/stdtypes.html#bytes.strip) method in source code.
# In[39]:
multi_line = """
I am a multi-line string
consisting of two lines.
""".strip()
# In[40]:
for i, line in enumerate(multi_line.split("\n"), start=1):
print(i, line)
# ## The `bytes` Type
# To end this chapter, we want to briefly look at the `bytes` data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).
#
# Let's open a binary file in read-only mode (i.e., `mode="rb"`) and read in all of its contents.
# In[41]:
with open("full_house.bin", mode="rb") as binary_file:
data = binary_file.read()
# `data` is an object of type `bytes`.
# In[42]:
id(data)
# In[43]:
type(data)
# It's value is given out in the literal bytes notation with a `b` prefix (cf., the [reference
](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). Every byte is expressed in hexadecimal representation with the escape sequence `"\x"`. This representation is commonly chosen as we can *not* tell what kind of information is hidden in the `data` by just looking at the bytes. Instead, we must be told by some other source how to **decode** the raw bytes into information we can interpret.
# In[44]:
data
# `bytes` objects work like `str` objects in many ways. In particular, they are *sequences* as well: The number of bytes is *finite* and we may *iterate* over them in *order*.
# In[45]:
len(data)
# Consisting of 8 bits, a single byte can always be interpreted as a whole number between `0` through `255`. That is exactly what we see when we loop over the `data` ...
# In[46]:
for byte in data:
print(byte, end=" ")
# ... or index into them.
# In[47]:
data[-1]
# Slicing returns another `bytes` object.
# In[48]:
data[::2]
# ### Character Encodings
# Luckily, `data` consists of bytes encoded with the [UTF-8
](https://en.wikipedia.org/wiki/UTF-8) encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes.
#
# To obtain a `str` object out of a given `bytes` object, we decode it with the `bytes` type's [decode()
](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method.
# In[49]:
cards = data.decode()
# In[50]:
type(cards)
# So, `data` consisted of a [full house
](https://en.wikipedia.org/wiki/List_of_poker_hands#Full_house) hand in a poker game.
# In[51]:
cards
# To go the opposite direction and encode a given `str` object, we use the `str` type's [encode()
](https://docs.python.org/3/library/stdtypes.html#str.encode) method.
# In[52]:
place = "Café Kastanientörtchen"
# In[53]:
place.encode()
# By default, [encode()
](https://docs.python.org/3/library/stdtypes.html#str.encode) and [decode()
](https://docs.python.org/3/library/stdtypes.html#bytes.decode) use an `encoding="utf-8"` argument. We may use another encoding like, for example, `"iso-8859-1"`, which can deal with ASCII and western European letters.
# In[54]:
place.encode("iso-8859-1")
# However, we must use the *same* encoding for the decoding step as for the encoding step. Otherwise, a `UnicodeDecodeError` is raised.
# In[55]:
place.encode("iso-8859-1").decode()
# Not all encodings map all Unicode code points. For example `"iso-8859-1"` does not know Czech letters. Below, [encode()
](https://docs.python.org/3/library/stdtypes.html#str.encode) raises a `UnicodeEncodeError` because of that.
# In[56]:
"Dobrý den, přátelé!".encode("iso-8859-1")
# ### Reading Files (continued)
# The [open()
](https://docs.python.org/3/library/functions.html#open) function takes an optional `encoding` argument as well.
# In[57]:
with open("umlauts.txt") as file:
print("".join(file.readlines()))
# In[58]:
with open("umlauts.txt", encoding="iso-8859-1") as file:
print("".join(file.readlines()))
# ### Best Practice: Use UTF-8 explicitly
# A best practice is to *always* specify the `encoding`, especially on computers running on Windows (cf., the talk by Łukasz Langa in the [Further Resources
](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/05_resources.ipynb#Unicode)) section at the end of this chapter.
#
# Below is the first example involving [open()
](https://docs.python.org/3/library/functions.html#open) one last time: It shows how *all* the contents of a text file should be read into one `str` object.
# In[59]:
with open("lorem_ipsum.txt", encoding="utf-8") as file:
content = "".join(file.readlines())
# In[60]:
content
# In[61]:
print(content)