#!/usr/bin/env python # coding: utf-8 # **Note**: Click on "*Kernel*" > "*Restart Kernel and Clear All Outputs*" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud ](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_content.ipynb). # # Chapter 6: Text & Bytes (continued) # In this second part of the chapter, we look in more detail at how `str` objects work in memory, in particular how the $0$s and $1$s in the memory translate into characters. # ## Special Characters # As previously seen, some characters have a special meaning when following the **escape character** `"\"`. Besides escaping the kind of quote used as the `str` object's delimiter, `'` or `"`, most of these **escape sequences** (i.e., `"\"` with the subsequent character), act as a **control character** that moves the "cursor" in the output *without* generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the [print() ](https://docs.python.org/3/library/functions.html#print) function. The [documentation ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) lists all available escape sequences, of which we show the most important ones below. # # The most common escape sequence is `"\n"` that "prints" a [newline character ](https://en.wikipedia.org/wiki/Newline) that is also called the line feed character or LF for short. # In[1]: "This is a sentence\nthat is printed\non three lines." # In[2]: print("This is a sentence\nthat is printed\non three lines.") # `"\b"` is the [backspace character ](https://en.wikipedia.org/wiki/Backspace), or BS for short, that moves the cursor back by one character. # In[3]: print("ABC\bX") # In[4]: print("ABC\bXY") # Similarly, `"\r"` is the [carriage return character ](https://en.wikipedia.org/wiki/Carriage_return), or CR for short, that moves the cursor back to the beginning of the line. # In[5]: print("ABC\rX") # In[6]: print("ABC\rXY") # While Linux and modern MacOS systems use solely `"\n"` to express a new line, Windows systems default to using `"\r\n"`. This may lead to "weird" bugs on software projects where people using both kind of operating systems collaborate. # In[7]: print("This is a sentence\r\nthat is printed\r\non three lines.") # `"\t"` makes the cursor "jump" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results. # In[8]: print("Jump\tfrom\ttab\tstop\tto\ttab\tstop.\nThe\tsecond\tline\tdoes\tso\ttoo.") # ### Raw Strings # Sometimes we do *not* want the backslash `"\"` and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `"\n"` does *not* makes sense here. # In[9]: print("C:\Programs\new_application") # Some `str` objects even produce a `SyntaxError` because the `"\U"` can *not* be interpreted as a Unicode code point (cf., next section). # In[10]: print("C:\Users\Administrator\Desktop\Project") # A simple solution would be to escape the escape character with a *second* backslash `"\"`. # In[11]: print("C:\\Programs\\new_application") # In[12]: print("C:\\Users\\Administrator\\Desktop\\Project") # However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a `r`. The literal is then interpreted in a "raw" way. # In[13]: print(r"C:\Programs\new_application") # In[14]: print(r"C:\Users\Administrator\Desktop\Project") # ## Characters are Numbers with a Convention # So far, we used the term **character** without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software. # # [Chapter 5 ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/05_numbers/00_content.ipynb) gives us an idea on how individual **bits** are used to express all types of numbers, from "simple" `int` objects to "complex" `float` ones. To model characters, another **layer of abstraction** is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters. # ### ASCII # Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called [American Standard Code for Information Interchange ](https://en.wikipedia.org/wiki/ASCII), or **ASCII** for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers `0` through `127`. # # A mapping from characters to numbers is referred to by the technical term **encoding**. We may use the built-in [ord() ](https://docs.python.org/3/library/functions.html#ord) function to **encode** any single character. The inverse to that is the built-in [chr() ](https://docs.python.org/3/library/functions.html#chr) function, which **decodes** a number into a character. # In[15]: ord("A") # In[16]: chr(65) # Of course, unprintable escape sequences like `"\n"` count as only *one* character. # In[17]: ord("\n") # In[18]: chr(10) # In ASCII, the numbers `0` through `31` (and `127`) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers `48` through `57`, the upper case letters with `65` through `90`, and the lower case letters with `97` through `122`. While this seems random as first, there is of course a "sophisticated" system behind it. That can immediately be seen when looking at the encoded numbers in their *binary* representations. # # For example, the digit `5` is mapped to the number `53` in ASCII. The binary representation of `53` is `0b_11_0101` and the least significant four bits, `0101`, mean $5$. Similarly, the letter `"E"` is the fifth letter in the alphabet. It is encoded with the number `69` in ASCII, which is `0b_100_0101` in binary. And, the least significant bits, `0_0101`, mean $5$. Analogously, `"e"` is encoded with `101` in ASCII, which is `0b_110_0101` in binary. And, the least significant bits, `0_0101`, mean $5$ again. This encoding was chosen mainly because programmers "in the old days" needed to implement these encodings "by hand." Python abstracts that logic away from its users. # # This encoding scheme is also the cause for the "weird" sorting in the "*String Comparison*" section in the [first part ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_content.ipynb#String-Comparison) of this chapter, where `"apple"` comes *after* `"Banana"`. As `"a"` is encoded with `97` and `"B"` with `66`, `"Banana"` must of course be "smaller" than `"apple"` when comparison is done in a pairwise fashion of the individual characters. # In[19]: for number in range(48, 58): print(number, bin(number), "-> ", chr(number)) # In[20]: for i, number in enumerate(range(65, 91), start=1): end = "\n" if i % 3 == 0 else "\t" print(number, bin(number), "-> ", chr(number), end=end) # In[21]: for i, number in enumerate(range(97, 123), start=1): end = "\n" if i % 3 == 0 else "\t" print(str(number).rjust(3), bin(number), "-> ", chr(number), end=end) # The remaining `symbols` encoded in ASCII are encoded with the numbers still unused, which is why they are scattered. # In[22]: symbols = ( list(range(32, 48)) + list(range(58, 65)) + list(range(91, 97)) + list(range(123, 127)) ) # In[23]: for i, number in enumerate(symbols, start=1): end = "\n" if i % 3 == 0 else "\t" print(str(number).rjust(3), bin(number).rjust(10), "-> ", chr(number), end=end) # As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are [ISO 8859-1 ](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) for western European letters or [Windows 1250 ](https://en.wikipedia.org/wiki/Windows-1250) for Latin ones. Many of these encodings use 8-bit numbers (i.e., `0` through `255`) to map the multitude of non-English letters (e.g., the German [umlauts ](https://en.wikipedia.org/wiki/Umlaut_%28linguistics%29) `"ä"`, `"ö"`, `"ü"`, or `"ß"`). # ### Unicode # However, none of these specialized encodings can map *all* characters of *all* languages around the world from *all* times in human history. To achieve that, a truly global standard called **[Unicode ](https://en.wikipedia.org/wiki/Unicode)** was developed and its first version released in 1991. Since then, Unicode has been amended with many other "characters." The most popular among them being [emojis ](https://en.wikipedia.org/wiki/Emoji) or the [Klingon ](https://en.wikipedia.org/wiki/Klingon_scripts) language (from the science fiction series [Star Trek ](https://en.wikipedia.org/wiki/Star_Trek)). In Unicode, every character is given an identity referred to as the **code point**. Code points are hexadecimal numbers from `0x0000` through `0x10ffff`, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first `127` code points are identical to the ASCII encoding for reasons explained in the "*The `bytes` Type*" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., [Wikipedia ](https://en.wikipedia.org/wiki/List_of_Unicode_characters)). # # All we need to know to print a character is its code point. Python uses the escape sequence `"\U"` that is followed by eight hexadecimal digits. Underscore separators are unfortunately *not* allowed here. # # So, to print a smiley, we just need to look up the corresponding number (e.g., [here ](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks)). # In[24]: "\U0001f604" # Every Unicode character also has a descriptive name that we can use with the escape sequence `"\N"` and within curly braces `{}`. # In[25]: "\N{FACE WITH TEARS OF JOY}" # Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence `"\u"` for brevity. # In[26]: "\U00000041" # hex(65) == 0x41 # In[27]: "\u0041" # Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence `"\x"` for even conciser code. # In[28]: "\x41" # As the `str` type is based on Unicode, a `str` object's behavior is more in line with how humans view text and not how it is expressed in source code. # # For example, while it is obvious that `len("A")` evaluates to `1`, ... # In[29]: len("A") # ... what should `len("\N{SNAKE}")` evaluate to? As the idea of a snake is expressed as *one* "character," [len() ](https://docs.python.org/3/library/functions.html#len) also returns `1` here. # In[30]: "\N{SNAKE}" # In[31]: len("\N{SNAKE}") # Many of the built-in `str` methods also consider Unicode. For example, in contrast to [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower), the [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) method knows that the German `"ß"` is commonly converted to `"ss"`. So, when searching for exact matches, normalizing text with [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) may yield better results than with [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower). # In[32]: "Straße".lower() # In[33]: "Straße".casefold() # Many other methods like [isdecimal() ](https://docs.python.org/3/library/stdtypes.html#str.isdecimal), [isdigit() ](https://docs.python.org/3/library/stdtypes.html#str.isdigit), [isnumeric() ](https://docs.python.org/3/library/stdtypes.html#str.isnumeric), [isprintable() ](https://docs.python.org/3/library/stdtypes.html#str.isprintable), [isidentifier() ](https://docs.python.org/3/library/stdtypes.html#str.isidentifier), and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning. # ## Multi-line Strings # Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8 ](https://www.python.org/dev/peps/pep-0008/) or because the text consists of many lines and typing out `"\n"` is tedious. However, using single double quotes `"` around multiple lines results in a `SyntaxError`. # In[34]: " Do not break the lines like this " # Instead, we may enclose a string literal with either **triple double** quotes `"""` or **triple single** quotes `'''`. Then, newline characters in the source code are converted into `"\n"` characters in the resulting `str` object. Docstrings are precisely that, and, by convention, always written within triple double quotes `"""`. # In[35]: multi_line = """ I am a multi-line string consisting of four lines. """ # A caveat is that `"\n"` characters are often inserted at the beginning or end of the text when we try to format the source code nicely. # In[36]: multi_line # In[37]: print(multi_line) # Using the [split() ](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional `sep` argument, we confirm that `multi_line` consists of *four* lines with the first and last line being empty. # In[38]: for i, line in enumerate(multi_line.split("\n"), start=1): print(i, line) # To mitigate that, we often see the [strip() ](https://docs.python.org/3/library/stdtypes.html#bytes.strip) method in source code. # In[39]: multi_line = """ I am a multi-line string consisting of two lines. """.strip() # In[40]: for i, line in enumerate(multi_line.split("\n"), start=1): print(i, line) # ## The `bytes` Type # To end this chapter, we want to briefly look at the `bytes` data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format). # # Let's open a binary file in read-only mode (i.e., `mode="rb"`) and read in all of its contents. # In[41]: with open("full_house.bin", mode="rb") as binary_file: data = binary_file.read() # `data` is an object of type `bytes`. # In[42]: id(data) # In[43]: type(data) # It's value is given out in the literal bytes notation with a `b` prefix (cf., the [reference ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). Every byte is expressed in hexadecimal representation with the escape sequence `"\x"`. This representation is commonly chosen as we can *not* tell what kind of information is hidden in the `data` by just looking at the bytes. Instead, we must be told by some other source how to **decode** the raw bytes into information we can interpret. # In[44]: data # `bytes` objects work like `str` objects in many ways. In particular, they are *sequences* as well: The number of bytes is *finite* and we may *iterate* over them in *order*. # In[45]: len(data) # Consisting of 8 bits, a single byte can always be interpreted as a whole number between `0` through `255`. That is exactly what we see when we loop over the `data` ... # In[46]: for byte in data: print(byte, end=" ") # ... or index into them. # In[47]: data[-1] # Slicing returns another `bytes` object. # In[48]: data[::2] # ### Character Encodings # Luckily, `data` consists of bytes encoded with the [UTF-8 ](https://en.wikipedia.org/wiki/UTF-8) encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes. # # To obtain a `str` object out of a given `bytes` object, we decode it with the `bytes` type's [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method. # In[49]: cards = data.decode() # In[50]: type(cards) # So, `data` consisted of a [full house ](https://en.wikipedia.org/wiki/List_of_poker_hands#Full_house) hand in a poker game. # In[51]: cards # To go the opposite direction and encode a given `str` object, we use the `str` type's [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) method. # In[52]: place = "Café Kastanientörtchen" # In[53]: place.encode() # By default, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) and [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) use an `encoding="utf-8"` argument. We may use another encoding like, for example, `"iso-8859-1"`, which can deal with ASCII and western European letters. # In[54]: place.encode("iso-8859-1") # However, we must use the *same* encoding for the decoding step as for the encoding step. Otherwise, a `UnicodeDecodeError` is raised. # In[55]: place.encode("iso-8859-1").decode() # Not all encodings map all Unicode code points. For example `"iso-8859-1"` does not know Czech letters. Below, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) raises a `UnicodeEncodeError` because of that. # In[56]: "Dobrý den, přátelé!".encode("iso-8859-1") # ### Reading Files (continued) # The [open() ](https://docs.python.org/3/library/functions.html#open) function takes an optional `encoding` argument as well. # In[57]: with open("umlauts.txt") as file: print("".join(file.readlines())) # In[58]: with open("umlauts.txt", encoding="iso-8859-1") as file: print("".join(file.readlines())) # ### Best Practice: Use UTF-8 explicitly # A best practice is to *always* specify the `encoding`, especially on computers running on Windows (cf., the talk by Łukasz Langa in the [Further Resources ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/05_resources.ipynb#Unicode)) section at the end of this chapter. # # Below is the first example involving [open() ](https://docs.python.org/3/library/functions.html#open) one last time: It shows how *all* the contents of a text file should be read into one `str` object. # In[59]: with open("lorem_ipsum.txt", encoding="utf-8") as file: content = "".join(file.readlines()) # In[60]: content # In[61]: print(content)