#!/usr/bin/env python
# coding: utf-8
# **Note**: Click on "*Kernel*" > "*Restart Kernel and Clear All Outputs*" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud
](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/05_numbers/01_content.ipynb).
# # Chapter 5: Numbers & Bits (continued)
# In this second part of the chapter, we look at the `float` type in detail. It is probably the most commonly used one in all of data science, even across programming languages.
# ## The `float` Type
# As we have seen before, some assumptions need to be made as to how the $0$s and $1$s in a computer's memory are to be translated into numbers. This process becomes a lot more involved when we go beyond integers and model [real numbers
](https://en.wikipedia.org/wiki/Real_number) (i.e., the set $\mathbb{R}$) with possibly infinitely many digits to the right of the period like $1.23$.
#
# The **[Institute of Electrical and Electronics Engineers
](https://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers)** (IEEE, pronounced "eye-triple-E") is one of the important professional associations when it comes to standardizing all kinds of aspects regarding the implementation of soft- and hardware.
#
# The **[IEEE 754
](https://en.wikipedia.org/wiki/IEEE_754)** standard defines the so-called **floating-point arithmetic** that is commonly used today by all major programming languages. The standard not only defines how the $0$s and $1$s are organized in memory but also, for example, how values are to be rounded, what happens in exceptional cases like divisions by zero, or what is a zero value in the first place.
#
# In Python, the simplest way to create a `float` object is to use a literal notation with a dot `.` in it.
# In[1]:
b = 42.0
# In[2]:
id(b)
# In[3]:
type(b)
# In[4]:
b
# As with `int` literals, we may use underscores `_` to make longer `float` objects easier to read.
# In[5]:
0.123_456_789
# In cases where the dot `.` is unnecessary from a mathematical point of view, we either need to end the number with it nevertheless or use the [float()
](https://docs.python.org/3/library/functions.html#float) built-in to cast the number explicitly. [float()
](https://docs.python.org/3/library/functions.html#float) can process any numeric object or a properly formatted `str` object.
# In[6]:
42.
# In[7]:
float(42)
# In[8]:
float("42")
# Leading and trailing whitespace is ignored ...
# In[9]:
float(" 42.87 ")
# ... but not whitespace in between.
# In[10]:
float("42. 87")
# `float` objects are implicitly created as the result of dividing an `int` object by another with the division operator `/`.
# In[11]:
1 / 3
# In general, if we combine `float` and `int` objects in arithmetic operations, we always end up with a `float` type: Python uses the "broader" representation.
# In[12]:
40.0 + 2
# In[13]:
21 * 2.0
# ### Scientific Notation
# `float` objects may also be created with the **scientific literal notation**: We use the symbol `e` to indicate powers of $10$, so $1.23 * 10^0$ translates into `1.23e0`.
# In[14]:
1.23e0
# Syntactically, `e` needs a `float` or `int` object in its literal notation on its left and an `int` object on its right, both without a space. Otherwise, we get a `SyntaxError`.
# In[15]:
1.23 e0
# In[16]:
1.23e 0
# In[17]:
1.23e0.0
# If we leave out the number to the left, Python raises a `NameError` as it unsuccessfully tries to look up a variable named `e0`.
# In[18]:
e0
# So, to write $10^0$ in Python, we need to think of it as $1*10^0$ and write `1e0`.
# In[19]:
1e0
# To express thousands of something (i.e., $10^3$), we write `1e3`.
# In[20]:
1e3 # = thousands
# Similarly, to express, for example, milliseconds (i.e., $10^{-3} s$), we write `1e-3`.
# In[21]:
1e-3 # = milli
# ## Special Values
# There are also three special values representing "**not a number,**" called `nan`, and positive or negative **infinity**, called `inf` or `-inf`, that are created by passing in the corresponding abbreviation as a `str` object to the [float()
](https://docs.python.org/3/library/functions.html#float) built-in. These values could be used, for example, as the result of a mathematically undefined operation like division by zero or to model the value of a mathematical function as it goes to infinity.
# In[22]:
float("nan") # also float("NaN")
# In[23]:
float("+inf") # also float("+infinity") or float("infinity")
# In[24]:
float("inf") # also float("+inf")
# In[25]:
float("-inf")
# `nan` objects *never* compare equal to *anything*, not even to themselves. This happens in accordance with the [IEEE 754
](https://en.wikipedia.org/wiki/IEEE_754) standard.
# In[26]:
float("nan") == float("nan")
# Another caveat is that any arithmetic involving a `nan` object results in `nan`. In other words, the addition below **fails silently** as no error is raised. As this also happens in accordance with the [IEEE 754
](https://en.wikipedia.org/wiki/IEEE_754) standard, we *need* to be aware of that and check any data we work with for any `nan` occurrences *before* doing any calculations.
# In[27]:
42 + float("nan")
# On the contrary, as two values go to infinity, there is no such concept as difference and *everything* compares equal.
# In[28]:
float("inf") == float("inf")
# Adding `42` to `inf` makes no difference.
# In[29]:
float("inf") + 42
# In[30]:
float("inf") + 42 == float("inf")
# We observe the same for multiplication ...
# In[31]:
42 * float("inf")
# In[32]:
42 * float("inf") == float("inf")
# ... and even exponentiation!
# In[33]:
float("inf") ** 42
# In[34]:
float("inf") ** 42 == float("inf")
# Although absolute differences become unmeaningful as we approach infinity, signs are still respected.
# In[35]:
-42 * float("-inf")
# In[36]:
-42 * float("-inf") == float("inf")
# As a caveat, adding infinities of different signs is an *undefined operation* in math and results in a `nan` object. So, if we (accidentally or unknowingly) do this on a real dataset, we do *not* see any error messages, and our program may continue to run with non-meaningful results! This is another example of a piece of code **failing silently**.
# In[37]:
float("inf") + float("-inf")
# In[38]:
float("inf") - float("inf")
# ## Imprecision
# `float` objects are *inherently* imprecise, and there is *nothing* we can do about it! In particular, arithmetic operations with two `float` objects may result in "weird" rounding "errors" that are strictly deterministic and occur in accordance with the [IEEE 754
](https://en.wikipedia.org/wiki/IEEE_754) standard.
#
# For example, let's add `1` to `1e15` and `1e16`, respectively. In the latter case, the `1` somehow gets "lost."
# In[39]:
1e15 + 1
# In[40]:
1e16 + 1
# Interactions between sufficiently large and small `float` objects are not the only source of imprecision.
# In[41]:
from math import sqrt
# In[42]:
sqrt(2) ** 2
# In[43]:
0.1 + 0.2
# This may become a problem if we rely on equality checks in our programs.
# In[44]:
sqrt(2) ** 2 == 2
# In[45]:
0.1 + 0.2 == 0.3
# A popular workaround is to benchmark the absolute value of the difference between the two numbers to be checked for equality against a pre-defined `threshold` *sufficiently* close to `0`, for example, `1e-15`.
# In[46]:
threshold = 1e-15
# In[47]:
abs((sqrt(2) ** 2) - 2) < threshold
# In[48]:
abs((0.1 + 0.2) - 0.3) < threshold
# The built-in [format()
](https://docs.python.org/3/library/functions.html#format) function allows us to show the **significant digits** of a `float` number as they exist in memory to arbitrary precision. To exemplify it, let's view a couple of `float` objects with `50` digits. This analysis reveals that almost no `float` number is precise! After 14 or 15 digits "weird" things happen. As we see further below, the "random" digits ending the `float` numbers do *not* "physically" exist in memory! Rather, they are "calculated" by the [format()
](https://docs.python.org/3/library/functions.html#format) function that is forced to show `50` digits.
#
# The [format()
](https://docs.python.org/3/library/functions.html#format) function is different from the [format()
](https://docs.python.org/3/library/stdtypes.html#str.format) method on `str` objects introduced in the next chapter (cf., [Chapter 6
](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/00_content.ipynb#format%28%29-Method)): Yet, both work with the so-called [format specification mini-language
](https://docs.python.org/3/library/string.html#format-specification-mini-language): `".50f"` is the instruction to show `50` digits of a `float` number.
# In[49]:
format(0.1, ".50f")
# In[50]:
format(0.2, ".50f")
# In[51]:
format(0.3, ".50f")
# In[52]:
format(1 / 3, ".50f")
# The [format()
](https://docs.python.org/3/library/functions.html#format) function does *not* round a `float` object in the mathematical sense! It just allows us to show an arbitrary number of the digits as stored in memory, and it also does *not* change these.
#
# On the contrary, the built-in [round()
](https://docs.python.org/3/library/functions.html#round) function creates a *new* numeric object that is a rounded version of the one passed in as the argument. It adheres to the common rules of math.
#
# For example, let's round `1 / 3` to five decimals. The obtained value for `roughly_a_third` is also *imprecise* but different from the "exact" representation of `1 / 3` above.
# In[53]:
roughly_a_third = round(1 / 3, 5)
# In[54]:
roughly_a_third
# In[55]:
format(roughly_a_third, ".50f")
# Surprisingly, `0.125` and `0.25` appear to be *precise*, and equality comparison works without the `threshold` workaround: Both are powers of $2$ in disguise.
# In[56]:
format(0.125, ".50f")
# In[57]:
format(0.25, ".50f")
# In[58]:
0.125 + 0.125 == 0.25
# ## Binary Representations
# To understand these subtleties, we need to look at the **[binary representation of floats
](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)** and review the basics of the **[IEEE 754
](https://en.wikipedia.org/wiki/IEEE_754)** standard. On modern machines, floats are modeled in so-called double precision with $64$ bits that are grouped as in the figure below. The first bit determines the sign ($0$ for plus, $1$ for minus), the next $11$ bits represent an $exponent$ term, and the last $52$ bits resemble the actual significant digits, the so-called $fraction$ part. The three groups are put together like so:
# $$float = (-1)^{sign} * 1.fraction * 2^{exponent-1023}$$
# A $1.$ is implicitly prepended as the first digit, and both, $fraction$ and $exponent$, are stored in base $2$ representation (i.e., they both are interpreted like integers above). As $exponent$ is consequently non-negative, between $0_{10}$ and $2047_{10}$ to be precise, the $-1023$, called the exponent bias, centers the entire $2^{exponent-1023}$ term around $1$ and allows the period within the $1.fraction$ part be shifted into either direction by the same amount. Floating-point numbers received their name as the period, formally called the **[radix point
](https://en.wikipedia.org/wiki/Radix_point)**, "floats" along the significant digits. As an aside, an $exponent$ of all $0$s or all $1$s is used to model the special values `nan` or `inf`.
#
# As the standard defines the exponent part to come as a power of $2$, we now see why `0.125` is a *precise* float: It can be represented as a power of $2$, i.e., $0.125 = (-1)^0 * 1.0 * 2^{1020-1023} = 2^{-3} = \frac{1}{8}$. In other words, the floating-point representation of $0.125_{10}$ is $0_2$, $1111111100_2 = 1020_{10}$, and $0_2$ for the three groups, respectively.
#
# The crucial fact for the data science practitioner to understand is that mapping the *infinite* set of the real numbers $\mathbb{R}$ to a *finite* set of bits leads to the imprecisions shown above!
#
# So, floats are usually good approximations of real numbers only with their first $14$ or $15$ digits. If more precision is required, we need to revert to other data types such as a `Decimal` or a `Fraction`, as shown in the next two sections.
#
# This [blog post](http://fabiensanglard.net/floating_point_visually_explained/) gives another neat and *visual* way as to how to think of floats. It also explains why floats become worse approximations of the reals as their absolute values increase.
#
# The Python [documentation
](https://docs.python.org/3/tutorial/floatingpoint.html) provides another good discussion of floats and the goodness of their approximations.
#
# If we are interested in the exact bits behind a `float` object, we use the [.hex()
](https://docs.python.org/3/library/stdtypes.html#float.hex) method that returns a `str` object beginning with `"0x1."` followed by the $fraction$ in hexadecimal notation and the $exponent$ as an integer after subtraction of $1023$ and separated by a `"p"`.
# In[59]:
one_eighth = 1 / 8
# In[60]:
one_eighth.hex()
# Also, the [.as_integer_ratio()
](https://docs.python.org/3/library/stdtypes.html#float.as_integer_ratio) method returns the two smallest integers whose ratio best approximates a `float` object.
# In[61]:
one_eighth.as_integer_ratio()
# In[62]:
roughly_a_third.hex()
# In[63]:
roughly_a_third.as_integer_ratio()
# `0.0` is also a power of $2$ and thus a *precise* `float` number.
# In[64]:
zero = 0.0
# In[65]:
zero.hex()
# In[66]:
zero.as_integer_ratio()
# As seen in [Chapter 1
](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/01_elements/00_content.ipynb#%28Data%29-Type-%2F-%22Behavior%22), the [.is_integer()
](https://docs.python.org/3/library/stdtypes.html#float.is_integer) method tells us if a `float` can be casted as an `int` object without any loss in precision.
# In[67]:
roughly_a_third.is_integer()
# In[68]:
one = roughly_a_third / roughly_a_third
one.is_integer()
# As the exact implementation of floats may vary and be dependent on a particular Python installation, we look up the [.float_info
](https://docs.python.org/3/library/sys.html#sys.float_info) attribute in the [sys
](https://docs.python.org/3/library/sys.html) module in the [standard library
](https://docs.python.org/3/library/index.html) to check the details. Usually, this is not necessary.
# In[69]:
import sys
# In[70]:
sys.float_info