Notebook

Errors, accuracy, and roundoff¶

Binary representation of numbers¶

We are used to base 10 numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
- Because we have 10 fingers
- We have 1's, 10's, 100's, 1000's, etc.
- $101325 = 5\cdot 10^0 + 2\cdot 10^1 + 3\cdot 10^2 + 1\cdot 10^3 + 0\cdot 10^4 + 1\cdot 10^5.$
Computers only have 2 fingers, so they use binary: 0, 1.
- We have 1's, 2's, 4's, 8's, 16's, etc.
- $11010 = 0\cdot 2^0 + 1\cdot 2^1 + 0\cdot 2^2 + 1\cdot 2^3 + 1\cdot 2^4 = 0+2+0+8+16=26.$

Convert decimal to binary¶

Take a decimal number, like 53.
Divide it by 2 and record the remainder. The remainder will either be a 1 or a 0. For example, 53/2 = 26 remainder 1.
Repeat this process until the number you are dividing becomes 0:
Now, line up and reverse the order of the remainder columns.

#	53	26	13	6	3	1	0
r		1	0	1	0	1	1

So, 101011 $\rightarrow$ reverse the order $\rightarrow$ $\mathbf{110101} = 53$

Binary fractions¶

$1.1 = 1\frac{1}{2} = 1.5.$
$1.11 = 1 + 1\cdot\frac{1}{2} + 1\cdot \frac{1}{4} = 1.75$
$1.101 = 1 + 1\cdot \frac{1}{2} + 0\cdot \frac{1}{4} + 1\cdot \frac{1}{8} = 1.625.$
- or, multiply and divide by $2^3\rightarrow\frac{1101}{2^3} = \frac{13}{2^3} = 1.625.$

Numerical representation¶

Finite number of bits per number
Integers are exact within some range
- Arithmetic is interpreted as an integer: 5/2 = 2, not 2.5
Real numbers must fall into bins $\rightarrow$ not exactly represented.
- Arithmetic involves errors (roundoff).
Question: what is the binary representation of the decimal number 0.1?

Floating point representation¶

See Numerical Recipes 3rd Ed. Page 8, section 1.1

Called floating point because the decimal point floats
- 1502.0 = 1.502E3 = 15.020E2
Addition: 1.57E2 + 2.0E0 is really 157.0E0 + 2.0E0
- That is, the power of 10 is made the same, then add.
- (That is, line up the decimals, then add).
Floats and Doubles
- Floats = 4 bytes = 32 bits
- Doubles = 8 bytes = 64 bits

S (1)	E (11)	M (52)
0	00000000000	0000000000000000000000000000000000000000000000000000

Format: $ (-1)^S \times 1.M \times 2^{E-1023}$
- $S$ is a sign bit (1 bit)
- $M$ is the mantissa (52 bits), the number part
  - Numbers between 0 and $2^{52}-1$.
  - $2^{52} = 4.5E15$ $\rightarrow 15+1$ =16 digits of accuracy
  - Note, binary doubles are normalized meaning they are left shifted until the left-most bit is 1. This is assumed, giving a bit more accuracy.
- $E$ is the exponent (11 bits)
  - 11 bits $\rightarrow 2^{11} = 2048 \rightarrow 2^{2048-1}.$
- The 1023 is a bias (shift), allowing negative exponents.
  - So, instead of 0 to 2047, have roughly $10^{-1023}$ to $10^{1023}$.

From Numerical Recipes: $$S\times M\times b^{E-e}$$

S	E	F	Value
any	1-2046	any	(-1)$^S\times$ 2$^{E-1023}$$\times$ 1.F
any	0	nonzero	(-1)$^S\times$ 2$^{E-1022}$$\times$ 0.F
0	0	0	$+0.0$
1	0	0	$-0.0$
0	2047	0	$+\infty$
1	2047	0	$-\infty$
any	2047	nonzero	NaN

\begin{align} &0\;\; 01111111111\;\; 0000000000000000000000000000000000000000000000000000 = +1\times 2^{1023-1023} \times 1.0_2 &=& \;1 \\ &1\;\; 01111111111\;\; 0000000000000000000000000000000000000000000000000000 = -1\times 2^{1023-1023} \times 1.0_2 &=& \;-1 \\ &0\;\; 01111111111\;\; 1000000000000000000000000000000000000000000000000000 = +1\times 2^{1023-1023} \times 1.1_2 &=& \;1.5 \\ &0\;\; 10000000000\;\; 0000000000000000000000000000000000000000000000000000 = +1\times 2^{1024-1023} \times 1.0_2 &=& \;2 \\ &0\;\; 10000000001\;\; 1010000000000000000000000000000000000000000000000000 = +1\times 2^{1025-1023} \times 1.1010_2 &=& \;6.5 \end{align}

Errors¶

Finite representation:

The relative error in the representation (RE):
- $RE\le 2^{-M}$.
- For double precision, $2^{-52}=2\times 10^{-16}\rightarrow$ 16 digits.

Floating point operations: $x \square y$, where $\square$ is one of + - * /
- $fl(x\square y) = round(x\square y)$, that is, have to round.
- $x+y\rightarrow$ need the same exponents $\rightarrow$ lose digits of the smaller number.
  - Suppose we had 4 digits to work with: $1000. + 7.200 \rightarrow 1.000E3 + 0.0072E3 \rightarrow 1.007E3$. So we lose the $2$.
- $x*y\rightarrow$ Add the exponents and multiply the mantissas $\rightarrow$ rounding error, but not as severe.
- $a+b = b+a$, but $(a+b)+c \ne a+(b+c)$.
  - Commutative, but not associative.
  - For example, for $\epsilon = \frac{1}{2}\epsilon_{mach}$, $(1+\epsilon)+\epsilon=1$, but $1+(\epsilon+\epsilon)>1$.

Examples of roundoff error trouble¶

Error Analysis¶

Here, we consider the worst cases.
$fl(x\square y) = (x\square y)(1+\delta),$ where $\delta$ is the relative error. (Also use $\epsilon$ for this.)

Example 1¶

$$a^2 - b^2$$

Algorithm 1¶

Compute as

$$a\cdot a - b\cdot b$$$$(1)\,\, (3)\,\, (2)$$$$fl(a\cdot a - b\cdot b) = [aa(1+\epsilon_1) - bb(1+\epsilon_2)](1+\epsilon_3)$$$$ = a^2(1+\epsilon_1)(1+\epsilon_3) - b^2(1+\epsilon_2)(1+\epsilon_3).$$

Now, $(1+\epsilon_1)(1+\epsilon_2) = 1+\epsilon_1+\epsilon_2 + \epsilon_1\epsilon_3.$
- The last term is small compared to the others, so we have
- $(1+\epsilon_1)(1+\epsilon_2) \approx 1+\delta_1$, where $|\delta_1|\le 2\epsilon_{mach}.$
Hence

$$a\cdot a - b\cdot b \rightarrow a^2(1+\delta_1) - b^2(1+\delta_2).$$

Now, the relative error between this, and the exact answer is

$$RE = \frac{|a^2(1+\delta_1) - b^2(1+\delta_2) - (a^2 - b^2)|}{|a^2-b^2|} = \frac{|a^2\delta_1 - b^2\delta_2|}{|a^2 - b^2|}.$$

The signs of $\delta_1$ and $\delta_2$ are unknown $\rightarrow$ worst case is for $\delta_1>0$ and $\delta_2<0$.

$$RE \le \frac{|\delta_1|a^2 + |\delta_2|b^2}{|a^2-b^2|}\le 2\epsilon_{mach}\frac{a^2+b^2}{|a^2-b^2|}.$$

Hence $RE\rightarrow \infty$ as $a\approx b$.
Subtraction of similar numbers $\rightarrow$ big relative error.
- ex: 1.055 - 1.054 = 0.001, but the roundoff error is in the last decimal $\approx \pm 0.001$.

Algorithm 2¶

Compute as

$$(a-b)(a+b)$$$$(1)\,\,(3)\,\,(2)$$

This gives:

$$fl((a-b)(a+b)) = [(a-b)(1+\epsilon_1)(a+b)(1+\epsilon_2)](1+\epsilon_3).$$$$ = (a-b)(a+b)(1+\epsilon_1)(1-\epsilon_2)(1-\epsilon_3)$$$$ \approx (a-b)(a+b)(1+\delta),$$

where $|\delta|\le 3\epsilon_{mach}.$

The relative error is

$$ RE = \frac{|(a-b)(a+b)(1+\delta) - (a-b)(a+b)|}{|a^2-b^2|} = |\delta|\le 3\epsilon_{mach}.$$

The relative error is independent of $a$ and $b$, and is at most $3\epsilon_{mach}$.
The error is as small as it can be.
So factor if possible.

Example 2¶

$$\prod_{x=1}^n x_i = x_1x_2x_3\ldots$$$$fl\left(\prod_{x=1}^n x_i\right) = \left(\prod x_i\right)\left(\prod(1+\epsilon_i\right) = \left(\prod x_i\right)(1+\delta),$$

where $|\delta|\le(n-1)\epsilon_{mach}$.

The relative error is simply

$$ RE = |\delta|$$

This is good, we can't do any better than having the relative error be $(n-1)\epsilon_{mach}$ when we are doing $n-1$ calculations.
(But be careful of very large $n$.)

Example 3¶

$$\sum_{i=1}^n x_i$$

Only the result is given here:

$$RE\le(n-1)\epsilon_{mach}\frac{\sum|x_i|}{|\sum x_i|}$$

Case 1: all the $x_i$ have the same signs.
- Then $RE\le(n-1)\epsilon_{mach}.$ This is ideal.
Case 2: May get $\sum x_i\approx 0$, and $\sum|x_i|$ is big.
- Then $RE$ blows up. This is bad.
- Reformulate if you can, like regroup terms?

Example 4¶

How to properly evaluate the quadratic equation.
See Numerical Recipes 3rd Ed. Page 227, section 5.6

$$ax^2 + bx+c=0.$$

Solutions¶

$$x= \frac{-b\pm\sqrt{b^2-4ac}}{2a},$$

$$x = \frac{2c}{-b\pm\sqrt{b^2-4ac}}.$$

What can go wrong in evaluating the solution using either of these two solutions?

Correct approach¶

$$q = -\frac{1}{2}\left[b + \mathrm{sgn}(b)\sqrt{b^2-4ac}\right],$$ $$x_1 = \frac{q}{a},$$ $$x_2 = \frac{c}{q}.$$

In [ ]: