In [1]:

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_1.0.4     crayon_1.5.2    digest_0.6.31   utf8_1.2.2     
 [5] IRdisplay_1.1   repr_1.1.5      lifecycle_1.0.3 jsonlite_1.8.4 
 [9] evaluate_0.20   pillar_1.8.1    rlang_1.0.6     cli_3.6.0      
[13] uuid_1.1-0      vctrs_0.5.2     IRkernel_1.3.2  tools_4.2.2    
[17] glue_1.6.2      fastmap_1.1.0   compiler_4.2.2  base64enc_0.1-3
[21] pbdZMQ_0.3-9    htmltools_0.5.4

Required R packages¶

In [2]:

# Dig into the internal representation of R objects
library(lobstr)  # if not installed, use install.packages('lobstr')
# For unsigned integers
library(uint8) # devtools::install_github('coolbutuseless/uint8')
# For bitstrings
library(pryr)
# For big integers
library(gmp)
# For single precision floating point numbers
library(float)
library(Rcpp)

Attaching package: ‘uint8’


The following objects are masked from ‘package:base’:

    :, order



Attaching package: ‘pryr’


The following objects are masked from ‘package:lobstr’:

    ast, mem_used



Attaching package: ‘gmp’


The following objects are masked from ‘package:base’:

    %*%, apply, crossprod, matrix, tcrossprod

Computer arithmetics¶

Units of computer storage¶

Humans use decimal digits (why?)
Computers use binary digits (why?)

Bit = binary digit (coined by statistician John Tukey).
byte = 8 bits.
KB = kilobyte = $10^3$ bytes; KiB = kibibyte = $2^{10}$ bytes.
MB = megabyte = $10^6$ bytes; MiB = mebibyte = $2^{20}$ bytes.
GB = gigabyte = $10^9$ bytes. Typical RAM size.
TB = terabyte = $10^{12}$ bytes. Typical hard drive size. Size of NYSE each trading session.
PB = petabyte = $10^{15}$ bytes.
EB = exabyte = $10^{18}$ bytes. Size of all healthcare data in 2011 is ~150 EB.
ZB = zetabyte = $10^{21}$ bytes.

R function lobstr::obj_size() shows the amount of memory (in bytes) used by an object. (This is a better version of the built-in utils::object.size()

In [3]:

x <- 100
lobstr::obj_size(x)
y <-c(20, 30)
lobstr::obj_size(y)
z <- as.matrix(runif(100 * 100), nrow=100)  # 100 x 100 random matrix
lobstr::obj_size(z)

56 B

64 B

80.22 kB

Print all variables in workspace and their sizes:

In [4]:

sapply(ls(), function(z, env=parent.env(environment())) obj_size(get(z, envir=env)))

x: 56
y: 64
z: 80216

Storage of Characters¶

Plain text files are stored in the form of characters: .jl, .r, .c, .cpp, .ipynb, .html, .tex, ...
ASCII (American Code for Information Interchange): 7 bits, only $2^7=128$ characters.

In [5]:

# integers 0, 1, ..., 127 and corresponding ascii character
sapply(0:127, intToUtf8)

''
'\001'
'\002'
'\003'
'\004'
'\005'
'\006'
'\a'
'\b'
'\t'
'\n'
'\v'
'\f'
'\r'
'\016'
'\017'
'\020'
'\021'
'\022'
'\023'
'\024'
'\025'
'\026'
'\027'
'\030'
'\031'
'\032'
'\033'
'\034'
'\035'
'\036'
'\037'
' '
'!'
'"'
'#'
'$'
'%'
'&'
'\''
'('
')'
'*'
'+'
','
'-'
'.'
'/'
'0'
'1'
'2'
'3'
'4'
'5'
'6'
'7'
'8'
'9'
':'
';'
'<'
'='
'>'
'?'
'@'
'A'
'B'
'C'
'D'
'E'
'F'
'G'
'H'
'I'
'J'
'K'
'L'
'M'
'N'
'O'
'P'
'Q'
'R'
'S'
'T'
'U'
'V'
'W'
'X'
'Y'
'Z'
'['
'\\'
']'
'^'
'_'
'`'
'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'j'
'k'
'l'
'm'
'n'
'o'
'p'
'q'
'r'
's'
't'
'u'
'v'
'w'
'x'
'y'
'z'
'{'
'|'
'}'
'~'
'\177'

Extended ASCII: 8 bits, $2^8=256$ characters.

In [6]:

# integers 128, 129, ..., 255 and corresponding extended ascii character
sapply(128:255, intToUtf8)

'\u0080'
'\u0081'
'\u0082'
'\u0083'
'\u0084'
'\u0085'
'\u0086'
'\u0087'
'\u0088'
'\u0089'
'\u008a'
'\u008b'
'\u008c'
'\u008d'
'\u008e'
'\u008f'
'\u0090'
'\u0091'
'\u0092'
'\u0093'
'\u0094'
'\u0095'
'\u0096'
'\u0097'
'\u0098'
'\u0099'
'\u009a'
'\u009b'
'\u009c'
'\u009d'
'\u009e'
'\u009f'
' '
'¡'
'¢'
'£'
'¤'
'¥'
'¦'
'§'
'¨'
'©'
'ª'
'«'
'¬'
''
'®'
'¯'
'°'
'±'
'²'
'³'
'´'
'µ'
'¶'
'·'
'¸'
'¹'
'º'
'»'
'¼'
'½'
'¾'
'¿'
'À'
'Á'
'Â'
'Ã'
'Ä'
'Å'
'Æ'
'Ç'
'È'
'É'
'Ê'
'Ë'
'Ì'
'Í'
'Î'
'Ï'
'Ð'
'Ñ'
'Ò'
'Ó'
'Ô'
'Õ'
'Ö'
'×'
'Ø'
'Ù'
'Ú'
'Û'
'Ü'
'Ý'
'Þ'
'ß'
'à'
'á'
'â'
'ã'
'ä'
'å'
'æ'
'ç'
'è'
'é'
'ê'
'ë'
'ì'
'í'
'î'
'ï'
'ð'
'ñ'
'ò'
'ó'
'ô'
'õ'
'ö'
'÷'
'ø'
'ù'
'ú'
'û'
'ü'
'ý'
'þ'
'ÿ'

Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
UTF-8 is the current dominant character encoding on internet.

Source: Google Blog

In [7]:

st <-  "\uD1B5\uACC4\uACC4\uC0B0"
st

'통계계산'

Integers: fixed-point number system¶

Fixed-point number system is a computer model for integers $\mathbb{Z}$.
- Remember that computer memory is finite whereas the cardinality of $\mathbb{Z}$ is (countably) infinite.
- Any representation of numbers in computer has to be an approximation.
The number $M$ of bits and method of representing negative numbers vary from system to system.
- The integer type in R has $M=32$ (packages such as ‘bit64’support 64 bit integers).
  - https://www.r-bloggers.com/r-in-a-64-bit-world/
- C has (unsigned) char, int, short, long (and long long), whose sizes depend on the machine.
- Matlab has (u)int8, (u)int16, (u)int32, (u)int64.

Unsigned integers¶

Model for $\mathbb{N} \cup \{0\}$.
For unsigned integers, the range is $[0,2^M-1]$.
R does not support unsigned integers natively (will see uint8 package later). In most other languages:

Type	Min	Max
UInt8	0	255
UInt16	0	65535
UInt32	0	4294967295
UInt64	0	18446744073709551615
UInt128	0	340282366920938463463374607431768211455

Signed integers¶

Model of $\mathbb{Z}$. Can do subtraction.
First bit ("most significant bit" or MSB) is the sign bit.
- 0 for nonnegative numbers
- 1 for negative numbers
Two's complement representation for negative numbers.
- Set the sign bit to 1
- Negate (0->1, 1->0) the remaining bits
- Add to 1 to the result
- Two's complement representation of a negative integer $x$ is the same as the unsigned integer $2^M - x$.

In [8]:

class(5L)  # postfix `L` means integer in R
pryr::bits(5L)
pryr::bits(-5L)
pryr::bits(2L * 5L) # shift bits of 5 to the left (why?)
pryr::bits(2L * -5L); # shift bits of -5 to left

'integer'

'00000000 00000000 00000000 00000101'

'11111111 11111111 11111111 11111011'

'00000000 00000000 00000000 00001010'

'11111111 11111111 11111111 11110110'

Two's complement representation respects modular arithmetic nicely.
Addition of any two signed integers are just bitwise addition, possibly modulo $2^M$
- $M=4$ case:

Source: Signed Binary Numbers, Subtraction and Overflow by Grant Braught

Range of representable integers by $M$-bit signed integer is $[-2^{M-1},2^{M-1}-1]$:

In [9]:

.Machine$integer.max  # R uses 32-bit integer

2147483647

In most other languages,

Type	Min	Max
Int8	-128	127
Int16	-32768	32767
Int32	-2147483648	2147483647
Int64	-9223372036854775808	9223372036854775807
Int128	-170141183460469231731687303715884105728	170141183460469231731687303715884105727

Overflow and underflow for integer arithmetic¶

R reports NA for integer overflow and underflow.

In [10]:

# The largest integer R can hold
.Machine$integer.max

2147483647

In [11]:

M <- 32
big <- 2^(M-1) - 1
as.integer(big)

2147483647

In [12]:

.Machine$integer.max + 1L

Warning message in .Machine$integer.max + 1L:
“NAs produced by integer overflow”

<NA>

unit8 outputs the result according to modular arithmetic. So does C and Julia.

In [13]:

uint8::as.uint8(255) + uint8::as.uint8(1)
uint8::as.uint8(250) + uint8::as.uint8(15)

[1] 0

[1] 9

Package gmp supports big integers with arbitrary precision.

In [14]:

gmp::as.bigz(.Machine$integer.max ) + gmp::as.bigz(1L)

Big Integer ('bigz') :
[1] 2147483648

Real numbers: floating-point number system¶

Floating-point number system is a computer model for the real line $\mathbb{R}$.

Most computer systems adopt the IEEE 754 standard, established in 1985, for floating-point arithmetics.

For the history, see an interview with William Kahan.

In the scientific notation, a real number is represented as

$$ \pm d_1.d_2d_3 \cdots d_p \times b^e, \quad 0 \le d_i < b. $$

Humans use the base $b=10$ and digits $d_i=0, 1, \dotsc, 9$.
In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1.

Normalized vs denormalized numbers. For example, decimal number 18 is

$$ +1.0010 \times 2^4 \quad (\text{normalized})$$

or, equivalently, $$ +0.1001 \times 2^5 \quad (\text{denormalized}).$$

In the floating-number system, computer stores
- sign bit
- the fraction (or mantissa, or significand) of the normalized representation
- the actual exponent plus a bias

R supports double precesion floating point numbers (see below) via double.
C supports floating point types float and double, where in most systems float corresponds to single precision while double corresponds to double precision.
Julia provides Float16 (half precision, implemented in software using Float32), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).

R has no single precision data type. All real numbers are stored in double precision format. The functions as.single and single are identical to as.double and double except they set the attribute Csingle that is used in the .C and .Fortran interface, and they are intended only to be used in that context. R Documentation

For ease of exposition, we begin with half precision.

Half precision (Float16)¶

Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format

In Julia, Float16 is the type for half precision numbers.
MSB is the sign bit.
10 significand bits (fraction=mantissa), hence $p=11$ (why?)
5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15 = $01111_2$ for encoding:
- $e_{\min} = \mathbf{00001_2} - 01111_2 = -14_{10}$
- $e_{\max} = \mathbf{11110_2} - 01111_2 = 15_{10}$
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
Precision: $\log_{10}2^{11} \approx 3.311$ decimal digits.

$$ (value) = (-1)^{b_{15}}\times 2^{(\sum_{j=1}^5 b_{15-j}2^{5-j}) - 15} \times \left( 1 + \sum_{i=1}^{10}\frac{b_{10-i}}{2^i}\right) $$

# This is Julia
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision

Half precision:
bitstring(Float16(5)) = "0100010100000000"
bitstring(Float16(-5)) = "1100010100000000"

Single precision (Float32, or `float`)¶

Source: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

In C, float is the type for single precision numbers for most systems.
In Julia, Float32 is the type for single precision numbers.
In R, single precision is not supported natively. We use the following workaround:

In [15]:

# Homework: figure out how this C++ code works
Rcpp::cppFunction('int float32bin(double x) {
    float flx = (float) x; 
    unsigned int binx = *((unsigned int*)&flx); 
    return binx; 
}')

MSB is the sign bit.
23 significand bits ($p=24$).
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
precision: $\log_{10}(2^{24}) \approx 7.225$ decimal digits.

In [16]:

message("Single precision:")
pryr::bits(float32bin(5)) # 5 in single precision
pryr::bits(float32bin(-5)) # -5 in single precision

Single precision:

'01000000 10100000 00000000 00000000'

'11000000 10100000 00000000 00000000'

Double precision (Float64, or `double`)¶

Source: https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
In C, double is the type for double precision numbers for most systems. It is the default type for numeric values.
In Julia, Float64 is the type for double precision numbers.
In R, double is the type for double precision numbers.
MSB is the sign bit.
52 significand bits ($p=53$).
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
precision to the $\log_{10}(2^{53}) \approx 15.95$ decimal point.

In [17]:

message("Double precision:")
pryr::bits(5)    # 5 in double precision
pryr::bits(-5)   # -5 in double precision

Double precision:

'01000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000'

'11000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000'

Special floating-point numbers¶

Exponent $e_{\max}+1$ plus a zero mantissa means $\pm \infty$.

In [18]:

pryr::bits(Inf)    # Inf in double precision
pryr::bits(-Inf)   # -Inf in double precision

'01111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000'

'11111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000'

Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN. NaN could be produced from 0 / 0, 0 * Inf, ...
In general NaN ≠ NaN bitwise. Test whether a number is NaN by is.nan function.

In [19]:

pryr::bits(0 / 0)  # NaN
pryr::bits(0 * Inf) # NaN

'11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000'

Exponent $e_{\min}-1$ with a zero mantissa represents the real number 0 ("exact zero").
- Why do we need an exact zero?

In [20]:

pryr::bits(0.0)  # 0 in double precision

'00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000'

Exponent $e_{\min}-1$ with a nonzero mantissa are for numbers less than $b^{e_{\min}}$.
Numbers are denormalized in the range $(0,b^{e_{\min}})$ -- gradual underflow.
For example, in half-precision, $e_{\min}=-14$ but $2^{-24}$ is represented by $0.0000000001_2 \times 2^{-14}$.
In single precision, $e_{\min}=-126$ but $2^{-149}$ is represented by $0.00000000000000000000001_2 \times 2^{-126}$.

In [21]:

2^(-126)  # emin=-126
pryr::bits(float32bin(2^(-126)))
2^(-149)  # denormalized
pryr::bits(float32bin(2^(-149)))

1.17549435082229e-38

'00000000 10000000 00000000 00000000'

1.40129846432482e-45

'00000000 00000000 00000000 00000001'

Rounding¶

Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest, ties to even mode:
Round to nearest: for example, the number 1/10 cannot be represented accurately as a (binary) floating point number:

$$ 0.1 = 1.10011001\dotsc_2 \times 2^{-4} $$

In [22]:

pryr::bits(float32bin(0.1)) # single precision, 1001 gets rounded to 101(0)

'00111101 11001100 11001100 11001101'

In single precision, the number

1.1001 1001 1001 1001 1001 1001 1001 ...

falls between

1.1001 1001 1001 1001 1001 100

and

1.1001 1001 1001 1001 1001 101.

The midway between these two numbers is

1.1001 1001 1001 1001 1001 1001 0000 0...

Hence $0.1_{10}$ is closer to 1.1001 1001 1001 1001 1001 101.

Ties to even: if the number falls midway, it is rounded to the nearest value with an even least significant digit.
For example, consider the case that the precision is 4 binary digits

Exact number	Rounded value	Remainder bits
1.000011*2^1	1.000*2^1	011 -> round down
1.000110*2^1	1.001*2^1	110 -> round up
1.011100*2^1	1.100*2^1	100 -> round up (*)
1.010100*2^1	1.010*2^1	100 -> round down (**)

In the third example (*), 1.011 100 is precisely midway between 1.011 and 1.100. The ties to even rule chooses 1.100, i.e., the representation whose the least significant bit is zero.
In the fourth example (**), 1.010 100 is also precisely midway between 1.010 and 1.011. The ties to even rule now chooses to round down to 1.010.
To summarize, if the remainder bits below the precision are 1000 0..., then the ties to even rule applies.

Errors¶

Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then

Absolute error

$$ | [x] - x | $$

Relative error

$$ \frac{| [x] - x |}{|x|} $$

(if $x \neq 0$).

Of course, we want to ensure that the error after a computation is small.

Machine epsilons¶

Floating-point numbers do not occur uniformly over the real number line

Source: What you never wanted to know about floating point but will be forced to find out

Same number of representible numbers in $(2^i, 2^{i+1}]$ as in $(2^{i+1}, 2^{i+2}]$. Within an interval, they are uniformly distributed.
Machine epsilons are the spacings of numbers around 1:
- $\epsilon_{\max}$ = (smallest positive floating point number that added to 1 will give a result different from 1) = $\frac{1}{2^p} + \frac{1}{2^{2p-1}}$
- $\epsilon_{\min}$ = (smallest positive floating point number that subtracted from 1 will give a result different from 1) = $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.

Source: Computational Statistics, James Gentle, Springer, New York, 2009.

Any real number in the interval $\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]$ is represented by a floating point number $1 = 1.00\dotsc 0_2 \times 2^0$ (recall ties to even).
Adding $\frac{1}{2^p}$ to 1 won't reach the next representable floating point number $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$. Hence $\epsilon_{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$.
Adding the floating point number next to $\frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$ to 1 WILL result in $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$, hence $\epsilon_{\max} = 1.00\dotsc 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}$.
Subtracting $\frac{1}{2^{p+1}}$ from 1 results in $1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}$, which is represented by the floating point number $1.00\dotsc 0_2 \times 2^{0} = 1$ by the "ties to even" rule. Hence $\epsilon_{\min} > \frac{1}{2^{p+1}}$.
The smallest positive floating point number larger than $\frac{1}{2^{p+1}}$ is $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 1_2 \times 2^{-p-1}$. Thus $\epsilon_{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.

Machine precision¶

Machine epsilon is often called the machine precision.
If a positive real number $x \in \mathbb{R}$ is represented by $[x]$ in the floating point arithmetic, then

$$ [x] = \left( 1 + \sum_{i=1}^{p-1}\frac{b_{i+1}}{2^i}\right) \times 2^e. $$

Thus $x - \frac{2^e}{2^p} < [x] \le x + \frac{2^e}{2^p}$, and $$ \begin{split} \frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\ &\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\ &\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\ &= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}. \end{split} $$ That is, the relative error of the floating point representation $[x]$ of real number $x$ is bounded by $\epsilon_{\max}$.

In [23]:

options(digits=20)

print(2^(-53) + 2^(-105))   # epsilon_max for double
print(1.0 + 2^(-53))
print(1.0 + (2^(-53) + 2^(-105)))
print(1.0 + 2^(-53) + 2^(-105))  # why is the result?  See "Catastrophic cancellation"

print(as.double(float::fl(2^(-24) + 2^(-47))))  # epsilon_max for float
print(as.double(float::fl(1.0) + float::fl(2^(-24))))
print(as.double(float::fl(1.0) + float::fl(2^(-24) + 2^(-47))))

[1] 1.1102230246251567869e-16
[1] 1
[1] 1.000000000000000222
[1] 1
[1] 5.9604651880817982601e-08
[1] 1
[1] 1.0000001192092895508

In [24]:

print(2^(-54) + 2^(-106))  # epsilon_min for double
print(1 - (2^(-54) + 2^(-106)))
pryr::bits(1.0)
pryr::bits(1 - (2^(-54) + 2^(-106)))

[1] 5.5511151231257839347e-17
[1] 0.99999999999999988898

'00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000'

'00111111 11101111 11111111 11111111 11111111 11111111 11111111 11111111'

In R, the variable .Machine contains numerical characteristics of the machine. double.neg.eps is our $\epsilon_{\max}$.

In [25]:

.Machine

$double.eps: 2.22044604925031e-16
$double.neg.eps: 1.11022302462516e-16
$double.xmin: 2.2250738585072e-308
$double.xmax: 1.79769313486232e+308
$double.base: 2
$double.digits: 53
$double.rounding: 5
$double.guard: 0
$double.ulp.digits: -52
$double.neg.ulp.digits: -53
$double.exponent: 11
$double.min.exp: -1022
$double.max.exp: 1024
$integer.max: 2147483647
$sizeof.long: 8
$sizeof.longlong: 8
$sizeof.longdouble: 16
$sizeof.pointer: 8
$longdouble.eps: 1.0842021724855e-19
$longdouble.neg.eps: 5.42101086242752e-20
$longdouble.digits: 64
$longdouble.rounding: 5
$longdouble.guard: 0
$longdouble.ulp.digits: -63
$longdouble.neg.ulp.digits: -64
$longdouble.exponent: 15
$longdouble.min.exp: -16382
$longdouble.max.exp: 16384

Floating point arithmetic¶

The IEEE 754 standard guarantees that the arithmetic operations involving floating point numbers is so that the operation result in the floating point representation of the exact arithmetric of the involved floating point numbers. For example, suppose $x$ and $y$ are two exact real numbers. Let $[x]$ and $[y]$ be their floating point representation, after the appropriate rounding. Then, the addition of these two numbers is computed like the following. $$ \begin{split} z &= [x] + [y] \quad \text{(exact arithmetric)} \\ z &\gets [z] \quad \text{(rounding)} \end{split} $$

Overflow and underflow of floating-point number¶

For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{+308}$). Overflow produces $\pm \inf$. Underflow yields zeros or denormalized numbers.
E.g., the logit link function is

$$p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.$$

The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to gradual underflow.

.Machine$double.xmax and .Machine$double.xmin functions gives largest and smallest non-subnormal number represented by the given floating point type.

Catastrophic cancellation¶

Scenario 1: Addition or subtraction of two numbers of widely different magnitudes: $a+b$ or $a-b$ where $a \gg b$ or $a \ll b$. We lose the precision in the number of smaller magnitude. Consider

$$\begin{eqnarray*} a &=& x.xxx ... \times 2^{30} \\ b &=& y.yyy... \times 2^{-30} \end{eqnarray*}$$

What happens when computer calculates $a+b$? We get $a+b=a$!

In [26]:

(a <- 2.0^30)
(b <- 2.0^-30)
(a + b == a)
pryr::bits(a)
pryr::bits(a + b)

1073741824

9.31322574615479e-10

TRUE

'01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000'

Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be $$ \delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} . $$ What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is $$ \delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} . $$ Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.

We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have $$ \begin{split} | [[x]+[y]]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &\le |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split} $$ because $\epsilon_{\max}^2 \approx 0$. Thus $$ \frac{| [[x]+[y]]-(x+y) |}{|x+y|} \le 2\epsilon_{\max} $$ approximately.

Scenario 2: Subtraction of two nearly equal numbers eliminates significant digits. $a-b$ where $a \approx b$. Consider

$$\begin{eqnarray*} a &=& x.xxxxxxxxxx1ssss \\ b &=& x.xxxxxxxxxx0tttt \end{eqnarray*}$$

The result is $1.vvvvu...u$ where $u$ are unassigned digits.

In [27]:

olddigits <- options('digits')$digits
options(digits=20)

a <- float::fl(1.2345678) # rounding
pryr::bits(float32bin(as.double(a))) # rounding
b <- float::fl(1.2345677)
pryr::bits(float32bin(as.double(b)))
print(as.double(a - b)) # correct result should be 1f-7
pryr::bits(float32bin(as.double(a - b)))   # must be 1.0000...0 x 2^(-23)
print(1/2^23)

options(digits=olddigits)

'00111111 10011110 00000110 01010001'

'00111111 10011110 00000110 01010000'

[1] 1.1920928955078125e-07

'00110100 00000000 00000000 00000000'

[1] 1.1920928955078125e-07

Analysis: Let $$ [x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} . $$

$[x]-[y] = \frac{1}{2^{p-1}} = [[x]-[y]]$.
The true difference $x-y$ may lie anywhere in $(0, \frac{1}{2^{p-2}}+\frac{1}{2^{2p}}]$.
Relative error

$$ \frac{|x-y -[[x]-[y]]|}{|x-y|} $$

is unbounded -- no guarantee of any significant digit!

Implications for numerical computation
- Rule 1: add small numbers together before adding larger ones
- Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
- Rule 3: avoid substraction of two numbers that are nearly equal
  - Example: in solving quadratic equation $ax^2 + bx + c = 0$ with $b > 0$ and $b^2 - 4ac > 0$, the roots are
  $$ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} . $$ If $b \approx \sqrt{b^2 - 4ac}$, then a scenario 2 occurs. A resolution is to compute only one root $x_1 = \frac{-b - \sqrt{b^2 - 4ac}}{2a}$ by the formula, and find the other root $x_2$ using the relation $x_1x_2 = c/a$.

Algebraic laws¶

Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.

Required R packages¶

Computer arithmetics¶

Units of computer storage¶

Storage of Characters¶

Integers: fixed-point number system¶

Unsigned integers¶

Signed integers¶

Overflow and underflow for integer arithmetic¶

Real numbers: floating-point number system¶

Half precision (Float16)¶

Single precision (Float32, or float)¶

Double precision (Float64, or double)¶

Special floating-point numbers¶

Rounding¶

Errors¶

Machine epsilons¶

Machine precision¶

Floating point arithmetic¶

Overflow and underflow of floating-point number¶

Catastrophic cancellation¶

Algebraic laws¶

Further readings¶

Single precision (Float32, or `float`)¶

Double precision (Float64, or `double`)¶