sessionInfo()
R version 4.2.2 (2022-10-31) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] fansi_1.0.4 crayon_1.5.2 digest_0.6.31 utf8_1.2.2 [5] IRdisplay_1.1 repr_1.1.5 lifecycle_1.0.3 jsonlite_1.8.4 [9] evaluate_0.20 pillar_1.8.1 rlang_1.0.6 cli_3.6.0 [13] uuid_1.1-0 vctrs_0.5.2 IRkernel_1.3.2 tools_4.2.2 [17] glue_1.6.2 fastmap_1.1.0 compiler_4.2.2 base64enc_0.1-3 [21] pbdZMQ_0.3-9 htmltools_0.5.4
# Dig into the internal representation of R objects
library(lobstr) # if not installed, use install.packages('lobstr')
# For unsigned integers
library(uint8) # devtools::install_github('coolbutuseless/uint8')
# For bitstrings
library(pryr)
# For big integers
library(gmp)
# For single precision floating point numbers
library(float)
library(Rcpp)
Attaching package: ‘uint8’ The following objects are masked from ‘package:base’: :, order Attaching package: ‘pryr’ The following objects are masked from ‘package:lobstr’: ast, mem_used Attaching package: ‘gmp’ The following objects are masked from ‘package:base’: %*%, apply, crossprod, matrix, tcrossprod
Humans use decimal digits (why?)
Computers use binary digits (why?)
R function lobstr::obj_size()
shows the amount of memory (in bytes) used by an object. (This is a better version of the built-in utils::object.size()
x <- 100
lobstr::obj_size(x)
y <-c(20, 30)
lobstr::obj_size(y)
z <- as.matrix(runif(100 * 100), nrow=100) # 100 x 100 random matrix
lobstr::obj_size(z)
56 B
64 B
80.22 kB
Print all variables in workspace and their sizes:
sapply(ls(), function(z, env=parent.env(environment())) obj_size(get(z, envir=env)))
.jl
, .r
, .c
, .cpp
, .ipynb
, .html
, .tex
, ...# integers 0, 1, ..., 127 and corresponding ascii character
sapply(0:127, intToUtf8)
# integers 128, 129, ..., 255 and corresponding extended ascii character
sapply(128:255, intToUtf8)
Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
UTF-8 is the current dominant character encoding on internet.
Source: Google Blog
st <- "\uD1B5\uACC4\uACC4\uC0B0"
st
Fixed-point number system is a computer model for integers $\mathbb{Z}$.
The number $M$ of bits and method of representing negative numbers vary from system to system.
integer
type in R has $M=32$ (packages such as ‘bit64’support 64 bit integers).
unsigned
) char
, int
, short
, long
(and long long
), whose sizes depend on the machine.(u)int8
, (u)int16
, (u)int32
, (u)int64
.uint8
package later). In most other languages:Type | Min | Max |
---|---|---|
UInt8 | 0 | 255 |
UInt16 | 0 | 65535 |
UInt32 | 0 | 4294967295 |
UInt64 | 0 | 18446744073709551615 |
UInt128 | 0 | 340282366920938463463374607431768211455 |
Model of $\mathbb{Z}$. Can do subtraction.
First bit ("most significant bit" or MSB) is the sign bit.
0
for nonnegative numbers1
for negative numbersTwo's complement representation for negative numbers.
0
->1
, 1
->0
) the remaining bits1
to the resultclass(5L) # postfix `L` means integer in R
pryr::bits(5L)
pryr::bits(-5L)
pryr::bits(2L * 5L) # shift bits of 5 to the left (why?)
pryr::bits(2L * -5L); # shift bits of -5 to left
Source: Signed Binary Numbers, Subtraction and Overflow by Grant Braught
.Machine$integer.max # R uses 32-bit integer
In most other languages,
Type | Min | Max |
---|---|---|
Int8 | -128 | 127 |
Int16 | -32768 | 32767 |
Int32 | -2147483648 | 2147483647 |
Int64 | -9223372036854775808 | 9223372036854775807 |
Int128 | -170141183460469231731687303715884105728 | 170141183460469231731687303715884105727 |
R reports NA
for integer overflow and underflow.
# The largest integer R can hold
.Machine$integer.max
M <- 32
big <- 2^(M-1) - 1
as.integer(big)
.Machine$integer.max + 1L
Warning message in .Machine$integer.max + 1L: “NAs produced by integer overflow”
unit8
outputs the result according to modular arithmetic. So does C and Julia.
uint8::as.uint8(255) + uint8::as.uint8(1)
uint8::as.uint8(250) + uint8::as.uint8(15)
[1] 0
[1] 9
Package gmp
supports big integers with arbitrary precision.
gmp::as.bigz(.Machine$integer.max ) + gmp::as.bigz(1L)
Big Integer ('bigz') : [1] 2147483648
Floating-point number system is a computer model for the real line $\mathbb{R}$.
For the history, see an interview with William Kahan.
Humans use the base $b=10$ and digits $d_i=0, 1, \dotsc, 9$.
In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1.
or, equivalently, $$ +0.1001 \times 2^5 \quad (\text{denormalized}).$$
R supports double precesion floating point numbers (see below) via double
.
C supports floating point types float
and double
, where in most systems float
corresponds to single precision while double
corresponds to double precision.
Julia provides Float16
(half precision, implemented in software using Float32
), Float32
(single precision), Float64
(double precision), and BigFloat
(arbitrary precision).
R has no single precision data type. All real numbers are stored in double precision format. The functions
as.single
andsingle
are identical toas.double
anddouble
except they set the attributeCsingle
that is used in the.C
and.Fortran
interface, and they are intended only to be used in that context. R Documentation
Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
In Julia, Float16
is the type for half precision numbers.
MSB is the sign bit.
10 significand bits (fraction=mantissa), hence $p=11$ (why?)
5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15 = $01111_2$ for encoding:
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
Precision: $\log_{10}2^{11} \approx 3.311$ decimal digits.
# This is Julia
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision
Half precision:
bitstring(Float16(5)) = "0100010100000000"
bitstring(Float16(-5)) = "1100010100000000"
float
)¶Source: https://en.wikipedia.org/wiki/Single-precision_floating-point_format
float
is the type for single precision numbers for most systems.Float32
is the type for single precision numbers.# Homework: figure out how this C++ code works
Rcpp::cppFunction('int float32bin(double x) {
float flx = (float) x;
unsigned int binx = *((unsigned int*)&flx);
return binx;
}')
MSB is the sign bit.
23 significand bits ($p=24$).
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
precision: $\log_{10}(2^{24}) \approx 7.225$ decimal digits.
message("Single precision:")
pryr::bits(float32bin(5)) # 5 in single precision
pryr::bits(float32bin(-5)) # -5 in single precision
Single precision:
double
)¶Source: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
In C, double
is the type for double precision numbers for most systems. It is the default type for numeric
values.
In Julia, Float64
is the type for double precision numbers.
In R, double
is the type for double precision numbers.
MSB is the sign bit.
52 significand bits ($p=53$).
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
precision to the $\log_{10}(2^{53}) \approx 15.95$ decimal point.
message("Double precision:")
pryr::bits(5) # 5 in double precision
pryr::bits(-5) # -5 in double precision
Double precision:
pryr::bits(Inf) # Inf in double precision
pryr::bits(-Inf) # -Inf in double precision
Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN
. NaN
could be produced from 0 / 0
, 0 * Inf
, ...
In general NaN ≠ NaN
bitwise. Test whether a number is NaN
by is.nan
function.
pryr::bits(0 / 0) # NaN
pryr::bits(0 * Inf) # NaN
pryr::bits(0.0) # 0 in double precision
2^(-126) # emin=-126
pryr::bits(float32bin(2^(-126)))
2^(-149) # denormalized
pryr::bits(float32bin(2^(-149)))
Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest, ties to even mode:
Round to nearest: for example, the number 1/10 cannot be represented accurately as a (binary) floating point number:
pryr::bits(float32bin(0.1)) # single precision, 1001 gets rounded to 101(0)
In single precision, the number
1.1001 1001 1001 1001 1001 1001 1001 ...
falls between
1.1001 1001 1001 1001 1001 100
and
1.1001 1001 1001 1001 1001 101
.
The midway between these two numbers is
1.1001 1001 1001 1001 1001 1001 0000 0...
Hence $0.1_{10}$ is closer to 1.1001 1001 1001 1001 1001 101
.
Ties to even: if the number falls midway, it is rounded to the nearest value with an even least significant digit.
For example, consider the case that the precision is 4 binary digits
Exact number | Rounded value | Remainder bits | |
---|---|---|---|
1.000011*2^1 | 1.000*2^1 | 011 -> round down | |
1.000110*2^1 | 1.001*2^1 | 110 -> round up | |
1.011100*2^1 | 1.100*2^1 | 100 -> round up (*) | |
1.010100*2^1 | 1.010*2^1 | 100 -> round down (**) |
In the third example (*), 1.011 100
is precisely midway between 1.011
and 1.100
. The ties to even rule chooses 1.100
, i.e., the representation whose the least significant bit is zero.
In the fourth example (**), 1.010 100
is also precisely midway between 1.010
and 1.011
. The ties to even rule now chooses to round down to 1.010
.
To summarize, if the remainder bits below the precision are 1000 0...
, then the ties to even rule applies.
Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then
(if $x \neq 0$).
Of course, we want to ensure that the error after a computation is small.
Source: What you never wanted to know about floating point but will be forced to find out
Same number of representible numbers in $(2^i, 2^{i+1}]$ as in $(2^{i+1}, 2^{i+2}]$. Within an interval, they are uniformly distributed.
Machine epsilons are the spacings of numbers around 1:
Source: Computational Statistics, James Gentle, Springer, New York, 2009.
Any real number in the interval $\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]$ is represented by a floating point number $1 = 1.00\dotsc 0_2 \times 2^0$ (recall ties to even).
Adding $\frac{1}{2^p}$ to 1 won't reach the next representable floating point number $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$. Hence $\epsilon_{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$.
Adding the floating point number next to $\frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$ to 1 WILL result in $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$, hence $\epsilon_{\max} = 1.00\dotsc 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}$.
Subtracting $\frac{1}{2^{p+1}}$ from 1 results in $1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}$, which is represented by the floating point number $1.00\dotsc 0_2 \times 2^{0} = 1$ by the "ties to even" rule. Hence $\epsilon_{\min} > \frac{1}{2^{p+1}}$.
The smallest positive floating point number larger than $\frac{1}{2^{p+1}}$ is $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 1_2 \times 2^{-p-1}$. Thus $\epsilon_{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.
Machine epsilon is often called the machine precision.
If a positive real number $x \in \mathbb{R}$ is represented by $[x]$ in the floating point arithmetic, then
Thus $x - \frac{2^e}{2^p} < [x] \le x + \frac{2^e}{2^p}$, and $$ \begin{split} \frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\ &\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\ &\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\ &= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}. \end{split} $$ That is, the relative error of the floating point representation $[x]$ of real number $x$ is bounded by $\epsilon_{\max}$.
options(digits=20)
print(2^(-53) + 2^(-105)) # epsilon_max for double
print(1.0 + 2^(-53))
print(1.0 + (2^(-53) + 2^(-105)))
print(1.0 + 2^(-53) + 2^(-105)) # why is the result? See "Catastrophic cancellation"
print(as.double(float::fl(2^(-24) + 2^(-47)))) # epsilon_max for float
print(as.double(float::fl(1.0) + float::fl(2^(-24))))
print(as.double(float::fl(1.0) + float::fl(2^(-24) + 2^(-47))))
[1] 1.1102230246251567869e-16 [1] 1 [1] 1.000000000000000222 [1] 1 [1] 5.9604651880817982601e-08 [1] 1 [1] 1.0000001192092895508
print(2^(-54) + 2^(-106)) # epsilon_min for double
print(1 - (2^(-54) + 2^(-106)))
pryr::bits(1.0)
pryr::bits(1 - (2^(-54) + 2^(-106)))
[1] 5.5511151231257839347e-17 [1] 0.99999999999999988898
.Machine
contains numerical characteristics of the machine. double.neg.eps
is our $\epsilon_{\max}$..Machine
The IEEE 754 standard guarantees that the arithmetic operations involving floating point numbers is so that the operation result in the floating point representation of the exact arithmetric of the involved floating point numbers. For example, suppose $x$ and $y$ are two exact real numbers. Let $[x]$ and $[y]$ be their floating point representation, after the appropriate rounding. Then, the addition of these two numbers is computed like the following. $$ \begin{split} z &= [x] + [y] \quad \text{(exact arithmetric)} \\ z &\gets [z] \quad \text{(rounding)} \end{split} $$
For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{+308}$). Overflow produces $\pm \inf$. Underflow yields zeros or denormalized numbers.
E.g., the logit link function is
The former expression can easily lead to Inf / Inf = NaN
, while the latter expression leads to gradual underflow.
.Machine$double.xmax
and .Machine$double.xmin
functions gives largest and smallest non-subnormal number represented by the given floating point type.What happens when computer calculates $a+b$? We get $a+b=a$!
(a <- 2.0^30)
(b <- 2.0^-30)
(a + b == a)
pryr::bits(a)
pryr::bits(a + b)
Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be $$ \delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} . $$ What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is $$ \delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} . $$ Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.
We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have $$ \begin{split} | [[x]+[y]]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &\le |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split} $$ because $\epsilon_{\max}^2 \approx 0$. Thus $$ \frac{| [[x]+[y]]-(x+y) |}{|x+y|} \le 2\epsilon_{\max} $$ approximately.
The result is $1.vvvvu...u$ where $u$ are unassigned digits.
olddigits <- options('digits')$digits
options(digits=20)
a <- float::fl(1.2345678) # rounding
pryr::bits(float32bin(as.double(a))) # rounding
b <- float::fl(1.2345677)
pryr::bits(float32bin(as.double(b)))
print(as.double(a - b)) # correct result should be 1f-7
pryr::bits(float32bin(as.double(a - b))) # must be 1.0000...0 x 2^(-23)
print(1/2^23)
options(digits=olddigits)
[1] 1.1920928955078125e-07
[1] 1.1920928955078125e-07
Analysis: Let $$ [x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} . $$
$[x]-[y] = \frac{1}{2^{p-1}} = [[x]-[y]]$.
The true difference $x-y$ may lie anywhere in $(0, \frac{1}{2^{p-2}}+\frac{1}{2^{2p}}]$.
Relative error
is unbounded -- no guarantee of any significant digit!
Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.
Section II.2, Computational Statistics by James Gentle (2009).
Sections 1.5 and 2.2, Applied Numerical Linear Algebra by James W. Demmel (1997).
What every computer scientist should know about floating-point arithmetic by David Goldberg (1991).