versioninfo()
Julia Version 1.9.3 Commit bed2cd540a1 (2023-08-24 14:43 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (x86_64-apple-darwin22.4.0) CPU: 8 × Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, skylake) Threads: 2 on 8 virtual cores
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
Activating new project at `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/03-arith` No Changes to `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/03-arith/Project.toml` No Changes to `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/03-arith/Manifest.toml`
Status `~/Dropbox/class/M1399.000200/2023/M1300_000200-2023fall/lectures/03-arith/Project.toml` (empty project)
Julia function Base.summarysize
shows the amount of memory (in bytes) used by an object.
x = rand(100, 100)
Base.summarysize(x)
80040
varinfo()
function prints all variables in workspace and their sizes.
varinfo() # similar to Matlab whos()
name | size | summary |
---|---|---|
Base | Module | |
Core | Module | |
Main | Module | |
x | 78.164 KiB | 100×100 Matrix{Float64} |
.jl
, .r
, .c
, .cpp
, .ipynb
, .html
, .tex
, ...# integers 0, 1, ..., 127 and corresponding ascii character
[0:127 Char.(0:127)]
128×2 Matrix{Any}: 0 '\0' 1 '\x01' 2 '\x02' 3 '\x03' 4 '\x04' 5 '\x05' 6 '\x06' 7 '\a' 8 '\b' 9 '\t' 10 '\n' 11 '\v' 12 '\f' ⋮ 116 't' 117 'u' 118 'v' 119 'w' 120 'x' 121 'y' 122 'z' 123 '{' 124 '|' 125 '}' 126 '~' 127 '\x7f'
# integers 128, 129, ..., 255 and corresponding extended ascii character
# show(STDOUT, "text/plain", [128:255 Char.(128:255)])
[128:255 Char.(128:255)]
128×2 Matrix{Any}: 128 '\u80' 129 '\u81' 130 '\u82' 131 '\u83' 132 '\u84' 133 '\u85' 134 '\u86' 135 '\u87' 136 '\u88' 137 '\u89' 138 '\u8a' 139 '\u8b' 140 '\u8c' ⋮ 244 'ô' 245 'õ' 246 'ö' 247 '÷' 248 'ø' 249 'ù' 250 'ú' 251 'û' 252 'ü' 253 'ý' 254 'þ' 255 'ÿ'
Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
Julia supports the full range of UTF-8 characters. You can type many Unicode math symbols by typing the backslashed LaTeX symbol name followed by tab.
# \beta-<tab>
β = 0.0
# \beta-<tab>-\hat-<tab>
β̂ = 0.0
0.0
integer
type in R has $M=32$ (packages such as 'bit64' support 64 bit integers).
unsigned
) char
, int
, short
, long
(and long long
), whose sizes depend on the machine.(u)int8
, (u)int16
, (u)int32
, (u)int64
.typemin(T)
and typemax(T)
give the lowest and highest representable number of a type T
respectivelyfor t in [UInt8, UInt16, UInt32, UInt64, UInt128]
println(t, '\t', typemin(t), '\t', typemax(t))
end
UInt8 0 255 UInt16 0 65535 UInt32 0 4294967295 UInt64 0 18446744073709551615 UInt128 0 340282366920938463463374607431768211455
Model of $\mathbb{Z}$. Can do subtraction.
First bit ("most significant bit" or MSB) is the sign bit.
0
for nonnegative numbers1
for negative numbersTwo's complement representation for negative numbers
0
->1
, 1
->0
) the remaining bits1
to the result@show typeof(5)
@show bitstring(5)
@show bitstring(-5)
@show bitstring(UInt64(Int128(2)^64 - 5)) == bitstring(-5)
@show bitstring(2 * 5) # shift bits of 5 to the left
@show bitstring(2 * -5); # shift bits of -5 to left
typeof(5) = Int64 bitstring(5) = "0000000000000000000000000000000000000000000000000000000000000101" bitstring(-5) = "1111111111111111111111111111111111111111111111111111111111111011" bitstring(UInt64(Int128(2) ^ 64 - 5)) == bitstring(-5) = true bitstring(2 * 5) = "0000000000000000000000000000000000000000000000000000000000001010" bitstring(2 * -5) = "1111111111111111111111111111111111111111111111111111111111110110"
typemin(Int64), typemax(Int64)
(-9223372036854775808, 9223372036854775807)
for T in [Int8, Int16, Int32, Int64, Int128]
println(T, '\t', typemin(T), '\t', typemax(T))
end
Int8 -128 127 Int16 -32768 32767 Int32 -2147483648 2147483647 Int64 -9223372036854775808 9223372036854775807 Int128 -170141183460469231731687303715884105728 170141183460469231731687303715884105727
@show typemax(Int128)
@show typemax(Int128) + 1 # overflow!
@show BigInt(typemax(Int128)) + 1;
typemax(Int128) = 170141183460469231731687303715884105727 typemax(Int128) + 1 = -170141183460469231731687303715884105728 BigInt(typemax(Int128)) + 1 = 170141183460469231731687303715884105728
R reports NA
for integer overflow.
Julia outputs the result according to modular arithmetic.
@show typemax(Int32)
@show typemax(Int32) + Int32(1); # modular arithmetics!
typemax(Int32) = 2147483647 typemax(Int32) + Int32(1) = -2147483648
using RCall
# The largest integer R can hold
R"""
.Machine$integer.max
"""
RObject{IntSxp} [1] 2147483647
R"""
as.integer(big+1)
"""
┌ Warning: RCall.jl: Warning: NAs introduced by coercion to integer range └ @ RCall /Users/jhwon/.julia/packages/RCall/iMDW2/src/io.jl:160
RObject{IntSxp} [1] NA
Floating-point number system is a computer model for $\mathbb{R}$.
For the history, see an interview with William Kahan.
Humans use the base $b=10$ and digits $d_i=0, 1, \dotsc, 9$.
In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1. The exponent $e$ is between $e_{\min}$ and $e_{\max}$.
or, equivalently, $$ +0.1001 \times 2^5 \quad (\text{denormalized}).$$
Float16
(half precision, implemented in software), Float32
(single precision), Float64
(double precision), and BigFloat
(arbitrary precision).{width="500" fig-align="center"}
::: {style="font-size: 50%;"}
https://en.wikipedia.org/wiki/Half-precision_floating-point_format
:::
In Julia, Float16
is the type for half precision numbers.
MSB is the sign bit.
10 significand bits (fraction=mantissa), hence $p=11$ (why?)
5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15 = $01111_2$ for encoding:
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
Precision: $\log_{10}2^{11} \approx 3.311$ decimal digits. $$ (value) = (-1)^{b_{15}}\times 2^{(\sum_{j=1}^5 b_{15-j}2^{5-j}) - 15} \times \left( 1 + \sum_{i=1}^{10}\frac{b_{10-i}}{2^i}\right) $$
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision
Half precision: bitstring(Float16(5)) = "0100010100000000" bitstring(Float16(-5)) = "1100010100000000"
{width="800" fig-align="center"}
::: {style="font-size: 50%;"}
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
:::
In Julia, Float32
is the type for single precision numbers.
MSB is the sign bit.
23 significand bits ($p=24$).
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
precision: $\log_{10}(2^{24}) \approx 7.225$ decimal digits.
println("Single precision:")
@show bitstring(Float32(5)) # 5 in single precision
@show bitstring(Float32(-5)); # -5 in single precision
Single precision: bitstring(Float32(5)) = "01000000101000000000000000000000" bitstring(Float32(-5)) = "11000000101000000000000000000000"
{width="800" fig-align="center"}
::: {style="font-size: 50%;"}
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
:::
Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
In Julia, Float64
is the type for double precision numbers.
MSB is the sign bit.
52 significand bits ($p=53$).
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
precision to the $\log_{10}(2^{53}) \approx 15.95$ decimal point.
println("Double precision:")
@show bitstring(Float64(5)) # 5 in double precision
@show bitstring(Float64(-5)); # -5 in double precision
Double precision: bitstring(Float64(5)) = "0100000000010100000000000000000000000000000000000000000000000000" bitstring(Float64(-5)) = "1100000000010100000000000000000000000000000000000000000000000000"
@show bitstring(Inf) # Inf in double precision
@show bitstring(-Inf); # -Inf in double precision
bitstring(Inf) = "0111111111110000000000000000000000000000000000000000000000000000" bitstring(-Inf) = "1111111111110000000000000000000000000000000000000000000000000000"
Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN
. NaN
could be produced from 0 / 0
, 0 * Inf
, ...
In general NaN ≠ NaN
bitwise. Test whether a number is NaN
by isnan
function.
@show bitstring(0 / 0) # NaN
@show bitstring(0 * Inf); # NaN
bitstring(0 / 0) = "1111111111111000000000000000000000000000000000000000000000000000" bitstring(0Inf) = "1111111111111000000000000000000000000000000000000000000000000000"
@show bitstring(0.0); # 0 in double precision
bitstring(0.0) = "0000000000000000000000000000000000000000000000000000000000000000"
@show Float16(2^(-14)) # emin=-14
@show bitstring(Float16(2^(-14)));
@show Float16(2^(-24)) # emin=-14
@show bitstring(Float16(2^(-24))); # denormalized
Float16(2 ^ -14) = Float16(6.104e-5) bitstring(Float16(2 ^ -14)) = "0000010000000000" Float16(2 ^ -24) = Float16(6.0e-8) bitstring(Float16(2 ^ -24)) = "0000000000000001"
@show nextfloat(Float16(0.0)) # next representable number
@show bitstring(nextfloat(Float16(0.0))); # denormalized
nextfloat(Float16(0.0)) = Float16(6.0e-8) bitstring(nextfloat(Float16(0.0))) = "0000000000000001"
Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). Julia offers several rounding modes, the default being RoundNearest
.
"Round to nearest, ties to even" rule: rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit (i.e., zero; default for IEEE 754 binary floating point numbers)
@show bitstring(0.1); # double precision Float64
@show bitstring(0.1f0); # single precision Float32, 1001 gets rounded to 101(0)
@show bitstring(Float16(0.1)); # half precision Float16, 1001 gets rounded to 101(0)
bitstring(0.1) = "0011111110111001100110011001100110011001100110011001100110011010" bitstring(0.1f0) = "00111101110011001100110011001101" bitstring(Float16(0.1)) = "0010111001100110"
Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then
Absolute error: $| [x] - x |$
Relative error: $\frac{| [x] - x |}{|x|}$
(if $x \neq 0$).
Of course, we want to ensure that the error after a computation is small.
{width="800" fig-align="center"}
::: {style="font-size: 50%;"}
Source: What you never wanted to know about floating point but will be forced to find out
:::
{width="1000" fig-align="center"}
::: {style="font-size: 50%;"}
Source: Computational Statistics, James Gentle, Springer, New York, 2009.
Caution: the definition of $\epsilon_{\max}$ and $\epsilon_{\min}$ in this book is different from the lecture note.
:::
Any real number in the interval $\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]=[1.111\dotsb1|1_2 \times 2^{-1}, 1.000\dotsb 0|1 \times 2^0]$ is represented by a floating point number $1 = 1.00\dotsc 0_2 \times 2^0$ (assuming the "ties to even" rule: consider $p=2$ case).
Adding $\frac{1}{2^p}$ to 1 won't reach the next representable floating point number $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$. Hence $\epsilon_{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$.
Adding the floating point number next to $\frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$ to 1 WILL result in $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$, hence $\epsilon_{\max} = 1.00\dotsb 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}$.
Subtracting $\frac{1}{2^{p+1}}$ from 1 results in $1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}$, which is represented by the floating point number $1.00\dotsb 0_2 \times 2^{0} = 1$ by the "ties to even" rule. Hence $\epsilon_{\min} > \frac{1}{2^{p+1}}$.
The smallest positive floating point number larger than $\frac{1}{2^{p+1}}$ is $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 1_2 \times 2^{-p-1}$. Thus $\epsilon_{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.
Machine epsilon is often called the machine precision.
If a positive real number $x \in \mathbb{R}$ is represented by $[x]$ in the floating point arithmetic, then $$ [x] = \left( 1 + \sum_{i=1}^{p-1}\frac{d_{i+1}}{2^i}\right) \times 2^e. $$
Thus $x - \frac{2^e}{2^p} \le [x] \le x + \frac{2^e}{2^p}$, and $$ \begin{split} \frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\ &\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\ &\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\ &= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}. \end{split} $$ That is, the relative error of the floating point representation $[x]$ of real number $x$ is bounded by $\epsilon_{\max}$.
@show 2^(-53) + 2^(-105); # epsilon_max for Float64
@show 1.0 + 2^(-53);
@show 1.0 + (2^(-53) + 2^(-105));
@show 1.0 + 2^(-53) + 2^(-105); # why is the result? See "Catastrophic cancellation"
@show Float32(2^(-24) + 2^(-47)); # epsilon_max for Float32
@show 1.0f0 + Float32(2^(-24));
@show 1.0f0 + Float32(2^(-24) + 2^(-47));
2 ^ -53 + 2 ^ -105 = 1.1102230246251568e-16 1.0 + 2 ^ -53 = 1.0 1.0 + (2 ^ -53 + 2 ^ -105) = 1.0000000000000002 1.0 + 2 ^ -53 + 2 ^ -105 = 1.0 Float32(2 ^ -24 + 2 ^ -47) = 5.960465f-8 1.0f0 + Float32(2 ^ -24) = 1.0f0 1.0f0 + Float32(2 ^ -24 + 2 ^ -47) = 1.0000001f0
@show 2^(-54) + 2^(-106); # epsilon_min for Float64
@show 1 - (2^(-54) + 2^(-106))
@show bitstring(1.0)
@show bitstring(1 - (2^(-54) + 2^(-106)))
2 ^ -54 + 2 ^ -106 = 5.551115123125784e-17 1 - (2 ^ -54 + 2 ^ -106) = 0.9999999999999999 bitstring(1.0) = "0011111111110000000000000000000000000000000000000000000000000000" bitstring(1 - (2 ^ -54 + 2 ^ -106)) = "0011111111101111111111111111111111111111111111111111111111111111"
"0011111111101111111111111111111111111111111111111111111111111111"
In Julia, eps(x)
gives the distance between consecutive representable floating point values at x
, i.e.,
eps(x) == max(x-prevfloat(x), nextfloat(x)-x)
which is roughly twice the $\epsilon_{\max}$.
@show eps(Float32) # machine epsilon for a floating point type, roughly twice our \epsilon_{\max}
@show eps(Float64) # same as eps()
# eps(x) is the spacing after x
@show eps(100.0)
@show eps(0.0)
# nextfloat(x) and prevfloat(x) give the neighbors of x
@show x = 1.25f0
@show prevfloat(x), x, nextfloat(x)
@show bitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x));
eps(Float32) = 1.1920929f-7 eps(Float64) = 2.220446049250313e-16 eps(100.0) = 1.4210854715202004e-14 eps(0.0) = 5.0e-324 x = 1.25f0 = 1.25f0 (prevfloat(x), x, nextfloat(x)) = (1.2499999f0, 1.25f0, 1.2500001f0) (bitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x))) = ("00111111100111111111111111111111", "00111111101000000000000000000000", "00111111101000000000000000000001")
.Machine
contains numerical characteristics of the machine. double.eps
and double.neg.eps
are roughly twice our $\epsilon_{\max}$ and $\epsilon_{\min}$, respectively.R"""
.Machine
"""
RObject{VecSxp} $double.eps [1] 2.220446e-16 $double.neg.eps [1] 1.110223e-16 $double.xmin [1] 2.225074e-308 $double.xmax [1] 1.797693e+308 $double.base [1] 2 $double.digits [1] 53 $double.rounding [1] 5 $double.guard [1] 0 $double.ulp.digits [1] -52 $double.neg.ulp.digits [1] -53 $double.exponent [1] 11 $double.min.exp [1] -1022 $double.max.exp [1] 1024 $integer.max [1] 2147483647 $sizeof.long [1] 8 $sizeof.longlong [1] 8 $sizeof.longdouble [1] 16 $sizeof.pointer [1] 8 $longdouble.eps [1] 1.084202e-19 $longdouble.neg.eps [1] 5.421011e-20 $longdouble.digits [1] 64 $longdouble.rounding [1] 5 $longdouble.guard [1] 0 $longdouble.ulp.digits [1] -63 $longdouble.neg.ulp.digits [1] -64 $longdouble.exponent [1] 15 $longdouble.min.exp [1] -16382 $longdouble.max.exp [1] 16384
For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{+308}$). Overflow produces $\pm \inf$. Underflow yields zeros or subnormal numbers.
Example: the logit link function is
The former expression can easily lead to Inf / Inf = NaN
, while the latter expression leads to gradual underflow.
floatmin
and floatmax
functions gives largest and smallest non-subnormal number represented by the given floating point type.for T in [Float16, Float32, Float64]
println(T, '\t', floatmin(T), '\t', floatmax(T), '\t', typemin(T),
'\t', typemax(T), '\t', eps(T))
end
Float16 6.104e-5 6.55e4 -Inf Inf 0.000977 Float32 1.1754944e-38 3.4028235e38 -Inf Inf 1.1920929e-7 Float64 2.2250738585072014e-308 1.7976931348623157e308 -Inf Inf 2.220446049250313e-16
BigFloat
in Julia offers arbitrary precision.@show precision(BigFloat)
@show floatmin(BigFloat)
@show floatmax(BigFloat);
precision(BigFloat) = 256 floatmin(BigFloat) = 8.50969131174083613912978790962048280567755996982969624908264897850135431080301e-1388255822130839284 floatmax(BigFloat) = 5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282
@show BigFloat(π); # default precision for BigFloat is 256 bits
# set precision to 1024 bits
setprecision(BigFloat, 1024) do
@show BigFloat(π)
end;
BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286198 BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724586997
The result of computation is just the digits that represented the rounding.
What happens when computer calculates $a+b$? We get $a+b=a$!
@show a = 2.0^30
@show b = 2.0^-30
@show a + b == a
@show bitstring(a)
@show bitstring(a + b);
a = 2.0 ^ 30 = 1.073741824e9 b = 2.0 ^ -30 = 9.313225746154785e-10 a + b == a = true bitstring(a) = "0100000111010000000000000000000000000000000000000000000000000000" bitstring(a + b) = "0100000111010000000000000000000000000000000000000000000000000000"
Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be $$ \delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} . $$ What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is $$ \delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} . $$ Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.
We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have $$ \small \begin{split} | [[x]+[y]]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &= |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split} $$ because $\epsilon_{\max}^2 \approx 0$.
Thus $$ \frac{| [[x]+[y]]-(x+y) |}{|x+y|} \le 2\epsilon_{\max} $$ approximately.
The result is $1.vvvvu...u$ where $u$ are unassigned digits.
a = 1.2345678f0 # rounding
@show bitstring(a) # rounding
b = 1.2345677f0
@show bitstring(b)
@show a - b # correct result should be 1f-7
@show bitstring(a - b) # must be 1.0000...0 x 2^(-23)
@show Float32(1/2^23);
bitstring(a) = "00111111100111100000011001010001" bitstring(b) = "00111111100111100000011001010000" a - b = 1.1920929f-7 bitstring(a - b) = "00110100000000000000000000000000" Float32(1 / 2 ^ 23) = 1.1920929f-7
Analysis: Let $$ [x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} . $$
$[x]-[y] = \frac{1}{2^{p-1}} = [[x]-[y]]$.
The true difference $x-y$ may lie anywhere in $(0, \frac{1}{2^{p-2}}]$.
Relative error
is unbounded -- no guarantee of any significant digit!
Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.
Consider solving a linear system $Ax=b$:
$$ \begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} , $$whose solution is $(x_1, x_2) = (1.000, 1.000)$.
A=[1.0 0.5; 0.667 0.333]; b = [1.5, 1.0]; A\b
2-element Vector{Float64}: 0.9999999999998335 1.000000000000333
If we perturb $b$ by 0.001, i.e., solve
$$ \begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 0.999 \end{bmatrix} , $$then the solution changes to $(x_1, x_2) = (0.000, 3.000)$.
b1 = [1.5, 0.999]; A\b1
2-element Vector{Float64}: -1.6653345369377348e-13 3.000000000000333
If we instead perturb $A$ by 0.001, i.e., solve
$$ \begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.334 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} , $$then this time the solution changes to $(x_1, x_2) = (2.000, -1.000)$.
A1=[1.0 0.5; 0.667 0.334]; A1\b
2-element Vector{Float64}: 2.0000000000001665 -1.000000000000333
In other words, an input perturbation of order $10^{-3}$ yield an output perturbation of order $10^0$. Thats 3 orders of magnutide of relative change!
Floating point representation $[x]$ of a real number $x$ may introduce such input perturbation easily. The perturbation of output of a problem with respect to the input is called conditioning.
A well-conditioned problem (instance) is one such that all small perturbations of $x$ lead to only small changes in $f(x)$.
An ill-conditioned problem is one such that some small perturbation of $x$ leads to a large change in $f(x)$.
The (relative) condition number $\kappa=\kappa(\theta)$ of a problem is defined by $$ \kappa = \lim_{\delta\to 0}\sup_{\|\delta \theta\|\le \delta}\frac{\|\delta f\|/\|f(\theta)\|}{\|\delta \theta\|/\|\theta\|} , $$
where $\delta f = f(\theta + \delta \theta) - f(\theta)$.
where $\|A\|$ is the matrix norm induced by vector norm $\|\cdot\|$, i.e., $$ \|A\| = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|}. $$
or $$ \delta x = -A^{-1}(\delta A)x + o(\Vert \delta A \Vert) . $$ That is, $\delta f = f(A + \delta A) - f(A) = - A^{-1}(\delta A)A^{-1}b + o(\Vert \delta A \Vert)$.
% = \Vert A^{-1} \Vert \Vert A \Vert , $$ hence $\kappa \le \Vert A^{-1} \Vert \Vert A \Vert$.
the ratio of the maximum and minimum singular values of $A$.
In the above problem, the condition number is matrix $A$ (w.r.t. Euclidean norm) is
using LinearAlgebra
LinearAlgebra.cond(A)
3611.5557231115513
Section II.2, Computational Statistics by James Gentle (2009).
Sections 1.5 and 2.2, Applied Numerical Linear Algebra by James W. Demmel (1997).
What every computer scientist should know about floating-point arithmetic by David Goldberg (1991).