Notebook

In [1]:

versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

In [2]:

using Pkg
Pkg.activate("../..")
Pkg.status()

  Activating environment at `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml`

      Status `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml`
  [7d9fca2a] Arpack v0.5.3
  [6e4b80f9] BenchmarkTools v1.1.4
  [1e616198] COSMO v0.8.1
  [f65535da] Convex v0.14.13
  [a93c6f00] DataFrames v1.2.2
  [31c24e10] Distributions v0.24.18
  [e2685f51] ECOS v0.12.3
  [f6369f11] ForwardDiff v0.10.19
  [28b8d3ca] GR v0.58.1
  [c91e804a] Gadfly v1.3.3
  [bd48cda9] GraphRecipes v0.5.7
  [82e4d734] ImageIO v0.5.8
  [6218d12a] ImageMagick v1.2.1
  [916415d5] Images v0.24.1
  [b6b21f68] Ipopt v0.7.0
  [42fd0dbc] IterativeSolvers v0.9.1
  [4076af6c] JuMP v0.21.9
  [b51810bb] MatrixDepot v1.0.4
  [1ec41992] MosekTools v0.9.4
  [76087f3c] NLopt v0.6.3
  [47be7bcc] ORCA v0.5.0
  [a03496cd] PlotlyBase v0.4.3
  [f0f68f2c] PlotlyJS v0.15.0
  [91a5bcdd] Plots v1.21.2
  [438e738f] PyCall v1.92.3
  [d330b81b] PyPlot v2.9.0
  [dca85d43] QuartzImageIO v0.7.3
  [6f49c342] RCall v0.13.12
  [ce6b1742] RDatasets v0.7.5
  [c946c3f1] SCS v0.7.1
  [276daf66] SpecialFunctions v1.6.1
  [2913bbd2] StatsBase v0.33.10
  [b8865327] UnicodePlots v2.0.1
  [0f1e0344] WebIO v0.8.15
  [8f399da3] Libdl
  [2f01184e] SparseArrays
  [10745b16] Statistics

Algorithms¶

Algorithm is loosely defined as a set of instructions for doing something, which terminates in finite time. An algorithm requires input and output.
Donald Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

Measure of efficiency¶

A basic unit for measuring algorithmic efficiency is flop.

A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.

Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.

How to measure efficiency of an algorithm? Big O notation. If $n$ is the size of a problem, an algorithm has order $O(f(n))$, where the leading term in the number of flops is $c \cdot f(n)$. For example,
- matrix-vector multiplication A * b, where A is $m \times n$ and b is $n \times 1$, takes $2mn$ or $O(mn)$ flops
- matrix-matrix multiplication A * B, where A is $m \times n$ and B is $n \times p$, takes $2mnp$ or $O(mnp)$ flops
A hierarchy of computational complexity:
Let $n$ be the problem size.
- Exponential order: $O(b^n)$ ("horrible")
- Polynomial order: $O(n^q)$ (doable)
- $O(n \log n )$ (fast)
- Linear order $O(n)$ (fast)
- Logarithmic order $O(\log n)$ (super fast)
Classification of data sets by Huber (1994).

Data Size	Bytes	Storage Mode
Tiny	$10^2$	Piece of paper
Small	$10^4$	A few pieces of paper
Medium	$10^6$ (megabytes)	A floppy disk
Large	$10^8$	Hard disk
Huge	$10^9$ (gigabytes)	Hard disk(s)
Massive	$10^{12}$ (terabytes)	RAID storage

Difference of $O(n^2)$ and $O(n\log n)$ on massive data. Suppose we have a teraflops supercomputer capable of doing $10^{12}$ flops per second. For a problem of size $n=10^{12}$, $O(n \log n)$ algorithm takes about

$$10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.$$

$O(n^2)$ algorithm takes about $10^{24}/10^{12} = 10^{12}$ seconds, which is approximately 31710 years!

QuickSort and Fast Fourier Transform (invented by John Tukey) are celebrated algorithms that turn $O(n^2)$ operations into $O(n \log n)$. Another example is the Strassen's method for matrix multiplication, which turns $O(n^3)$ matrix multiplication into $O(n^{\log_2 7})$.
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

-- James Gentle, Matrix Algebra, Springer, New York (2007).

For example, compare flops of the two mathematically equivalent expressions: A * B * x and A * (B * x) where A and B are matrices and x is a vector.

In [3]:

using BenchmarkTools, Random

Random.seed!(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

# complexity is n^3 + n^2 = O(n^3)
@benchmark $A * $B * $x

Out[3]:

BenchmarkTools.Trial: 301 samples with 1 evaluation.
 Range (min … max):  12.038 ms … 27.045 ms  ┊ GC (min … max): 0.00% … 15.52%
 Time  (median):     15.712 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.599 ms ±  3.089 ms  ┊ GC (mean ± σ):  3.23% ±  6.41%

     ▃  ▂ ▂ ▂█▃▁ ▄ ▄▂               ▁                          
  ▄▃▄█▆▆█▅██████▇█▆██▆▅▅▆▆▄▃▃▄▇▃▅▆▆██▅▆▅▄▅▁▃▁▃▃▅▃▁▁▆▃▃▃▃▃▄▁▃▃ ▄
  12 ms           Histogram: frequency by time        24.7 ms <

 Memory estimate: 7.64 MiB, allocs estimate: 3.

In [4]:

# complexity is n^2 + n^2 = O(n^2)
@benchmark $A * ($B * $x)

Out[4]:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  275.552 μs …  4.424 ms  ┊ GC (min … max): 0.00% … 92.43%
 Time  (median):     288.479 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   318.128 μs ± 73.509 μs  ┊ GC (mean ± σ):  0.13% ±  0.92%

   ▆█▆▄▄▃▂▂▂▂▁▁▁▁▁       ▁▁▁▁▁▁▁▁▁▁▁▁▁▁                        ▁
  ██████████████████████████████████████████▇▇▇▆▆▆▆▅▆▆▅▅▅▅▅▆▅▅ █
  276 μs        Histogram: log(frequency) by time       527 μs <

 Memory estimate: 15.88 KiB, allocs estimate: 2.

Why are there the difference?

Performance of computer systems¶

FLOPS.
For example, my laptop has the Intel i5-8279U (Coffee Lake) CPU with 4 cores runing at 2.4 GHz (cycles per second).

In [5]:

versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Intel Coffee Lake CPUs can do 16 double-precision flops per cycle and 32 single-precision flops per cycle. Then the theoretical throughput of my laptop is $$ 4 \times 2.4 \times 10^9 \times 16 = 153.6 \text{ GFLOPS} $$ in double precision and $$ 4 \times 2.4 \times 10^9 \times 32 = 307.2 \text{ GFLOPS} $$ in single precision.

In Julia, LinearAlgebra.peakflops() computes the peak flop rate of the computer by using gemm! (double precision matrix-matrix multiplication).

In [6]:

using LinearAlgebra

LinearAlgebra.peakflops(2^14) # matrix size 2^14

Out[6]:

1.4744965532793915e11

which is about 147.4 GFLOPS DP.

Stability of numerical algorithms¶

Recall that abstractly, a problem can be viewed as function $f: \mathcal{X} \to \mathcal{Y}$ where $\mathcal{X}$ is a normed vector space of data and $\mathcal{Y}$ is a normed vector space of solutions.

For given data $x \in \mathcal{X}$, the true solution of the problem $f$ is $y = f(x) \in \mathcal{Y}$. - The problem of solving $Ax=b$ for fixed $b$ is $f: A \mapsto A^{-1}b$ with $\mathcal{X}=\{M\in\mathbb{R}^{n\times n}: M \text{ is invertible} \}$ and $\mathcal{Y} = \mathbb{R}^n$.

An algorithm can be viewed as another map $\tilde{f}: \mathcal{X} \to \mathcal{Y}$.

For given data $x \in \mathcal{X}$, the solution computed by algorithm $\tilde{f}$ is $\hat{y} = \tilde{f}(x) \in \mathcal{Y}$. - Example 1: solve $Ax=b$ by GEPP followed by forward and backward substitutions on a digital computer. - Example 2: solve $Ax=b$ by Gauss-Seidel (an iterative method to come) on a digital computer. - In both cases, the solutions (in $\mathcal{Y}$) are not the same as $A^{-1}b$! + We'll learn about these algorithms soon. - Algorithms will be affected by at least rounding errors.

It is not necessarily true that $\hat{y} = y$, or $\tilde{f}(x) = f(x)$. The forward error of a computed solution is the relative error

$$ \frac{\Vert \tilde{f}(x) - f(x) \Vert}{\Vert f(x) \Vert} . $$

Algorithm $\tilde{f}$ is said stable if

$$ \forall x \in \mathcal{X}, \exists \tilde{x} \in \mathcal{X} \text{ such that } \frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon) \implies \frac{\|\tilde{f}(x) - f(\tilde{x})\|}{\|f(\tilde{x})\|} = O(\epsilon) , \quad \text{as}~ \epsilon \to 0 . $$

In words, a stable algorithm gives "nearly the right" answer to a "slightly wrong" question.

Backward stability: algorithm $\tilde{f}$ is said backward stable if

$$ \forall x \in \mathcal{X}, \exists \tilde{x} \in \mathcal{X} \text{ such that } \frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon) \implies \tilde{f}(x) = f(\tilde{x}) , \quad \text{as}~ \epsilon \to 0 $$

In words, a backward stable algorithm gives "exactly the right" answer to a "slightly wrong" question. - Backward stability implies stability, but not the other way around.

If a backward stable algorithm $\tilde{f}$ is applied to solve a problem $f$, the forward error of $\tilde{f}$ is bounded by the condition number of problem $f$.

To see this, recall the definition of the condition number $$ \kappa = \lim_{\delta\to 0}\sup_{\|\tilde{x} - x\|\le \delta \Vert x \Vert}\frac{\|f(\tilde{x}) - f(x)\|/\|f(x)\|}{\|\tilde{x} - x\|/\|x\|} . $$ Thus for $\tilde{x} \in \mathcal{X}$ such that $\frac{\Vert\tilde{x} - x\Vert}{\Vert x \Vert} = O(\epsilon)$ and $\tilde{f}(x) = f(\tilde{x})$ as $\epsilon \to 0$, we have $$ \frac{\|\tilde{f}(x) - f(x)\|}{\|f(x)\|} \le ( \kappa + o(1) )\frac{\|\tilde{x} - x\|}{\|x\|} = O(\kappa \epsilon) . $$

Examples
- Computing the inner product $x^Ty$ of vectors $x$ snd $y$ using by $[\sum_{i=1}^n [[x_i][y_i]]]$ (in IEEE754) is backward stable.
- Computing the inner product $A=xy^T$ of vectors $x$ snd $y$ using by $A_{ij}=[[x_i][y_i]]$ (in IEEE754) is stable but not backward stable.
(Backward) Stability a property of an algorithm, whereas conditioning is a property of a problem.

Reading¶

What is Numerical Stability? by Nick Higham

Acknowledgment¶

This lecture note has evolved from Dr. Hua Zhou's 2019 Spring Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2019spring/index.html.