Consider $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A} \in \mathbb{R}^{n \times n}$. Or, consider matrix inverse (if you want). $\mathbf{A}$ can be huge. Keep massive data in mind: 1000 Genome Project, NetFlix, Google PageRank, finance, spatial statistics, ... We should be alert to many easy linear systems.
Don't blindly use A \ b
and inv
in Julia or solve
function in R. Don't waste computing resources by bad choices of algorithms!
Diagonal $\mathbf{A}$: $n$ flops. Use Diagonal
type of Julia.
using BenchmarkTools
srand(280)
n = 1000
A = diagm(randn(n)) # a full matrix
b = randn(n)
@which A \ b
# check `istril(A)` and `istriu(A)`, then call `Diagonal(A) \ b`
@benchmark A \ b
BenchmarkTools.Trial: memory estimate: 15.95 KiB allocs estimate: 5 -------------- minimum time: 729.697 μs (0.00% GC) median time: 743.364 μs (0.00% GC) mean time: 778.562 μs (0.06% GC) maximum time: 2.414 ms (62.58% GC) -------------- samples: 6392 evals/sample: 1
# O(n) computation, no extra array allocation
@benchmark Diagonal(A) \ b
BenchmarkTools.Trial: memory estimate: 15.95 KiB allocs estimate: 5 -------------- minimum time: 3.791 μs (0.00% GC) median time: 4.701 μs (0.00% GC) mean time: 5.935 μs (14.60% GC) maximum time: 267.824 μs (95.83% GC) -------------- samples: 10000 evals/sample: 8
Bidiagonal, tridiagonal, or banded $\mathbf{A}$: Band LU, band Cholesky, ... roughly $O(n)$ flops.
Bidiagonal
, Tridiagonal
, SymTridiagonal
types of Julia.srand(280)
n = 1000
dv = randn(n)
ev = randn(n - 1)
b = randn(n)
# symmetric tridiagonal matrix
A = SymTridiagonal(dv, ev)
1000×1000 SymTridiagonal{Float64}: 0.126238 0.618244 ⋅ … ⋅ ⋅ ⋅ 0.618244 -2.34688 1.10206 ⋅ ⋅ ⋅ ⋅ 1.10206 1.91661 ⋅ ⋅ ⋅ ⋅ ⋅ -0.447244 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋮ ⋱ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.709662 ⋅ ⋅ ⋅ ⋅ ⋅ 0.542815 -0.363838 ⋅ ⋅ ⋅ ⋅ -0.363838 -1.04034 0.321148 ⋅ ⋅ ⋅ ⋅ 0.321148 -0.276537
# convert to a full matrix
Afull = full(A)
# LU decomposition (2/3) n^3 flops!
@benchmark Afull \ b
BenchmarkTools.Trial: memory estimate: 7.65 MiB allocs estimate: 8 -------------- minimum time: 12.552 ms (0.00% GC) median time: 13.446 ms (0.00% GC) mean time: 13.404 ms (4.16% GC) maximum time: 16.647 ms (0.00% GC) -------------- samples: 373 evals/sample: 1
# specialized algorithm for tridiagonal matrix, O(n) flops
@benchmark A \ b
BenchmarkTools.Trial: memory estimate: 23.89 KiB allocs estimate: 6 -------------- minimum time: 12.511 μs (0.00% GC) median time: 14.714 μs (0.00% GC) mean time: 17.326 μs (8.28% GC) maximum time: 2.081 ms (97.27% GC) -------------- samples: 10000 evals/sample: 1
Triangular $\mathbf{A}$: $n^2$ flops to solve linear system.
srand(280)
n = 1000
A = tril(randn(n, n))
b = randn(n)
# check istril() then triangular solve
@benchmark A \ b
BenchmarkTools.Trial: memory estimate: 7.95 KiB allocs estimate: 2 -------------- minimum time: 762.721 μs (0.00% GC) median time: 946.558 μs (0.00% GC) mean time: 1.051 ms (0.02% GC) maximum time: 7.309 ms (0.00% GC) -------------- samples: 4721 evals/sample: 1
# triangular solve directly; save the cost of istril()
@benchmark LowerTriangular(A) \ b
BenchmarkTools.Trial: memory estimate: 7.97 KiB allocs estimate: 3 -------------- minimum time: 320.558 μs (0.00% GC) median time: 380.211 μs (0.00% GC) mean time: 430.062 μs (0.07% GC) maximum time: 4.834 ms (0.00% GC) -------------- samples: 10000 evals/sample: 1
srand(280)
B = 10 # number of blocks
ni = 100
A = blkdiag([sprandn(ni, ni, 0.01) for b in 1:B]...)
1000×1000 SparseMatrixCSC{Float64,Int64} with 969 stored entries: [31 , 1] = 2.07834 [53 , 1] = -1.11883 [58 , 1] = -0.66448 [14 , 4] = 1.11793 [96 , 5] = 1.22813 [81 , 8] = -0.919643 [48 , 9] = 1.0185 [49 , 9] = -0.996332 [15 , 10] = 1.30841 [28 , 10] = -0.818757 ⋮ [956 , 987] = -0.900804 [967 , 987] = -0.438788 [971 , 991] = 0.176756 [929 , 992] = -1.17384 [974 , 993] = 1.59235 [967 , 994] = 0.542169 [994 , 995] = 0.627832 [998 , 997] = 0.60382 [935 , 998] = 0.342675 [947 , 998] = 0.482228 [975 , 1000] = 0.991598
Use $$ \begin{eqnarray*} (\mathbf{A} \otimes \mathbf{B})^{-1} &=& \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\ (\mathbf{C}^T \otimes \mathbf{A}) \text{vec}(\mathbf{B}) &=& \text{vec}(\mathbf{A} \mathbf{B} \mathbf{C}). \end{eqnarray*} $$ to avoid forming and doing computation on the potentially huge Kronecker $\mathbf{A} \otimes \mathbf{B}$.
Sparsity: sparse matrix decomposition or iterative method.
Matrix
package), MKL (sparse BLAS), ... as much as possible.srand(280)
n = 1000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + (2n) * I
b = randn(n)
Afull = full(A)
countnz(A) / length(A) # sparsity
0.005096
# dense matrix-vector multiplication
@benchmark Afull * b
BenchmarkTools.Trial: memory estimate: 7.94 KiB allocs estimate: 1 -------------- minimum time: 100.168 μs (0.00% GC) median time: 186.690 μs (0.00% GC) mean time: 201.362 μs (0.00% GC) maximum time: 2.580 ms (0.00% GC) -------------- samples: 10000 evals/sample: 1
# sparse matrix-vector multiplication
@benchmark A * b
BenchmarkTools.Trial: memory estimate: 7.94 KiB allocs estimate: 1 -------------- minimum time: 6.943 μs (0.00% GC) median time: 7.599 μs (0.00% GC) mean time: 8.241 μs (2.32% GC) maximum time: 483.364 μs (95.82% GC) -------------- samples: 10000 evals/sample: 5
# dense Cholesky decomposition
@benchmark cholfact(Afull)
BenchmarkTools.Trial: memory estimate: 7.63 MiB allocs estimate: 8 -------------- minimum time: 9.110 ms (0.00% GC) median time: 11.609 ms (0.00% GC) mean time: 11.827 ms (8.51% GC) maximum time: 68.000 ms (85.06% GC) -------------- samples: 422 evals/sample: 1
# sparse Cholesky decomposition
@benchmark cholfact(A)
BenchmarkTools.Trial: memory estimate: 1.33 MiB allocs estimate: 53 -------------- minimum time: 3.326 ms (0.00% GC) median time: 3.434 ms (0.00% GC) mean time: 3.631 ms (0.99% GC) maximum time: 7.742 ms (0.00% GC) -------------- samples: 1375 evals/sample: 1
# solve via dense Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b
BenchmarkTools.Trial: memory estimate: 7.64 MiB allocs estimate: 10 -------------- minimum time: 10.132 ms (0.00% GC) median time: 11.326 ms (0.00% GC) mean time: 11.478 ms (5.48% GC) maximum time: 16.560 ms (0.00% GC) -------------- samples: 435 evals/sample: 1
# solve via sparse Cholesky
xcholsp = cholfact(A) \ b
vecnorm(xchol - xcholsp)
4.020546956752686e-18
@benchmark cholfact(A) \ b
BenchmarkTools.Trial: memory estimate: 1.36 MiB allocs estimate: 64 -------------- minimum time: 3.825 ms (0.00% GC) median time: 3.928 ms (0.00% GC) mean time: 4.094 ms (0.92% GC) maximum time: 6.442 ms (0.00% GC) -------------- samples: 1220 evals/sample: 1
# sparse solve via conjugate gradient
using IterativeSolvers
xcg, = cg(A, b)
vecnorm(xcg - xchol)
INFO: Recompiling stale cache file /Users/huazhou/.julia/lib/v0.6/IterativeSolvers.ji for module IterativeSolvers.
0.022750177100322466
@benchmark cg(A, b)
BenchmarkTools.Trial: memory estimate: 33.30 KiB allocs estimate: 27 -------------- minimum time: 31.181 μs (0.00% GC) median time: 34.950 μs (0.00% GC) mean time: 40.390 μs (8.18% GC) maximum time: 2.927 ms (94.92% GC) -------------- samples: 10000 evals/sample: 1
Easy plus low rank: $\mathbf{U} \in \mathbb{R}^{n \times r}$, $\mathbf{V} \in \mathbb{R}^{r \times n}$, $r \ll n$. Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_r + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$
WoodburyMatrices.jl
package can be useful.using BenchmarkTools, WoodburyMatrices
srand(280)
n = 1000
r = 5
A = Diagonal(rand(n))
B = randn(n, r)
D = Diagonal(rand(r))
b = randn(n)
# W = A + B * D * B'
W = SymWoodbury(A, B, D)
Wfull = Symmetric(full(W))
1000×1000 Symmetric{Float64,Array{Float64,2}}: 1.8571 0.513107 0.872146 … 0.764278 -0.241331 0.54921 0.513107 4.57505 -0.636972 -1.86465 -1.92237 -1.72569 0.872146 -0.636972 4.81387 1.99357 1.99337 3.66327 -0.516414 -0.996711 -0.0919924 0.262832 0.612402 0.621834 0.193686 1.68244 -0.770028 -0.723437 -1.4868 -1.32247 1.6567 0.0634435 -0.901968 … -0.241872 -0.0356772 -0.39826 0.553996 -0.274515 2.21265 0.219437 2.20382 2.60902 0.402356 1.89288 -1.13032 -0.771441 -1.96862 -1.93483 -1.07744 -1.63881 1.78016 0.96551 1.7292 1.91326 -2.21617 -2.90695 -2.55971 -0.47867 0.855389 -0.933916 1.29975 0.779828 4.12459 … 1.87358 0.737112 2.84136 -0.80833 1.44882 1.67581 -0.139063 -0.107873 0.818132 -2.32469 -4.83109 -2.31796 -0.0346402 2.65564 0.591075 ⋮ ⋱ 0.334995 0.0914829 1.60026 0.0937123 1.40804 1.92919 0.390722 2.36157 -1.58383 -1.92435 -1.3291 -1.88114 1.19218 0.845478 1.9362 … -0.0890571 1.36046 2.01013 0.915526 0.889395 -0.606443 -0.428052 -0.817555 -1.2017 -0.778472 2.1444 1.50696 0.00689644 -1.28104 -0.141234 0.275366 -0.25866 0.416593 0.481534 0.0874531 0.316543 -0.289749 -1.16146 0.550825 0.698152 0.789054 0.949917 0.439213 -0.379216 1.0951 … 0.626399 0.624574 0.8568 -0.592534 -0.235847 -1.11586 -0.601497 -0.0651787 -0.573737 0.764278 -1.86465 1.99357 3.21548 0.932773 1.97505 -0.241331 -1.92237 1.99337 0.932773 3.43991 2.8539 0.54921 -1.72569 3.66327 1.97505 2.8539 4.85234
# solve via Cholesky
@benchmark cholfact(Wfull) \ b
BenchmarkTools.Trial: memory estimate: 7.64 MiB allocs estimate: 9 -------------- minimum time: 8.913 ms (0.00% GC) median time: 9.119 ms (0.00% GC) mean time: 9.485 ms (6.82% GC) maximum time: 12.387 ms (15.54% GC) -------------- samples: 527 evals/sample: 1
# solve using Woodbury formula
@benchmark W \ b
BenchmarkTools.Trial: memory estimate: 76.16 KiB allocs estimate: 26 -------------- minimum time: 19.164 μs (0.00% GC) median time: 24.648 μs (0.00% GC) mean time: 34.932 μs (24.29% GC) maximum time: 4.617 ms (98.44% GC) -------------- samples: 10000 evals/sample: 1
# multiplication without using Woodbury structure
@benchmark Wfull * b
BenchmarkTools.Trial: memory estimate: 7.94 KiB allocs estimate: 1 -------------- minimum time: 193.856 μs (0.00% GC) median time: 203.398 μs (0.00% GC) mean time: 205.489 μs (0.00% GC) maximum time: 531.061 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1
# multiplication using Woodbury structure
@benchmark W * b
BenchmarkTools.Trial: memory estimate: 24.06 KiB allocs estimate: 5 -------------- minimum time: 4.910 μs (0.00% GC) median time: 6.397 μs (0.00% GC) mean time: 8.787 μs (20.79% GC) maximum time: 374.038 μs (96.19% GC) -------------- samples: 10000 evals/sample: 7
Easy plus border: For $\mathbf{A}$ pd and $\mathbf{V}$ full row rank, $$ \begin{pmatrix} \mathbf{A} & \mathbf{V}^T \\ \mathbf{V} & \mathbf{0} \end{pmatrix}^{-1} = \begin{pmatrix} \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \\ (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & - (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \end{pmatrix}. $$
Orthogonal $\mathbf{A}$: $n^2$ flops at most. Permutation matrix, Householder matrix, Jacobi matrix, ... take less.
Toeplitz systems: $$ \mathbf{T} = \begin{pmatrix} r_0 & r_1 & r_2 & r_3 \\ r_{-1} & r_0 & r_1 & r_2 \\ r_{-2} & r_{-1} & r_0 & r_1 \\ r_{-3} & r_{-2} & r_{-1} & r_0 \end{pmatrix}. $$ $\mathbf{T} \mathbf{x} = \mathbf{b}$, where $\mathbf{T}$ is pd and Toeplitz, can be solved in $O(n^2)$ flops. Durbin algorithm (Yule-Walker equation), Levinson algorithm (general $\mathbf{b}$), Trench algorithm (inverse). These matrices occur in auto-regressive models and econometrics.
ToeplitzMatrices.jl
package can be useful.Circulant systems: Toeplitz matrix with wraparound $$ C(\mathbf{z}) = \begin{pmatrix} z_0 & z_4 & z_3 & z_2 & z_1 \\ z_1 & z_0 & z_4 & z_3 & z_2 \\ z_2 & z_1 & z_0 & z_4 & z_3 \\ z_3 & z_2 & z_1 & z_0 & z_4 \\ z_4 & z_3 & z_2 & z_1 & z_0 \end{pmatrix}, $$ FFT type algorithms: DCT (discrete cosine transform) and DST (discrete sine transform).
Vandermonde matrix: such as in interpolation and approximation problems $$ \mathbf{V}(x_0,\ldots,x_n) = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_0 & x_1 & \cdots & x_n \\ \vdots & \vdots & & \vdots \\ x_0^n & x_1^n & \cdots & x_n^n \end{pmatrix}. $$ $\mathbf{V} \mathbf{x} = \mathbf{b}$ or $\mathbf{V}^T \mathbf{x} = \mathbf{b}$ can be solved in $O(n^2)$ flops.
Cauchy-like matrices: $$ \Omega \mathbf{A} - \mathbf{A} \Lambda = \mathbf{R} \mathbf{S}^T, $$ where $\Omega = \text{diag}(\omega_1,\ldots,\omega_n)$ and $\Lambda = \text{diag}(\lambda_1,\ldots, \lambda_n)$. $O(n)$ flops for LU and QR.
Structured-rank problems: semiseparable matrices (LU and QR takes $O(n)$ flops), quasiseparable matrices, ...
versioninfo()
Julia Version 0.6.2 Commit d386e40c17 (2017-12-13 18:08 UTC) Platform Info: OS: macOS (x86_64-apple-darwin14.5.0) CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-3.9.1 (ORCJIT, skylake)