Notebook

Table of Contents¶

1 A list of "easy" linear systems

1.1 Diagonal matrix

1.2 Bidiagonal, tridiagonal, and banded matrices

1.3 Triangular matrix

1.4 Block diagonal matrix

1.5 Kronecker product

1.6 Sparse matrix

1.7 Easy plus low rank

1.8 Easy plus border

1.9 Orthogonal matrix

1.10 Toeplitz matrix

1.11 Circulant matrix

1.12 Vandermonde matrix

1.13 Cauchy-like matrix

1.14 Structured-rank matrix

A list of "easy" linear systems¶

Consider $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A} \in \mathbb{R}^{n \times n}$. Or, consider matrix inverse (if you want). $\mathbf{A}$ can be huge. Keep massive data in mind: 1000 Genome Project, NetFlix, Google PageRank, finance, spatial statistics, ... We should be alert to many easy linear systems.

Don't blindly use A \ b and inv in Julia or solve function in R. Don't waste computing resources by bad choices of algorithms!

Diagonal matrix¶

Diagonal $\mathbf{A}$: $n$ flops. Use Diagonal type of Julia.

In [1]:

using BenchmarkTools

srand(280)
n = 1000
A = diagm(randn(n)) # a full matrix
b = randn(n)

@which A \ b

Out[1]:

\(A::AbstractArray{T,2} where T, B::Union{AbstractArray{T,1}, AbstractArray{T,2}} where T) at linalg/generic.jl:820

In [2]:

# check `istril(A)` and `istriu(A)`, then call `Diagonal(A) \ b`
@benchmark A \ b

Out[2]:

BenchmarkTools.Trial: 
  memory estimate:  15.95 KiB
  allocs estimate:  5
  --------------
  minimum time:     729.697 μs (0.00% GC)
  median time:      743.364 μs (0.00% GC)
  mean time:        778.562 μs (0.06% GC)
  maximum time:     2.414 ms (62.58% GC)
  --------------
  samples:          6392
  evals/sample:     1

In [3]:

# O(n) computation, no extra array allocation
@benchmark Diagonal(A) \ b

Out[3]:

BenchmarkTools.Trial: 
  memory estimate:  15.95 KiB
  allocs estimate:  5
  --------------
  minimum time:     3.791 μs (0.00% GC)
  median time:      4.701 μs (0.00% GC)
  mean time:        5.935 μs (14.60% GC)
  maximum time:     267.824 μs (95.83% GC)
  --------------
  samples:          10000
  evals/sample:     8

Bidiagonal, tridiagonal, and banded matrices¶

Bidiagonal, tridiagonal, or banded $\mathbf{A}$: Band LU, band Cholesky, ... roughly $O(n)$ flops.

Use Bidiagonal, Tridiagonal, SymTridiagonal types of Julia.

In [4]:

srand(280) 

n  = 1000
dv = randn(n)
ev = randn(n - 1)
b  = randn(n)
# symmetric tridiagonal matrix
A  = SymTridiagonal(dv, ev)

Out[4]:

1000×1000 SymTridiagonal{Float64}:
 0.126238   0.618244    ⋅        …    ⋅          ⋅          ⋅      
 0.618244  -2.34688    1.10206        ⋅          ⋅          ⋅      
  ⋅         1.10206    1.91661        ⋅          ⋅          ⋅      
  ⋅          ⋅        -0.447244       ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
 ⋮                               ⋱                                 
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅            0.709662    ⋅          ⋅      
  ⋅          ⋅          ⋅            0.542815  -0.363838    ⋅      
  ⋅          ⋅          ⋅           -0.363838  -1.04034    0.321148
  ⋅          ⋅          ⋅             ⋅         0.321148  -0.276537

In [5]:

# convert to a full matrix
Afull = full(A)

# LU decomposition (2/3) n^3 flops!
@benchmark Afull \ b

Out[5]:

BenchmarkTools.Trial: 
  memory estimate:  7.65 MiB
  allocs estimate:  8
  --------------
  minimum time:     12.552 ms (0.00% GC)
  median time:      13.446 ms (0.00% GC)
  mean time:        13.404 ms (4.16% GC)
  maximum time:     16.647 ms (0.00% GC)
  --------------
  samples:          373
  evals/sample:     1

In [6]:

# specialized algorithm for tridiagonal matrix, O(n) flops
@benchmark A \ b

Out[6]:

BenchmarkTools.Trial: 
  memory estimate:  23.89 KiB
  allocs estimate:  6
  --------------
  minimum time:     12.511 μs (0.00% GC)
  median time:      14.714 μs (0.00% GC)
  mean time:        17.326 μs (8.28% GC)
  maximum time:     2.081 ms (97.27% GC)
  --------------
  samples:          10000
  evals/sample:     1

Triangular matrix¶

Triangular $\mathbf{A}$: $n^2$ flops to solve linear system.

In [7]:

srand(280)

n = 1000
A = tril(randn(n, n))
b = randn(n)

# check istril() then triangular solve
@benchmark A \ b

Out[7]:

BenchmarkTools.Trial: 
  memory estimate:  7.95 KiB
  allocs estimate:  2
  --------------
  minimum time:     762.721 μs (0.00% GC)
  median time:      946.558 μs (0.00% GC)
  mean time:        1.051 ms (0.02% GC)
  maximum time:     7.309 ms (0.00% GC)
  --------------
  samples:          4721
  evals/sample:     1

In [8]:

# triangular solve directly; save the cost of istril()
@benchmark LowerTriangular(A) \ b

Out[8]:

BenchmarkTools.Trial: 
  memory estimate:  7.97 KiB
  allocs estimate:  3
  --------------
  minimum time:     320.558 μs (0.00% GC)
  median time:      380.211 μs (0.00% GC)
  mean time:        430.062 μs (0.07% GC)
  maximum time:     4.834 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Block diagonal matrix¶

Block diagonal: Suppose $n = \sum_i n_i$. $(\sum_i n_i)^3$ vs $\sum_i n_i^3$.

Julia has a blkdiag function that generates a sparse matrix. Anyone interested writing a BlockDiagonal.jl package?

In [9]:

srand(280)

B  = 10 # number of blocks
ni = 100
A  = blkdiag([sprandn(ni, ni, 0.01) for b in 1:B]...)

Out[9]:

1000×1000 SparseMatrixCSC{Float64,Int64} with 969 stored entries:
  [31  ,    1]  =  2.07834
  [53  ,    1]  =  -1.11883
  [58  ,    1]  =  -0.66448
  [14  ,    4]  =  1.11793
  [96  ,    5]  =  1.22813
  [81  ,    8]  =  -0.919643
  [48  ,    9]  =  1.0185
  [49  ,    9]  =  -0.996332
  [15  ,   10]  =  1.30841
  [28  ,   10]  =  -0.818757
  ⋮
  [956 ,  987]  =  -0.900804
  [967 ,  987]  =  -0.438788
  [971 ,  991]  =  0.176756
  [929 ,  992]  =  -1.17384
  [974 ,  993]  =  1.59235
  [967 ,  994]  =  0.542169
  [994 ,  995]  =  0.627832
  [998 ,  997]  =  0.60382
  [935 ,  998]  =  0.342675
  [947 ,  998]  =  0.482228
  [975 , 1000]  =  0.991598

Kronecker product¶

Use $$ \begin{eqnarray*} (\mathbf{A} \otimes \mathbf{B})^{-1} &=& \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\ (\mathbf{C}^T \otimes \mathbf{A}) \text{vec}(\mathbf{B}) &=& \text{vec}(\mathbf{A} \mathbf{B} \mathbf{C}). \end{eqnarray*} $$ to avoid forming and doing computation on the potentially huge Kronecker $\mathbf{A} \otimes \mathbf{B}$.

Sparse matrix¶

Sparsity: sparse matrix decomposition or iterative method.

The easiest recognizable structure. Familiarize yourself with the sparse matrix computation tools in Julia, Matlab, R (Matrix package), MKL (sparse BLAS), ... as much as possible.

In [10]:

srand(280)

n = 1000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + (2n) * I
b = randn(n)
Afull = full(A)
countnz(A) / length(A) # sparsity

Out[10]:

0.005096

In [11]:

# dense matrix-vector multiplication
@benchmark Afull * b

Out[11]:

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     100.168 μs (0.00% GC)
  median time:      186.690 μs (0.00% GC)
  mean time:        201.362 μs (0.00% GC)
  maximum time:     2.580 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [12]:

# sparse matrix-vector multiplication
@benchmark A * b

Out[12]:

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     6.943 μs (0.00% GC)
  median time:      7.599 μs (0.00% GC)
  mean time:        8.241 μs (2.32% GC)
  maximum time:     483.364 μs (95.82% GC)
  --------------
  samples:          10000
  evals/sample:     5

In [13]:

# dense Cholesky decomposition
@benchmark cholfact(Afull)

Out[13]:

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  8
  --------------
  minimum time:     9.110 ms (0.00% GC)
  median time:      11.609 ms (0.00% GC)
  mean time:        11.827 ms (8.51% GC)
  maximum time:     68.000 ms (85.06% GC)
  --------------
  samples:          422
  evals/sample:     1

In [14]:

# sparse Cholesky decomposition
@benchmark cholfact(A)

Out[14]:

BenchmarkTools.Trial: 
  memory estimate:  1.33 MiB
  allocs estimate:  53
  --------------
  minimum time:     3.326 ms (0.00% GC)
  median time:      3.434 ms (0.00% GC)
  mean time:        3.631 ms (0.99% GC)
  maximum time:     7.742 ms (0.00% GC)
  --------------
  samples:          1375
  evals/sample:     1

In [15]:

# solve via dense Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b

Out[15]:

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  10
  --------------
  minimum time:     10.132 ms (0.00% GC)
  median time:      11.326 ms (0.00% GC)
  mean time:        11.478 ms (5.48% GC)
  maximum time:     16.560 ms (0.00% GC)
  --------------
  samples:          435
  evals/sample:     1

In [16]:

# solve via sparse Cholesky
xcholsp = cholfact(A) \ b
vecnorm(xchol - xcholsp)

Out[16]:

4.020546956752686e-18

In [17]:

@benchmark cholfact(A) \ b

Out[17]:

BenchmarkTools.Trial: 
  memory estimate:  1.36 MiB
  allocs estimate:  64
  --------------
  minimum time:     3.825 ms (0.00% GC)
  median time:      3.928 ms (0.00% GC)
  mean time:        4.094 ms (0.92% GC)
  maximum time:     6.442 ms (0.00% GC)
  --------------
  samples:          1220
  evals/sample:     1

In [18]:

# sparse solve via conjugate gradient
using IterativeSolvers

xcg, = cg(A, b)
vecnorm(xcg - xchol)

INFO: Recompiling stale cache file /Users/huazhou/.julia/lib/v0.6/IterativeSolvers.ji for module IterativeSolvers.

Out[18]:

0.022750177100322466

In [19]:

@benchmark cg(A, b)

Out[19]:

BenchmarkTools.Trial: 
  memory estimate:  33.30 KiB
  allocs estimate:  27
  --------------
  minimum time:     31.181 μs (0.00% GC)
  median time:      34.950 μs (0.00% GC)
  mean time:        40.390 μs (8.18% GC)
  maximum time:     2.927 ms (94.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

Easy plus low rank¶

Easy plus low rank: $\mathbf{U} \in \mathbb{R}^{n \times r}$, $\mathbf{V} \in \mathbb{R}^{r \times n}$, $r \ll n$. Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_r + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$

Keep HW2 Q2 in mind.
WoodburyMatrices.jl package can be useful.

In [20]:

using BenchmarkTools, WoodburyMatrices

srand(280)
n = 1000
r = 5

A = Diagonal(rand(n))
B = randn(n, r)
D = Diagonal(rand(r))
b = randn(n)
# W = A + B * D * B'
W = SymWoodbury(A, B, D)
Wfull = Symmetric(full(W))

Out[20]:

1000×1000 Symmetric{Float64,Array{Float64,2}}:
  1.8571     0.513107    0.872146   …   0.764278    -0.241331    0.54921 
  0.513107   4.57505    -0.636972      -1.86465     -1.92237    -1.72569 
  0.872146  -0.636972    4.81387        1.99357      1.99337     3.66327 
 -0.516414  -0.996711   -0.0919924      0.262832     0.612402    0.621834
  0.193686   1.68244    -0.770028      -0.723437    -1.4868     -1.32247 
  1.6567     0.0634435  -0.901968   …  -0.241872    -0.0356772  -0.39826 
  0.553996  -0.274515    2.21265        0.219437     2.20382     2.60902 
  0.402356   1.89288    -1.13032       -0.771441    -1.96862    -1.93483 
 -1.07744   -1.63881     1.78016        0.96551      1.7292      1.91326 
 -2.21617   -2.90695    -2.55971       -0.47867      0.855389   -0.933916
  1.29975    0.779828    4.12459    …   1.87358      0.737112    2.84136 
 -0.80833    1.44882     1.67581       -0.139063    -0.107873    0.818132
 -2.32469   -4.83109    -2.31796       -0.0346402    2.65564     0.591075
  ⋮                                 ⋱                                    
  0.334995   0.0914829   1.60026        0.0937123    1.40804     1.92919 
  0.390722   2.36157    -1.58383       -1.92435     -1.3291     -1.88114 
  1.19218    0.845478    1.9362     …  -0.0890571    1.36046     2.01013 
  0.915526   0.889395   -0.606443      -0.428052    -0.817555   -1.2017  
 -0.778472   2.1444      1.50696        0.00689644  -1.28104    -0.141234
  0.275366  -0.25866     0.416593       0.481534     0.0874531   0.316543
 -0.289749  -1.16146     0.550825       0.698152     0.789054    0.949917
  0.439213  -0.379216    1.0951     …   0.626399     0.624574    0.8568  
 -0.592534  -0.235847   -1.11586       -0.601497    -0.0651787  -0.573737
  0.764278  -1.86465     1.99357        3.21548      0.932773    1.97505 
 -0.241331  -1.92237     1.99337        0.932773     3.43991     2.8539  
  0.54921   -1.72569     3.66327        1.97505      2.8539      4.85234

In [21]:

# solve via Cholesky
@benchmark cholfact(Wfull) \ b

Out[21]:

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  9
  --------------
  minimum time:     8.913 ms (0.00% GC)
  median time:      9.119 ms (0.00% GC)
  mean time:        9.485 ms (6.82% GC)
  maximum time:     12.387 ms (15.54% GC)
  --------------
  samples:          527
  evals/sample:     1

In [22]:

# solve using Woodbury formula
@benchmark W \ b

Out[22]:

BenchmarkTools.Trial: 
  memory estimate:  76.16 KiB
  allocs estimate:  26
  --------------
  minimum time:     19.164 μs (0.00% GC)
  median time:      24.648 μs (0.00% GC)
  mean time:        34.932 μs (24.29% GC)
  maximum time:     4.617 ms (98.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [23]:

# multiplication without using Woodbury structure
@benchmark Wfull * b

Out[23]:

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     193.856 μs (0.00% GC)
  median time:      203.398 μs (0.00% GC)
  mean time:        205.489 μs (0.00% GC)
  maximum time:     531.061 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [24]:

# multiplication using Woodbury structure
@benchmark W * b

Out[24]:

BenchmarkTools.Trial: 
  memory estimate:  24.06 KiB
  allocs estimate:  5
  --------------
  minimum time:     4.910 μs (0.00% GC)
  median time:      6.397 μs (0.00% GC)
  mean time:        8.787 μs (20.79% GC)
  maximum time:     374.038 μs (96.19% GC)
  --------------
  samples:          10000
  evals/sample:     7

Easy plus border¶

Easy plus border: For $\mathbf{A}$ pd and $\mathbf{V}$ full row rank, $$ \begin{pmatrix} \mathbf{A} & \mathbf{V}^T \\ \mathbf{V} & \mathbf{0} \end{pmatrix}^{-1} = \begin{pmatrix} \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \\ (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & - (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \end{pmatrix}. $$

Orthogonal matrix¶

Orthogonal $\mathbf{A}$: $n^2$ flops at most. Permutation matrix, Householder matrix, Jacobi matrix, ... take less.

Toeplitz matrix¶

Toeplitz systems: $$ \mathbf{T} = \begin{pmatrix} r_0 & r_1 & r_2 & r_3 \\ r_{-1} & r_0 & r_1 & r_2 \\ r_{-2} & r_{-1} & r_0 & r_1 \\ r_{-3} & r_{-2} & r_{-1} & r_0 \end{pmatrix}. $$ $\mathbf{T} \mathbf{x} = \mathbf{b}$, where $\mathbf{T}$ is pd and Toeplitz, can be solved in $O(n^2)$ flops. Durbin algorithm (Yule-Walker equation), Levinson algorithm (general $\mathbf{b}$), Trench algorithm (inverse). These matrices occur in auto-regressive models and econometrics.

ToeplitzMatrices.jl package can be useful.

Circulant matrix¶

Circulant systems: Toeplitz matrix with wraparound $$ C(\mathbf{z}) = \begin{pmatrix} z_0 & z_4 & z_3 & z_2 & z_1 \\ z_1 & z_0 & z_4 & z_3 & z_2 \\ z_2 & z_1 & z_0 & z_4 & z_3 \\ z_3 & z_2 & z_1 & z_0 & z_4 \\ z_4 & z_3 & z_2 & z_1 & z_0 \end{pmatrix}, $$ FFT type algorithms: DCT (discrete cosine transform) and DST (discrete sine transform).

Vandermonde matrix¶

Vandermonde matrix: such as in interpolation and approximation problems $$ \mathbf{V}(x_0,\ldots,x_n) = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_0 & x_1 & \cdots & x_n \\ \vdots & \vdots & & \vdots \\ x_0^n & x_1^n & \cdots & x_n^n \end{pmatrix}. $$ $\mathbf{V} \mathbf{x} = \mathbf{b}$ or $\mathbf{V}^T \mathbf{x} = \mathbf{b}$ can be solved in $O(n^2)$ flops.

Cauchy-like matrix¶

Cauchy-like matrices: $$ \Omega \mathbf{A} - \mathbf{A} \Lambda = \mathbf{R} \mathbf{S}^T, $$ where $\Omega = \text{diag}(\omega_1,\ldots,\omega_n)$ and $\Lambda = \text{diag}(\lambda_1,\ldots, \lambda_n)$. $O(n)$ flops for LU and QR.

Structured-rank matrix¶

Structured-rank problems: semiseparable matrices (LU and QR takes $O(n)$ flops), quasiseparable matrices, ...

In [25]:

versioninfo()

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)