Week 1: Intro week, floating point, vector norms, matrix multiplication
Today we will talk about:
Consider composition of two linear operators:
Then, $z = Ay = A B x = C x$, where $C$ is the matrix-by-matrix product.
A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements
$$
c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m
$$
Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.
Matrix-by-matrix product is the core for almost all efficient algorithms in linear algebra.
Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products,
so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor.
However, implementing MM is not easy at all!
Is it easy to multiply a matrix by a matrix?
The answer is: no, if you want it as fast as possible,
using the computers that are at hand.
Let us do a short demo and compare a np.dot()
procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython).
import numpy as np
def matmul(a, b):
n = a.shape[0]
k = a.shape[1]
m = b.shape[1] #c[i, :] +=
c = np.zeros((n, m))
#Transpose B here, but it is N^2 operations
for i in range(n): #This is N^3
for j in range(m):
for s in range(k):
c[i, j] += a[i, s] * b[s, j]
%reload_ext cythonmagic
import numpy as np
from numba import jit
@jit
def numba_matmul(a, b):
n = a.shape[0]
k = a.shape[1]
m = b.shape[1]
c = np.zeros((n, m))
for i in range(n):
for j in range(m):
for s in range(k):
c[i, j] += a[i, s] * b[s, j]
return c
Then we just compare computational times.
Guess the answer.
n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = numba_matmul(a, b)
#%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)
The slowest run took 294.61 times longer than the fastest. This could mean that an intermediate result is being cached 1 loops, best of 3: 952 µs per loop 10000 loops, best of 3: 112 µs per loop
Why it is so?
There are two important issues:
We need to reduce the number of read/write operations!
This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.
Basic linear algebra operations (BLAS) have three levels:
What is the principal differences between them?
The main difference is the number of operations vs. the number of input data!
Remark: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.
Strassen gives $\mathcal{O}(n^{2.78...})$
World record $\mathcal{O}(n^{2.37})$ Reference
The constant is unfortunately too big to make it practical!
How we can use memory hierarchy?
Break the matrix into blocks! ($2 \times 2$ is an illustration)
$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$, $B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}$
Then,
$AB$ = $\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$
If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the Haswell Intel Chip), then we load them only once into the memory.
The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size.
The blocking has also deep connection with parallel computations.
Consider adding two vectors:
$$ c = a + b$$
and we have two processors.
How fast can we go?
Of course, not faster then twice.
## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000000
a = np.random.randn(n)
mkl.set_num_threads(1)
%timeit a + a
mkl.set_num_threads(2)
%timeit a + a
100 loops, best of 3: 2.82 ms per loop 100 loops, best of 3: 2.82 ms per loop
## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
mkl.set_num_threads(1)
%timeit a.dot(a)
mkl.set_num_threads(4)
%timeit a.dot(a)
100 loops, best of 3: 10.4 ms per loop 100 loops, best of 3: 5.73 ms per loop
Typically, two cases are distinguished:
In both cases, the efficiency is governed by a
memory bandwidth:
I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time.
For BLAS-3 routines, the speedup can be obtained that is more noticable.
For large-scale clusters (>100 000 cores, see the Top500 list) there is still scaling.
A new direction in NLA is communication-avoiding algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities.
This requires absolutely different algorithms.
This can be an interesting project (i.e. do NLA in a cloud).
Now we go to matrix norms.
from IPython.core.display import HTML
def css_styling():
styles = open("./styles/custom.css", "r").read()
return HTML(styles)
css_styling()