#!/usr/bin/env python # coding: utf-8 # # Lecture 3.1: Matrix multiplication part 1 # ## Syllabus # **Week 1:** Intro week, floating point, vector norms, matrix multiplication # # ## Recap of the previous lecture # - Concept of floating point # - Basic vector norms(p-norm, Manhattan distance, 2-norm, Chebyshev norm) # - A short demo on $L_1$-norm minimization # - Concept of forward/backward error # ## Today lecture # Today we will talk about: # - Matrices # - Matrix multiplication # - Matrix norms, operator norms # - unitary matrices, unitary invariant norms # - Concept of block algorithms for NLA: why and how. # - Complexity of matrix multiplication # ## Matrix-by-matrix product # # Consider composition of two linear operators: # # 1. $y = Bx$ # 2. $z = Ay$ # # Then, $z = Ay = A B x = C x$, where $C$ is the **matrix-by-matrix product**. # # A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements # $$ # c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m # $$ # ## Complexity of MM # Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$. # # Matrix-by-matrix product is the **core** for almost all efficient algorithms in linear algebra. # # Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products, # # so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor. # # However, implementing MM is not easy at all! # ## Efficient implementation for MM # Is it easy to multiply a matrix by a matrix? # # The answer is: **no**, if you want it as fast as possible, # # using the computers that are at hand. # ## Demo # Let us do a short demo and compare a `np.dot()` procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython). # In[24]: import numpy as np def matmul(a, b): n = a.shape[0] k = a.shape[1] m = b.shape[1] #c[i, :] += c = np.zeros((n, m)) #Transpose B here, but it is N^2 operations for i in range(n): #This is N^3 for j in range(m): for s in range(k): c[i, j] += a[i, s] * b[s, j] # In[22]: get_ipython().run_line_magic('reload_ext', 'cythonmagic') # In[26]: import numpy as np from numba import jit @jit def numba_matmul(a, b): n = a.shape[0] k = a.shape[1] m = b.shape[1] c = np.zeros((n, m)) for i in range(n): for j in range(m): for s in range(k): c[i, j] += a[i, s] * b[s, j] return c # Then we just compare computational times. # # Guess the answer. # In[27]: n = 100 a = np.random.randn(n, n) b = np.random.randn(n, n) get_ipython().run_line_magic('timeit', 'c = numba_matmul(a, b)') #%timeit cf = cython_matmul(a, b) get_ipython().run_line_magic('timeit', 'c = np.dot(a, b)') # Why it is so? # There are two important issues: # # - Computers are more and more parallel (multicore, graphics processing units) # - The memory pyramid: there is a whole hierarchy of levels # ## Memory architecture # # Fast memory is small, bigger memory is slow. # - Data fits into the fast memory: # load all data, compute # - Data does not fit into the fast memory: # load data by chunks, compute, load again # We need to reduce the number of read/write operations! # # This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use. # ## BLAS # Basic linear algebra operations (**BLAS**) have three levels: # 1. BLAS-1, operations like $c = a + b$ # 2. BLAS-2, operations like matrix-by-vector product # 3. BLAS-3, matrix-by-matrix product # # What is the principal differences between them? # The main difference is the number of operations vs. the number of input data! # # 1. BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations # 2. BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations # 3. BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations # **Remark**: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done. # # Strassen gives $\mathcal{O}(n^{2.78...})$ # # World record $\mathcal{O}(n^{2.37})$ [Reference](http://arxiv.org/pdf/1401.7714v1.pdf) # # The constant is unfortunately too big to make it practical! # ## Memory hierarchy # How we can use memory hierarchy? # Break the matrix into blocks! ($2 \times 2$ is an **illustration**) # # $ # A = \begin{bmatrix} # A_{11} & A_{12} \\ # A_{21} & A_{22} # \end{bmatrix}$, $B = \begin{bmatrix} # B_{11} & B_{12} \\ # B_{21} & B_{22} # \end{bmatrix}$ # Then, # # $AB$ = $\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ # A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$ # # If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the [Haswell Intel Chip](http://en.wikipedia.org/wiki/List_of_Intel_Core_i7_microprocessors#.22Haswell-H.22_.28MCP.2C_quad-core.2C_22_nm.29)), then we load them only once into the memory. # ### Key point # The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size. # # - Have to do linear algebra in terms of blocks! # - So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss) # ## Parallelization # The blocking has also deep connection with parallel computations. # Consider adding two vectors: # $$ c = a + b$$ # and we have two processors. # # How fast can we go? # # Of course, not faster then twice. # In[2]: ## This demo requires Anaconda distribution to be installed import mkl import numpy as np n = 1000000 a = np.random.randn(n) mkl.set_num_threads(1) get_ipython().run_line_magic('timeit', 'a + a') mkl.set_num_threads(2) get_ipython().run_line_magic('timeit', 'a + a') # In[3]: ## This demo requires Anaconda distribution to be installed import mkl n = 500 a = np.random.randn(n, n) mkl.set_num_threads(1) get_ipython().run_line_magic('timeit', 'a.dot(a)') mkl.set_num_threads(4) get_ipython().run_line_magic('timeit', 'a.dot(a)') # Typically, two cases are distinguished: # 1. Shared memory (i.e., multicore on every desktop/smartphone) # 2. Distributed memory (i.e. each processor has its own memory, can send information through a network) # In both cases, the efficiency is governed by a # # **memory bandwidth**: # # I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time. # # For BLAS-3 routines, the speedup can be obtained that is more noticable. # # For large-scale clusters (>100 000 cores, see the [Top500 list](http://www.top500.org/lists/)) there is still scaling. # ## Communication-avoiding algorithms # A new direction in NLA is **communication-avoiding** algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities. # # This requires **absolutely different algorithms**. # # This can be an interesting **project** (i.e. do NLA in a cloud). # ## Summary of MM part # - MM is the core of NLA. You have to think in block terms, if you want high efficiency # - This is all about computer memory hierarchy # - $\mathcal{O}(n^{2 + \epsilon})$ complexity hypothesis is not proven or disproven yet. # # Now we go to [matrix norms](lecture-3.2.ipynb). # In[14]: from IPython.core.display import HTML def css_styling(): styles = open("./styles/custom.css", "r").read() return HTML(styles) css_styling()