One of the most basic building blocks of numerical linear algebra is the computation of matrix multiplication: given an $m \times n$ matrix $A$ and an $n \times p$ matrix $B$, compute the $m \times p$ matrix $C = AB$. The entries of $C$ are given by the exact formula: $$ C_{ik} = \sum_{j=1}^n A_{ij} B_{jk} $$ but there are many ways to implement this computation. $\approx 2mnp$ flops (floating-point additions and multiplications) are required, but they can re-ordered arbitrarily, leading to $\sim (mnp)!$ possible orderings.
It turns out that the ordering of the operations in the matrix multiplication has a huge impact on performance, along with low-level details of the inner loops. Basically, three factors make the implementation of efficient matrix multiplication highly nontrivial:
Float64
operations), and to get the full benefit of these operations typically requires hand coding.As a consequence, there is a huge performance gap between the most obvious three-loop matrix-multiplication code and highly optimized code. This gap has become the central factor in the design of dense linear-algebra libraries for several decades, especially the industry-standard free/open-source the LAPACK library: nearly all dense linear algebra is now organized around highly optimized BLAS libraries.
Because Julia benefits from fast compilers, we can illustrate this performance gap fairly with simple Julia code. (In contrast, similar implementation in Matlab or Python would be orders of magnitude slower, and would demonstrate mostly language rather than the algorithmic effects.)
The following is the simplest, most obvious, matrix-multiplication algorithm: just three nested loops, implementing a dot product for each output $C_{ik}$.
The only concessions we have made to performance concerns here are (1) we implement an in-place matmul!
variant that operates on a pre-existing C
array, to avoid benchmarking the memory allocation/deallocation and (2) we use the @inbounds
macro to turn off array bounds-checking in Julia for the inner loop. Together, these make less than a factor of two difference in speed.
# compute C = A * B, using naive matrix-multiplication algorithm,
# with a pre-allocated output array C. ("!" is a Julia convention
# for functions that modify their arguments.)
function matmul!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
for i = 1:m
for k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i,j] * B[j,k]
end
@inbounds C[i,k] = c
end
end
return C
end
# a wrapper that allocates C of an appropriate type
matmul(A, B) = matmul!(Array(promote_type(eltype(A), eltype(B)),
size(A,1), size(B,2)),
A, B)
matmul (generic function with 1 method)
# correctness check:
A = rand(5,6)
B = rand(6,7)
norm(matmul(A,B) - A * B)
5.874748045952207e-16
matmul
¶Here, we will benchmark our naive matmul
implementation against the highly optimized OpenBLAS library that Julia uses for its built-in matrix multiplication. Like matmul!
, we will call OpenBLAS with pre-allocated output via A_mul_B!(C, A, B)
instead of the simpler A * B
. By default, OpenBLAS uses multiple CPU cores, which gives it an "unfair" parallel speedup, but we can disable this for benchmarking purposes:
# for benchmarking, use only single-threaded BLAS:
blas_set_num_threads(1)
We will benchmark $n \times n$ matrix multiplications for various $n$ from 10 to 1000. Julia's @elapsed ...code...
macro is useful for benchmarking: it times the code and returns the time in seconds. As we go, we will print the ratio of the naive time to the optimized time, to see the slowdown of our naive code.
N = round(Int, logspace(1, 3, 50)) # 50 sizes from 10 to 1000
# alternatively, use N = 10:1000 to see some interesting patterns due to cache associativity etc.
t = Float64[]
t0 = Float64[]
for n in N
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(t, @elapsed matmul!(C,A,B))
push!(t0, @elapsed A_mul_B!(C,A,B))
println("finished n = $n: slowdown of ", t[end]/t0[end])
end
finished n = 10: slowdown of 0.1035400987349464 finished n = 11: slowdown of 0.3934385485274015 finished n = 12: slowdown of 0.9483443708609272 finished n = 13: slowdown of 1.1404925312878482 finished n = 15: slowdown of 0.9914450867052024 finished n = 16: slowdown of 1.7304189435336979 finished n = 18: slowdown of 1.6973721390223229 finished n = 19: slowdown of 1.7661992263056094 finished n = 21: slowdown of 2.024729520865533 finished n = 23: slowdown of 0.7399927875946629 finished n = 26: slowdown of 1.5725318693588908 finished n = 28: slowdown of 3.6276960784313728 finished n = 31: slowdown of 2.6615384615384614 finished n = 34: slowdown of 2.2351009486742885 finished n = 37: slowdown of 3.243408824946651 finished n = 41: slowdown of 2.871548079517855 finished n = 45: slowdown of 3.164640800812855 finished n = 49: slowdown of 3.3017767328353322 finished n = 54: slowdown of 3.5265713449150886 finished n = 60: slowdown of 2.552516607051609 finished n = 66: slowdown of 4.412245015309863 finished n = 72: slowdown of 4.754955401387512 finished n = 79: slowdown of 4.614424447851936 finished n = 87: slowdown of 4.547156366956991 finished n = 95: slowdown of 4.661519378135509 finished n = 105: slowdown of 4.739791843510477 finished n = 115: slowdown of 4.591320122629614 finished n = 126: slowdown of 3.8284480445693005 finished n = 139: slowdown of 4.569132796154416 finished n = 153: slowdown of 5.2111327139376415 finished n = 168: slowdown of 5.5258936806264725 finished n = 184: slowdown of 5.528180642861182 finished n = 202: slowdown of 5.458618726240517 finished n = 222: slowdown of 5.297375455998394 finished n = 244: slowdown of 5.915406741871204 finished n = 268: slowdown of 5.530700392499934 finished n = 295: slowdown of 5.771326808665092 finished n = 324: slowdown of 5.687081483012161 finished n = 356: slowdown of 5.811475430653415 finished n = 391: slowdown of 6.131034678424199 finished n = 429: slowdown of 7.083835795704239 finished n = 471: slowdown of 8.439363626176503 finished n = 518: slowdown of 11.267570579349409 finished n = 569: slowdown of 25.16666066964611 finished n = 625: slowdown of 36.5777891682229 finished n = 687: slowdown of 40.679403502796966 finished n = 754: slowdown of 41.24305467789491 finished n = 829: slowdown of 38.67832822433434 finished n = 910: slowdown of 46.54816663609678 finished n = 1000: slowdown of 46.64543925412325
using PyPlot
PyPlot.svg(true) # SVG output for nice plots
true
Now, we will plot the results. Since the number of flops is $2n^3$, we will plot $2n^3 / t$ for time $t$ in microseconds in order to plot the gigaflops rate (billions of flops per second). If you naively think of a CPU as a box that performs floating-point instructions at a fixed rate, with all other instructions being negligible (a picture that may have been true circa 1985), this would be a flat horizontal line independent of $n$, but we will see that reality is quite different.
The OpenBLAS library gets an "unfair" factor of 4 speedup on typical modern Intel processors thanks to hand-coded support for AVX SIMD instructions, which perform 4 double-precision floating-point operations simultaneously, so we will divide the BLAS performance by 4 for comparison purposes.
plot(N, 2N.^3 ./ t * 1e-9, "rs-")
plot(N, 2N.^3 ./ t0 * 1e-9 / 4, "b.-")
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive matmul", "BLAS matmul / 4"], loc="lower left")
PyObject <matplotlib.legend.Legend object at 0x3159a4f50>
You may be suspicious that the problem is simply that Julia is slow. We can check this hypothesis by implementing the same algorithm in C, compiling it, and then calling it by Julia's built-in ccall
instruction that makes it easy to call C from Julia.
# C implementation:
Cmatmul = """
void Cmatmul(int m, int n, int p, double *C, double *A, double *B)
{
int i, j, k;
for (i = 0; i < m; ++i)
for (j = 0; j < p; ++j) {
double c = 0.0;
for (k = 0; k < n; ++k)
c += A[i + m*k] * B[k + n*j];
C[i + m*j] = c;
}
}
"""
# compile to a shared library by piping Cmatmul to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
open(`gcc -fPIC -O3 -xc -shared -o $(Clib * "." * Base.Sys.dlext) -`, "w") do f
print(f, Cmatmul)
end
# define a Julia cmatmul! function that simply calls the C code in the shared library we compiled
function cmatmul!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
ccall(("Cmatmul", Clib), Void, (Cint, Cint, Cint, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}),
m, n, p, C, A, B)
return C
end
cmatmul(A, B) = cmatmul!(Array(promote_type(eltype(A), eltype(B)),
size(A,1), size(B,2)),
A, B)
cmatmul (generic function with 1 method)
# correctness check:
A = rand(5,6)
B = rand(6,7)
norm(cmatmul(A,B) - A * B)
5.438959822042073e-16
Now, let's benchmark it and print out the "speedup" compared to pure Julia. We see that it is not actually much faster (and in fact is occasionally slower). Julia's main claim to fame is that it is a dynamic language (like Python or Matlab) that stays within a factor of 2 (usually better) of C.
tc = Float64[]
for n in N
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(tc, @elapsed cmatmul!(C,A,B))
println("finished n = $n: speedup of ", tc[end]/t[length(tc)])
end
finished n = 10: speedup of 0.7577519017625179 finished n = 11: speedup of 0.830385344283007 finished n = 12: speedup of 1.34683426443203 finished n = 13: speedup of 1.239646017699115 finished n = 15: speedup of 1.2045242537313432 finished n = 16: speedup of 1.2166315789473683 finished n = 18: speedup of 1.3301148659896787 finished n = 19: speedup of 1.267898699520876 finished n = 21: speedup of 1.3372955288985822 finished n = 23: speedup of 1.2903671215074723 finished n = 26: speedup of 1.3376331360946745 finished n = 28: speedup of 0.631578947368421 finished n = 31: speedup of 1.3269029876732363 finished n = 34: speedup of 1.3451597105076998 finished n = 37: speedup of 1.3435278137402635 finished n = 41: speedup of 1.3439682987187922 finished n = 45: speedup of 1.3435084548243632 finished n = 49: speedup of 1.3110516211027283 finished n = 54: speedup of 1.3429943334740164 finished n = 60: speedup of 1.3280149741505725 finished n = 66: speedup of 1.2345982295579758 finished n = 72: speedup of 1.2742275024751186 finished n = 79: speedup of 1.3209061088698708 finished n = 87: speedup of 1.4694414692923736 finished n = 95: speedup of 1.33804267739574 finished n = 105: speedup of 1.315616930464479 finished n = 115: speedup of 1.3081477961364825 finished n = 126: speedup of 1.3054582930756842 finished n = 139: speedup of 1.2415774562921682 finished n = 153: speedup of 1.377781104751245 finished n = 168: speedup of 1.3192460080852928 finished n = 184: speedup of 1.3533118456436806 finished n = 202: speedup of 1.3517407806809112 finished n = 222: speedup of 1.3675698268010255 finished n = 244: speedup of 1.2641042130173612 finished n = 268: speedup of 1.3564776094668647 finished n = 295: speedup of 1.3427095443011976 finished n = 324: speedup of 1.3451304058119589 finished n = 356: speedup of 1.314114073772014 finished n = 391: speedup of 1.3614335265990731 finished n = 429: speedup of 1.278161696265901 finished n = 471: speedup of 1.143742654538017 finished n = 518: speedup of 1.1189199233675 finished n = 569: speedup of 0.9689080503729397 finished n = 625: speedup of 0.9809858078939151 finished n = 687: speedup of 1.010350707041232 finished n = 754: speedup of 1.0200991139697084 finished n = 829: speedup of 0.9987222049247495 finished n = 910: speedup of 0.9924198431820828 finished n = 1000: speedup of 1.0399383034324052
As a first step in the right direction, we'll implement a cache-oblivious algorithm for matrix multiplication: divide the matrices into four submatrices which are multiplied recursively until a sufficiently large base case is reached (large enough to amortize the recursion overhead). This strategy erases the steep performance drop-off that occurs for large $n$ where the matrix goes out-of-cache, at the cost of ~25 lines of code rather than ~10 for the naive loops.
(It still doesn't match the OpenBLAS performance because it fails to address the other two problems: unrolling and optimizing the base cases to optimize register utilization, and coding for SIMD instructions.)
function add_matmul_rec!(m,n,p, i0,j0,k0, C,A,B)
if m+n+p <= 64 # base case: naive matmult for sufficiently large matrices
for i = 1:m
for k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i0+i,j0+j] * B[j0+j,k0+k]
end
@inbounds C[i0+i,k0+k] += c
end
end
else
m2 = m ÷ 2; n2 = n ÷ 2; p2 = p ÷ 2
add_matmul_rec!(m2, n2, p2, i0, j0, k0, C, A, B)
add_matmul_rec!(m-m2, n2, p2, i0+m2, j0, k0, C, A, B)
add_matmul_rec!(m2, n-n2, p2, i0, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, n2, p-p2, i0, j0, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n-n2, p2, i0+m2, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, n-n2, p-p2, i0, j0+n2, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n2, p-p2, i0+m2, j0, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n-n2, p-p2, i0+m2, j0+n2, k0+p2, C, A, B)
end
return C
end
function matmul_rec!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
fill!(C, 0)
return add_matmul_rec!(m,n,p, 0,0,0, C,A,B)
end
matmul_rec(A, B) = matmul_rec!(Array(promote_type(eltype(A), eltype(B)),
size(A,1), size(B,2)),
A, B)
matmul_rec (generic function with 1 method)
# correctness check:
A = rand(50,60)
B = rand(60,70)
norm(matmul_rec(A,B) - A * B)
5.019916642754782e-14
tco = Float64[]
for n in N
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(tco, @elapsed matmul_rec!(C,A,B))
println("finished n = $n: slowdown of ", tco[end]/t0[length(tc)])
end
finished n = 10: slowdown of 0.0002640718846038852 finished n = 11: slowdown of 1.5406894559891634e-5 finished n = 12: slowdown of 1.48666362935588e-5 finished n = 13: slowdown of 1.8260729402049778e-5 finished n = 15: slowdown of 2.695570950020644e-5 finished n = 16: slowdown of 3.1900661632052845e-5 finished n = 18: slowdown of 4.0830812975554396e-5 finished n = 19: slowdown of 4.701518113016436e-5 finished n = 21: slowdown of 6.153859452511171e-5 finished n = 23: slowdown of 8.749641522750339e-5 finished n = 26: slowdown of 0.00012073183257637853 finished n = 28: slowdown of 0.0001489460260381603 finished n = 31: slowdown of 0.0001983701234382087 finished n = 34: slowdown of 0.0002577031930407617 finished n = 37: slowdown of 0.000326659215763643 finished n = 41: slowdown of 0.00043945242981473085 finished n = 45: slowdown of 0.0006296868993599274 finished n = 49: slowdown of 0.0007996585059278408 finished n = 54: slowdown of 0.001055448549107824 finished n = 60: slowdown of 0.0014455404413185454 finished n = 66: slowdown of 0.0019704299489691113 finished n = 72: slowdown of 0.0024838723373530242 finished n = 79: slowdown of 0.003170457966127205 finished n = 87: slowdown of 0.004306074485979218 finished n = 95: slowdown of 0.005656472268606749 finished n = 105: slowdown of 0.007404716238561779 finished n = 115: slowdown of 0.009630033681607116 finished n = 126: slowdown of 0.013590921271277276 finished n = 139: slowdown of 0.017406441251426284 finished n = 153: slowdown of 0.023697507235647182 finished n = 168: slowdown of 0.030236615323910893 finished n = 184: slowdown of 0.0447346365803406 finished n = 202: slowdown of 0.05462755993434527 finished n = 222: slowdown of 0.0704428198278089 finished n = 244: slowdown of 0.0954285697215369 finished n = 268: slowdown of 0.12186967368909192 finished n = 295: slowdown of 0.1611034006016316 finished n = 324: slowdown of 0.22232863897629065 finished n = 356: slowdown of 0.30612406897610767 finished n = 391: slowdown of 0.4013126750436783 finished n = 429: slowdown of 0.5399580479922224 finished n = 471: slowdown of 0.6878182184748497 finished n = 518: slowdown of 0.9196936850037526 finished n = 569: slowdown of 1.188057699074366 finished n = 625: slowdown of 1.5717575414969196 finished n = 687: slowdown of 2.162829799069688 finished n = 754: slowdown of 3.012230709856141 finished n = 829: slowdown of 3.9697904420812633 finished n = 910: slowdown of 5.220376279077336 finished n = 1000: slowdown of 6.648449090128808
plot(N, 2N.^3 ./ t * 1e-9, "rs-")
plot(N, 2N.^3 ./ tco * 1e-9, "ko-")
plot(N, 2N.^3 ./ t0 * 1e-9 / 4, "b.-")
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive matmul", "cache-oblivious", "BLAS matmul / 4"], loc="lower left")
PyObject <matplotlib.legend.Legend object at 0x3059c24d0>