Notebook

Benchmarking Matrix Multiplication¶

One of the most basic building blocks of numerical linear algebra is the computation of matrix multiplication: given an $m \times n$ matrix $A$ and an $n \times p$ matrix $B$, compute the $m \times p$ matrix $C = AB$. The entries of $C$ are given by the exact formula: $$ C_{ik} = \sum_{j=1}^n A_{ij} B_{jk} $$ but there are many ways to implement this computation. $\approx 2mnp$ flops (floating-point additions and multiplications) are required, but they can re-ordered arbitrarily, leading to $\sim (mnp)!$ possible orderings.

It turns out that the ordering of the operations in the matrix multiplication has a huge impact on performance, along with low-level details of the inner loops. Basically, three factors make the implementation of efficient matrix multiplication highly nontrivial:

Caches: the matrix accesses must be reordered to obtain temporal locality and hence efficient memory (cache) usage.
Registers: at the lowest level, the CPU registers form a kind of ideal cache. The innermost loops of the matrix multiplication need to be unrolled in order to load many values into registers and perform as much work with them as possible (essentially a small submatrix multiplication). It turns out that a lot of tricks are required to do this well.
SIMD instructions: modern CPUs include special instructions that can perform several arithmetic operations at once (e.g. 2, 4, or even 8 Float64 operations), and to get the full benefit of these operations typically requires hand coding.

As a consequence, there is a huge performance gap between the most obvious three-loop matrix-multiplication code and highly optimized code. This gap has become the central factor in the design of dense linear-algebra libraries for several decades, especially the industry-standard free/open-source the LAPACK library: nearly all dense linear algebra is now organized around highly optimized BLAS libraries.

Because Julia benefits from fast compilers, we can illustrate this performance gap fairly with simple Julia code. (In contrast, similar implementation in Matlab or Python would be orders of magnitude slower, and would demonstrate mostly language rather than the algorithmic effects.)

Naive algorithm¶

The following is the simplest, most obvious, matrix-multiplication algorithm: just three nested loops, implementing a dot product for each output $C_{ik}$.

The only concessions we have made to performance concerns here are (1) we implement an in-place matmul! variant that operates on a pre-existing C array, to avoid benchmarking the memory allocation/deallocation and (2) we use the @inbounds macro to turn off array bounds-checking in Julia for the inner loop. Together, these make less than a factor of two difference in speed.

In [1]:

# compute C = A * B, using naive matrix-multiplication algorithm,
# with a pre-allocated output array C.  ("!" is a Julia convention
# for functions that modify their arguments.)
function matmul!(C, A, B)
    m,n = size(A)
    n,p = size(B)
    size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
    for i = 1:m
        for k = 1:p
            c = zero(eltype(C))
            for j = 1:n
                @inbounds c += A[i,j] * B[j,k]
            end
            @inbounds C[i,k] = c
        end
    end
    return C
end

# a wrapper that allocates C of an appropriate type
matmul(A, B) = matmul!(Array(promote_type(eltype(A), eltype(B)),
                             size(A,1), size(B,2)),
                       A, B)

Out[1]:

matmul (generic function with 1 method)

In [2]:

# correctness check:
A = rand(5,6)
B = rand(6,7)
norm(matmul(A,B) - A * B)

Out[2]:

5.874748045952207e-16

Benchmarking naive `matmul`¶

Here, we will benchmark our naive matmul implementation against the highly optimized OpenBLAS library that Julia uses for its built-in matrix multiplication. Like matmul!, we will call OpenBLAS with pre-allocated output via A_mul_B!(C, A, B) instead of the simpler A * B. By default, OpenBLAS uses multiple CPU cores, which gives it an "unfair" parallel speedup, but we can disable this for benchmarking purposes:

In [3]:

# for benchmarking, use only single-threaded BLAS:
blas_set_num_threads(1)

We will benchmark $n \times n$ matrix multiplications for various $n$ from 10 to 1000. Julia's @elapsed ...code... macro is useful for benchmarking: it times the code and returns the time in seconds. As we go, we will print the ratio of the naive time to the optimized time, to see the slowdown of our naive code.

In [4]:

N = round(Int, logspace(1, 3, 50))  # 50 sizes from 10 to 1000
# alternatively, use N = 10:1000 to see some interesting patterns due to cache associativity etc.
t = Float64[]
t0 = Float64[]
for n in N
    A = zeros(n,n)
    B = zeros(n,n)
    # preallocate output C so that allocation is not included in timing
    C = zeros(n,n)
    push!(t, @elapsed matmul!(C,A,B))
    push!(t0, @elapsed A_mul_B!(C,A,B))
    println("finished n = $n: slowdown of ", t[end]/t0[end])
end

finished n = 10: slowdown of 0.1035400987349464
finished n = 11: slowdown of 0.3934385485274015
finished n = 12: slowdown of 0.9483443708609272
finished n = 13: slowdown of 1.1404925312878482
finished n = 15: slowdown of 0.9914450867052024
finished n = 16: slowdown of 1.7304189435336979
finished n = 18: slowdown of 1.6973721390223229
finished n = 19: slowdown of 1.7661992263056094
finished n = 21: slowdown of 2.024729520865533
finished n = 23: slowdown of 0.7399927875946629
finished n = 26: slowdown of 1.5725318693588908
finished n = 28: slowdown of 3.6276960784313728
finished n = 31: slowdown of 2.6615384615384614
finished n = 34: slowdown of 2.2351009486742885
finished n = 37: slowdown of 3.243408824946651
finished n = 41: slowdown of 2.871548079517855
finished n = 45: slowdown of 3.164640800812855
finished n = 49: slowdown of 3.3017767328353322
finished n = 54: slowdown of 3.5265713449150886
finished n = 60: slowdown of 2.552516607051609
finished n = 66: slowdown of 4.412245015309863
finished n = 72: slowdown of 4.754955401387512
finished n = 79: slowdown of 4.614424447851936
finished n = 87: slowdown of 4.547156366956991
finished n = 95: slowdown of 4.661519378135509
finished n = 105: slowdown of 4.739791843510477
finished n = 115: slowdown of 4.591320122629614
finished n = 126: slowdown of 3.8284480445693005
finished n = 139: slowdown of 4.569132796154416
finished n = 153: slowdown of 5.2111327139376415
finished n = 168: slowdown of 5.5258936806264725
finished n = 184: slowdown of 5.528180642861182
finished n = 202: slowdown of 5.458618726240517
finished n = 222: slowdown of 5.297375455998394
finished n = 244: slowdown of 5.915406741871204
finished n = 268: slowdown of 5.530700392499934
finished n = 295: slowdown of 5.771326808665092
finished n = 324: slowdown of 5.687081483012161
finished n = 356: slowdown of 5.811475430653415
finished n = 391: slowdown of 6.131034678424199
finished n = 429: slowdown of 7.083835795704239
finished n = 471: slowdown of 8.439363626176503
finished n = 518: slowdown of 11.267570579349409
finished n = 569: slowdown of 25.16666066964611
finished n = 625: slowdown of 36.5777891682229
finished n = 687: slowdown of 40.679403502796966
finished n = 754: slowdown of 41.24305467789491
finished n = 829: slowdown of 38.67832822433434
finished n = 910: slowdown of 46.54816663609678
finished n = 1000: slowdown of 46.64543925412325

In [14]:

using PyPlot
PyPlot.svg(true) # SVG output for nice plots

Out[14]:

true

Now, we will plot the results. Since the number of flops is $2n^3$, we will plot $2n^3 / t$ for time $t$ in microseconds in order to plot the gigaflops rate (billions of flops per second). If you naively think of a CPU as a box that performs floating-point instructions at a fixed rate, with all other instructions being negligible (a picture that may have been true circa 1985), this would be a flat horizontal line independent of $n$, but we will see that reality is quite different.

The OpenBLAS library gets an "unfair" factor of 4 speedup on typical modern Intel processors thanks to hand-coded support for AVX SIMD instructions, which perform 4 double-precision floating-point operations simultaneously, so we will divide the BLAS performance by 4 for comparison purposes.

In [6]:

plot(N, 2N.^3 ./ t * 1e-9, "rs-")
plot(N, 2N.^3 ./ t0 * 1e-9 / 4, "b.-")
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive matmul", "BLAS matmul / 4"], loc="lower left")

Out[6]:

PyObject <matplotlib.legend.Legend object at 0x3159a4f50>

Naive matmul in C¶

You may be suspicious that the problem is simply that Julia is slow. We can check this hypothesis by implementing the same algorithm in C, compiling it, and then calling it by Julia's built-in ccall instruction that makes it easy to call C from Julia.

In [7]:

# C implementation:
Cmatmul = """
void Cmatmul(int m, int n, int p, double *C, double *A, double *B)
{
    int i, j, k;
    for (i = 0; i < m; ++i)
        for (j = 0; j < p; ++j) {
            double c = 0.0;
            for (k = 0; k < n; ++k)
                c += A[i + m*k] * B[k + n*j];
            C[i + m*j] = c;
    }
}
"""
# compile to a shared library by piping Cmatmul to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
open(`gcc -fPIC -O3 -xc -shared -o $(Clib * "." * Base.Sys.dlext) -`, "w") do f
    print(f, Cmatmul)
end

# define a Julia cmatmul! function that simply calls the C code in the shared library we compiled
function cmatmul!(C, A, B)
    m,n = size(A)
    n,p = size(B)
    size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
    ccall(("Cmatmul", Clib), Void, (Cint, Cint, Cint, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}),
          m, n, p, C, A, B)
    return C
end
cmatmul(A, B) = cmatmul!(Array(promote_type(eltype(A), eltype(B)),
                               size(A,1), size(B,2)),
                         A, B)

Out[7]:

cmatmul (generic function with 1 method)

In [8]:

# correctness check:
A = rand(5,6)
B = rand(6,7)
norm(cmatmul(A,B) - A * B)

Out[8]:

5.438959822042073e-16

Now, let's benchmark it and print out the "speedup" compared to pure Julia. We see that it is not actually much faster (and in fact is occasionally slower). Julia's main claim to fame is that it is a dynamic language (like Python or Matlab) that stays within a factor of 2 (usually better) of C.

In [9]:

tc = Float64[]
for n in N
    A = zeros(n,n)
    B = zeros(n,n)
    # preallocate output C so that allocation is not included in timing
    C = zeros(n,n)
    push!(tc, @elapsed cmatmul!(C,A,B))
    println("finished n = $n: speedup of ", tc[end]/t[length(tc)])
end

finished n = 10: speedup of 0.7577519017625179
finished n = 11: speedup of 0.830385344283007
finished n = 12: speedup of 1.34683426443203
finished n = 13: speedup of 1.239646017699115
finished n = 15: speedup of 1.2045242537313432
finished n = 16: speedup of 1.2166315789473683
finished n = 18: speedup of 1.3301148659896787
finished n = 19: speedup of 1.267898699520876
finished n = 21: speedup of 1.3372955288985822
finished n = 23: speedup of 1.2903671215074723
finished n = 26: speedup of 1.3376331360946745
finished n = 28: speedup of 0.631578947368421
finished n = 31: speedup of 1.3269029876732363
finished n = 34: speedup of 1.3451597105076998
finished n = 37: speedup of 1.3435278137402635
finished n = 41: speedup of 1.3439682987187922
finished n = 45: speedup of 1.3435084548243632
finished n = 49: speedup of 1.3110516211027283
finished n = 54: speedup of 1.3429943334740164
finished n = 60: speedup of 1.3280149741505725
finished n = 66: speedup of 1.2345982295579758
finished n = 72: speedup of 1.2742275024751186
finished n = 79: speedup of 1.3209061088698708
finished n = 87: speedup of 1.4694414692923736
finished n = 95: speedup of 1.33804267739574
finished n = 105: speedup of 1.315616930464479
finished n = 115: speedup of 1.3081477961364825
finished n = 126: speedup of 1.3054582930756842
finished n = 139: speedup of 1.2415774562921682
finished n = 153: speedup of 1.377781104751245
finished n = 168: speedup of 1.3192460080852928
finished n = 184: speedup of 1.3533118456436806
finished n = 202: speedup of 1.3517407806809112
finished n = 222: speedup of 1.3675698268010255
finished n = 244: speedup of 1.2641042130173612
finished n = 268: speedup of 1.3564776094668647
finished n = 295: speedup of 1.3427095443011976
finished n = 324: speedup of 1.3451304058119589
finished n = 356: speedup of 1.314114073772014
finished n = 391: speedup of 1.3614335265990731
finished n = 429: speedup of 1.278161696265901
finished n = 471: speedup of 1.143742654538017
finished n = 518: speedup of 1.1189199233675
finished n = 569: speedup of 0.9689080503729397
finished n = 625: speedup of 0.9809858078939151
finished n = 687: speedup of 1.010350707041232
finished n = 754: speedup of 1.0200991139697084
finished n = 829: speedup of 0.9987222049247495
finished n = 910: speedup of 0.9924198431820828
finished n = 1000: speedup of 1.0399383034324052

Cache-oblivious matrix-multiplication¶

As a first step in the right direction, we'll implement a cache-oblivious algorithm for matrix multiplication: divide the matrices into four submatrices which are multiplied recursively until a sufficiently large base case is reached (large enough to amortize the recursion overhead). This strategy erases the steep performance drop-off that occurs for large $n$ where the matrix goes out-of-cache, at the cost of ~25 lines of code rather than ~10 for the naive loops.

(It still doesn't match the OpenBLAS performance because it fails to address the other two problems: unrolling and optimizing the base cases to optimize register utilization, and coding for SIMD instructions.)

In [10]:

function add_matmul_rec!(m,n,p, i0,j0,k0, C,A,B)
    if m+n+p <= 64   # base case: naive matmult for sufficiently large matrices
        for i = 1:m
            for k = 1:p
                c = zero(eltype(C))
                for j = 1:n
                    @inbounds c += A[i0+i,j0+j] * B[j0+j,k0+k]
                end
                @inbounds C[i0+i,k0+k] += c
            end
        end
    else
        m2 = m ÷ 2; n2 = n ÷ 2; p2 = p ÷ 2
        add_matmul_rec!(m2, n2, p2, i0, j0, k0, C, A, B)
        
        add_matmul_rec!(m-m2, n2, p2, i0+m2, j0, k0, C, A, B)
        add_matmul_rec!(m2, n-n2, p2, i0, j0+n2, k0, C, A, B)
        add_matmul_rec!(m2, n2, p-p2, i0, j0, k0+p2, C, A, B)
        
        add_matmul_rec!(m-m2, n-n2, p2, i0+m2, j0+n2, k0, C, A, B)
        add_matmul_rec!(m2, n-n2, p-p2, i0, j0+n2, k0+p2, C, A, B)
        add_matmul_rec!(m-m2, n2, p-p2, i0+m2, j0, k0+p2, C, A, B)
        
        add_matmul_rec!(m-m2, n-n2, p-p2, i0+m2, j0+n2, k0+p2, C, A, B)
    end
    return C
end

function matmul_rec!(C, A, B)
    m,n = size(A)
    n,p = size(B)
    size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
    fill!(C, 0)
    return add_matmul_rec!(m,n,p, 0,0,0, C,A,B)
end


matmul_rec(A, B) = matmul_rec!(Array(promote_type(eltype(A), eltype(B)),
                                     size(A,1), size(B,2)),
                               A, B)

Out[10]:

matmul_rec (generic function with 1 method)

In [11]:

# correctness check:
A = rand(50,60)
B = rand(60,70)
norm(matmul_rec(A,B) - A * B)

Out[11]:

5.019916642754782e-14

In [12]:

tco = Float64[]
for n in N
    A = zeros(n,n)
    B = zeros(n,n)
    # preallocate output C so that allocation is not included in timing
    C = zeros(n,n)
    push!(tco, @elapsed matmul_rec!(C,A,B))
    println("finished n = $n: slowdown of ", tco[end]/t0[length(tc)])
end

finished n = 10: slowdown of 0.0002640718846038852
finished n = 11: slowdown of 1.5406894559891634e-5
finished n = 12: slowdown of 1.48666362935588e-5
finished n = 13: slowdown of 1.8260729402049778e-5
finished n = 15: slowdown of 2.695570950020644e-5
finished n = 16: slowdown of 3.1900661632052845e-5
finished n = 18: slowdown of 4.0830812975554396e-5
finished n = 19: slowdown of 4.701518113016436e-5
finished n = 21: slowdown of 6.153859452511171e-5
finished n = 23: slowdown of 8.749641522750339e-5
finished n = 26: slowdown of 0.00012073183257637853
finished n = 28: slowdown of 0.0001489460260381603
finished n = 31: slowdown of 0.0001983701234382087
finished n = 34: slowdown of 0.0002577031930407617
finished n = 37: slowdown of 0.000326659215763643
finished n = 41: slowdown of 0.00043945242981473085
finished n = 45: slowdown of 0.0006296868993599274
finished n = 49: slowdown of 0.0007996585059278408
finished n = 54: slowdown of 0.001055448549107824
finished n = 60: slowdown of 0.0014455404413185454
finished n = 66: slowdown of 0.0019704299489691113
finished n = 72: slowdown of 0.0024838723373530242
finished n = 79: slowdown of 0.003170457966127205
finished n = 87: slowdown of 0.004306074485979218
finished n = 95: slowdown of 0.005656472268606749
finished n = 105: slowdown of 0.007404716238561779
finished n = 115: slowdown of 0.009630033681607116
finished n = 126: slowdown of 0.013590921271277276
finished n = 139: slowdown of 0.017406441251426284
finished n = 153: slowdown of 0.023697507235647182
finished n = 168: slowdown of 0.030236615323910893
finished n = 184: slowdown of 0.0447346365803406
finished n = 202: slowdown of 0.05462755993434527
finished n = 222: slowdown of 0.0704428198278089
finished n = 244: slowdown of 0.0954285697215369
finished n = 268: slowdown of 0.12186967368909192
finished n = 295: slowdown of 0.1611034006016316
finished n = 324: slowdown of 0.22232863897629065
finished n = 356: slowdown of 0.30612406897610767
finished n = 391: slowdown of 0.4013126750436783
finished n = 429: slowdown of 0.5399580479922224
finished n = 471: slowdown of 0.6878182184748497
finished n = 518: slowdown of 0.9196936850037526
finished n = 569: slowdown of 1.188057699074366
finished n = 625: slowdown of 1.5717575414969196
finished n = 687: slowdown of 2.162829799069688
finished n = 754: slowdown of 3.012230709856141
finished n = 829: slowdown of 3.9697904420812633
finished n = 910: slowdown of 5.220376279077336
finished n = 1000: slowdown of 6.648449090128808

In [13]:

plot(N, 2N.^3 ./ t * 1e-9, "rs-")
plot(N, 2N.^3 ./ tco * 1e-9, "ko-")
plot(N, 2N.^3 ./ t0 * 1e-9 / 4, "b.-")
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive matmul", "cache-oblivious", "BLAS matmul / 4"], loc="lower left")

Out[13]:

PyObject <matplotlib.legend.Legend object at 0x3059c24d0>

Benchmarking Matrix Multiplication¶

Naive algorithm¶

Benchmarking naive matmul¶

Naive matmul in C¶

Cache-oblivious matrix-multiplication¶

Benchmarking naive `matmul`¶