In many problems, especially problems accessing lots of data and doing relatively simple computations on each datum, the performance bottleneck is memory rather than computational speed. Because memory is arranged into a memory hierarchy of larger/slower and smaller/faster memories, it turns out that changing the order of memory access can have a huge impact on performance.
In this notebook, we'll explore these performance issues with a few typical matrix algorithms, implemented in Julia.
using BenchmarkTools # a useful package of benchmarking utilities
versioninfo() # a useful function to print out information about the machine
Julia Version 1.5.3 Commit 788b2c77c1 (2020-11-09 13:37 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, skylake) Environment: JULIA_EDITOR = code JULIA_CXX_RTTI = 1
One of the most basic building blocks of numerical linear algebra is the computation of matrix multiplication: given an $m \times n$ matrix $A$ and an $n \times p$ matrix $B$, compute the $m \times p$ matrix $C = AB$. The entries of $C$ are given by the exact formula: $$ C_{ik} = \sum_{j=1}^n A_{ij} B_{jk} $$ but there are many ways to implement this computation. $\approx 2mnp$ flops (floating-point additions and multiplications) are required, but they can re-ordered arbitrarily, leading to $\sim (mnp)!$ possible orderings.
It turns out that the ordering of the operations in the matrix multiplication has a huge impact on performance, along with low-level details of the inner loops. Basically, three factors make the implementation of efficient matrix multiplication highly nontrivial:
Float64
operations), and to get the full benefit of these operations typically requires hand coding.As a consequence, there is a huge performance gap between the most obvious three-loop matrix-multiplication code and highly optimized code. This gap has become the central factor in the design of dense linear-algebra libraries for several decades, especially the industry-standard free/open-source the LAPACK library: nearly all dense linear algebra is now organized around highly optimized BLAS libraries.
Because Julia benefits from fast compilers, we can illustrate this performance gap fairly with simple Julia code. (In contrast, similar implementation in Matlab or Python would be orders of magnitude slower, and would demonstrate mostly language rather than the algorithmic effects.)
The following is the simplest, most obvious, matrix-multiplication algorithm: just three nested loops, implementing a dot product for each output $C_{ik}$.
The only concessions we have made to performance concerns here are (1) we implement an in-place matmul!
variant that operates on a pre-existing C
array, to avoid benchmarking the memory allocation/deallocation and (2) we use the @inbounds
macro to turn off array bounds-checking in Julia for the inner loop. Together, these make less than a factor of two difference in speed.
# compute C = A * B, using naive matrix-multiplication algorithm,
# with a pre-allocated output array C. ("!" is a Julia convention
# for functions that modify their arguments.)
function matmul!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
for i = 1:m, k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i,j] * B[j,k]
end
@inbounds C[i,k] = c
end
return C
end
# a wrapper that allocates C of an appropriate type
matmul(A, B) = matmul!(Array{promote_type(eltype(A), eltype(B))}(undef,
size(A,1), size(B,2)),
A, B)
matmul (generic function with 1 method)
# correctness check:
A = rand(5,6)
B = rand(6,7)
using LinearAlgebra
norm(matmul(A,B) - A * B)
8.599750569898517e-16
All modern CPUs have special SIMD instrucions that perform multiple floating-point operations simultaneously. This gives almost a factor of 8 speedup to matrix multiplication on typical modern Intel processors thanks to hand-coded support for AVX-512 SIMD instructions.
Not only that, but there are also "fused-multiply add" that perform x + αy
in a single instruction, which can give another factor of 2 in computing dot products and matrix multiplications.
If we don't use these instructions, we lose up to a factor of 16 right off the bat, and it will be hard to compare to optimized BLAS libraries. Fortunately, for a problem as simple as matrix multiplication, the use of SIMD can be automated. You typically have to tell the compiler to do this explicitly, however, because changing the code to use SIMD changes the floating-point result because it requires doing operations in a different order. (Compilers don't like to change your answers unless you give them permission!)
In particular, there is a nice Julia package called LoopVectorization.jl that provides an @avx
macro to rewrite simple loops to use AVX and FMA instructions. We'll use this in a matmul2
variant of the code above:
using LoopVectorization
# compute C = A * B, using naive matrix-multiplication algorithm,
# with a pre-allocated output array C. ("!" is a Julia convention
# for functions that modify their arguments.)
function matmul2!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
@avx for i = 1:m, k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i,j] * B[j,k]
end
@inbounds C[i,k] = c
end
return C
end
# a wrapper that allocates C of an appropriate type
matmul2(A, B) = matmul2!(Array{promote_type(eltype(A), eltype(B))}(undef,
size(A,1), size(B,2)),
A, B)
matmul2 (generic function with 1 method)
A = rand(50,60); B = rand(60,70); C = zeros(50,70);
@btime matmul!($C, $A, $B);
@btime matmul2!($C, $A, $B);
183.287 μs (0 allocations: 0 bytes) 10.739 μs (0 allocations: 0 bytes)
matmul
¶Here, we will benchmark our two naive matmul
implementations against the highly optimized OpenBLAS library that Julia uses for its built-in matrix multiplication. Like matmul!
, we will call OpenBLAS with pre-allocated output via mul!(C, A, B)
instead of the simpler A * B
. By default, OpenBLAS uses multiple CPU cores, which gives it an "unfair" parallel speedup, but we can disable this for benchmarking purposes:
# for benchmarking, use only single-threaded BLAS:
BLAS.set_num_threads(1)
We will benchmark $n \times n$ matrix multiplications for various $n$ from 10 to 1000. Julia's @elapsed ...code...
macro is useful for benchmarking: it times the code and returns the time in seconds. As we go, we will print the ratio of the naive time to the optimized time, to see the slowdown of our naive code.
N = round.(Int, 10 .^ range(1, log10(3000), length=60)) # 60 sizes from 10 to 3000
# alternatively, use N = 10:1000 to see some interesting patterns due to cache associativity etc.
t = Float64[]
tv = Float64[]
t0 = Float64[]
for n in N
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(t, @belapsed matmul!($C,$A,$B))
push!(tv, @belapsed matmul2!($C,$A,$B))
push!(t0, @belapsed mul!($C,$A,$B))
println("finished n = $n: slowdown of ", t[end]/t0[end])
end
finished n = 10: slowdown of 2.2571504835754426 finished n = 11: slowdown of 2.4950691233198357 finished n = 12: slowdown of 3.5707673377490377 finished n = 13: slowdown of 3.4865546942291124 finished n = 15: slowdown of 3.599620732978466 finished n = 16: slowdown of 5.630373700477662 finished n = 18: slowdown of 5.243865125760819 finished n = 20: slowdown of 6.874664698176374 finished n = 22: slowdown of 6.387227592987055 finished n = 24: slowdown of 7.530390738060781 finished n = 26: slowdown of 7.139406668416908 finished n = 29: slowdown of 8.86996197718631 finished n = 32: slowdown of 8.063699530516432 finished n = 35: slowdown of 8.306155473952822 finished n = 39: slowdown of 7.8994731502994195 finished n = 43: slowdown of 8.78684030157642 finished n = 47: slowdown of 8.872432655588812 finished n = 52: slowdown of 11.35894960417069 finished n = 57: slowdown of 11.229594409624667 finished n = 63: slowdown of 10.679177119919718 finished n = 69: slowdown of 12.7211454513707 finished n = 76: slowdown of 15.040056012086819 finished n = 84: slowdown of 15.753567049381342 finished n = 92: slowdown of 17.092524771384277 finished n = 102: slowdown of 16.621883914175864 finished n = 112: slowdown of 18.920114327477748 finished n = 123: slowdown of 18.19056822213508 finished n = 136: slowdown of 20.336674367002036 finished n = 150: slowdown of 20.31741376511874 finished n = 165: slowdown of 20.2904499894427 finished n = 182: slowdown of 21.965074406415226 finished n = 200: slowdown of 24.728083316259607 finished n = 221: slowdown of 23.298093092435405 finished n = 243: slowdown of 23.615522898556552 finished n = 268: slowdown of 25.363386568368117 finished n = 295: slowdown of 24.449838923997785 finished n = 325: slowdown of 25.827970639495494 finished n = 358: slowdown of 25.60151899390615 finished n = 394: slowdown of 26.386463599559594 finished n = 434: slowdown of 27.653257847554208 finished n = 478: slowdown of 27.73446719730181 finished n = 526: slowdown of 27.442908250601207 finished n = 580: slowdown of 28.24196057903414 finished n = 639: slowdown of 27.894851485757723 finished n = 704: slowdown of 32.19274609232377 finished n = 775: slowdown of 28.101950897993888 finished n = 854: slowdown of 28.704945212934334 finished n = 940: slowdown of 33.59099217844437 finished n = 1036: slowdown of 31.388923438146342 finished n = 1141: slowdown of 38.542516868347455 finished n = 1257: slowdown of 32.326363845664744 finished n = 1384: slowdown of 48.44881463157586 finished n = 1525: slowdown of 56.25239043644647 finished n = 1680: slowdown of 87.92004480586539 finished n = 1850: slowdown of 102.2170752969168 finished n = 2038: slowdown of 114.34477319737667 finished n = 2245: slowdown of 115.49633126747378 finished n = 2473: slowdown of 128.1797712118958 finished n = 2724: slowdown of 126.06789160710578 finished n = 3000: slowdown of 142.05141996766926
using PyPlot # a plotting library based on Python's Matplotlib
Now, we will plot the results. Since the number of flops is $2n^3$, we will plot $2n^3 / t$ for time $t$ in microseconds in order to plot the gigaflops rate (billions of flops per second). If you naively think of a CPU as a box that performs floating-point instructions at a fixed rate, with all other instructions being negligible (a picture that may have been true circa 1985), this would be a flat horizontal line independent of $n$, but we will see that reality is quite different.
plot(N, 2N.^3 ./ t * 1e-9, "m.--")
plot(N, 2N.^3 ./ tv * 1e-9, "rs-")
plot(N, 2N.^3 ./ t0 * 1e-9, "b.-")
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive matmul", "naive + SIMD", "BLAS matmul"], loc="center right")
PyObject <matplotlib.legend.Legend object at 0x7f8d184ff510>
You may be suspicious that the problem is simply that Julia is slow. We can check this hypothesis by implementing the same algorithm in C, compiling it, and then calling it by Julia's built-in ccall
instruction that makes it easy to call C from Julia.
# C implementation:
Cmatmul = """
void Cmatmul(int m, int n, int p, double *C, double *A, double *B)
{
int i, j, k;
for (i = 0; i < m; ++i)
for (j = 0; j < p; ++j) {
double c = 0.0;
for (k = 0; k < n; ++k)
c += A[i + m*k] * B[k + n*j];
C[i + m*j] = c;
}
}
"""
# compile to a shared library by piping Cmatmul to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
import Libdl
open(`gcc -fPIC -O3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
print(f, Cmatmul)
end
# define a Julia cmatmul! function that simply calls the C code in the shared library we compiled
function cmatmul!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
ccall(("Cmatmul", Clib), Cvoid, (Cint, Cint, Cint, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}),
m, n, p, C, A, B)
return C
end
cmatmul(A, B) = cmatmul!(Array{promote_type(eltype(A), eltype(B))}(undef,
size(A,1), size(B,2)),
A, B)
cmatmul (generic function with 1 method)
# correctness check:
A = rand(5,6)
B = rand(6,7)
norm(cmatmul(A,B) - A * B)
1.0175362097255202e-15
Now, let's benchmark it and print out the "speedup" compared to pure Julia. We see that it is about the same speed. Julia's main claim to fame is that it is a dynamic language (like Python or Matlab) that stays within a factor of 2 (usually better) of C.
tc = Float64[]
for n in N[N .<= 1000]
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(tc, @belapsed cmatmul!($C,$A,$B))
println("finished n = $n: speedup of ", tc[end]/t[length(tc)])
end
title("C vs. Julia")
xlabel(L"matrix size $n$")
ylabel("C time / Julia time")
plot(N[N .<= 1000], tc ./ t[N .<= 1000], "k*-")
finished n = 10: speedup of 0.8096392132348016 finished n = 11: speedup of 0.787933800068331 finished n = 12: speedup of 0.8429995468962392 finished n = 13: speedup of 0.8449945527175886 finished n = 15: speedup of 0.9459532739843644 finished n = 16: speedup of 0.9247698181001573 finished n = 18: speedup of 0.9302127439843024 finished n = 20: speedup of 0.8912962271199104 finished n = 22: speedup of 0.8948917748917748 finished n = 24: speedup of 0.8670125876813682 finished n = 26: speedup of 1.005001103184526 finished n = 29: speedup of 0.8989197530864198 finished n = 32: speedup of 0.9371017624920818 finished n = 35: speedup of 0.8755310778914239 finished n = 39: speedup of 0.9625932062966033 finished n = 43: speedup of 0.9251170046801873 finished n = 47: speedup of 0.9471063805036934 finished n = 52: speedup of 0.9394676004623648 finished n = 57: speedup of 0.9359114179037452 finished n = 63: speedup of 0.9298708865041629 finished n = 69: speedup of 0.9078362203859774 finished n = 76: speedup of 0.8845695860754345 finished n = 84: speedup of 0.8774865071882058 finished n = 92: speedup of 0.8900395274582411 finished n = 102: speedup of 0.8975104879973288 finished n = 112: speedup of 0.9356617279334923 finished n = 123: speedup of 0.9556412476601625 finished n = 136: speedup of 0.9537445468897421 finished n = 150: speedup of 0.9844177502939319 finished n = 165: speedup of 0.9551775324504405 finished n = 182: speedup of 0.9860263536709534 finished n = 200: speedup of 0.9441919276969843 finished n = 221: speedup of 0.9907513876310845 finished n = 243: speedup of 0.9990125898590008 finished n = 268: speedup of 0.9916502820202407 finished n = 295: speedup of 0.9939553564949799 finished n = 325: speedup of 0.9905882483291066 finished n = 358: speedup of 0.981835574410494 finished n = 394: speedup of 0.9852785395384825 finished n = 434: speedup of 0.9908817162394533 finished n = 478: speedup of 0.9917092456362243 finished n = 526: speedup of 0.9934823924981853 finished n = 580: speedup of 0.991162753360323 finished n = 639: speedup of 0.9951056952239036 finished n = 704: speedup of 0.9786153629846499 finished n = 775: speedup of 0.984708599313264 finished n = 854: speedup of 0.9972466098521331 finished n = 940: speedup of
0.9930787512487544
1-element Array{PyCall.PyObject,1}: PyObject <matplotlib.lines.Line2D object at 0x7f8d18834210>
As a first step in the right direction, we'll implement a cache-oblivious algorithm for matrix multiplication: divide the matrices into four submatrices which are multiplied recursively until a sufficiently large base case is reached (large enough to amortize the recursion overhead). This strategy erases the steep performance drop-off that occurs for large $n$ where the matrix goes out-of-cache, at the cost of ~25 lines of code rather than ~10 for the naive loops.
(The "last factor of 2" in performance takes a lot more effort. Our code still doesn't match the OpenBLAS performance because the cache-oblivious algorithm is only optimal up to constant factors, and moreover one really needs to unroll and optimize the base cases to optimize register utilization. And then there is the effort required to parallelize the whole thing, which is not considered here.)
In contrast, OpenBLAS requires ≈1000× as much code (≈30000 lines for matrix multiplications). A recent pure-Julia library called Octavian.jl can match and even sometimes exceed the multi-threaded OpenBLAS performance, with "only" 100× as much code (a few thousand lines). High-level languages + fast compilation is a convenient combination for writing high-performance code!
function add_matmul_rec!(m,n,p, i0,j0,k0, C,A,B)
if m+n+p <= 256 # base case: naive matmult for sufficiently large matrices
@avx for i = 1:m, k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i0+i,j0+j] * B[j0+j,k0+k]
end
@inbounds C[i0+i,k0+k] += c
end
else
m2 = m ÷ 2; n2 = n ÷ 2; p2 = p ÷ 2
add_matmul_rec!(m2, n2, p2, i0, j0, k0, C, A, B)
add_matmul_rec!(m-m2, n2, p2, i0+m2, j0, k0, C, A, B)
add_matmul_rec!(m2, n-n2, p2, i0, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, n2, p-p2, i0, j0, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n-n2, p2, i0+m2, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, n-n2, p-p2, i0, j0+n2, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n2, p-p2, i0+m2, j0, k0+p2, C, A, B)
add_matmul_rec!(m-m2, n-n2, p-p2, i0+m2, j0+n2, k0+p2, C, A, B)
end
return C
end
function matmul_rec!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
fill!(C, 0)
return add_matmul_rec!(m,n,p, 0,0,0, C,A,B)
end
matmul_rec(A, B) = matmul_rec!(Array{promote_type(eltype(A), eltype(B))}(undef,
size(A,1), size(B,2)),
A, B)
matmul_rec (generic function with 1 method)
# correctness check:
A = rand(50,60)
B = rand(60,70)
norm(matmul_rec(A,B) - A * B)
0.0
tco = Float64[]
for n in N
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
push!(tco, @belapsed matmul_rec!($C,$A,$B))
println("finished n = $n: slowdown of ", tco[end]/t0[length(tco)])
end
finished n = 10: slowdown of 0.42001787059144474 finished n = 11: slowdown of 0.3744457778506396 finished n = 12: slowdown of 0.5041604323980581 finished n = 13: slowdown of 0.5199141574397629 finished n = 15: slowdown of 0.4199135366588606 finished n = 16: slowdown of 0.5527778677231711 finished n = 18: slowdown of 0.5997538447503661 finished n = 20: slowdown of 0.6598303216818115 finished n = 22: slowdown of 0.6105471661213682 finished n = 24: slowdown of 0.6054457506492477 finished n = 26: slowdown of 0.6547650301916514 finished n = 29: slowdown of 0.667514258555133 finished n = 32: slowdown of 0.5813784037558685 finished n = 35: slowdown of 0.587218107508143 finished n = 39: slowdown of 0.5800459766353611 finished n = 43: slowdown of 0.6574409521633646 finished n = 47: slowdown of 0.6198369284755909 finished n = 52: slowdown of 0.7644332882795906 finished n = 57: slowdown of 0.8070023773503351 finished n = 63: slowdown of 0.7167586552935273 finished n = 69: slowdown of 0.859088528025144 finished n = 76: slowdown of 0.8939823856726978 finished n = 84: slowdown of 0.9027979043584885 finished n = 92: slowdown of 1.0306243961624015 finished n = 102: slowdown of 1.0338273595435943 finished n = 112: slowdown of 1.013494818720472 finished n = 123: slowdown of 1.0922077194554927 finished n = 136: slowdown of 1.0610428888416061 finished n = 150: slowdown of 1.1496622245605002 finished n = 165: slowdown of 1.108019822885745 finished n = 182: slowdown of 1.2082268595068328 finished n = 200: slowdown of 1.3487085607243876 finished n = 221: slowdown of 1.2927168814548335 finished n = 243: slowdown of 1.281230228017142 finished n = 268: slowdown of 1.34465028457611 finished n = 295: slowdown of 1.2720082744576946 finished n = 325: slowdown of 1.44003504602943 finished n = 358: slowdown of 1.3618780961782526 finished n = 394: slowdown of 1.4870545074866357 finished n = 434: slowdown of 1.4920417112593873 finished n = 478: slowdown of 1.575467493331027 finished n = 526: slowdown of 1.5873598921013061 finished n = 580: slowdown of 1.7110408616573909 finished n = 639: slowdown of 1.59850854225871 finished n = 704: slowdown of 1.441796289755857 finished n = 775: slowdown of 1.502407218795922 finished n = 854: slowdown of 1.7202498542978835 finished n = 940: slowdown of 1.6664969962115703 finished n = 1036: slowdown of 1.754418686970654 finished n = 1141: slowdown of 1.7199930651660251 finished n = 1257: slowdown of 1.827259833448249 finished n = 1384: slowdown of 1.7536936660976454 finished n = 1525: slowdown of 1.5631772405220175 finished n = 1680: slowdown of 1.7056254048980006 finished n = 1850: slowdown of 1.871265472364321 finished n = 2038: slowdown of 1.7970598957555743 finished n = 2245: slowdown of 2.1909956398756925 finished n = 2473: slowdown of 2.0075170584159063 finished n = 2724: slowdown of 1.9658048765581884 finished n = 3000: slowdown of 1.6747624520873492
plot(N, 2N.^3 ./ tv * 1e-9, "rs-")
plot(N, 2N.^3 ./ tco * 1e-9, "ko-")
plot(N, 2N.^3 ./ t0 * 1e-9, "b.-")
ylim(0,50)
ylabel(L"gigaflops $2n^3/t$")
xlabel(L"matrix size $n$")
legend(["naive+SIMD", "cache-oblivious+SIMD", "BLAS matmul (single-threaded)"], loc="upper right")
PyObject <matplotlib.legend.Legend object at 0x7f8d18a02dd0>
Matrix addition is an interesting case because it has no data re-use, so there is no possible temporal locality, but depending on what order you use for the loops and how matrices are stored in memory, you may or may not get spatial locality that takes advantage of cache lines.
Here let's implement matrix addition in two different ways. As above, we'll use a pre-allocated output array so that our benchmark does not include the time for memory allocation:
function matadd1!(C, A, B)
size(C) == size(A) == size(B) || throw(DimensionMismatchmatch())
m,n = size(A)
for i = 1:m
@simd for j = 1:n
@inbounds C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
matadd1(A, B) = matadd1!(similar(A, promote_type(eltype(A), eltype(B))), A, B)
function matadd2!(C, A, B)
size(C) == size(A) == size(B) || throw(DimensionMismatch())
m,n = size(A)
for j = 1:n
@simd for i = 1:m
@inbounds C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
matadd2(A, B) = matadd2!(similar(A, promote_type(eltype(A), eltype(B))), A, B)
A = rand(5,6)
B = rand(5,6)
A + B ≈ matadd1(A,B) ≈ matadd2(A,B)
true
Na = round.(Int, 10 .^ range(1, log10(3000), length=60)) # 60 sizes from 10 to 3000
# alternatively, use N = 10:1000 to see some interesting patterns due to cache associativity etc.
t1 = Float64[]
t2 = Float64[]
for n in Na
A = zeros(n,n)
B = zeros(n,n)
# preallocate output C so that allocation is not included in timing
C = zeros(n,n)
matadd1!(C,A,B) # add once just to make sure we are in cache if A and B are small
push!(t1, @belapsed matadd1!($C,$A,$B))
push!(t2, @belapsed matadd2!($C,$A,$B))
println("finished n = $n: ratio t1/t2 of ", t1[end]/t2[end])
end
finished n = 10: ratio t1/t2 of 1.0230625080301594 finished n = 11: ratio t1/t2 of 1.0686913725305058 finished n = 12: ratio t1/t2 of 1.0949888352129231 finished n = 13: ratio t1/t2 of 1.0825560142269264 finished n = 15: ratio t1/t2 of 1.0440187828083989 finished n = 16: ratio t1/t2 of 2.0463619261597166 finished n = 18: ratio t1/t2 of 1.8997204570255033 finished n = 20: ratio t1/t2 of 1.7882358742287783 finished n = 22: ratio t1/t2 of 1.749614054182348 finished n = 24: ratio t1/t2 of 1.7210357565446446 finished n = 26: ratio t1/t2 of 1.6955612908163085 finished n = 29: ratio t1/t2 of 1.6268502571988732 finished n = 32: ratio t1/t2 of 3.7985615003507647 finished n = 35: ratio t1/t2 of 2.184259278495316 finished n = 39: ratio t1/t2 of 2.636343399866439 finished n = 43: ratio t1/t2 of 2.5014325977133742 finished n = 47: ratio t1/t2 of 2.3754143845488613 finished n = 52: ratio t1/t2 of 3.9109653916211293 finished n = 57: ratio t1/t2 of 5.485032635606571 finished n = 63: ratio t1/t2 of 2.548234882908074 finished n = 69: ratio t1/t2 of 2.9762531493250104 finished n = 76: ratio t1/t2 of 3.5468661336084604 finished n = 84: ratio t1/t2 of 4.781761983884512 finished n = 92: ratio t1/t2 of 4.397587419417854 finished n = 102: ratio t1/t2 of 5.6833489769100805 finished n = 112: ratio t1/t2 of 5.605194971370078 finished n = 123: ratio t1/t2 of 4.057858711093892 finished n = 136: ratio t1/t2 of 3.060295964949906 finished n = 150: ratio t1/t2 of 3.9087834375320285 finished n = 165: ratio t1/t2 of 4.787215753692272 finished n = 182: ratio t1/t2 of 4.675062801276393 finished n = 200: ratio t1/t2 of 3.746411757951544 finished n = 221: ratio t1/t2 of 4.444597701149425 finished n = 243: ratio t1/t2 of 4.264818796068796 finished n = 268: ratio t1/t2 of 4.108974948640112 finished n = 295: ratio t1/t2 of 4.2012526096033405 finished n = 325: ratio t1/t2 of 4.356924508240298 finished n = 358: ratio t1/t2 of 4.330135862782156 finished n = 394: ratio t1/t2 of 4.577073299659345 finished n = 434: ratio t1/t2 of 4.616964397251718 finished n = 478: ratio t1/t2 of 3.883952142030104 finished n = 526: ratio t1/t2 of 10.140144336101446 finished n = 580: ratio t1/t2 of 14.569555318807371 finished n = 639: ratio t1/t2 of 17.416578748647154 finished n = 704: ratio t1/t2 of 16.68947246927238 finished n = 775: ratio t1/t2 of 17.026629484555645 finished n = 854: ratio t1/t2 of 16.683473218211113 finished n = 940: ratio t1/t2 of 14.982423636374792 finished n = 1036: ratio t1/t2 of 15.476104697294462 finished n = 1141: ratio t1/t2 of 17.092937001283893 finished n = 1257: ratio t1/t2 of 15.49919176720316 finished n = 1384: ratio t1/t2 of 14.013528372534543 finished n = 1525: ratio t1/t2 of 16.4677420528883 finished n = 1680: ratio t1/t2 of 15.175868218928526 finished n = 1850: ratio t1/t2 of 13.771946491596166 finished n = 2038: ratio t1/t2 of 16.69229308028548 finished n = 2245: ratio t1/t2 of 14.411234783257054 finished n = 2473: ratio t1/t2 of 14.735477683088696 finished n = 2724: ratio t1/t2 of 17.709709652956878 finished n = 3000: ratio t1/t2 of 16.002173288119597
plot(Na, Na.^2 ./ t1 * 1e-9, "rs-")
plot(Na, Na.^2 ./ t2 * 1e-9, "bo-")
xlabel(L"matrix size $n$")
ylabel(L"gigaflops $n^2/t$")
legend(["by row", "by column"])
PyObject <matplotlib.legend.Legend object at 0x7f8ced4db7d0>
plot(Na, t1 ./ t2, "rs-")
xlabel(L"matrix size $n$")
ylabel("by row time / by column time")
title("ratio of matrix-addition algorithms")
PyObject Text(0.5, 1.0, 'ratio of matrix-addition algorithms')
We can see that addition is almost *20 times slower* if we add by rows rather than by columns.
The reason for this is that Julia stores matrices with consecutive columns, which is known as column-major storage format.
Another example of the importance of spatial locality can be seen in vectorized operations. For more information on this subject, see the blog post:
Consider the following function, which computes $f(x) = 2x^2 - 3x + 4$ element-wise for an array of $x$ values:
f_vec(x) = 2 * x.^2 - 3 * x .+ 4
f_devec(X) = f_vec.(X)
X = rand(10^6);
@btime f_vec($X);
6.198 ms (10 allocations: 38.15 MiB)
@btime f_devec($X);
1.020 ms (2 allocations: 7.63 MiB)
The f_devec
function is more than 4 times faster than the vectorized version, and allocates about 1/4 the memory. That is because f_vec
is effectively equivalent to:
function f_vec_really(x)
tmp1 = x.^2
tmp2 = 2 * tmp1
tmp3 = 3 * x
tmp4 = tmp2 .- tmp3
tmp5 = tmp4 .+ 4
return tmp5
end
f_vec_really (generic function with 1 method)
That is, each "vectorized" operation, like x.^2
, while it is individually fast, does a separate loop over its data and allocates a temporary array for the result. This is slow for three reasons:
x[i]
, etcetera), is incurred multiple times rather than once.In Julia, there is a syntax 2 .* x.^2 .- 3 .* x .+ 4
, or equivalently @. 2x^2 - 3x + 4
in which the compiler is guaranteed to "fuse" the operations into a single loop, allocating no temporary arrays.