This IJulia/Jupyter notebook presents some performance experiments with vectorization in Julia, to accompany the blog post on syntactic loop fusion in Julia 0.6.
We use the following example problem:
evaluating f(2x^2 + 6x^3 - sqrt(x))
, for the function f(x) = 3x^2 + 5x + 2
, elementwise for x
in an array X
, storing the results in-place in X
. We implement this in three different styles: "traditional" vectorized style vec!(X)
ala Julia 0.4 or Matlab/Numpy, the devectorized style (explicit loops) devec!(X)
, and new-style vectorization newvec!
with syntactic loop fusion:
f(x) = 3x.^2 + 5x + 2
# traditional-style vectorization:
vec!(X) = X .= f(2X.^2 + 6X.^3 - sqrt.(X))
# new-style vectorization (dot operations = syntactic loop fusion):
newvec!(X) = X .= f.(2 .* X.^2 .+ 6 .* X.^3 .- sqrt.(X))
# devectorized (explicit loops):
function devec!(X)
for i in eachindex(X)
x = X[i]
X[i] = f(2x^2 + 6x^3 - sqrt(x))
end
return X
end
devec! (generic function with 1 method)
Let's run a simple benchmark, comparing the performance of the three functions for a vector of $10^6$ Float64
values. We will use the BenchmarkTools package to collect timing statistics for us.
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10 # use 10s benchmarks to reduce timing noise
10
X = zeros(10^6)
t_vec = @benchmark vec!($X)
BenchmarkTools.Trial: memory estimate: 91.55 mb allocs estimate: 24 -------------- minimum time: 39.018 ms (30.62% GC) median time: 46.892 ms (41.53% GC) mean time: 77.637 ms (63.53% GC) maximum time: 126.879 ms (75.92% GC) -------------- samples: 129 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00%
t_devec = @benchmark devec!($X)
BenchmarkTools.Trial: memory estimate: 0.00 bytes allocs estimate: 0 -------------- minimum time: 3.325 ms (0.00% GC) median time: 3.577 ms (0.00% GC) mean time: 3.728 ms (0.00% GC) maximum time: 6.484 ms (0.00% GC) -------------- samples: 2680 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00%
t_newvec = @benchmark newvec!($X)
BenchmarkTools.Trial: memory estimate: 0.00 bytes allocs estimate: 0 -------------- minimum time: 3.615 ms (0.00% GC) median time: 3.902 ms (0.00% GC) mean time: 4.104 ms (0.00% GC) maximum time: 9.670 ms (0.00% GC) -------------- samples: 2434 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00%
println("traditional vectorization slowdown = ", time(minimum(t_vec)) / time(minimum(t_devec)), "×")
println("new-style vectorization slowdown = ", time(minimum(t_newvec)) / time(minimum(t_devec)), "×")
println("\ntraditional vectorization memory overhead = ", memory(minimum(t_vec)) / sizeof(X), "×")
traditional vectorization slowdown = 11.73377323838746× new-style vectorization slowdown = 1.0871549107544276× traditional vectorization memory overhead = 12.00012×
As can be seen from the preceding timing ratios, the traditional vectorized version is 12× slower than the devectorized version, but the new-style "dot-vectorized" version is only 10% slower.
Also, the traditional vectorized version wastes a factor of 12 in memory compared to the size of X
itself, due to the 12 temporary arrays that are allocated while evaluating vec!
, whereas newvec!
allocates nothing.
To help us understand the source of the 12× slowdown of the traditional vectorized version, we want to separate two potential performance problems: the cost of allocating all of the temporary arrays vs. the cost of multiple loops where only one loop is required. To do this, I'll write a version of vec!
in which all of the 12 temporary arrays are pre-allocated:
function vec_prealloc!(X, TWELVE_TEMP_ARRAYS)
T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12 = TWELVE_TEMP_ARRAYS
# evaluate T7 = 2X.^2 + 6X.^3 - sqrt.(X):
T1 .= X.^2
T2 .= 2 .* T1
T3 .= X.^3
T4 .= 6 .* T3
T5 .= T2 .+ T4
T6 .= sqrt.(X)
T7 .= T5 .- T6
# evaluate T12 = f(T7):
T8 .= T7.^2
T9 .= 3 .* T8
T10 .= 5 .* T7
T11 .= T9 .+ T10
T12 .= T11 .+ 2
# store result in X
X .= T12
return X
end
vec_prealloc!(X) = vec_prealloc!(X, ntuple(i -> similar(X), 12))
vec_prealloc! (generic function with 2 methods)
Before we do anything else, let's make sure that all of these functions are computing the same thing, by verifying that they give the same answer on a random vectory Y
:
Y = rand(100)
vec!(copy(Y)) == vec_prealloc!(copy(Y)) == devec!(copy(Y)) == newvec!(copy(Y))
true
Great, it works! Now, let's try our quick benchmark from above:
t_vec_prealloc = @benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))
BenchmarkTools.Trial: memory estimate: 0.00 bytes allocs estimate: 0 -------------- minimum time: 13.342 ms (0.00% GC) median time: 13.821 ms (0.00% GC) mean time: 14.224 ms (0.00% GC) maximum time: 17.571 ms (0.00% GC) -------------- samples: 703 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00%
println("preallocated traditional vectorization slowdown = ", time(minimum(t_vec_prealloc)) / time(minimum(t_devec)), "×")
preallocated traditional vectorization slowdown = 4.012209861295182×
Even with pre-allocation, executing 12 separate loops (as vec!
does implicitly) is a factor-of-four slowdown.
To get further insight, let's benchmark as a function of the length of array.
n = [1,5,10,20,50,100,1000,5000,10^4,10^5,10^6,10^7] # array sizes to benchmark
12-element Array{Int64,1}: 1 5 10 20 50 100 1000 5000 10000 100000 1000000 10000000
t = Array{Float64}(length(n), 4)
for i = 1:length(n)
X = zeros(n[i])
println("benchmarking n = ", n[i])
t[i,1] = time(minimum(@benchmark devec!($X)))
t[i,2] = time(minimum(@benchmark vec!($X)))
t[i,3] = time(minimum(@benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))))
t[i,4] = time(minimum(@benchmark newvec!($X)))
end
benchmarking n = 1 benchmarking n = 5 benchmarking n = 10 benchmarking n = 20 benchmarking n = 50 benchmarking n = 100 benchmarking n = 1000 benchmarking n = 5000 benchmarking n = 10000 benchmarking n = 100000 benchmarking n = 1000000 benchmarking n = 10000000
["n" "devec" "vec" "vec_pre" "newvec"; n (t ./ t[:,1]) ]
13×5 Array{Any,2}: "n" "devec" "vec" "vec_pre" "newvec" 1.0 1.0 33.0302 15.369 1.46505 5.0 1.0 23.6401 11.1489 1.68799 10.0 1.0 12.907 6.66253 1.63485 20.0 1.0 8.28721 4.24866 1.35354 50.0 1.0 4.89418 2.3063 1.16262 100.0 1.0 3.30924 1.88955 1.1252 1000.0 1.0 2.78468 1.65965 1.09363 5000.0 1.0 2.6733 2.30583 1.08739 10000.0 1.0 2.69954 2.39681 1.08704 100000.0 1.0 3.30592 3.23626 1.08734 1.0e6 1.0 8.13306 4.01938 1.08778 1.0e7 1.0 22.818 5.96294 1.08505
The above table shows the execution times normalized by the time for the devectorized version. We can see the following:
The traditionally vectorized code (vec
) has a huge overhead for tiny arrays (where the relative cost of allocation is much bigger than arithmetic) and for very large arrays (where it spends a lot of time in the garbage collector). At best, for intermediate-sized arrays, it is a factor of 2–3× slower than the devectorized loop.
Even if you eliminate the cost of allocation and garbage collection, in the pre-allocated version of the traditionally vectorized code (vec_pre
), it is still much slower than the devectorized code. This is especially true for very small vectors (due to the cost of setting up all of the little loops) and very large vectors (which don't fit into cache, and hence the cache misses induced by the non-fused loops are expensive).
The new-style "dot-vectorized code" (newvec
) is mostly within 10% of the cost of the devectorized loop. The worst case is that of length-1 arrays, where the overhead of the broadcast
bookkeeping is noticable, but even then it is only a 30% penalty.
Experienced Julia programmers know that the devec!
code above should have performance comparable to code in a low-level language, like C, but it is good to check this.
The following code implements the same function in C. To run it, we invoke the C compiler, link it into a shared library, load it from Julia, and call it with the ccall
syntax from Julia. (This won't work unless you have a C compiler installed, obviously.)
C_code = """
#include <stddef.h>
#include <math.h>
void devec(size_t n, double *X) {
for (size_t i = 0; i < n; ++i) {
double x = X[i];
double x2 = x*x;
double y = 2*x2 + 6*x*x2 - sqrt(x);
X[i] = 3*y*y + 5*y + 2;
}
}
"""
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
open(`gcc -fPIC -O3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
print(f, C_code)
end
function c_devec!(X::Array{Float64})
ccall(("devec", Clib), Void, (Csize_t, Ptr{Float64}), length(X), X)
return X
end
c_devec! (generic function with 1 method)
Let's check that this gives the correct answer. (We use ≈
rather than ==
because there are some slight differences in floating-point rounding errors between the C and Julia code.)
devec!(copy(Y)) ≈ c_devec!(copy(Y))
true
Hooray, it works! Now let's benchmark the C code and compare it to the Julia devec!
:
tc = Array{Float64}(length(n))
for i = 1:length(n)
X = zeros(n[i])
println("benchmarking n = ", n[i])
tc[i] = time(minimum(@benchmark c_devec!($X)))
end
println("\nJulia devec time / C time: ", t[:,1] ./ tc)
benchmarking n = 1 benchmarking n = 5 benchmarking n = 10 benchmarking n = 20 benchmarking n = 50 benchmarking n = 100 benchmarking n = 1000 benchmarking n = 5000 benchmarking n = 10000 benchmarking n = 100000 benchmarking n = 1000000 benchmarking n = 10000000 Julia devec time / C time: [1.12107,1.16494,1.17804,1.05198,1.05288,1.02803,1.00759,1.00788,1.0071,1.00482,1.00442,0.997434]
As expected, the Julia devec!
code is within a few percent of the speed of the C code (sometimes even faster, although that's probably timing noise). Unlike the C code, which
works only for Float64
(C double
), however, the Julia devec!
code is type-generic
(it works for any type where +
, *
, ^
, and sqrt
are defined).