Notebook

More Dots: Performance and Vectorization in Julia¶

This IJulia/Jupyter notebook presents some performance experiments with vectorization in Julia, to accompany the blog post on syntactic loop fusion in Julia 0.6.

We use the following example problem: evaluating f(2x^2 + 6x^3 - sqrt(x)), for the function f(x) = 3x^2 + 5x + 2, elementwise for x in an array X, storing the results in-place in X. We implement this in three different styles: "traditional" vectorized style vec!(X) ala Julia 0.4 or Matlab/Numpy, the devectorized style (explicit loops) devec!(X), and new-style vectorization newvec! with syntactic loop fusion:

In [1]:

f(x) = 3x.^2 + 5x + 2

# traditional-style vectorization:
vec!(X) = X .= f(2X.^2 + 6X.^3 - sqrt.(X))

# new-style vectorization (dot operations = syntactic loop fusion):
newvec!(X) = X .= f.(2 .* X.^2 .+ 6 .* X.^3 .- sqrt.(X))

# devectorized (explicit loops):
function devec!(X)
    for i in eachindex(X)
        x = X[i]
        X[i] = f(2x^2 + 6x^3 - sqrt(x))
    end
    return X
end

Out[1]:

devec! (generic function with 1 method)

A quick benchmark¶

Let's run a simple benchmark, comparing the performance of the three functions for a vector of $10^6$ Float64 values. We will use the BenchmarkTools package to collect timing statistics for us.

In [2]:

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10 # use 10s benchmarks to reduce timing noise

Out[2]:

In [3]:

X = zeros(10^6)

t_vec = @benchmark vec!($X)

Out[3]:

BenchmarkTools.Trial: 
  memory estimate:  91.55 mb
  allocs estimate:  24
  --------------
  minimum time:     39.018 ms (30.62% GC)
  median time:      46.892 ms (41.53% GC)
  mean time:        77.637 ms (63.53% GC)
  maximum time:     126.879 ms (75.92% GC)
  --------------
  samples:          129
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [4]:

t_devec = @benchmark devec!($X)

Out[4]:

BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.325 ms (0.00% GC)
  median time:      3.577 ms (0.00% GC)
  mean time:        3.728 ms (0.00% GC)
  maximum time:     6.484 ms (0.00% GC)
  --------------
  samples:          2680
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [5]:

t_newvec = @benchmark newvec!($X)

Out[5]:

BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.615 ms (0.00% GC)
  median time:      3.902 ms (0.00% GC)
  mean time:        4.104 ms (0.00% GC)
  maximum time:     9.670 ms (0.00% GC)
  --------------
  samples:          2434
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [6]:

println("traditional vectorization slowdown = ", time(minimum(t_vec)) / time(minimum(t_devec)), "×")
println("new-style vectorization slowdown = ", time(minimum(t_newvec)) / time(minimum(t_devec)), "×")
println("\ntraditional vectorization memory overhead = ", memory(minimum(t_vec)) / sizeof(X), "×")

traditional vectorization slowdown = 11.73377323838746×
new-style vectorization slowdown = 1.0871549107544276×

traditional vectorization memory overhead = 12.00012×

As can be seen from the preceding timing ratios, the traditional vectorized version is 12× slower than the devectorized version, but the new-style "dot-vectorized" version is only 10% slower.

Also, the traditional vectorized version wastes a factor of 12 in memory compared to the size of X itself, due to the 12 temporary arrays that are allocated while evaluating vec!, whereas newvec! allocates nothing.

More detailed benchmarks¶

To help us understand the source of the 12× slowdown of the traditional vectorized version, we want to separate two potential performance problems: the cost of allocating all of the temporary arrays vs. the cost of multiple loops where only one loop is required. To do this, I'll write a version of vec! in which all of the 12 temporary arrays are pre-allocated:

In [7]:

function vec_prealloc!(X, TWELVE_TEMP_ARRAYS)
    T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12 = TWELVE_TEMP_ARRAYS
    
    # evaluate T7 = 2X.^2 + 6X.^3 - sqrt.(X):
    T1 .= X.^2
    T2 .= 2 .* T1
    T3 .= X.^3
    T4 .= 6 .* T3
    T5 .= T2 .+ T4
    T6 .= sqrt.(X)
    T7 .= T5 .- T6
    
    # evaluate T12 = f(T7):
    T8 .= T7.^2
    T9 .= 3 .* T8
    T10 .= 5 .* T7
    T11 .= T9 .+ T10
    T12 .= T11 .+ 2
    
    # store result in X
    X .= T12
    return X
end
vec_prealloc!(X) = vec_prealloc!(X, ntuple(i -> similar(X), 12))

Out[7]:

vec_prealloc! (generic function with 2 methods)

Before we do anything else, let's make sure that all of these functions are computing the same thing, by verifying that they give the same answer on a random vectory Y:

In [8]:

Y = rand(100)
vec!(copy(Y)) == vec_prealloc!(copy(Y)) == devec!(copy(Y)) == newvec!(copy(Y))

Out[8]:

true

Great, it works! Now, let's try our quick benchmark from above:

In [9]:

t_vec_prealloc = @benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))

Out[9]:

BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.342 ms (0.00% GC)
  median time:      13.821 ms (0.00% GC)
  mean time:        14.224 ms (0.00% GC)
  maximum time:     17.571 ms (0.00% GC)
  --------------
  samples:          703
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [10]:

println("preallocated traditional vectorization slowdown = ", time(minimum(t_vec_prealloc)) / time(minimum(t_devec)), "×")

preallocated traditional vectorization slowdown = 4.012209861295182×

Even with pre-allocation, executing 12 separate loops (as vec! does implicitly) is a factor-of-four slowdown.

To get further insight, let's benchmark as a function of the length of array.

In [11]:

n = [1,5,10,20,50,100,1000,5000,10^4,10^5,10^6,10^7] # array sizes to benchmark

Out[11]:

12-element Array{Int64,1}:
        1
        5
       10
       20
       50
      100
     1000
     5000
    10000
   100000
  1000000
 10000000

In [12]:

t = Array{Float64}(length(n), 4)
for i = 1:length(n)
    X = zeros(n[i])
    println("benchmarking n = ", n[i])
    t[i,1] = time(minimum(@benchmark devec!($X)))
    t[i,2] = time(minimum(@benchmark vec!($X)))
    t[i,3] = time(minimum(@benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))))
    t[i,4] = time(minimum(@benchmark newvec!($X)))
end

benchmarking n = 1
benchmarking n = 5
benchmarking n = 10
benchmarking n = 20
benchmarking n = 50
benchmarking n = 100
benchmarking n = 1000
benchmarking n = 5000
benchmarking n = 10000
benchmarking n = 100000
benchmarking n = 1000000
benchmarking n = 10000000

In [13]:

["n" "devec" "vec" "vec_pre" "newvec"; n (t ./ t[:,1]) ]

Out[13]:

13×5 Array{Any,2}:
       "n"    "devec"    "vec"     "vec_pre"   "newvec"
      1.0    1.0       33.0302   15.369       1.46505  
      5.0    1.0       23.6401   11.1489      1.68799  
     10.0    1.0       12.907     6.66253     1.63485  
     20.0    1.0        8.28721   4.24866     1.35354  
     50.0    1.0        4.89418   2.3063      1.16262  
    100.0    1.0        3.30924   1.88955     1.1252   
   1000.0    1.0        2.78468   1.65965     1.09363  
   5000.0    1.0        2.6733    2.30583     1.08739  
  10000.0    1.0        2.69954   2.39681     1.08704  
 100000.0    1.0        3.30592   3.23626     1.08734  
      1.0e6  1.0        8.13306   4.01938     1.08778  
      1.0e7  1.0       22.818     5.96294     1.08505

The above table shows the execution times normalized by the time for the devectorized version. We can see the following:

The traditionally vectorized code (vec) has a huge overhead for tiny arrays (where the relative cost of allocation is much bigger than arithmetic) and for very large arrays (where it spends a lot of time in the garbage collector). At best, for intermediate-sized arrays, it is a factor of 2–3× slower than the devectorized loop.
Even if you eliminate the cost of allocation and garbage collection, in the pre-allocated version of the traditionally vectorized code (vec_pre), it is still much slower than the devectorized code. This is especially true for very small vectors (due to the cost of setting up all of the little loops) and very large vectors (which don't fit into cache, and hence the cache misses induced by the non-fused loops are expensive).
The new-style "dot-vectorized code" (newvec) is mostly within 10% of the cost of the devectorized loop. The worst case is that of length-1 arrays, where the overhead of the broadcast bookkeeping is noticable, but even then it is only a 30% penalty.

Comparison to C code¶

Experienced Julia programmers know that the devec! code above should have performance comparable to code in a low-level language, like C, but it is good to check this.

The following code implements the same function in C. To run it, we invoke the C compiler, link it into a shared library, load it from Julia, and call it with the ccall syntax from Julia. (This won't work unless you have a C compiler installed, obviously.)

In [14]:

C_code = """
#include <stddef.h>
#include <math.h>
void devec(size_t n, double *X) {
    for (size_t i = 0; i < n; ++i) {
        double x = X[i];
        double x2 = x*x;
        double y = 2*x2 + 6*x*x2 - sqrt(x);
        X[i] = 3*y*y + 5*y + 2;
    }
}
"""
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
open(`gcc -fPIC -O3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end

In [15]:

function c_devec!(X::Array{Float64})
    ccall(("devec", Clib), Void, (Csize_t, Ptr{Float64}), length(X), X)
    return X
end

Out[15]:

c_devec! (generic function with 1 method)

Let's check that this gives the correct answer. (We use ≈ rather than == because there are some slight differences in floating-point rounding errors between the C and Julia code.)

In [16]:

devec!(copy(Y)) ≈ c_devec!(copy(Y))

Out[16]:

true

Hooray, it works! Now let's benchmark the C code and compare it to the Julia devec!:

In [17]:

tc = Array{Float64}(length(n))
for i = 1:length(n)
    X = zeros(n[i])
    println("benchmarking n = ", n[i])
    tc[i] = time(minimum(@benchmark c_devec!($X)))
end
println("\nJulia devec time / C time: ", t[:,1] ./ tc)

benchmarking n = 1
benchmarking n = 5
benchmarking n = 10
benchmarking n = 20
benchmarking n = 50
benchmarking n = 100
benchmarking n = 1000
benchmarking n = 5000
benchmarking n = 10000
benchmarking n = 100000
benchmarking n = 1000000
benchmarking n = 10000000

Julia devec time / C time: [1.12107,1.16494,1.17804,1.05198,1.05288,1.02803,1.00759,1.00788,1.0071,1.00482,1.00442,0.997434]

As expected, the Julia devec! code is within a few percent of the speed of the C code (sometimes even faster, although that's probably timing noise). Unlike the C code, which works only for Float64 (C double), however, the Julia devec! code is type-generic (it works for any type where +, *, ^, and sqrt are defined).