Julia isn't fast per se.
One can write terribly slow code in any language, including Julia.
So let's ask a different question.
(modified from Steve's Julia intro)
Vandermonde matrix: \begin{align}V=\begin{bmatrix}1&\alpha _{1}&\alpha _{1}^{2}&\dots &\alpha _{1}^{n-1}\\1&\alpha _{2}&\alpha _{2}^{2}&\dots &\alpha _{2}^{n-1}\\1&\alpha _{3}&\alpha _{3}^{2}&\dots &\alpha _{3}^{n-1}\\\vdots &\vdots &\vdots &\ddots &\vdots \\1&\alpha _{m}&\alpha _{m}^{2}&\dots &\alpha _{m}^{n-1}\end{bmatrix}\end{align}
using PyCall
np = pyimport("numpy")
np.vander(1:5, increasing=true)
The source code for this function is here. It calls np.multiply.accumulate
which is implemented in C here. However, this code doesn't actually perform the computation, it basically only checks types and stuff. The actual kernel that gets called is here. This isn't even C code but a template for C code which is used to generate type specific kernels.
Overall, this setup only supports a limited set of types, like Float64
, Float32
, and so forth.
Here is a simple Julia implementation
function vander(x::AbstractVector{T}, n=length(x)) where T
m = length(x)
V = Matrix{T}(undef, m, n)
for j = 1:m
V[j,1] = one(x[j])
end
for i= 2:n
for j = 1:m
V[j,i] = x[j] * V[j,i-1]
end
end
return V
end
vander(1:5)
using BenchmarkTools, Plots
ns = exp10.(range(1, 4, length=30));
tnp = Float64[]
tjl = Float64[]
for n in ns
x = 1:n |> collect
push!(tnp, @belapsed np.vander(\$x) samples=3 evals=1)
push!(tjl, @belapsed vander(\$x) samples=3 evals=1)
end
plot(ns, tnp./tjl, m=:circle, xscale=:log10, xlab="matrix size", ylab="NumPy time / Julia time", legend=:false)
Note that the clean and concise Julia implementation is beating numpy's C implementation for small matrices and is on-par for large matrix sizes.
At the same time, the Julia code is generic and works for arbitrary types!
vander(Int32[4, 8, 16, 32])
It even works for non-numerical types. The only requirement is that the type has a one (identity element) and a multiplication operation defined.
vander(["this", "is", "a", "test"])
Here, one(String) == ""
since the empty string is the identity under multiplication (string concatenation).
Julia specializes on the types of function arguments.
When a function is called for the first time, Julia compiles efficient machine code for the given input types.
If it is called again, the already existing machine code is reused, until we call the function with different input types.
func(x,y) = x^2 + y
@time func(1,2)
@time func(1,2)
First call: compilation + running the code
Second call: running the code
@time func(1,2)
If the input types change, Julia compiles a new specialization of the function.
@time func(1.3,4.8)
@time func(1.3,4.8)
We now have two efficient codes, one for all Int64
inputs and another one for all Float64
arguments, in the cache.
We can inspect the code at all transformation stages with a bunch of macros:
@macroexpand
)@code_typed
, @code_warntype
)@code_lowered
)@code_llvm
)@code_native
)@code_typed func(1,2)
@code_lowered func(1,2)
@code_llvm func(1,2)
We can remove the comments (lines starting with ;
using debuginfo=:none
).
@code_llvm debuginfo=:none func(1,2)
@code_native func(1,2)
Let's compare this to Float64
input.
@code_native func(1.2,2.9)
Let's try to estimate the performance gain by specialization.
We wrap our numbers into a custom type which internally stores them as Any
to prevent specialization.
(This is qualitatively comparable to what Python does.)
struct Anything
value::Any
end
operation(x::Number) = x^2 + sqrt(x)
operation(x::Anything) = x.value^2 + sqrt(x.value)
using BenchmarkTools
@btime operation(2);
@btime operation(2.0);
x = Anything(2.0)
@btime operation($x);
That's about an 40 times slowdown!
@code_native operation(2.0)
@code_native operation(x)
In scientific computations, we typically run a piece of code many times over and over again. Think of a Monte Carlo simulation, for example, where we perform the update and the Metropolis check millions of times.
Therefore, we want our run-time to be as short as possible.
On the other hand, for a given set of input arguments, Julia compiles the piece of code only once, as we have seen above. The time it takes to compile our code is almost always negligible compared to the duration of the full computation.
A general strategy is therefore to move parts of the computation to compile-time.
Since Julia specializes on types, at compile-time only type information is available to the compiler.
f1(x::Int) = x + 1
f2(x::Int) = x + 2
function f_slow(x::Int, p::Bool)
if p # check depends on the value of p
return f1(x)
else
return f2(x)
end
end
@code_llvm debuginfo=:none f_slow(1, true)
We can eliminate the if branch by moving the condition check to the type domain. This way, it will only be evaluated once at compile-time.
abstract type Boolean end
struct True <: Boolean end # type domain true
struct False <: Boolean end # type domain false
function f_fast(x::Int, p::Boolean)
if typeof(p) == True # check solely based on the type of p
return f1(x)
else
return f2(x)
end
end
@code_llvm debuginfo=:none f_fast(1, True())
Note that Julia's type inference is powerful. Specifying types is not necessary for best performance!
function my_function(x)
y = rand()
z = rand()
x+y+z
end
function my_function_typed(x::Int)::Float64
y::Float64 = rand()
z::Float64 = rand()
x+y+z
end
@btime my_function(10);
@btime my_function_typed(10);
However, annotating types explicitly can serve a purpose.
@code_warntype
.