# code for loading the format for the notebook import os # path : store the current path to convert back to it later path = os.getcwd() os.chdir(os.path.join('..', '..', 'notebook_format')) from formats import load_style load_style(plot_style = False)
os.chdir(path) # 1. magic to print version # 2. magic so that the notebook will reload external python modules # 3. to use cython %load_ext watermark %load_ext autoreload %autoreload 2 %load_ext cython import numpy as np from numba import jit, njit, prange %watermark -a 'Ethen' -d -t -v -p numpy,cython,numba
Ethen 2017-08-25 10:26:10 CPython 3.5.2 IPython 5.4.1 numpy 1.13.1 cython 0.25.2 numba 0.34.0
Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. In short it aims to give the simplicity of Python and efficiency of C. If you like some additional motivation to try it out consider listening to a 20 minute-ish talk from Pycon. Youtube: Cython as a Game Changer for Efficiency
We can write it in notebooks by loading the cython magic.
%%cython def hello_snippet(): """ after loading the cython magic, we can run the cython code (this code isn't different from normal python code) """ print('hello cython')
Or write it as a .pyx file and use
setup.py to compile it:
# cython hello world def hello(): print('Hello, World!')
# compiling the .pyx module from distutils.core import setup from Cython.Build import cythonize # key-value pairs that tell disutils the name # of the application and which extensions it # needs to build, for the cython modules, we # use glob patterns, e.g. *.pyx or simply pass in # the filename.pyx setup( name = 'Hello', ext_modules = cythonize('*.pyx'), ) # after that run # python setup.py build_ext --inplace # in the command line, and we can import it like # normal python modules
from helloworld import hello hello()
Cython extends the Python language with static type declarations. This increases speed by not needing to do type-checks when running the program. The way we do this in Cython is by adding the
We'll write a simple program that increments j by 1 for 1000 times and compare the speed difference when adding the type declaration.
%%cython def example(): """simply increment j by 1 for 1000 times""" # declare the integer type before using it cdef int i, j = 0 for i in range(1000): j += 1 return j
def example_py(): j = 0 for i in range(1000): j += 1 return j
The slowest run took 28.37 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 47.7 ns per loop
10000 loops, best of 3: 49.4 µs per loop
Notice the runtime difference (look at the units)
To declare functions we use the
%%cython cpdef int compute_sum(int a, int b): return a + b
Notice apart from declaring the function using the
cpdef keyword, we also specify the return type to be a integer and a two input argument to be integers.
There's still an overhead to calling functions, so if the function is small and is in a computationally expensive for loop, then we can add the
inline keyword in the function declaration. By doing this, it will replace function call solely with the function body, thus reducing the time to call the function multiple times.
%%cython cpdef inline int compute_sum(int a, int b): return a + b
Typed memoryviews allow even more efficient numpy manipulation since again, it does not incur the python overhead.
%%cython import numpy as np # declare memoryviews by using : in the  cdef double[:, :] b = np.zeros((3, 3), dtype = 'float64') b = 1 # it now prints memoryview instead of original numpy array print(b)
<MemoryView of 'ndarray' object>
We'll start by implementing a pure python version of the function that will give us a good benchmark for comparison with Cython alternatives below.
import numpy as np def euclidean_distance(x1, x2): dist = np.sqrt(np.sum((x1 - x2) ** 2)) return dist def pairwise_python(X, metric = 'euclidean'): if metric == 'euclidean': dist_func = euclidean_distance else: raise ValueError("unrecognized metric") n_samples = X.shape D = np.zeros((n_samples, n_samples)) # We could exploit symmetry to reduce the number of computations required, # i.e. distance D[i, j] = D[j, i] # by only looping over its upper triangle for i in range(n_samples): for j in range(i + 1, n_samples): dist = dist_func(X[i], X[j]) D[i, j] = dist D[j, i] = dist return D
X = np.random.random((1000, 3)) %timeit pairwise_python(X)
1 loop, best of 3: 3.53 s per loop
We'll try re-writing this into Cython using type memoryview. The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
All the commented
.pyx code can be found in the github folder. You can simply run
python setup.py install to install
# pairwise1.pyx from pairwise1 import pairwise1 # test optimized code on a larger matrix X = np.random.random((5000, 3)) %timeit pairwise1(X)
1 loop, best of 3: 607 ms per loop
We can see the huge speedup over the pure python version! It turns out, though, that we can do even better. If we look in the code, the slicing operation when we call X[i] and X[j] must generate a new numpy array each time. So this time, we will directly slice the X array without creating new array each time.
# pairwise2.pyx from pairwise2 import pairwise2 %timeit pairwise2(X)
1 loop, best of 3: 231 ms per loop
We now try utilize Cython's parallel functionality. For some reason can't compile the parallel version when following Cython's documentation on compiling parallel version that utilizes OPENMP (a multithreading API), will come back to this in the future. Had to take a different route by installing it as if it was a package. You can simply run
python setup_parallel.py install to install
# pairwise3.pyx from pairwise3 import pairwise3 %timeit pairwise3(X)
10 loops, best of 3: 88.9 ms per loop
We've touch upon an exmaple of utilizing Cython to speed up or CPU intensive numerical operations. Though, to get the full advantage out of Cython, it's still good to know some C/C++ programming (things like void type, pointers, standard library).
Numba is an LLVM compiler for python code, which allows code written in Python to be converted to highly efficient compiled code in real-time. To use it, we simply add a
@jit (just in time compilation) decorator to our function. We can add arguments to the decorator to specify the input type, but it is recommended not to and simply let Numba decide when and how to optimize.
@jit def pairwise_numba1(X): M = X.shape N = X.shape D = np.zeros((M, M), dtype = np.float64) for i in range(M): for j in range(i + 1, M): d = 0.0 for k in range(N): tmp = X[i, k] - X[j, k] d += tmp * tmp dist = np.sqrt(d) D[i, j] = dist D[j, i] = dist return D
# a nice speedup from the raw python code given the # little amount of effort that we had to put in # (just adding the jit decorator) %timeit pairwise_numba1(X)
1 loop, best of 3: 236 ms per loop
@jit decorator tells Numba to compile the function. The argument types will be inferred by Numba when function is called. If Numba can't infer the types, it will fall back to a python object; When this happens, we probably won't see any significant speed up. The numba documentation lists out what python and numpy features are supported.
A number of keyword-only arguments can be passed to the
@jit decorator. e.g.
nopython. Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass
nopython = True to the decorator, so it becomes @jit(nopython = True). Or we can be even lazier and simply use the
The latest version (released around mid July 2017)
0.34.0 also allows use to write parallel code by specifying the
parallel = True argument to the decorator and changing
prange to perform explicit parallel loops. Note that we must ensure the loop does not have cross iteration dependencies.
@njit(parallel = True) def pairwise_numba2(X): M = X.shape N = X.shape D = np.zeros((M, M), dtype = np.float64) for i in prange(M): for j in range(i + 1, M): d = 0.0 for k in range(N): tmp = X[i, k] - X[j, k] d += tmp * tmp dist = np.sqrt(d) D[i, j] = dist D[j, i] = dist return D
The slowest run took 5.27 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 104 ms per loop
Note that we add the
@njit decorator, we are marking a function for optimization by Numba's JIT (just in time) compiler, meaning the python code is compiled on the fly into optimized machine code during the first time we invoke the function call. In other words, we can see some additional speed boost the next time we call the function since we won't have the initial compilation overhead.
10 loops, best of 3: 105 ms per loop
Little change to the code shows significant gain in speed!!! This insane functionality allows us to prototype algorithms with numpy without lossing the speed of C++ code. If you read it up to this part, consider giving the numba project a star.
For more information, this is a pretty good Pydata talk that illustrates the potential of numba. Youtube: Numba: Flexible analytics written in Python.