Is it a bird?
Is it a plane?
Accelerating Python with numba

Juan Luis Cano Rodríguez <hello@juanlu.space>
2020-04-09 @ PyAmsterdam #StayAtHome

Outline¶

"Python is slow"
What is numba?
Some examples
Limitations and workarounds
Conclusions

Who is this guy?¶

Aerospace Engineer with a passion for orbits 🛰
Former chair of the Python España non profit and former co-organizer of PyCon Spain 🐍
Mission Planning & Execution Engineer at Satellogic 🌍
Free Software advocate and Python enthusiast 🕮
Hard Rock lover 🎸

Follow me! https://github.com/astrojuanlu/

Me!

"Python is slow"¶

Data structures¶

array vs list

(From https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)

Dynamic and interpreted (rather than static and compiled)¶

Four nested loops

(From https://gist.github.com/Juanlu001/cf19b1c16caf618860fb)

Introspection¶

"Why is checking isinstance(something, Mapping) so slow?"

(From https://stackoverflow.com/q/42378726/554319)

What to do?¶

Vectorization¶

Rewriting some code leveraging high level NumPy functions can make it way faster
However, this works best for array manipulation - some other algorithms cannot easily be vectorized
And even if you can, vectorized code can be impossible to read

Too smart

Cython¶

Mature, widely used, effective, gradual - a great project!
Some personal problems with it:
- I don't know any C, so it's more difficult for me
- I wanted poliastro to be super easy to install by avoiding the "two language" problem (this includes Windows)
- The native debugger is broken https://github.com/cython/cython/issues/1717
- I really don't want to worry about some gore details

I don't have lots of experience with it, so I don't have solid arguments against it.

PyPy¶

PyPy is a super interesting alternative Python implementation https://pypy.org/
I really really want to use it more, but there are some obstacles:
- The documentation is a bit poor, even the changelogs
- Lacks interest from the mainstream community (including snarky comments by Guido about "nobody using it in production")
- Difficult to install libraries with binary extensions, although conda-forge is helping! https://conda-forge.org/status/
- PyPy has several incompatibilities with manylinux1 wheels https://bitbucket.org/pypy/pypy/issues/2617/ although manylinux2010 is taking off!
- At the moment, requires rewriting NumPy-based code to use plain Python lists... impractical

And a lot others...¶

https://github.com/pfalcon/awesome-python-compilers/

The number of projects keeps growing

All of a sudden, in 2012...¶

numba 0.1 is released

https://twitter.com/teoliphant/status/235789560678858752

What is numba?¶

numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Latest stable version (at the time of writing) 0.48, (0.49.0rc1 tagged 13 days ago)
Documentation https://numba.pydata.org/numba-doc/latest/index.html
BSD-2 License
Easy to install:

$ pip install numba
$ conda install numba [--channel conda-forge]

Example 1: Monte Carlo for $\pi$¶

Monte Carlo pi

(From https://commons.wikimedia.org/wiki/File:Pi_30K.gif)

In [1]:

import random

def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:

monte_carlo_pi(100)

Out[2]:

3.56

In [3]:

monte_carlo_pi(100_000)

Out[3]:

3.14984

In [4]:

%timeit monte_carlo_pi(10_000_000)  # Slow!

5.38 s ± 708 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]:

from numba import jit

In [6]:

monte_carlo_pi_fast = jit(monte_carlo_pi)

In [7]:

%timeit -n1 -r1 monte_carlo_pi_fast(100)

272 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [8]:

%timeit monte_carlo_pi_fast(10_000_000)  # 40x faster!

152 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Example 2: Plate deflection¶

$$ w(\xi, \eta) = \frac{4 P_c}{\pi^4 D L_x L_y} \sum_{m=1}^\infty \sum_{n=1}^\infty \frac{\sin{\frac{m \pi \xi}{L_x}} \sin \frac{n \pi \eta}{L_y}}{\left(\left(\frac{m}{L_x}\right)^2 + \left(\frac{n}{L_y}\right)^2\right)^2} $$

(From http://www.efunda.com/formulae/solid_mechanics/plates/calculators/SSSS_PPoint.cfm)

In [9]:

import numpy as np
from numpy import pi, sin

In [10]:

@jit
def a_mn_point(P, a, b, xi, eta, mm, nn):
    """Navier series coefficient for concentrated load."""
    return 4 * P * sin(mm * pi * xi / a) * sin(nn * pi * eta / b) / (a * b)

In [11]:

@jit
def plate_displacement(xx, yy, ww, a, b, P, xi, eta, D, max_m, max_n):
    max_i, max_j = ww.shape
    for mm in range(1, max_m):
        for nn in range(1, max_n):
            for ii in range(max_i):
                for jj in range(max_j):
                    a_mn = a_mn_point(P, a, b, xi, eta, mm, nn)
                    ww[ii, jj] += (
                        a_mn
                        / (mm ** 2 / a ** 2 + nn ** 2 / b ** 2) ** 2
                        * sin(mm * pi * xx[ii, jj] / a)
                        * sin(nn * pi * yy[ii, jj] / b)
                        / (pi ** 4 * D)
                    )

In [12]:

# Plate geometry
a = 1.0  # m
b = 1.0  # m
h = 50e-3  # m

# Material properties
E = 69e9  # Pa
nu = 0.35

# Series terms
max_m = 16
max_n = 16

# Computation points
# NOTE: With an odd number of points the center of the place is included in
# the grid
NUM_POINTS = 101

# Load
P = -10e3  # N
xi = a / 2
eta = a / 2

# Flexural rigidity
D = h**3 * E / (12 * (1 - nu**2))

# Set up domain
x = np.linspace(0, a, num=NUM_POINTS)
y = np.linspace(0, b, num=NUM_POINTS)
xx, yy = np.meshgrid(x, y)

In [13]:

# Compute displacement field
ww = np.zeros_like(xx)
%timeit -n1 -r1 plate_displacement.py_func(xx, yy, ww, a, b, P, xi, eta, D, max_m, max_n)

23.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [14]:

%timeit plate_displacement(xx, yy, ww, a, b, P, xi, eta, D, max_m, max_n)  # 100 times faster!

132 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]:

# https://github.com/plotly/plotly.py/issues/1664#issuecomment-511773518
import plotly.graph_objects as go
import plotly.io as pio

# Set default renderer
pio.renderers.default = "notebook_connected"

# Set default template
pio.templates["slides"] = go.layout.Template(layout=dict(width=800, height=550))
pio.templates.default = "plotly+slides"

In [16]:

fig = go.Figure()
fig.add_surface(z=ww)

Example 3: A shameless self-plug personal achievement¶

"poliastro: An Astrodynamics library written in Python with Fortran performance" at the 6th International Conference on Astrodynamics Tools and Techniques

poliastro benchmark

https://indico.esa.int/event/111/contributions/393/

Example 4: C extensions with CFFI!¶

We can call C functions exported using CFFI:

https://web.archive.org/web/20160611082327/https://www.continuum.io/blog/developer-blog/calling-c-libraries-numba-using-cffi

from numba import njit, cffi_support

# See https://www.pybonacci.org/2016/02/07/como-crear-extensiones-en-c-para-python-usando-cffi-y-numba/
# (Spanish, sorry!)
import _hyper
cffi_support.register_module(_hyper)

_hyp2f1 = _hyper.lib.hyp2f1x  # See https://github.com/numba/numba/issues/1688


@njit
def hyp2f1(a, b, c, x):
    return _hyp2f1(a, b, c, x)

Caveats and limitations ☹️¶

python-sgp4 fail

The "nopython mode" is the only way¶

Two modes: "object mode" and "nopython mode", only the latter is truly optimized
Functions JITted in nopython mode can only call other functions in nopython mode
Avoid "object mode"! In the process of being deprecated, in numba 0.44 raises warnings

It's nopython all the way down

In [17]:

@jit
def range10():
    l = []
    for x in range(10):
        l.append(x)
    return l

range10()

Out[17]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [18]:

@jit
def reversed_range10():
    l = []
    for x in range(10):
        l.append(x)

    return reversed(l)  # innocuous change, but no reversed support in nopython mode

reversed_range10()

<ipython-input-18-2e1fa93510fc>:1: NumbaWarning:


Compilation is falling back to object mode WITH looplifting enabled because Function "reversed_range10" failed type inference due to: Untyped global name 'reversed': cannot determine Numba type of <class 'type'>

File "<ipython-input-18-2e1fa93510fc>", line 7:
def reversed_range10():
    <source elided>

    return reversed(l)  # innocuous change, but no reversed support in nopython mode
    ^


<ipython-input-18-2e1fa93510fc>:1: NumbaWarning:


Compilation is falling back to object mode WITHOUT looplifting enabled because Function "reversed_range10" failed type inference due to: cannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>

File "<ipython-input-18-2e1fa93510fc>", line 4:
def reversed_range10():
    <source elided>
    l = []
    for x in range(10):
    ^


/home/juanlu/.pyenv/versions/3.8.0/envs/numba38/lib/python3.8/site-packages/numba/object_mode_passes.py:177: NumbaWarning:

Function "reversed_range10" was compiled in object mode without forceobj=True, but has lifted loops.

File "<ipython-input-18-2e1fa93510fc>", line 3:
def reversed_range10():
    l = []
    ^


/home/juanlu/.pyenv/versions/3.8.0/envs/numba38/lib/python3.8/site-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning:


Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-18-2e1fa93510fc>", line 3:
def reversed_range10():
    l = []
    ^


<ipython-input-18-2e1fa93510fc>:1: NumbaWarning:


Compilation is falling back to object mode WITHOUT looplifting enabled because Function "reversed_range10" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-18-2e1fa93510fc> (4)

File "<ipython-input-18-2e1fa93510fc>", line 4:
def reversed_range10():
    <source elided>
    l = []
    for x in range(10):
    ^


/home/juanlu/.pyenv/versions/3.8.0/envs/numba38/lib/python3.8/site-packages/numba/object_mode_passes.py:177: NumbaWarning:

Function "reversed_range10" was compiled in object mode without forceobj=True.

File "<ipython-input-18-2e1fa93510fc>", line 4:
def reversed_range10():
    <source elided>
    l = []
    for x in range(10):
    ^


/home/juanlu/.pyenv/versions/3.8.0/envs/numba38/lib/python3.8/site-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning:


Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-18-2e1fa93510fc>", line 4:
def reversed_range10():
    <source elided>
    l = []
    for x in range(10):
    ^

Out[18]:

<list_reverseiterator at 0x7f00643809a0>

In [19]:

@jit(nopython=True)
def reversed_range10():
    l = []
    for x in range(10):
        l.append(x)

    return l[::-1]  # innocuous change, but no reversed support in nopython mode

reversed_range10()

Out[19]:

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Passing functions as arguments is slow¶

Since numba 0.38 the user can pass JITted functions as arguments, but it's even slower than not JITting them https://github.com/numba/numba/issues/2952
Arguably the most important blocker to write reusable numba code

In [21]:

from numba import njit

In [22]:

@njit
def func(x):
    return x**3 - 1

@njit
def fprime(x):
    return 3 * x**2

In [23]:

@njit
def njit_newton(func, x0, fprime):
    for _ in range(50):
        fder = fprime(x0)
        fval = func(x0)
        newton_step = fval / fder
        x = x0 - newton_step
        if abs(x - x0) < 1.48e-8:
            return x
        x0 = x

In [24]:

%timeit njit_newton(func, 1.5, fprime)
%timeit njit_newton.py_func(func, 1.5, fprime=fprime)

56.5 µs ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.93 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With a smart combination of closures and caching we can implement a workaround:

In [25]:

from functools import lru_cache

In [26]:

@lru_cache()
def newton_generator(func, fprime):
    @njit
    def njit_newton_final(x0):
        for _ in range(50):
            fder = fprime(x0)
            fval = func(x0)
            newton_step = fval / fder
            x = x0 - newton_step
            if abs(x - x0) < 1.48e-8:
                return x
            x0 = x

    return njit_newton_final

In [27]:

%timeit -n1 -r1 newton_generator(func, fprime)

528 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [28]:

%timeit -n1 -r1 newton_generator(func, fprime)

2.24 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [29]:

newton_func = newton_generator(func, fprime)
%timeit -n1 -r1 newton_func(1.5)

102 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [30]:

%timeit newton_func(1.5)

307 ns ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [31]:

%timeit newton_generator(func, fprime)

211 ns ± 7.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [32]:

def newton(func, x0, fprime):
    return newton_generator(func, fprime)(x0)

%timeit newton(func, 1.5, fprime)

584 ns ± 26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

NumPy arrays and nothing else¶

Two layers

High level API:

Supports complex data structures (e.g. astropy.units or pint, NumPy extensions for physical units)
Convert the code to normalized, simple structure that numba understands

Dangerous™ algorithms:

Fast (easy to accelerate with numba.njit)
Only cares about numbers, makes assumptions

Tips and tricks¶

If anything fails, export NUMBA_DISABLE_JIT=1
Be careful with dtypes! https://github.com/numba/numba/issues/3993#issuecomment-485029668
Try to split the function in smaller chunks until you find out what triggers object (slow) mode
You might have to rewrite some stuff, make it less dynamic
Keep an eye on https://numba.pydata.org/numba-doc/dev/reference/pysupported.html and https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

Not covered in this talk¶

Ahead-of-time (AOT) compilation
Interface with GPUs (CUDA, ROCm)
Releasing the GIL

Check out the documentation! https://numba.pydata.org/numba-doc/latest/

Conclusions¶

Numba is ✨ awesome ✨ when you make it work!
Still requires a bit of code rewrites, but the code is still mostly pythonic Python
For non-numerical code, you will probably have to find something else

Per Python ad astra! 🚀¶

Slides https://github.com/astrojuanlu/talk-numba
My email hello@juanlu.space
My Twitter https://twitter.com/poliastro_py

Vega launch

Backup slides¶

Comparison of solutions¶

Project	Pros	Cons
NumPy	Powerful, omnipresent	Vectorized code is sometimes difficult to read¹, if you can't vectorize you are out of luck
Cython	Gradual, effective, widely used, mature	Tricky if you don't know any C, couldn't make the native debugger work²
PyPy	General purpose	C extensions still very slow, no wheels on PyPI
Numba	Simplest, very effective	Only numerical code, needs special care

And many others: Pythran, Nuitka, mypyc...

¹Check out "Integration with the vernacular", by James Powell https://pyvideo.org/pydata-london-2015/integration-with-the-vernacular.html

²https://github.com/cython/cython/issues/2699

³See https://github.com/antocuni/pypy-wheels for a half-baked effort. Perhaps the future will be brighter with the new manylinux2010 specification? https://bitbucket.org/pypy/pypy/issues/2617/pypy-binary-is-linked-to-too-much-stuff, https://github.com/pypa/manylinux/issues/179