In [ ]:

%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

Problem 1: The Mandelbrot Fractal¶

This is a fun problem, for several reasons:

It produces pretty pictures
There are lots of variations to play with
The algorithm can exit early, making it non-trivial to vectorize

Let's import some libraries. Note, to automatically see plots, sometimes you may have to do:

%matplotlib inline

(or notebook, widget) - for the recommended setup, you should be fine without these.

In [ ]:

import matplotlib.pyplot as plt

# Extra performance libraries for later
# import numexpr
import numba
import numpy as np

You can generate a Mandelbrot fractal by applying the transform:

$$ z_{n+1}=z_{n}^{2}+c $$

repeatedly to a regular matrix of complex numbers $c$, and recording the iteration number where the value $|z|$ surpassed some bound, usually 2. You start at $z_0 = c$.

Let's set up some initial parameters and a helper matrix:

In [ ]:

1j + 1

In [ ]:

maxiterations = 50

# 300 x 400 matrix of complex numbers from [-1.5j, 1.5j] x [-2, 2]
c = np.sum(np.broadcast_arrays(*np.ogrid[-1.5j:1.5j:300j, -2:2:400j]), axis=0)

Note that in the interest of absolute brevity, I've taken advantage of the fact ogrid works with complex numbers; however, mgrid does not. ogrid is faster anyway.

Let's make sure we have the correct c:

In [ ]:

fig, axs = plt.subplots(1, 2, figsize=(6, 3))
axs[0].pcolormesh(c.real, c.imag, c.real)
axs[1].pcolormesh(c.real, c.imag, c.imag);

Okay, let's make the fractal as simply as possible in numpy:

In [ ]:

fractal = np.zeros_like(c, dtype=np.int32)
z = c.copy()

for i in range(1, maxiterations + 1):
    z = z**2 + c  # Compute z
    diverge = abs(z) > 2  # Divergence criteria

    z[diverge] = 2  # Keep number size small
    fractal[~diverge] = i  # Fill in non-diverged iteration number

In [ ]:

fig, ax = plt.subplots(figsize=(4, 3))
ax.pcolormesh(c.real, c.imag, fractal);

In a notebook, you often start with raw code (like above) for easy debugging, but once it works, you put in in a function, like the function below:

In [ ]:

def fractal_numpy(c, maxiterations):
    f = np.zeros_like(c, dtype=np.int32)
    z = c.copy()

    for i in range(1, maxiterations + 1):
        z = z * z + c  # Compute z
        diverge = np.abs(z**2) > 2**2  # Divergence criteria

        z[diverge] = 2  # Keep number size small
        f[~diverge] = i  # Fill in non-diverged iteration number

    return f

In [ ]:

%%timeit
fractal_numpy(c, maxiterations)

While we wouldn't really do this normally, expecting it to be much slower, here is the pure Python version:

In [ ]:

def fractal_pure(c, maxiterations):
    fractal = np.zeros_like(c, dtype=np.int32)

    for yi in range(c.shape[0]):
        for xi in range(c.shape[1]):
            z = cxy = c[yi, xi]
            for i in range(1, maxiterations + 1):
                z = z**2 + cxy
                if abs(z) > 2:
                    break
                fractal[yi, xi] = i
    return fractal

In [ ]:

%%timeit
fractal_pure(c, maxiterations)

I don't know about you, but that was much faster than I would have naively expected. Why? What is different about the algorithm?

For later use, and for better design, let's break up the above function into to pieces; the "on each" part and the part that applies it to each element (vectorization).

In [ ]:

def on_each_python(cxy, maxiterations):
    z = cxy
    for i in range(1, maxiterations + 1):
        z = z * z + cxy
        if abs(z) > 2:
            return i
    return i


def fractal_python(c, maxiterations):
    fractal = np.zeros_like(c, dtype=np.int32)

    for yi in range(c.shape[0]):
        for xi in range(c.shape[1]):
            fractal[yi, xi] = on_each_python(c[yi, xi], maxiterations)

    return fractal

In [ ]:

%%timeit
fractal_python(c, maxiterations)

While we'll look at something much more interesting soon, here is NumPy's vectorize. This is not supposed to do much except replace the outer function we had to manually define (though I've actually found it to be noticeably faster).

In [ ]:

fractal_npvectorize = np.vectorize(on_each_python)

In [ ]:

%%timeit
fractal_npvectorize(c, maxiterations)

We can plot any of these to make sure they work:

In [ ]:

fractal = fractal_npvectorize(c, maxiterations)
fig, ax = plt.subplots(figsize=(4, 3))
ax.pcolormesh(c.real, c.imag, fractal)

Profiling¶

Never optimize until you have profiled! If code becomes uglier/harder to maintain, you must have a solid reason for doing so.

Let's look at the line_profiler package, which has fairly nice IPython magic support. First let's load the magic:

In [ ]:

%load_ext line_profiler

Now, we run the magic with -f function_to_profile and the command to profile. Only the lines of the function we specify will show up:

In [ ]:

%lprun -f fractal_numpy fractal_numpy(c, maxiterations)

In [ ]:

If you don't have external packages available, the built-in cProfile is also usable, though not as pretty.

Note¶
Most standard library modules with names like something, cSomething were merged in Python 3, with the faster compiled implementation being selected automatically. This one was not, since cProfile and profile are not quite identical. profile is much slower.

In [ ]:

import cProfile

In [ ]:

cProfile.run("fractal_numpy(c, maxiterations)", sort=2)

(Note: Numba takes full advantage of the instruction set on your system, since it does not expect to be compiled and run on a different machine; thus often Numba will be faster than normally compiled code).

Numexpr¶

How can we make NumPy faster? Expressions are slow in NumPy because they usually create lots of temporary arrays, and memory allocations are costly. To avoid this, you could manually reuse memory, but this would require lots of ugly rewriting, such as taking a + b + c and writing t = a + b; b += c.

Starting with NumPy 1.13, some simple expressions like the one above, will avoid making memory copies (generally on Unix only)

There's a second issue; even with avoiding unneeded temporaries, you still have to run multiple kernels (computation functions) - it would be nicer if you could just do the full calculation on each input and produce one output, with no in-between steps.

One way to do this is with numexpr. This is an odd little library that can compile small expressions just-in-time (JIT). Here's what it looks like in practice:

In [ ]:

import numexpr

a, b = np.random.random_sample(size=(2, 100_000))

print("classic", 2 * a + 3 * b)
print("numexpr", numexpr.evaluate("2 * a + 3 * b"))

In [ ]:

%%timeit
c = 2 * a + 3 * b  # Try 2 * a**2 + 3 * b**3

In [ ]:

%%timeit
c = numexpr.evaluate("2 * a + 3 * b")

However, numexpr is very limited. It has a small set of data types, a small collection of supported operators and basic functions, and works one-line-at a time. You can make it less magical with feed dictionaries if you want.

Numba¶

If that managed to whet your appitite, let's look at Numba - a Python to LLVM JIT compiler! We'll see it again, but for now, here's a little demo:

In [ ]:

@numba.vectorize
def f(a, b):
    return 2 * a + 3 * b

In [ ]:

%%time
c = f(a, b)

In [ ]:

%%timeit
c = f(a, b)

It took the function we defined, pulled it apart, and turned into Low Level Virtual Machine (LLVM) code, and compiled it. No special strings or special syntax; it is just a (large) subset of Python and NumPy. And users and libraries can extend it too. It also supports:

Vectorized, general vectorized, or regular functions
Ahead of time compilation, JIT, or dynamic JIT
Parallelized targets
GPU targets via CUDA or ROCm
Nesting
Creating cfunction callbacks

It is almost always as fast or faster than any other compiled solution (minus the JIT time). A couple of years ago it became much easier to install (via PIP with LLVMLite's lightweight and independent LLVM build).

Project: accelerate¶

Try some of the following:

Use numexpr to replace parts of the above calculation. Why is this not very effective?
Try reducing the number of memory allocations by using numpy
Try accelerating using @numba.njit
Try accelerating using @numba.vectorize