**WARNING**: Due to the CPU restrictions on notebook execution on Binder, the benefits of multithreading are going to be erratic. The code in this notebook will run on Binder, but for reasonable benchmarks, you should download and run this notebook on your own system.
Numba supports several approaches to multithreading:
The first two options make use of the ParallelAccelerator optimization pass (contributed by Intel) in Numba. ParallelAccelerator is only supported on 64-bit platforms, and is not available for Python 2.7 on Windows. It is also only effective when compiling in nopython mode.
import numpy as np import numba from numba import jit
NumPy array expressions have a significant amount of implied parallelism, as operations are broadcast independently over the input elements. ParallelAccelerator can identify this parallelism and automatically distribute it over several threads. All we need to do is enable the parallelization pass with
SQRT_2PI = np.sqrt(2 * np.pi) @jit(nopython=True, parallel=True) def gaussians(x, means, widths): '''Return the value of gaussian kernels. x - location of evaluation means - array of kernel means widths - array of kernel widths ''' n = means.shape result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths return result / SQRT_2PI / n
means = np.random.uniform(-1, 1, size=1000000) widths = np.random.uniform(0.1, 0.3, size=1000000) gaussians(0.4, means, widths)
To see the effect of multiple CPUs, we can compare to the case where ParallelAccelerator disabled. Noting that decorators are functions that transform other functions, we can call
jit as a function:
gaussians_nothread = jit(nopython=True)(gaussians.py_func) %timeit gaussians_nothread(0.4, means, widths) %timeit gaussians(0.4, means, widths)
We can also compare the performance to the uncompiled NumPy array evaluation using the
.py_func attribute to get the original Python function:
%timeit gaussians.py_func(0.4, means, widths) # compare to pure NumPy
The performance ratio depends on the number of CPUs in your system, but the multithreaded version is definitely faster than the single threaded version.
ParallelAccelerator can also handle reductions:
@jit(nopython=True, parallel=True) def kde(x, means, widths): '''Return the value of gaussian kernels. x - location of evaluation means - array of kernel means widths - array of kernel widths ''' n = means.shape result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths return result.mean() / SQRT_2PI kde_nothread = jit(nopython=True)(kde.py_func)
%timeit kde_nothread(0.4, means, widths) %timeit kde(0.4, means, widths)
There are other situations where you would like multithreading, but do not have a straightforward array expression. In those cases, using
prange() in a for loop indicates to ParallelAccelerator that this is a loop where each iteration is independent of the other and can be executed in parallel.
For example, we might want to run many Monte Carlo trials in a row:
import random # Serial version @jit(nopython=True) def monte_carlo_pi_serial(nsamples): acc = 0 for i in range(nsamples): x = random.random() y = random.random() if (x**2 + y**2) < 1.0: acc += 1 return 4.0 * acc / nsamples # Parallel version @jit(nopython=True, parallel=True) def monte_carlo_pi_parallel(nsamples): acc = 0 # Only change is here for i in numba.prange(nsamples): x = random.random() y = random.random() if (x**2 + y**2) < 1.0: acc += 1 return 4.0 * acc / nsamples
prange() is automatically handling the reduction variable
acc in a thread-safe way for you. We are also relying on Numba to automatically initialize the random number generator in each thread independently.
You can also have each thread in a
prange() modify a separate element in an output array, but more general race conditions are not automatically resolved by ParallelAccelerator. Be careful!
Let's see how fast these two implementations are:
%time monte_carlo_pi_serial(int(4e8)) %time monte_carlo_pi_parallel(int(4e8))
The parallel version saturates all the CPUs once the initial compilation finishes.
Sometimes your threading system is external to Numba entirely. You might be using
concurrent.futures to run functions in multiple threads, or a parallel framework like Dask. For these situations, you do not want to use ParallelAccelerator, but do want to allow the Numba-compiled function to run concurrently in different threads.
To do this, you want the Numba function to release the Global Interpreter Lock (GIL) during execution. This can be done using the
nogil=True option to
Let's do our Monte Carlo example again, but with Dask. Note that Numba will still handle initializing separate random number generator seeds on each thread, as it did with ParallelAccelerator.
import dask import dask.delayed @jit(nopython=True, nogil=True) def monte_carlo_pi(nsamples): acc = 0 for i in range(nsamples): x = random.random() y = random.random() if (x**2 + y**2) < 1.0: acc += 1 return 4.0 * acc / nsamples print(monte_carlo_pi(int(1e6))) delayed_monte_carlo_pi = dask.delayed(monte_carlo_pi)
%%time futures = [delayed_monte_carlo_pi(int(4e8)) for i in range(4)] results = dask.compute(futures) print(sum(results)/4) # average resuts
%%time futures = [delayed_monte_carlo_pi(int(4e8)) for i in range(4)] results = dask.compute(futures, num_workers=1) print(sum(results)/4) # average resuts