Numba-dppy is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy uses dpctl to support SYCL features. Currently Intel’s DPC++ is the only SYCL runtime supported by Numba-dppy.
Numba-dppy provides two ways to express SYCL parallelism:
An automatic offload mode for NumPy data-parallel expressions and Numba parallel loops via @numba.jit
. This automatic approach extends Numba's existing auto-parallelizer to support generating SYCL kernels from data-parallel code regions. Using the automatic offload approach a programmer needs only minimal changes to the existing code and can try to offload an existing @numba.jit
decorated function to a SYCL device by invoking the function from a dpctl.device_context
.
An explicit kernel programming mode using the @numba_dppy.kernel
decorator. The explicit kernel approach is similar to Numba's other GPU backends: numba.cuda
. The @numba_dppy.kernel
decorator is provided by the numba-dppy package. Several advanced SYCL features such as indexing, synchronization, fences, atomcis are provided by the @numba_dppy.kernel
decorator. Thus, using the decorator a relatively low-level SYCL kernel can be written directly in Python. The feature is intended for programmers who already have SYCL and GPU programming experience.
import numpy as np
import numba_dppy as dppy # numba-dppy package should be installed for the examples below.
import dpctl
from numba import njit
import math
The automatic offload feature in numba-dppy is triggered when a @numba.jit
function is invoked inside a dpctl.device_context
scope. The following example demonstrates the usage of numba-dppy's automatic offload functionality. Note that the example is identical to the normal Numba parallel example, the only difference is that the function is called in the dpctl.device_context
.
@njit
def f1(a, b):
c = a + b
return c
N = 64 * 32
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
# Use the environment variable SYCL_DEVICE_FILTER to change the default device.
# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
device = dpctl.select_default_device()
print("Using device ...")
print(device)
with dpctl.device_context(device):
c = f1(a, b)
print(c)
Using device ... <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Graphics [0x5917]] at 0x18db438cfb0> [2. 2. 2. ... 2. 2. 2.]
Controllable fallback behavior during automatic offload
By default, if a section of code cannot be offloaded to the GPU, it is automatically executed on the CPU and a warning is printed. This behavior is only applicable to JIT functions, auto-offloading of NumPy calls, array expressions and prange loops. To disable this functionality and force code running on GPU, set the environment variable NUMBA_DPPY_FALLBACK_OPTION
to false (for example, export NUMBA_DPPY_FALLBACK_OPTION=0
). In this case the code is not automatically offloaded to the CPU and errors occur if any.
Diagnostic reporting for automatic offload
Export NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1
:
Setting the debug environment variable NUMBA_DPPY_OFFLOAD_DIAGNOSTICS
provides emission of the parallel and offload diagnostics information based on produced parallel transforms. The level of detail depends on the integer value between 1 and 4 that is set to the environment variable (higher is more detailed). In the "Auto-offloading" section there is the information on which device (device name) this kernel was offloaded.
Writing a SYCL kernel using the @numba_dppy.kernel
decorator has similar syntax to writing OpenCL kernels. The numba-dppy module provides similar indexing and other functions as OpenCL. Some of the indexing functions supported inside a numba_dppy.kernel are:
numba_dppy.get_global_id
: Gets the global ID of the itemnumba_dppy.get_local_id
: Gets the local ID of the itemnumba_dppy.get_local_size
: Gets the local work group size of the devicenumba_dppy.get_group_id
: Gets the group ID of the itemnumba_dppy.get_num_groups
: Gets the number of gropus in a worksgroupRefer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details.
@dppy.kernel
def sum(a, b, c):
i = dppy.get_global_id(0)
c[i] = a[i] + b[i]
a = np.array(np.random.random(20), dtype=np.float32)
b = np.array(np.random.random(20), dtype=np.float32)
c = np.ones_like(a)
device = dpctl.select_default_device()
with dpctl.device_context(device):
sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)
print(a+b)
print(c)
[1.1647326 0.5044042 1.0928384 1.6194623 0.64363265 0.923868 0.9901773 0.16170211 0.6585165 0.37717268 1.7218891 0.7935294 1.1921285 1.0631248 0.97428465 1.2411709 0.12518258 1.3276634 0.50359565 1.2648091 ] [1.1647326 0.5044042 1.0928384 1.6194623 0.64363265 0.923868 0.9901773 0.16170211 0.6585165 0.37717268 1.7218891 0.7935294 1.1921285 1.0631248 0.97428465 1.2411709 0.12518258 1.3276634 0.50359565 1.2648091 ]
Numba-dppy supports several atomic operations supported by DPC++.
class numba_dppy.ocl.stubs.atomic
atomic namespace
add(ary, idx, val)
Perform atomic ary[idx] += val
. Returns the old value at the index location as if it is loaded atomically.sub(ary, idx, val)
Perform atomic ary[idx] -= val
. Returns the old value at the index location as if it is loaded atomically."""
The example demonstrates the use of numba_dppy's ``atomic_add`` intrinsic
function on a SYCL device. The ``dpctl.select_gpu_device`` is
equivalent to ``sycl::gpu_selector`` and returns a sycl::device of type GPU.
For more information please look at:
https://github.com/IntelPython/numba-dppy/blob/0.16.0/numba_dppy/examples/atomic_op.py
Without these two environment variables Numba_dppy will use other
implementation for floating point atomics.
"""
@dppy.kernel
def atomic_add(a):
dppy.atomic.add(a, 0, 1)
global_size = 100
a = np.array([0], dtype=np.float32)
device = dpctl.select_default_device()
with dppy.offload_to_sycl_device(device):
atomic_add[global_size, dppy.DEFAULT_LOCAL_SIZE](a)
print(a)
print("Done...")
[100.] Done...
Expected 100, because global_size = 100
OpenCL and SYCL do not directly have a notion for device-only functions, i.e. functions that can be only invoked from a kernel and not from a host function. However, numba-dppy provides a special decorator numba_dppy.func
specifically to implement device functions.
@dppy.func
def a_device_function(a):
"""
A ``func`` is a device callable function that can be invoked from
``kernel`` and other ``func`` functions.
"""
return a + 1
@dppy.func
def another_device_function(a):
return a_device_function(a)
@dppy.kernel
def a_kernel_function(a, b):
i = dppy.get_global_id(0)
b[i] = another_device_function(a[i])
N = 10
a = np.ones(N)
b = np.ones(N)
device = dpctl.select_default_device()
with dppy.offload_to_sycl_device(device):
a_kernel_function[N, dppy.DEFAULT_LOCAL_SIZE](a, b)
print("Done...")
Done...
This example demonstrates a summation reduction on a one-dimensional array.
In this example, to reduce the array we invoke the kernel multiple times.
@dppy.kernel
def sum_reduction_kernel(A, R, stride):
i = dppy.get_global_id(0)
# sum two element
R[i] = A[i] + A[i + stride]
# store the sum to be used in nex iteration
A[i] = R[i]
def sum_reduce(A):
"""Size of A should be power of two."""
total = len(A)
# max size will require half the size of A to store sum
R = np.array(np.random.random(math.ceil(total / 2)), dtype=A.dtype)
device = dpctl.select_default_device()
with dppy.offload_to_sycl_device(device):
while total > 1:
global_size = total // 2
sum_reduction_kernel[global_size, dppy.DEFAULT_LOCAL_SIZE](
A, R, global_size
)
total = total // 2
return R[0]
# This test will only work for size = power of two
N = 2048
assert N % 2 == 0
A = np.array(np.random.random(N), dtype=np.float32)
A_copy = A.copy()
actual = sum_reduce(A)
expected = A_copy.sum()
print("Actual: ", actual)
print("Expected:", expected)
assert expected - actual < 1e-2
Actual: 1035.3582 Expected: 1035.3582