Notebook

Numba 0.54 NUMBA-DPPY Release Demo¶

Numba-dppy is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy uses dpctl to support SYCL features. Currently Intel’s DPC++ is the only SYCL runtime supported by Numba-dppy.

Numba-dppy provides two ways to express SYCL parallelism:

An automatic offload mode for NumPy data-parallel expressions and Numba parallel loops via @numba.jit. This automatic approach extends Numba's existing auto-parallelizer to support generating SYCL kernels from data-parallel code regions. Using the automatic offload approach a programmer needs only minimal changes to the existing code and can try to offload an existing @numba.jit decorated function to a SYCL device by invoking the function from a dpctl.device_context.
An explicit kernel programming mode using the @numba_dppy.kernel decorator. The explicit kernel approach is similar to Numba's other GPU backends: numba.cuda. The @numba_dppy.kernel decorator is provided by the numba-dppy package. Several advanced SYCL features such as indexing, synchronization, fences, atomcis are provided by the @numba_dppy.kernel decorator. Thus, using the decorator a relatively low-level SYCL kernel can be written directly in Python. The feature is intended for programmers who already have SYCL and GPU programming experience.

Some useful imports¶

In [6]:

import numpy as np
import numba_dppy as dppy  # numba-dppy package should be installed for the examples below.
import dpctl
from numba import njit
import math

An automatic offload¶

The automatic offload feature in numba-dppy is triggered when a @numba.jit function is invoked inside a dpctl.device_contextscope. The following example demonstrates the usage of numba-dppy's automatic offload functionality. Note that the example is identical to the normal Numba parallel example, the only difference is that the function is called in the dpctl.device_context.

In [9]:

@njit
def f1(a, b):
    c = a + b
    return c

N = 64 * 32
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)

# Use the environment variable SYCL_DEVICE_FILTER to change the default device.
# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
device = dpctl.select_default_device()
print("Using device ...")
print(device)

with dpctl.device_context(device):
    c = f1(a, b)
    print(c)

Using device ...
<dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) Graphics [0x5917]] at 0x18db438cfb0>
[2. 2. 2. ... 2. 2. 2.]

Controllable fallback behavior during automatic offload

By default, if a section of code cannot be offloaded to the GPU, it is automatically executed on the CPU and a warning is printed. This behavior is only applicable to JIT functions, auto-offloading of NumPy calls, array expressions and prange loops. To disable this functionality and force code running on GPU, set the environment variable NUMBA_DPPY_FALLBACK_OPTION to false (for example, export NUMBA_DPPY_FALLBACK_OPTION=0). In this case the code is not automatically offloaded to the CPU and errors occur if any.

Diagnostic reporting for automatic offload

Export NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1:

Setting the debug environment variable NUMBA_DPPY_OFFLOAD_DIAGNOSTICS provides emission of the parallel and offload diagnostics information based on produced parallel transforms. The level of detail depends on the integer value between 1 and 4 that is set to the environment variable (higher is more detailed). In the "Auto-offloading" section there is the information on which device (device name) this kernel was offloaded.

Writing Explicit Kernels in numba-dppy¶

Writing a SYCL kernel using the @numba_dppy.kernel decorator has similar syntax to writing OpenCL kernels. The numba-dppy module provides similar indexing and other functions as OpenCL. Some of the indexing functions supported inside a numba_dppy.kernel are:

numba_dppy.get_global_id: Gets the global ID of the item
numba_dppy.get_local_id: Gets the local ID of the item
numba_dppy.get_local_size: Gets the local work group size of the device
numba_dppy.get_group_id : Gets the group ID of the item
numba_dppy.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details.

In [12]:

@dppy.kernel
def sum(a, b, c):
    i = dppy.get_global_id(0)
    c[i] = a[i] + b[i]

a = np.array(np.random.random(20), dtype=np.float32)
b = np.array(np.random.random(20), dtype=np.float32)
c = np.ones_like(a)

device = dpctl.select_default_device()

with dpctl.device_context(device):
    sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)

print(a+b)
print(c)

[1.1647326  0.5044042  1.0928384  1.6194623  0.64363265 0.923868
 0.9901773  0.16170211 0.6585165  0.37717268 1.7218891  0.7935294
 1.1921285  1.0631248  0.97428465 1.2411709  0.12518258 1.3276634
 0.50359565 1.2648091 ]
[1.1647326  0.5044042  1.0928384  1.6194623  0.64363265 0.923868
 0.9901773  0.16170211 0.6585165  0.37717268 1.7218891  0.7935294
 1.1921285  1.0631248  0.97428465 1.2411709  0.12518258 1.3276634
 0.50359565 1.2648091 ]

Numba-DPPY Atomics¶

Numba-dppy supports several atomic operations supported by DPC++.

class numba_dppy.ocl.stubs.atomic
atomic namespace

add(ary, idx, val) Perform atomic ary[idx] += val. Returns the old value at the index location as if it is loaded atomically.
sub(ary, idx, val) Perform atomic ary[idx] -= val. Returns the old value at the index location as if it is loaded atomically.

In [15]:

"""
The example demonstrates the use of numba_dppy's ``atomic_add`` intrinsic
function on a SYCL device. The ``dpctl.select_gpu_device`` is
equivalent to ``sycl::gpu_selector`` and returns a sycl::device of type GPU.

For more information please look at:
https://github.com/IntelPython/numba-dppy/blob/0.16.0/numba_dppy/examples/atomic_op.py

Without these two environment variables Numba_dppy will use other
implementation for floating point atomics.
"""

@dppy.kernel
def atomic_add(a):
    dppy.atomic.add(a, 0, 1)

global_size = 100
a = np.array([0], dtype=np.float32)

device = dpctl.select_default_device()

with dppy.offload_to_sycl_device(device):
    atomic_add[global_size, dppy.DEFAULT_LOCAL_SIZE](a)

print(a)

print("Done...")

[100.]
Done...

Expected 100, because global_size = 100

Numba-DPPY device functions¶

OpenCL and SYCL do not directly have a notion for device-only functions, i.e. functions that can be only invoked from a kernel and not from a host function. However, numba-dppy provides a special decorator numba_dppy.func specifically to implement device functions.

In [16]:

@dppy.func
def a_device_function(a):
    """
    A ``func`` is a device callable function that can be invoked from
    ``kernel`` and other ``func`` functions.
    """
    return a + 1


@dppy.func
def another_device_function(a):
    return a_device_function(a)


@dppy.kernel
def a_kernel_function(a, b):
    i = dppy.get_global_id(0)
    b[i] = another_device_function(a[i])


N = 10
a = np.ones(N)
b = np.ones(N)

device = dpctl.select_default_device()

with dppy.offload_to_sycl_device(device):
    a_kernel_function[N, dppy.DEFAULT_LOCAL_SIZE](a, b)

print("Done...")

Done...

Reduction on SYCL-supported Devices¶

This example demonstrates a summation reduction on a one-dimensional array.
In this example, to reduce the array we invoke the kernel multiple times.

In [17]:

@dppy.kernel
def sum_reduction_kernel(A, R, stride):
    i = dppy.get_global_id(0)
    # sum two element
    R[i] = A[i] + A[i + stride]
    # store the sum to be used in nex iteration
    A[i] = R[i]


def sum_reduce(A):
    """Size of A should be power of two."""
    total = len(A)
    # max size will require half the size of A to store sum
    R = np.array(np.random.random(math.ceil(total / 2)), dtype=A.dtype)

    device = dpctl.select_default_device()

    with dppy.offload_to_sycl_device(device):
        while total > 1:
            global_size = total // 2
            sum_reduction_kernel[global_size, dppy.DEFAULT_LOCAL_SIZE](
                A, R, global_size
            )
            total = total // 2

    return R[0]

# This test will only work for size = power of two
N = 2048
assert N % 2 == 0

A = np.array(np.random.random(N), dtype=np.float32)
A_copy = A.copy()

actual = sum_reduce(A)
expected = A_copy.sum()

print("Actual:  ", actual)
print("Expected:", expected)

assert expected - actual < 1e-2

Actual:   1035.3582
Expected: 1035.3582

Numba 0.54 NUMBA-DPPY Release Demo¶

Some useful imports¶

An automatic offload¶

Writing Explicit Kernels in numba-dppy¶

Numba-DPPY Atomics¶

Numba-DPPY device functions¶

Reduction on SYCL-supported Devices¶

Numba-dppy Github repository: https://github.com/IntelPython/numba-dppy ¶

More examples: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/examples ¶

Tests: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/tests ¶

Numba 0.54 NUMBA-DPPY Release Demo¶

Some useful imports¶

An automatic offload¶

Writing Explicit Kernels in numba-dppy¶

Numba-DPPY Atomics¶

Numba-DPPY device functions¶

Reduction on SYCL-supported Devices¶

Numba-dppy Github repository: https://github.com/IntelPython/numba-dppy¶

More examples: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/examples¶

Tests: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/tests¶

Numba-dppy Github repository: https://github.com/IntelPython/numba-dppy ¶

More examples: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/examples ¶

Tests: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/tests ¶