Key changes to the CUDA target for Release 0.53 demonstrated in this notebook:
math.log2
and math.remainder
(Guilherme Leobas).Other key changes, not demonstrated in this notebook, include:
from collections import namedtuple
from numba import cuda, void, int32
import math
import numpy as np
The following operations are now implemented, with the signatures cuda.atomic.<op>(ary, idx, val)
. These execute ary[idx] = ary[idx] <op> val
atomically, and return the old value at ary[idx]
for int32
, uint32
, int64
, and uint64
operands:
cuda.atomic.and_(ary, idx, val)
.cuda.atomic.or_(ary, idx, val)
.cuda.atomic.xor(ary, idx, val)
.Note the underscore suffix for the AND and OR operations, which is needed to prevent a collision with the and
and or
keywords.
A quick demo of AND:
@cuda.jit
def and_demo(x, old, val):
old[0] = cuda.atomic.and_(x, 0, val)
x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)
and_demo[1, 1](x, old, 3)
We expect that 4 & 3 = 0
:
print(x[0])
0
The old value should be 4
:
print(old[0])
4
We can demonstrate OR similarly:
@cuda.jit
def or_demo(x, old, val):
old[0] = cuda.atomic.or_(x, 0, val)
x = np.array([4], dtype=np.uint32)
old = np.zeros_like(x)
or_demo[1, 1](x, old, 3)
Hopefully 4 | 3 = 7
:
print(x[0])
7
And the old value should still be 4:
print(old[0])
4
XOR is left as an exercise for the reader.
Increment and decrement are also supported for uint32
and uint64
operands:
cuda.atomic.inc(ary, idx, val)
, increments ary[idx]
by 1
up to val
, then resets to 0
.cuda.atomic.dec(ary, idx, val)
, decrements ary[idx]
by 1
down to 0
, resetting to val
beyond 0
or if ary[idx]
is greater than val
initially.A simple example using INC:
@cuda.jit
def inc_demo(x, old, val):
old[0] = cuda.atomic.inc(x, 0, val)
First we'll try incrementing 10, with a maximum of 20:
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
inc_demo[1, 1](x, old, 20)
This should yield 11, with an old value of 10:
print(x[0])
print(old[0])
11 10
Now if we increment 10 this time, but with a maximum of 10:
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
inc_demo[1, 1](x, old, 10)
We should see that the counter has reset to 0:
print(x[0])
print(old[0])
0 10
Atomic compare-and-swap is now extended to support 64-bit operands (int64
, uint64
) operands in addition to 32-bit (int32
, uint32
). Atomic exchange, which swaps values atomically with no comparison, is added:
cuda.atomic.compare_and_swap(ary, old, val)
performs ary[0] = val if ary[0] == old
, returning the original ary[0]
value.cuda.atomic.exch(ary, idx, val)
performs ary[idx] = val
, returning the old value of ary[idx]
.A short example exchanging values:
@cuda.jit
def exch_demo(x, old, val):
old[0] = cuda.atomic.exch(x, 0, val)
x = np.array([10], dtype=np.uint32)
old = np.zeros_like(x)
exch_demo[1, 1](x, old, 15)
We should see that x[0]
now contains 15:
print(x[0])
15
And the old value, 10
, was returned by the exch
function:
print(old[0])
10
The math.log2
and math.remainder
functions can now be used in kernels.
@cuda.jit
def log2_remainder_demo(y, x):
y[0] = math.log2(x[0])
y[1] = math.remainder(x[0], 3)
x = np.array([4], dtype=np.float32)
y = np.zeros(2, dtype=np.float32)
log2_remainder_demo[1, 1](y, x)
log2(4)
is 2
:
print(y[0])
2.0
4 remainder 3 is 1:
print(y[1])
1.0
Grid Synchronization provides a way for different blocks to synchronize across the entire grids. It is implemented by instantiating a grid group with this_grid()
and calling its sync()
method.
The following example kernel uses the whole grid to write to rows of a matrix in sequence - each row of the matrix must be completed before any thread can move on to the next row:
@cuda.jit(void(int32[:,::1]))
def sequential_rows(M):
col = cuda.grid(1)
g = cuda.cg.this_grid()
rows = M.shape[0]
cols = M.shape[1]
for row in range(1, rows):
opposite = cols - col - 1
# Each row's elements are one greater than the previous row
M[row, col] = M[row - 1, opposite] + 1
# Wait until all threads have written their column element,
# and that the write is visible to all other threads
g.sync()
We'll create an empty matrix for the kernel to work on, and determine an appropriate block size:
# Empty input data
A = np.zeros((1024, 1024), dtype=np.int32)
# A somewhat arbitrary choice (one warp), but generally smaller block sizes
# allow more blocks to be launched (noting that other limitations on
# occupancy apply such as shared memory size)
blockdim = 32
griddim = A.shape[1] // blockdim
Now we can launch the kernel and print the result:
# Kernel launch - this is implicitly a cooperative launch
sequential_rows[griddim, blockdim](A)
# What do the results look like?
print(A)
[[ 0 0 0 ... 0 0 0] [ 1 1 1 ... 1 1 1] [ 2 2 2 ... 2 2 2] ... [1021 1021 1021 ... 1021 1021 1021] [1022 1022 1022 ... 1022 1022 1022] [1023 1023 1023 ... 1023 1023 1023]]
There is a more strict limit on the grid size for launching a kernel using a Grid Group, as all blocks must be able to be resident on the GPU concurrently - unlike a regular launch, it is not possible to wait for one block to fully execute before another block launches, as this would create the conditions for a deadlock.
The maximum grid size for a cooperative kernel can be enquired with max_cooperative_grid_blocks()
. This varies between GPU models, depending on the number of SMs and their available resources. It also varyies for different overloads, because each overload is compiled for a different set of arguments that affects the generated code.
Here we determine the maximum grid size for our example:
# Grab the first overload - there's only one because we've compiled with one signature
overload = next(iter(sequential_rows.overloads.values()))
max_blocks = overload.max_cooperative_grid_blocks(blockdim)
print(max_blocks)
1152
Numba 0.53 updates the CUDA kernel dispatcher mechanism to bring it more into line with the way the CPU dispatcher works
For bencharking, we can use a small test launching empty kernels with varying numbers of arguments with and without signatures:
print("Eagerly compiled with signatures:")
@cuda.jit('void()')
def sig_kernel_1():
return
@cuda.jit('void(float32[:])')
def sig_kernel_2(arr1):
return
@cuda.jit('void(float32[:],float32[:])')
def sig_kernel_3(arr1,arr2):
return
@cuda.jit('void(float32[:],float32[:],float32[:])')
def sig_kernel_4(arr1,arr2,arr3):
return
@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')
def sig_kernel_5(arr1,arr2,arr3,arr4):
return
arr = cuda.device_array(10000, dtype=np.float32)
%timeit sig_kernel_1[1, 1]()
%timeit sig_kernel_2[1, 1](arr)
%timeit sig_kernel_3[1, 1](arr,arr)
%timeit sig_kernel_4[1, 1](arr,arr,arr)
%timeit sig_kernel_5[1, 1](arr,arr,arr,arr)
print("Without signatures:")
@cuda.jit
def nosig_kernel_1():
return
@cuda.jit
def nosig_kernel_2(arr1):
return
@cuda.jit
def nosig_kernel_3(arr1,arr2):
return
@cuda.jit
def nosig_kernel_4(arr1,arr2,arr3):
return
@cuda.jit
def nosig_kernel_5(arr1,arr2,arr3,arr4):
return
arr = cuda.device_array(10000, dtype=np.float32)
%timeit nosig_kernel_1[1, 1]()
%timeit nosig_kernel_2[1, 1](arr)
%timeit nosig_kernel_3[1, 1](arr,arr)
%timeit nosig_kernel_4[1, 1](arr,arr,arr)
%timeit nosig_kernel_5[1, 1](arr,arr,arr,arr)
Eagerly compiled with signatures: 20 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.2 µs ± 32.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 44.6 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 53.6 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 62.4 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Without signatures: 19.8 µs ± 50.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 35.1 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 44.9 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 54.1 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 62.4 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing some results from the above benchmark between Numba 0.52 and Numba 0.53 yields the following on a Linux system with a Quadro RTX 8000 and CUDA 11.1 (11.2 was not used for the comparison as it is only supported by Numba 0.53):
Without signatures:
# Args | 0.52 | 0.53 | % Delta |
---|---|---|---|
0 | 26.8 | 18.4 | -31.3% |
1 | 60.0 | 33.8 | -43.7% |
2 | 83.0 | 43.4 | -47.7% |
3 | 104.0 | 52.3 | -49.7% |
4 | 124.0 | 61.0 | -50.8% |
With signatures:
# Args | 0.52 | 0.53 | % Delta |
---|---|---|---|
0 | 18.6 | 18.7 | +0.5% |
1 | 31.1 | 33.4 | +6.7% |
2 | 39.8 | 42.7 | +7.3% |
3 | 47.8 | 51.8 | +8.4% |
4 | 55.2 | 60.2 | +9.1% |
Tuples and namedtuples can now be passed to kernels.
Let's create a kernel and pass a tuple to it.
First we'll define a kernel that extracts values from the heterogeneously-typed tuple x
and stores them in r1
and r2
:
@cuda.jit
def extract_tuple(r1, r2, x):
r1[0] = x[0]
r1[1] = x[1]
r1[2] = x[2]
r2[0] = x[3]
r2[1] = x[4]
r2[2] = x[5]
A tuple of integers and floating point values:
x = (1, 2, 3, 4.5, 5.5, 6.5)
Some space for our results:
r1 = np.zeros(len(x) // 2, dtype=np.int64)
r2 = np.zeros(len(x) // 2, dtype=np.float64)
We can now launch the kernel:
extract_tuple[1, 1](r1, r2, x)
Next we can verify that the values have been extracted from the tuple as expected:
print(r1)
print(r2)
[1 2 3] [4.5 5.5 6.5]
We'll create a named tuple to represent a point:
Point = namedtuple('Point', ('x', 'y'))
We'll create a kernel that extracts values from a Point
, in a similar fashion to the tuple example above:
@cuda.jit
def extract_point(r, p):
r[0] = p.x
r[1] = p.y
x = Point(1, 2)
r = np.zeros(len(x), dtype=np.int64)
extract_point[1, 1](r, x)
Our extracted data:
print(r)
[1 2]
We can nest tuples and named tuples arbitrarily. Using a tuple of a tuple and a scalar:
@cuda.jit
def extract_nested(r, x):
r[0] = len(x)
r[1] = len(x[0])
r[2] = x[0][0]
r[3] = x[0][1]
r[4] = x[0][2]
r[5] = x[1]
x = ((6, 5, 4), 7)
r = np.ones(6, dtype=np.int64)
extract_nested[1, 1](r, x)
Our output contains (len(x), len(x[0]), *x[0], x[1])
:
print(r)
[2 3 6 5 4 7]
Arrays can also be tuple members, allowing us to group together arrays into a single parameter. For example, we might write a kernel that accepts data bundled up in a struct-of-arrays-type (SoA) form:
@cuda.jit
def vector_magnitudes_soa(r, v):
i = cuda.grid(1)
if i >= len(r):
return
r[i] = math.sqrt(v[0][i] ** 2 + v[1][i] ** 2)
Number of elements, and space for the results:
N = 32
r = np.zeros(N)
An SoA vector structure:
np.random.seed(1) # For reproducibility between notebook runs
vx = np.random.rand(N)
vy = np.random.rand(N)
v = (vx, vy)
We can pass v
to the kernel rather than needing to pass vx
and vy
individually:
vector_magnitudes_soa[1, N](r, v)
Our result:
print(r)
[1.04472949 0.89617665 0.69187712 0.43698409 0.70201198 0.83971806 0.18715589 0.82591084 1.06549082 0.92199529 0.50435392 1.04522133 0.22903347 0.98574786 0.90900818 0.73193985 0.50691019 0.57362161 0.14171652 0.70715054 0.82823808 1.00401469 0.58299133 0.6943761 1.04769698 0.90655963 0.59541039 0.70084737 0.19827937 0.97086385 0.70132994 0.59065734]
A new instrinsic, cuda.cbrt
provides an efficient way to compute a cube root for float32
and float64
values. Optimized functions from NVIDIA's libdevice library (__nv_cbrt
and __nv_cbrtf
) underlie its implementation.
For comparison, we'll define a kernel that implements cube root using a standard mathematical form, and another one using the intrinsic:
@cuda.jit
def cube_root(r, x):
for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
r[i] = x[i] ** (1.0 / 3.0)
@cuda.jit
def intrinsic_cube_root(r, x):
for i in range(cuda.grid(1), cuda.gridsize(1), len(r)):
r[i] = cuda.cbrt(x[i])
Then we'll set up some data to test with:
# Array size and data type
N = 2 ** 16
dtype = np.float32
np.random.seed(1)
x = np.random.rand(N).astype(dtype) * 100
r = np.zeros_like(x)
n_threads = 256
n_blocks = N // n_threads
# Copy data to device so we can time the kernel execution only
d_r_normal = cuda.to_device(r)
d_r_fast = cuda.to_device(r)
d_x = cuda.to_device(x)
Now we'll define functions to time for benchmarking. We need to synchronize with the device after the kernel launch to ensure we capture the execution time of the kernel and not just the time spent by the CPU launching the kernel:
def run_normal():
cube_root[n_blocks, n_threads](d_r_normal, d_x)
cuda.synchronize()
def run_fast():
intrinsic_cube_root[n_blocks, n_threads](d_r_fast, d_x)
cuda.synchronize()
Now let's time both versions:
%timeit run_normal()
%timeit run_fast()
76.2 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 55.5 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The results agree to six decimal places with float32
data (this is likely to be a reasonable expectation for most use cases):
rtol=1.0e-6
h_r_normal = d_r_normal.copy_to_host()
h_r_fast = d_r_fast.copy_to_host()
np.testing.assert_allclose(h_r_normal, h_r_fast, rtol=rtol)
print(f"Results are in agreement to a relative tolerance of {rtol}.")
Results are in agreement to a relative tolerance of 1e-06.