Key changes to the CUDA target for Release 0.54 demonstrated in this notebook:
activemask()
and lanemask_lt()
are now supportedcuda.ffs
now gives correct answers (its behavior mirrors that of __ffs()
in CUDA C.Other key changes, not demonstrated in this notebook, include:
cuda-gdb
can now find the source location at the beginning of a kernel (e.g. when breaking on the first instruction of kernels with set cuda break_on_launch application
)@overload
) (Stuart Archibald and Graham Markall).nanosleep()
is added - this can be used to implement exponential backoff when atomic operations have high contention (Graham Markall)fastmath=True
(Michael Collison)from numba import cuda, guvectorize, vectorize, void, int32, float64, uint32
import math
import numpy as np
np.random.seed(1)
Getting the best performance out of the CUDA target needs care to be taken with the grid dimensions and also with data transfers. There are two common errors that can hinder getting the best performance out of the CUDA target, especially amongst beginners:
Numba 0.54 helps beginners avoid these mistakes by emitting warnings when it can.
Let's define a simple linear algebra kernel:
@cuda.jit
def axpy(r, a, x, y):
i = cuda.grid(1)
if i < len(r):
r[i] = a * x[i] + y[i]
We'll also write a function that creates some input data of a given size, N
, and launches the kernel on the data with a grid of sufficient size to cover all the elements of the input data
def create_and_add_vectors(N):
# Create input data and transfer to GPU
x = np.random.random(N)
y = np.random.random(N)
d_x = cuda.to_device(x)
d_y = cuda.to_device(y)
d_r = cuda.device_array_like(d_x)
a = 4.5
# Compute grid dimensions
# An arbitrary reasonable choice of block size
block_dim = 256
# Enough blocks to cover the input
grid_dim = math.ceil(len(d_x) / block_dim)
# Launch the kernel
axpy[grid_dim, block_dim](d_r, a, d_x, d_y)
# Return the result
return d_r.copy_to_host()
Now if we use the kernel for vectors of length 32, we should end up with a very small grid, which results in a warning that the grid is too small for efficiency:
create_and_add_vectors(32)
/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
array([2.83448855, 3.77462551, 0.6923918 , 1.67601221, 1.34690244, 1.25014935, 0.85645923, 2.30516759, 2.77431472, 3.17284096, 2.16681931, 3.87276708, 1.02326113, 4.39942199, 1.03183967, 3.31071794, 2.16564695, 2.6441328 , 0.65110818, 1.57029223, 3.81497868, 4.62272375, 1.90198196, 3.16881432, 4.51786879, 4.17245856, 0.97200449, 0.87550488, 0.86657132, 4.36569725, 1.13696091, 2.30916358])
In general, the grid should be at least twice the size of the number of SMs on the device. For example, a Quadro RTX 8000 has 72 SMs, so 144 blocks as a minimum would be required. Note that 2 * SMs
is not necessarily ideal - sometimes further oversubscription of 3-4 times is better for hiding latency.
For most GPUs, using input data with 2 ** 16
elements should result in a large enough grid that the device is utilized more effectively:
create_and_add_vectors(2 ** 16)
array([0.79326341, 2.86602196, 3.74815365, ..., 1.23942119, 3.11112051, 4.51046983])
When using @vectorize
, the grid is implicitly created based on the size of the input - this is useful because it saves the user thinking about the grid, but it can also hide the fact that a small grid gets created for small data. To catch this, the warning also activates for @vectorize
-d functions.
Let's redo the example with @vectorize
:
@vectorize([float64(float64, float64, float64)], target='cuda')
def axpy_vectorize(a, x, y):
return a * x + y
def vectorize_add_vectors(N):
x = np.random.random(N)
y = np.random.random(N)
d_x = cuda.to_device(x)
d_y = cuda.to_device(y)
d_r = cuda.device_array_like(d_x)
a = 4.5
axpy_vectorize(a, d_x, d_y, out=d_r)
return d_r.copy_to_host()
/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: Eager compilation of device functions is deprecated (this occurs when a signature is provided)
warn(NumbaDeprecationWarning(msg))
(Note that an unrelated warning about deprecated behavior is produced here from the internal implementation of @vectorize
- this will be removed in 0.55, and has no effect on the use of @vectorize
)
Now if we again call with small data, a small grid will be produced and warned about:
vectorize_add_vectors(32)
/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
array([0.9076929 , 0.84091496, 4.51281098, 1.60343059, 2.39170646, 3.57143795, 1.9939654 , 0.66315516, 1.01864565, 1.65200497, 1.4520746 , 3.12323577, 3.13834654, 3.20744399, 0.8351249 , 0.61896314, 5.05435786, 2.13761441, 4.8652938 , 3.19708948, 4.05406256, 1.17937198, 0.27872906, 3.49203118, 3.56165816, 1.04358889, 3.96496876, 1.80941232, 0.94801354, 4.8046405 , 1.84938901, 3.86637987])
Now let's try again with larger data. @vectorize
parameterizes the grid differently to the above example, so we need to use larger input data than before to ensure the warning doesn't trigger on the larger GPUs:
vectorize_add_vectors(2 ** 18)
array([4.56478069, 1.52891639, 3.09231419, ..., 3.26977228, 1.42833257, 2.86626516])
@guvectorize
can similarly create a small grid, so the warning is also enabled for it. Sometimes the combination of the shape of the input data and the shapes of the input variables can obscure the fact that few parallel invocations of the inner function are created, so this warning can be helpful.
As an example, let's define a kernel that reduces a set of vectors:
@guvectorize(['void(float32[:], float32[:], float32[:])'],
'(n),(n)->(n)', target='cuda')
def numba_dist_cuda(a, b, dist):
len = a.shape[0]
for i in range(len):
dist[i] = a[i] * b[i]
/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: Eager compilation of device functions is deprecated (this occurs when a signature is provided)
warn(NumbaDeprecationWarning(msg))
If we construct inputs such that the function is invoked once for a single large input, therefore using only one thread, we see the warning:
a = np.random.rand(1024 * 32).astype('float32')
b = np.random.rand(1024 * 32).astype('float32')
dist = np.zeros(a.shape[0]).astype('float32')
numba_dist_cuda(a, b, dist)
/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
array([0.25242987, 0.15926176, 0.12577234, ..., 0.27449533, 0.00076392, 0.13365015], dtype=float32)
Now we construct an input that will result in many invocations of the function (524288) instances operating on 2-vectors. There will be enough blocks to fill most devices, and the warning should be elided:
a = np.random.rand(524288 * 2).astype('float32').reshape((524288, 2))
b = np.random.rand(524288 * 2).astype('float32').reshape((524288, 2))
dist = np.zeros_like(a)
numba_dist_cuda(a, b, dist)
array([[0.27024668, 0.4367813 ], [0.02799038, 0.25452447], [0.03211394, 0.31761077], ..., [0.2193208 , 0.08123732], [0.15066122, 0.23769546], [0.00870252, 0.06313875]], dtype=float32)
Numba makes it easy to pass NumPy arrays on the host to CUDA kernels by automatically transferring them to and from the device before and after a kernel call. Whilst this is convenient in early development, a better strategy for performance is to manually control data movement to and from the device.
To help ensure that a data movement strategy is properly implemented, Numba now warns when implicit copies are made when launching kernels.
For example, let's call our axpy
kernel again, on NumPy arrays this time:
N = 2 ** 16
x = np.random.random(N)
y = np.random.random(N)
r = np.empty_like(x)
a = 5
block_dim = 256
grid_dim = math.ceil(len(x) / block_dim)
axpy[grid_dim, block_dim](r, a, x, y)
/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/devicearray.py:790: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
warn(NumbaPerformanceWarning(msg))
The emitted warning makes it clear that we're inducing some overhead here.
The added functions activemask()
and lanemask_lt()
, as well as the corrected implementation of ffs()
can be combined to implement warp-aggregated atomic operatations, as described in CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics. The following example code is a translation of the examples from that post, where the code filters input by a predicate: it takes each of the input values, and places those that match a predicate in an output array.
First we define a device function where:
warp_res
),warp_res
is used to atomically increment a counter of the number of matching elements atomically (ctr
),warp_res + rank
value).@cuda.jit(device=True)
def atomicAggInc(ctr):
active = cuda.activemask()
leader = uint32(cuda.ffs(active) - uint32(1))
change = cuda.popc(active)
rank = cuda.popc(uint32(active & cuda.lanemask_lt()))
warp_res = 0
if rank == 0:
warp_res = cuda.atomic.add(ctr, 0, change)
warp_res = cuda.shfl_sync(active, warp_res, leader)
return warp_res + rank
Next we can use the warp-aggregated function to implement our filtering function. The filtering function takes values that are greater than zero and stores them in the output array:
@cuda.jit
def filter_k(dst, nres, src):
i = cuda.grid(1)
if i >= len(src):
return
if src[i] > 0:
dst[atomicAggInc(nres)] = src[i]
We set up the run by creating some output data and printing a short summary of it:
# Parameters for the run
N = 2 ** 24
zero_factor = 0.25
print(f'Running with {N} elements, of which approximately {zero_factor * 100}%'
' are zero\n')
# Seed the RNG for repeatability
np.random.seed(1)
# Create input data
inputs = np.random.random(N)
zeros = np.zeros(N)
factors = np.random.random(N)
values = np.where(factors > zero_factor, inputs, zeros)
# Quick summary of the data
value_count = np.sum(values > 0)
print(f'There are {value_count} nonzeroes in:')
print(values)
Running with 16777216 elements, of which approximately 25.0% are zero There are 12584753 nonzeroes in: [0.417022 0.72032449 0. ... 0.20570723 0.36716537 0.0979951 ]
Next we create outputs for kernel and copy values to device:
d_nres = cuda.to_device(np.zeros(1, dtype=np.uint32))
d_result = cuda.to_device(np.zeros_like(values))
d_values = cuda.to_device(values)
Then we can can launch the kernel:
n_threads = 128
n_blocks = N // n_threads
filter_k[n_blocks, n_threads](d_result, d_nres, d_values)
Then we can summarize the kernel's output:
cuda_n = d_nres.copy_to_host()[0]
print(f'The kernel found {cuda_n} elements, resulting in the array:')
result = d_result.copy_to_host()
print(result)
The kernel found 12584753 elements, resulting in the array: [0.09805002 0.52309677 0.44381494 ... 0. 0. 0. ]
Finally, some sanity checking:
# Did we filter the expected number of values?
np.testing.assert_equal(value_count, cuda_n)
# Are the first cuda_n elements all nonzero?
np.testing.assert_equal(np.ones(value_count, dtype=np.bool_),
result[:cuda_n] > 0)
# Were elements after the cuda_nth element left as zero?
np.testing.assert_equal(np.zeros(N - value_count), result[cuda_n:])
print('Sanity checks passed!')
Sanity checks passed!
Since NumPy 1.12, a relaxed strides checking algorithm for determining the contiguity of arrays has been the default (as opposed to strict strides checking). Numba originally implemented the strict variant for checking the contiguity of arrays, which can prevent it being used in certain cases - for example, using Dask to chunk an array could result in an array that was contiguous but was determined to be disjoint, preventing it from being transferred to the device.
Numba now uses the relaxed algorithm, enabling this use case and other similar ones. The following example provides a demonstration of a case where relaxed strides checking is required - a Dask array is chunked and a CUDA gufunc is invoked on a block of the chunked array:
import dask.array as da
import numpy as np
from numba import guvectorize
x = da.arange(10000, dtype=np.float64).reshape(100, 100).rechunk((1, 10))
f = np.ascontiguousarray(x.blocks[0, 0])
g = np.ascontiguousarray(x.blocks[0, 0])
@guvectorize(
["void(float64[:], float64[:], float64[:], float64[:])"],
"(n),(n),(p)->(p)",
nopython=True,
target="cuda",
)
def reduce_func(x, y, _, out) -> None:
sum_ = 0.0
for i in range(x.shape[0]):
sum_ += x[i] + y[i]
out[:] = sum_
result = reduce_func(f, g, np.empty(1, dtype=x.dtype))
print(result)
[[90.]]
/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: Eager compilation of device functions is deprecated (this occurs when a signature is provided) warn(NumbaDeprecationWarning(msg)) /home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy. warn(NumbaPerformanceWarning(msg))
It is now possible to pass tuples to CUDA ufuncs, in line with NumPy and CPU-targets Numba ufuncs:
from numba import vectorize, float64
import math
@vectorize([float64(float64)], target='cuda')
def cuda_cos(x):
return math.cos(x)
print(cuda_cos((1.0, 2.0, 3.0)))
[ 0.54030231 -0.41614684 -0.9899925 ]
/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: Eager compilation of device functions is deprecated (this occurs when a signature is provided) warn(NumbaDeprecationWarning(msg)) /home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy. warn(NumbaPerformanceWarning(msg))
The UUID of a device can now be accessed in its uuid
property - whilst it is not yet possible to select devices by UUID, this can help ensure the intended GPU is in use - this can be helpful in a system using Multi-Instance GPU (MIG) or when the CUDA_VISIBLE_DEVICES
environment variable is being used to control which devices can be seen by the application.
An example getting the current device's UUID:
ctx = cuda.current_context()
print(ctx.device.uuid)
GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643