$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\Tr}[0]{^\top} \newcommand{\softmax}[1]{\mathrm{softmax}\left({#1}\right)} $$

CS236781: Deep Learning

Tutorial 11: CUDA Kernels


In this tutorial, we will cover:

  • The CUDA programming model
  • Accelerating numerical python code with numba
  • Implementing CUDA kernels in python
  • Thread synchronization
  • Shared memory
In [1]:
# Setup
%matplotlib inline
import os
import sys
import math
import time
import tqdm
import torch
import numpy as np
import matplotlib.pyplot as plt
In [2]:
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if device.type != "cuda":
    raise RuntimeError("This tutorial requires a GPU!")

print(f"Using {device=!s}")
Using device=cuda

The CUDA programming model

CUDA is a parallel programming model and software environment that leverages the computational resources of NVIDIA GPU's for general-purpose numeric computation.

It provides compilers, programming-language extensions, optimized software libraries and developer tools.

  • CUDA defines a programming model and a memory model
  • CUDA programs run 1000's of threads on on 100's of physical cores
  • Defines extensions to C language to write GPU code (But here we'll use Python :)
  • Allows heterogeneous computation:
    • CPU runs sequential operations and invokes GPU
    • GPU runs massively-parallel work
    • Both can run concurrently

Device: The GPU
Host: The machine controlling the GPU

Heterogeneous computing Host-device communication

CUDA Kernels

  • A Kernel is a function that is called from host and executes on device
  • Generally, one kernel executes at a time on the entire device
    • Actually, kernels can be queued into "streams"
    • Kernels from different streams can overlap
  • A Kernel runs on many concurrent threads
  • Each thread executes the same code

Kernel "Geometry"

  • A kernel launches as a 1d or 2d-grid which contains multiple thread blocks.
  • Each block contains multiple threads "arranged" in a 1d, 2d or 3d configuration
  • Threads within a block can synchronize (barrier) and share memory
  • Each thread has a unique id that is mostly used for
    • Selecting in/out data (computing memory access locations)
    • Control-flow decisions

Note that multi-dimensional grids and blocks are just for the convenience of the programmer.

  • Helps implement algorithms for 2d and 3d data
  • Nothing actually changes in the hardware execution
    • however, memory-sharing within blocks can be exploited

How is a kernel implemented and launched?

  • The CUDA C-extensions allow the programmer to define which code is compiled for CPU or GPU.
  • A special syntax (<<< >>>) allows the definition of kernel geometry when launching it.
__global__ void MyKernel() {}      // call from host, execute on GPU
__device__ float MyDeviceFunc() {} // call from GPU, execute on GPU
__host__ int HostFunc() {}         // call from host, execute on host

dim3 dimGrid(100, 50);  // 5000 thread blocks in the grid, in a 2D layout
dim3 dimBlock(4, 8, 8); // 256 threads per block, in a 3D layout
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

GPU threads

  • Practically zero creation and switching overhead
  • Can launch kernels with thousands of threads, many more than physical cores ("oversubscribed")
    • When a thread is blocked due to memory latency, it's instantly swapped out with another waiting thread
    • Instant thread switching hides memory latency
  • Even very simple kernels can generate performance benefit with massive parallelization
  • Scheduled together in "warps": groups of (usually 32) threads performing the same instruction (SIMT)

Determining Thread IDs

The CUDA runtime provides special variables for determining the geometry of the currently executing kernel:

  • gridDim: Dimensions (size) of the grid, in blocks. Can be 1d or 2d.
  • blockDim: Dimension (size) of each block, in threads. Can be 1d, 2d, or 3d.

The CUDA runtime provides special variables for calculating the unique thread id:

  • blockIdx: Index of current block, within the grid. Can be 1d or 2d.
  • threadIdx: Index of current thread, within the block. Can be 1d, 2d, or 3d.

Example: How can we use the above variables to obtain the unique thread id?

A unique thread id for a 1d kernel geometry can be obtained with
blockIdx.x * blockDim.x + threadIdx.x.

Key idea of CUDA

  • Write a single-threaded program with the thread id as a parameter.
  • Use thread id to select a subset of data to process.
  • Launch many threads, so that together they cover the entire range of input data.
  • Code automatically scales to all available physical processors.


A key feature of CUDA is that a Kernel transparently scales to device with a different number of physical processors.

  • A thread is executed by a single CUDA core.
  • A thread block is executed within a one "streaming multiprocessor" (SM), containing multiple CUDA cores.
  • The entire kernel grid is executed on a device, which contains many SMs.

The hardware automatically schedules thread blocks on any available multiprocessor.

Source code defining kernel "geometry" stays the same regardless of hardware.

For example, the same Kernel configuration can be launched on devices with a different number of multiprocessors:

Memory Hierarchy

Different types of memory are available to device threads.

The most important ones are:


  • Per-thread access
  • On chip $\rightarrow$ extremely fast
  • Persisted until thread terminates

Thread-local memory

  • Stores per-thread local variables that cannot fit in the register memory
  • Located in DRAM $\rightarrow$ slow
  • Persisted until thread terminates

Shared memory

  • Shared between threads in the same thread block
  • Used for collaboration between threads in the same block
  • On chip $\rightarrow$ very fast
  • Persisted until end of block

Global memory

  • Can be accessed by any thread in any thread block
  • Used to copy to/from host
  • Located in DRAM $\rightarrow$ slow
  • Persisted for the life of the application

Heuristics for Kernel sizes

How many blocks?

  • Should occupy every SM $\rightarrow$ At least one block per SM
  • Should have something to run on SM if current block is waiting (e.g. sync) $\rightarrow$ At least two blocks per SM
  • Should scale with same code if we upgrade hardware $\rightarrow$ Many blocks per SM!

How many threads?

  • Many threads $\rightarrow$ hides global memory latency
  • Too many threads $\rightarrow$ exhaust registers and shared memory
  • Should be a multiple of warp size
  • Typical selection: 64 to 512 per block

Implementing CUDA Kernels with numba

What is numba?

Numba is a just-in-time (JIT) function compiler, focused on numerical python. It can be used to accelerate python code by generating efficient, type-specialized machine code.

Numba supports all major OSes and a wide range of hardware (Intel x86/64, NVIDIA CUDA, ARM). It's developed and actively maintained by Anaconda Inc., and considered production ready.

Let's explain the terms we used above:

Just-in-time: Functions are compiled the first time they're called. The compiler therefore knows the argument types.

Bonus: This also allows Numba to be used interactively in a Jupyter notebook :)

Function compiler: Numba compiles Python functions, not entire applications.

Numba does not replace the Python interpreter, it effectively transforms a function into a usually faster function.

Numerical python: Numba supports only a subset of the python language. It works well with numerical types such as int, float, and complex, functions from the math and cmath modules and with numpy arrays.

Type-specialized: Numba speeds up your function by generating a specialized implementation for the specific data types you are using.

First steps with numba on the CPU

In [3]:
import numpy as np
import numba

Let's implement a "Hello World" style example: A trivial function that increments an array by 1.

In [4]:
def inc_cpu(a: np.ndarray):
    for i in range(len(a)):
        a[i] += 1

We use the numba.jit decorator to wrap our code in a numba object that will JIT-compile and cache it when called.

In [5]:
# The inc_cpu variable no longer points to a regular python function, but a callable wrapper.
CPUDispatcher(<function inc_cpu at 0x7f9cdc181550>)

What's the nopython option?

  • If nopython=True, numba will try to compile the entire function so that it can be run completely without the Python interpreter. This is usually what you want.
  • Otherwise, numba will try to compile the entire function, but if there are unsupported operations or types it will try to only extract loops and compile them as separate functions.

Let's create a million-element array and see how fast the code runs without numba.

In [6]:
a = np.zeros((10**6,), dtype=np.float32)
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
In [7]:
# Run as regular python code (interpreted)
%timeit inc_cpu.py_func(a)
2 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
array([8., 8., 8., ..., 8., 8., 8.], dtype=float32)

We had to call the function with .py_func to get the original function (before wrapping with the jitter).

Now let's call it though the wrapper to time the compiled version:

In [8]:
# Run as jit-compiled machine code
%timeit inc_cpu(a)
191 µs ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
array([16., 16., 16., ..., 16., 16., 16.], dtype=float32)

That's about 5 orders of magnitude faster! Not bad for just adding a decorator function...

Now lets also compare this to numpy:

In [9]:
# Run using numpy add(), this is like a + 1 but without allocating output array
%timeit np.add(a, 1, out=a)
169 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Nice, we get results similar to numpy's optimized C code.

Important note about benchmarking:

The first time we called inc_cpu we paid a overhead price for the compilation. However, the %timeit magic returns the best result from multiple runs, so our results do not show this overhead.

First steps with numba on the GPU

In [10]:
from numba import cuda

# Show GPUs on the machine
Found 1 CUDA devices
id 0    b'GeForce RTX 2080 Ti'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 0
                              pci bus id: 29
	1/1 devices are supported

Let's rewrite our "Hello World" example as a CUDA kernel.

In [11]:
def inc_gpu(a: np.ndarray):
    # Compute unique thread id
    idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    # Notice:
    # 1. No loop: every thread will operate on a single element
    # 2. We assume more threads than array elements
    if idx < a.shape[0]:
        a[idx] += 1

# Look at the type after decoration
<numba.cuda.compiler.Dispatcher at 0x7f9cd1be12b0>

We defined a Kernel. How do we specify it's grid/block geometry?

In [12]:
gridsize = (1024, 1024)
blocksize = (32, 32, 32)

# Index the wrapped function to get a new Kernel, with a configured geometry
inc_gpu[gridsize, blocksize]
<numba.cuda.compiler._KernelConfiguration at 0x7f9cd1c86160>

Now lets invoke this kernel with a specific geometry containing more threads than array elements (over 1M threads!)

In [13]:
blocksize = 256
gridsize = math.ceil(a.shape[0] / blocksize)

print(f'#blocks={gridsize}, #threads per block={blocksize}, #total_threads={blocksize*gridsize}')
#blocks=3907, #threads per block=256, #total_threads=1000192
In [14]:
# Copy data to GPU memory
d_a = cuda.to_device(a)

# Run as a kernel on GPU
# Note that we must synchronize to benchmark properly
%timeit inc_gpu[gridsize, blocksize](d_a); cuda.synchronize()

# Copying data back from device will also synchronize, i.e. wait for kernel to complete
a = d_a.copy_to_host()
d_a=<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x7f9cd1be1eb0>

79.4 µs ± 928 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
array([162238., 162238., 162238., ..., 162238., 162238., 162238.],

What's the cuda.synchronize() about? Why do we need it for the benchmark?

Launching a kernel is non-blocking: A CUDA kernel executes concurrently with the host code.

The host can execute multiple kernels which will be serialized (as a "stream") and call synchronize() to block until their completion.

However, note that copying memory to/from the host is a blocking operation. If we do not manually copy, numba will do this for us when the input is not on the device:

In [15]:
# Invoke the kernel on data in HOST memory
# Numba with automatically copy, sync and copy back
%timeit inc_gpu[gridsize, blocksize](a)
2.47 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Why so slow?

We now included two memory copies in each benchmark iteration! So this is not a correct comparison.

Input size (in)dependence

OK, so we got great performance with our simple CUDA kernel, but there's a major limitation with the above kernel.

Our grid size depends on the input size - we assumed the grid contains a thread for every single input element.

This is inconvenient: we'll need to launch the kernel differently for every input size.

In [16]:
a = np.zeros((10**6,), dtype=np.float32)

# Not enough threads for all elements
inc_gpu[100, 64](a)
array([1., 1., 1., ..., 0., 0., 0.], dtype=float32)

How can we fix this?

Common pattern: Every thread processes multiple elements, spaced apart by a "stride" which jumps over all the threads in the grid.

Let's demonstrate the pattern with a kernel that multiplies two arrays elementwise.

In [17]:
def mult_kernel(a: np.ndarray, b: np.ndarray, out: np.ndarray):
    threads_per_block = cuda.blockDim.x
    num_blocks = cuda.gridDim.x
    thread_idx_in_block = cuda.threadIdx.x
    block_idx = cuda.blockIdx.x
    # Get thread id in the 1d grid, as usual
    thread_idx_unique = thread_idx_in_block + block_idx * threads_per_block
    # Calculate range of elements this thread will process
    start = thread_idx_unique 
    end = len(a)
    stride = threads_per_block * num_blocks # jump over all threads, in case we have more data than threads
    for i in range(start, end, stride):
        out[i] = a[i] * b[i]
In [18]:
d_a = cuda.to_device(np.ones((10**6,), dtype=np.float32) * 2)
d_b = cuda.to_device(np.ones((10**6,), dtype=np.float32) * 3)
In [19]:
d_out = cuda.to_device(np.zeros_like(a))

# Kernel with a thread for each element
%timeit mult_kernel[1024, 1024](d_a, d_b, d_out); cuda.synchronize()

146 µs ± 670 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
array([6., 6., 6., ..., 6., 6., 6.], dtype=float32)
In [20]:
d_out = cuda.to_device(np.zeros_like(a))

# Less threads than elements
%timeit mult_kernel[32, 256](d_a, d_b, d_out); cuda.synchronize()

181 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
array([6., 6., 6., ..., 6., 6., 6.], dtype=float32)

Race Conditions

Like in any multithreaded environment, a race condition occurs when a memory location might be modified by multiple independent threads. For example,

  • Read-After-Write (RAW): One thread is reading a memory location at the same time another thread might be writing to it.
  • Write-After-Write (WAW): Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.

So far, in our examples, each thread had exclusive responsibility for a unique subsets of output array elements.

But what if different threads need to combine results?

Let's consider an example, where multiple threads need to increment a global counter:

  1. Read the current value of a counter in global memory.
  2. Compute counter + 1.
  3. Write that value back to the counter.

CUDA provides "atomic operations" which will read, modify and update a memory location in one "atomic" operation.

Here are simple kernels incrementing a global counter:

In [21]:
def counter_kernel(global_counter):
    # Race condition
    global_counter[0] += 1
def atomic_counter_kernel(global_counter):
    # Add 1 to offset 0 in global_counter array, as an atomic operation
    cuda.atomic.add(global_counter, 0, 1)

Let's see what happens with multiple threads running these kernels:

In [22]:
n_threads = 1024
blocksize = n_threads // 32
gridsize = n_threads // blocksize
print(f'gridsize={gridsize}, blocksize={blocksize}')
gridsize=32, blocksize=32
In [23]:
counter = cuda.to_device(np.array([0], dtype=np.int32))

# Race condition
counter_kernel[gridsize, blocksize](counter)

print(f'counter={counter.copy_to_host()[0]}, expected={n_threads}')
counter=1, expected=1024
In [24]:
counter = cuda.to_device(np.array([0], dtype=np.int32))

# No race condition
atomic_counter_kernel[gridsize, blocksize](counter)

print(f'counter={counter.copy_to_host()[0]}, expected={n_threads}')
counter=1024, expected=1024

Let's look at a slightly more useful example: Computing a histogram from a an array.

In addition, we'll show two convenience functions provided by numba for calculating the thread id and grid size.

In [25]:
def hist_kernel(x: np.ndarray, xmin: float, xmax: float, histogram_out: np.ndarray):
    # Number of bins determined by histogram elements
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins
    # Use numba's helper functions for obtaining unique (absolute) thread ids
    # grid(1) provides unique thread id for a 1d grid
    # gridsize(1) provides total size of grid (in threads), for a 1d grid
    start = cuda.grid(1)
    end = x.shape[0]
    stride = cuda.gridsize(1)
    for i in range(start, end, stride):
        bin_number = math.floor((x[i] - xmin)/bin_width)
        if bin_number >= 0 and bin_number < nbins:
            cuda.atomic.add(histogram_out, bin_number, 1)

Let's try it on 1M samples:

In [26]:
# Create normally-distributed input
x = np.random.normal(size=10**6).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)

With CUDA:

In [27]:
# Create output array
histogram_out = np.zeros(shape=150, dtype=np.int32)

# Run the kernel
hist_kernel[64,64](x, xmin, xmax, histogram_out)
In [28]:
# Plot, just to see the histogram looks correct
nbins = len(histogram_out)
bin_edges = np.linspace(xmin, xmax, num=nbins, endpoint=False)
bin_width = (xmax - xmin) / nbins
plt.bar(bin_edges, histogram_out, width=bin_width, align='edge');

With numpy:

In [29]:
np.histogram(x, bins=nbins, range=(xmin, xmax))[0]
array([    9,    12,    10,    14,    18,    11,    31,    31,    50,
          37,    63,    68,    89,   107,   128,   157,   154,   174,
         208,   263,   345,   369,   440,   463,   537,   660,   702,
         881,   907,  1069,  1269,  1475,  1642,  1869,  2132,  2338,
        2544,  2798,  3228,  3502,  3913,  4322,  4753,  5251,  5653,
        6113,  6721,  7397,  7930,  8516,  9114,  9685, 10319, 10993,
       11652, 12367, 13017, 13773, 14448, 15185, 15768, 16498, 17290,
       17631, 18032, 18502, 19098, 19539, 19834, 20332, 20900, 21109,
       21076, 21316, 21375, 21175, 21244, 21042, 20729, 20550, 20217,
       20027, 19620, 19205, 18653, 18378, 17942, 17030, 16338, 15971,
       15155, 14353, 13901, 13147, 12292, 11571, 11053, 10402,  9579,
        9043,  8582,  7883,  7325,  6685,  6218,  5589,  5125,  4680,
        4267,  3905,  3659,  3188,  2883,  2625,  2312,  2122,  1743,
        1640,  1361,  1247,  1067,  1039,   821,   710,   642,   567,
         472,   414,   382,   308,   267,   238,   186,   181,   129,
         102,    82,    87,    73,    62,    42,    33,    31,    30,
          22,    22,    14,    13,     8,     7])

Now let's compare timings, this time also measuring with and without memory copy overhead.

In [30]:
%timeit np.histogram(x, bins=nbins, range=(xmin, xmax))[0]

print('\nCUDA, host arrays:')
%timeit hist_kernel[64,64](x, xmin, xmax, histogram_out)

print('\nCUDA, device arrays:')
d_x = cuda.to_device(x)
d_histogram_out = cuda.to_device(histogram_out)
%timeit hist_kernel[64,64](d_x, xmin, xmax, d_histogram_out); cuda.synchronize()
8.08 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CUDA, host arrays:
3 ms ± 69.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CUDA, device arrays:
349 µs ± 689 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Shared Memory

Sometimes it's necessary for threads to cooperate by working on the same data. However, cooperating though the global memory is extremely slow due to memory latencies.

As we saw, CUDA provides fast shared memory only between threads that are in the same block.

No CUDA tutorial can be complete without a matrix multiplication example, so let's start there. First, we'll implement a straightforward matrix multiplication that does not take advantage of shared memory.

In our initial implementation, each thread will read one row of matrix A and one column of matrix B and compute one corresponding element of the output C.

In [31]:
def matmul_kernel(a, b, out):
    # Unique thread id on a 2d-grid
    i, j = cuda.grid(2)           # location of current thread (i, j)
    imax, jmax = cuda.gridsize(2) # total number of threads (I, J)

    # Each thread calculates one output element: out[i,j]
    if i < out.shape[0] and j < out.shape[1]:
        for k in range(b.shape[0]):
            out[i, j] += a[i,k] * b[k,j]

This time, we'll be working with 2D kernel geometry.

In [32]:
# Base matrix dim
N = 256

# Input data 
a = np.ones((N, 2*N), dtype=np.float32) * 2
b = np.ones((2*N, N), dtype=np.float32) * 3
expected_out = np.matmul(a, b)

# Kernel geometry: cover the entire matrix with the grid
blocksize = cuda.get_current_device().WARP_SIZE
gridsize = (2*N + blocksize-1)//blocksize

block_dim = (blocksize, blocksize)
grid_dim = (gridsize, gridsize)

print(f'block_dim={block_dim}, grid_dim={grid_dim}')
print(f'total_threads={(blocksize**2) * (gridsize**2)}')
block_dim=(32, 32), grid_dim=(16, 16)

Let's try our simple kernel:

In [33]:
out = np.zeros((N, N), dtype=np.float32)
matmul_kernel[grid_dim, block_dim](a, b, out)

# Make sure result is correct
assert(np.allclose(out, expected_out))
array([[3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.]], dtype=float32)

We used lots of threads and a large grid!

But, why is this implementation still inefficient?

Both A and B will be read many times from the slow global memory:

  • A will be read B.shape[1] times
  • B will be read A.shape[0] times

Now we can move on to a more efficient version which takes take advantage of shared memory to reduce global memory bandwidth.

Recall that shared memory is on-chip memory available on each streaming multiprocessor. It is shared only between threads of the same block (even if other blocks are running on the same SM).

Shared memory is scarce hardware resource, in many cases limited to 48kB per block. It should be used sparingly, as a way to reduce latency of global memory.

We will implement as follows:

  • Each thread block is responsible for computing one square sub-matrix C_sub of the output C, of shape (blocksize,blocksize).
  • Each thread within the block is responsible for computing one element of C_sub.
  • C_sub is the product of two rectangular matrices: A_sub of shape (block_size, A.shape[1]) which has the same row indices as C_sub, and B_sub of shape (B.shape[0], blocksize) which has the same column indices as Csub.
  • These two rectangular matrices are divided into as many square matrices of shape (blocksize, blocksize) as necessary.
  • C_sub is computed as the sum of the products of these square matrices. To compute this product:
    • First we load two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix.
    • Then each thread computes one element of the product using the shared memory.
    • Each thread accumulates the result of each of these square products into a register and once done writes the result to global memory.
In [34]:
def fast_matmul_kernel(a, b, out):
    # Define arrays in the shared memory
    # These will hold a square block of A_sub and B_sub
    # The size and type of the arrays must be known at compile time
    a_sub = cuda.shared.array(shape=(blocksize, blocksize), dtype=numba.float32)
    b_sub = cuda.shared.array(shape=(blocksize, blocksize), dtype=numba.float32)

    # Global id of current thread in a 2D threadblock: defines output location
    x, y = cuda.grid(2)
    # Bounds check
    if x >= out.shape[0] or y >= out.shape[1]:
    # Index of thread within it's own block
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of (blocksize,)-shaped vectors.
    tmp = 0.
    for i in range(bpg):
        # i is the index of the current inner square block
        # Get row and col within A_sub and B_sub and do bounds check
        b_sub_row = tx + i * blocksize
        a_sub_col = ty + i * blocksize
        if not (b_sub_row < b.shape[0] and a_sub_col < a.shape[1]):
        # Preload one element from A_sub and B_sub into shared memory
        a_sub[tx, ty] = a[x, a_sub_col]
        b_sub[tx, ty] = b[b_sub_row, y]

        # Wait for all threads in current block

        # Compute inner product between vectors, read from the shared memory
        for j in range(blocksize):
            tmp += a_sub[tx, j] * b_sub[j, ty]


    # Write to global memory
    out[x, y] = tmp

What does the cuda.syncthreads() call do?

Why do we need the first call? And why do we need to second?

This call allows us to use a synchronization mechanism called a barrier, between threads within the same threadblock.

A barrier blocks each thread until all threads reach it, at which point all threads become unblocked.

  • The first syncthreads() call is needed in order to wait for the entire a_sub and b_sub matrices to fill, since each thread loads only one element.
  • The second syncthreads() is necessary so that a thread will not advance to the next square sub-block. This will cause it to fetch new data into a_sub and b_sub while another thread might still need the old data.

Let's test our faster matrix multiplication:

In [35]:
out = np.zeros((N, N), dtype=np.float32)
fast_matmul_kernel[grid_dim, block_dim](a, b, out)

# Make sure result is correct
assert(np.allclose(out, expected_out))
array([[3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.],
       [3072., 3072., 3072., ..., 3072., 3072., 3072.]], dtype=float32)

Now we can benchmark the performance:

In [36]:
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(np.zeros((N, N), dtype=np.float32))

print('numpy matmul:')
%timeit np.matmul(a, b, out=out)

print('\nCUDA, naive implementation:')
%timeit matmul_kernel[grid_dim, block_dim](d_a, d_b, d_out); cuda.synchronize()

print('\nCUDA, with shared memory:')
%timeit fast_matmul_kernel[grid_dim, block_dim](d_a, d_b, d_out); cuda.synchronize()
numpy matmul:
496 µs ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CUDA, naive implementation:
1.6 ms ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CUDA, with shared memory:
784 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


  • CUDA provides a very powerful framework for easily writing highly scalable multithreaded code.
  • Once we have the right mental model about how it works, we can leverage the power of GPUs for performing arbitrary computation.
  • Using numba, we can do this directly in Python, and even iterate implementing our GPU code interactively within a jupyter notebook.
  • As a bonus, we learned how to accelerate any numerical python function with numba, and squeeze out extra performance gains even without a GPU.

You should experiment with these tools to speed up your pre-and post training tasks such as data loading and preprocessing, augmentation, statistical analysis of model outputs and so on.


Image credits

Some images in this tutorial were taken and/or adapted from: