# A couple of useful imports
from numba import cuda
import numpy as np
Adding a callback to a stream allows a function on the host to be called when all the items presently enqueued on the stream have completed. This can be useful for notifying the host code that a certain sequence of events has completed.
The CUDA Toolkit Samples include an example of using stream callbacks in a multithreaded application in the 0_Simple/simpleCallback
directory - for this notebook, we demonstrate the CUDA Python API only.
First let's define a kernel and a callback function:
# A simple kernel to add two arrays
@cuda.jit
def add_kernel(r, x, y):
i = cuda.grid(1)
if i < len(r):
r[i] = x[i] + y[i]
# Python function for callback
# `data` is the data value set when the callback was added (see below)
def print_callback(stream, status, data):
print(f'Callback function reached! Data is {data}')
Now we'll create a stream then use it to transfer data and launch a kernel before enqueueing a callback:
# Create a stream
s1 = cuda.stream()
# Transfer all data to the device on the stream
n_elements = 256
x = cuda.to_device(np.random.random(n_elements), stream=s1)
y = cuda.to_device(np.random.random(n_elements), stream=s1)
r = cuda.device_array_like(x, stream=s1)
# Launch the kernel on the stream
add_kernel[1, n_elements, s1](r, x, y)
# Add a callback that will be called on the host when the kernel launch is complete.
# The first parameter is the function to call.
# The second parameter is passed into the `data` (3rd) argument of the callback.
s1.add_callback(print_callback, 'HELLO WORLD')
Callback function reached! Data is HELLO WORLD
Callbacks can also be used to notify Event
objects. For example:
import threading
# Create a stream and an event that will be notified
s2 = cuda.stream()
callback_event = threading.Event()
# Define a callback function to notify the event
def notify_callback(stream, status, event):
event.set()
# Add the callback to the stream:
s2.add_callback(notify_callback, callback_event)
# Wait for the event
print("Waiting for event")
ret = callback_event.wait(1.0)
print(f"Waiting returned: {ret}")
Waiting for event Waiting returned: True
Stream.async_done()
returns an awaitable that resolves once all preceding stream operations are complete:
import asyncio
# Define a new stream
s3 = cuda.stream()
# Directly awaiting on async_done works in the notebook as it already
# has a running event loop
await s3.async_done()
An example creating multiple tasks on multiple streams and gathering:
async def async_cuda_fn(value_in):
stream = cuda.stream()
h_src, h_dst = cuda.pinned_array(8), cuda.pinned_array(8)
h_src[:] = value_in
d_ary = cuda.to_device(h_src, stream=stream)
d_ary.copy_to_host(h_dst, stream=stream)
await stream.async_done()
return h_dst.mean()
values_in = [1, 2, 3, 4]
tasks = [asyncio.create_task(async_cuda_fn(v)) for v in values_in]
values_out = await asyncio.gather(*tasks)
values_out
[1.0, 2.0, 3.0, 4.0]
Let's define a simple kernel, and compile it eagerly:
@cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def add(r, x, y):
i = cuda.grid(1)
if i < len(r):
r[i] = x[i] + y[i]
Now we can print the SASS code:
print(add.inspect_sass())
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM75 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM75)" .elftype @"ET_EXEC" //--------------------- .debug_frame -------------------------- .section .debug_frame,"",@progbits .debug_frame: /*0000*/ .byte 0xff, 0xff, 0xff, 0xff, 0x28, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff /*0010*/ .byte 0xff, 0xff, 0xff, 0xff, 0x03, 0x00, 0x04, 0x7c, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0c, 0x81, 0x80 /*0020*/ .byte 0x80, 0x28, 0x00, 0x08, 0xff, 0x81, 0x80, 0x28, 0x08, 0x81, 0x80, 0x80, 0x28, 0x00, 0x00, 0x00 /*0030*/ .byte 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x30, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 /*0040*/ .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 /*0048*/ .dword _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE /*0050*/ .byte 0x70, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x04, 0x02, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00 /*0060*/ .byte 0x00, 0x00, 0x0c, 0x81, 0x80, 0x80, 0x28, 0x00, 0x04, 0x7a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 //--------------------- .nv.info -------------------------- .section .nv.info,"",@"SHT_CUDA_INFO" .align 4 //----- nvinfo : EIATTR_FRAME_SIZE .align 4 /*0000*/ .byte 0x04, 0x11 /*0002*/ .short (.L_11 - .L_10) .align 4 .L_10: /*0004*/ .word index@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) /*0008*/ .word 0x00000000 //----- nvinfo : EIATTR_REGCOUNT .align 4 .L_11: /*000c*/ .byte 0x04, 0x2f /*000e*/ .short (.L_13 - .L_12) .align 4 .L_12: /*0010*/ .word index@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) /*0014*/ .word 0x0000000d //----- nvinfo : EIATTR_MIN_STACK_SIZE .align 4 .L_13: /*0018*/ .byte 0x04, 0x12 /*001a*/ .short (.L_15 - .L_14) .align 4 .L_14: /*001c*/ .word index@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) /*0020*/ .word 0x00000000 .L_15: //--------------------- .nv.info._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE -------------------------- .section .nv.info._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,"",@"SHT_CUDA_INFO" .align 4 //----- nvinfo : EIATTR_EXIT_INSTR_OFFSETS .align 4 /*0000*/ .byte 0x04, 0x1c /*0002*/ .short (.L_17 - .L_16) // ....[0].... .L_16: /*0004*/ .word 0x00000070 // ....[1].... /*0008*/ .word 0x000001f0 //----- nvinfo : EIATTR_MAXREG_COUNT .align 4 .L_17: /*000c*/ .byte 0x03, 0x1b /*000e*/ .short 0x00ff //----- nvinfo : EIATTR_KPARAM_INFO .align 4 /*0010*/ .byte 0x04, 0x17 /*0012*/ .short (.L_19 - .L_18) .L_18: /*0014*/ .word 0x00000000 /*0018*/ .short 0x0000 /*001a*/ .short 0x0000 /*001c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_19: /*0020*/ .byte 0x04, 0x17 /*0022*/ .short (.L_21 - .L_20) .L_20: /*0024*/ .word 0x00000000 /*0028*/ .short 0x0001 /*002a*/ .short 0x0008 /*002c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_21: /*0030*/ .byte 0x04, 0x17 /*0032*/ .short (.L_23 - .L_22) .L_22: /*0034*/ .word 0x00000000 /*0038*/ .short 0x0002 /*003a*/ .short 0x0010 /*003c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_23: /*0040*/ .byte 0x04, 0x17 /*0042*/ .short (.L_25 - .L_24) .L_24: /*0044*/ .word 0x00000000 /*0048*/ .short 0x0003 /*004a*/ .short 0x0018 /*004c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_25: /*0050*/ .byte 0x04, 0x17 /*0052*/ .short (.L_27 - .L_26) .L_26: /*0054*/ .word 0x00000000 /*0058*/ .short 0x0004 /*005a*/ .short 0x0020 /*005c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_27: /*0060*/ .byte 0x04, 0x17 /*0062*/ .short (.L_29 - .L_28) .L_28: /*0064*/ .word 0x00000000 /*0068*/ .short 0x0005 /*006a*/ .short 0x0028 /*006c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_29: /*0070*/ .byte 0x04, 0x17 /*0072*/ .short (.L_31 - .L_30) .L_30: /*0074*/ .word 0x00000000 /*0078*/ .short 0x0006 /*007a*/ .short 0x0030 /*007c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_31: /*0080*/ .byte 0x04, 0x17 /*0082*/ .short (.L_33 - .L_32) .L_32: /*0084*/ .word 0x00000000 /*0088*/ .short 0x0007 /*008a*/ .short 0x0038 /*008c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_33: /*0090*/ .byte 0x04, 0x17 /*0092*/ .short (.L_35 - .L_34) .L_34: /*0094*/ .word 0x00000000 /*0098*/ .short 0x0008 /*009a*/ .short 0x0040 /*009c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_35: /*00a0*/ .byte 0x04, 0x17 /*00a2*/ .short (.L_37 - .L_36) .L_36: /*00a4*/ .word 0x00000000 /*00a8*/ .short 0x0009 /*00aa*/ .short 0x0048 /*00ac*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_37: /*00b0*/ .byte 0x04, 0x17 /*00b2*/ .short (.L_39 - .L_38) .L_38: /*00b4*/ .word 0x00000000 /*00b8*/ .short 0x000a /*00ba*/ .short 0x0050 /*00bc*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_39: /*00c0*/ .byte 0x04, 0x17 /*00c2*/ .short (.L_41 - .L_40) .L_40: /*00c4*/ .word 0x00000000 /*00c8*/ .short 0x000b /*00ca*/ .short 0x0058 /*00cc*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_41: /*00d0*/ .byte 0x04, 0x17 /*00d2*/ .short (.L_43 - .L_42) .L_42: /*00d4*/ .word 0x00000000 /*00d8*/ .short 0x000c /*00da*/ .short 0x0060 /*00dc*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_43: /*00e0*/ .byte 0x04, 0x17 /*00e2*/ .short (.L_45 - .L_44) .L_44: /*00e4*/ .word 0x00000000 /*00e8*/ .short 0x000d /*00ea*/ .short 0x0068 /*00ec*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_45: /*00f0*/ .byte 0x04, 0x17 /*00f2*/ .short (.L_47 - .L_46) .L_46: /*00f4*/ .word 0x00000000 /*00f8*/ .short 0x000e /*00fa*/ .short 0x0070 /*00fc*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_47: /*0100*/ .byte 0x04, 0x17 /*0102*/ .short (.L_49 - .L_48) .L_48: /*0104*/ .word 0x00000000 /*0108*/ .short 0x000f /*010a*/ .short 0x0078 /*010c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_49: /*0110*/ .byte 0x04, 0x17 /*0112*/ .short (.L_51 - .L_50) .L_50: /*0114*/ .word 0x00000000 /*0118*/ .short 0x0010 /*011a*/ .short 0x0080 /*011c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_51: /*0120*/ .byte 0x04, 0x17 /*0122*/ .short (.L_53 - .L_52) .L_52: /*0124*/ .word 0x00000000 /*0128*/ .short 0x0011 /*012a*/ .short 0x0088 /*012c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_53: /*0130*/ .byte 0x04, 0x17 /*0132*/ .short (.L_55 - .L_54) .L_54: /*0134*/ .word 0x00000000 /*0138*/ .short 0x0012 /*013a*/ .short 0x0090 /*013c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_55: /*0140*/ .byte 0x04, 0x17 /*0142*/ .short (.L_57 - .L_56) .L_56: /*0144*/ .word 0x00000000 /*0148*/ .short 0x0013 /*014a*/ .short 0x0098 /*014c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_KPARAM_INFO .align 4 .L_57: /*0150*/ .byte 0x04, 0x17 /*0152*/ .short (.L_59 - .L_58) .L_58: /*0154*/ .word 0x00000000 /*0158*/ .short 0x0014 /*015a*/ .short 0x00a0 /*015c*/ .byte 0x00, 0xf0, 0x21, 0x00 //----- nvinfo : EIATTR_CBANK_PARAM_SIZE .align 4 .L_59: /*0160*/ .byte 0x03, 0x19 /*0162*/ .short 0x00a8 //----- nvinfo : EIATTR_PARAM_CBANK .align 4 /*0164*/ .byte 0x04, 0x0a /*0166*/ .short (.L_61 - .L_60) .align 4 .L_60: /*0168*/ .word index@(.nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) /*016c*/ .short 0x0160 /*016e*/ .short 0x00a8 .L_61: //--------------------- .nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE -------------------------- .section .nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,"a",@progbits .align 4 .nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE: .zero 520 //--------------------- .text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE -------------------------- .section .text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,"ax",@progbits .sectioninfo @"SHI_REGISTERS=13" .align 128 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@function .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,(.L_62 - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) .other _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@"STO_CUDA_ENTRY STV_DEFAULT" _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE: .text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE: /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /*0010*/ S2R R0, SR_TID.X ; /*0020*/ S2R R3, SR_CTAID.X ; /*0030*/ IMAD R0, R3, c[0x0][0x0], R0 ; /*0040*/ ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x188], PT ; /*0050*/ SHF.R.S32.HI R6, RZ, 0x1f, R0 ; /*0060*/ ISETP.GE.AND.EX P0, PT, R6, c[0x0][0x18c], PT, P0 ; /*0070*/ @P0 EXIT ; /*0080*/ ISETP.GE.AND P0, PT, R0, RZ, PT ; /*0090*/ SEL R7, RZ, c[0x0][0x1c0], P0 ; /*00a0*/ SEL R3, RZ, c[0x0][0x1f8], P0 ; /*00b0*/ SEL R9, RZ, c[0x0][0x1c4], P0 ; /*00c0*/ SEL R5, RZ, c[0x0][0x1fc], P0 ; /*00d0*/ IADD3 R7, P1, R0.reuse, R7, RZ ; /*00e0*/ IADD3 R3, P2, R0, R3, RZ ; /*00f0*/ IMAD.X R10, R6.reuse, 0x1, R9, P1 ; /*0100*/ LEA R4, P1, R7, c[0x0][0x1b8], 0x2 ; /*0110*/ IMAD.X R8, R6, 0x1, R5, P2 ; /*0120*/ LEA R2, P2, R3, c[0x0][0x1f0], 0x2 ; /*0130*/ LEA.HI.X R5, R7, c[0x0][0x1bc], R10, 0x2, P1 ; /*0140*/ LEA.HI.X R3, R3, c[0x0][0x1f4], R8, 0x2, P2 ; /*0150*/ LDG.E.SYS R5, [R4] ; /*0160*/ LDG.E.SYS R2, [R2] ; /*0170*/ SEL R7, RZ, c[0x0][0x188], P0 ; /*0180*/ IADD3 R0, P1, R0, R7, RZ ; /*0190*/ SEL R7, RZ, c[0x0][0x18c], P0 ; /*01a0*/ IMAD.X R7, R6, 0x1, R7, P1 ; /*01b0*/ LEA R6, P0, R0, c[0x0][0x180], 0x2 ; /*01c0*/ LEA.HI.X R7, R0, c[0x0][0x184], R7, 0x2, P0 ; /*01d0*/ FADD R9, R2, R5 ; /*01e0*/ STG.E.SYS [R6], R9 ; /*01f0*/ EXIT ; .L_9: /*0200*/ BRA `(.L_9); .L_62: //--------------------- .nv.global -------------------------- .section .nv.global,"aw",@nobits .align 8 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__: .nv.global: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__: .zero 4 .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__ .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__,@object .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__,(.L_4 - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__) _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__: .zero 4 .L_4: .zero 4 .global _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE .type _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@object .size _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,(.L_8 - _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE) _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE: .zero 8 .L_8:
Note that SASS inspection uses nvdisasm
from the CUDA Binary Utilities, so the full CUDA toolkit must be installed - the conda cudatoolkit
package does not include nvdisasm
.
nanmin
and nanmax
¶Numba 0.50 changed the semantics of cuda.atomic.min
and cuda.atomic.max
to more closely match that of math.max
and numpy.max
, which always return the first argument if one of the operands is a NaN. The old behaviour, treating NaN values as missing data is now implemented as cuda.atomic.nanmin
and cuda.atomic.nanmax
(c.f. nanmin
and nanmax
in NumPy).
To demonstrate the difference, we'll define two kernels, one using max
, and the other using nanmax
:
@cuda.jit
def atomic_max_array(val, arr):
i = cuda.grid(1)
cuda.atomic.max(arr, i, val[0])
@cuda.jit
def atomic_nanmax_array(val, arr):
i = cuda.grid(1)
cuda.atomic.nanmax(arr, i, val[0])
Now we define input arrays - we'll set the value of one element to NaN:
n_elements = 8
x_max = np.arange(n_elements, dtype=np.float32)
x_max[5] = np.nan
# A copy of the input for the other kernel
x_nanmax = x_max.copy()
# Print an array to show the NaN in position:
x_max
array([ 0., 1., 2., 3., 4., nan, 6., 7.], dtype=float32)
Now we'll launch the kernels on our arrays:
# A value to compare the array values with
val = np.ones(1, dtype=np.float32) + 2
# Launch kernels
atomic_max_array[1, n_elements](val, x_max)
atomic_nanmax_array[1, n_elements](val, x_nanmax)
Looking at the output from cuda.atomic.max
, we see that it didn't treat the NaN value as missing data, but instead returned it:
x_max
array([ 3., 3., 3., 3., 4., nan, 6., 7.], dtype=float32)
Whereas cuda.atomic.nanmax
treated the NaN as missing data and replaced it with the comparison value:
x_nanmax
array([3., 3., 3., 3., 4., 3., 6., 7.], dtype=float32)