CUDA Kernel API¶
Kernel declaration¶
The @cuda.jit
decorator is used to create a CUDA dispatcher object that can
be configured and launched:

numba.cuda.
jit
(func_or_sig=None, device=False, inline=False, link=[], debug=None, opt=True, **kws)¶ JIT compile a python function conforming to the CUDA Python specification. If a signature is supplied, then a function is returned that takes a function to compile.
 Parameters
func_or_sig –
A function to JIT compile, or a signature of a function to compile. If a function is supplied, then a
numba.cuda.compiler.AutoJitCUDAKernel
is returned. If a signature is supplied, then a function is returned. The returned function accepts another function, which it will compile and then return anumba.cuda.compiler.AutoJitCUDAKernel
.Note
A kernel cannot have any return value.
device (bool) – Indicates whether this is a device function.
link (list) – A list of files containing PTX source to link with the function
debug – If True, check for exceptions thrown when executing the kernel. Since this degrades performance, this should only be used for debugging purposes. Defaults to False. (The default value can be overridden by setting environment variable
NUMBA_CUDA_DEBUGINFO=1
.)fastmath – If true, enables flushtozero and fusedmultiplyadd, disables precise division and square root. This parameter has no effect on device function, whose fastmath setting depends on the kernel function from which they are called.
max_registers – Request that the kernel is limited to using at most this number of registers per thread. The limit may not be respected if the ABI requires a greater number of registers than that requested. Useful for increasing occupancy.
opt (bool) – Whether to compile from LLVM IR to PTX with optimization enabled. When
True
,opt=3
is passed to NVVM. WhenFalse
,opt=0
is passed to NVVM. Defaults toTrue
.
Dispatcher objects¶
The usual syntax for configuring a Dispatcher with a launch configuration uses subscripting, with the arguments being as in the following:
# func is some function decorated with @cuda.jit
func[griddim, blockdim, stream, sharedmem]
The griddim
and blockdim
arguments specify the size of the grid and
thread blocks, and may be either integers or tuples of length up to 3. The
stream
parameter is an optional stream on which the kernel will be launched,
and the sharedmem
parameter specifies the size of dynamic shared memory in
bytes.
Subscripting the Dispatcher returns a configuration object that can be called with the kernel arguments:
configured = func[griddim, blockdim, stream, sharedmem]
configured(x, y, z)
However, it is more idiomatic to configure and call the kernel within a single statement:
func[griddim, blockdim, stream, sharedmem](x, y, z)
This is similar to launch configuration in CUDA C/C++:
func<<<griddim, blockdim, sharedmem, stream>>>(x, y, z)
Note
The order of stream
and sharedmem
are reversed in Numba
compared to in CUDA C/C++.
Dispatcher objects also provide several utility methods for inspection and creating a specialized instance:

class
numba.cuda.compiler.
Dispatcher
(py_func, sigs, targetoptions)¶ CUDA Dispatcher object. When configured and called, the dispatcher will specialize itself for the given arguments (if no suitable specialized version already exists) & compute capability, and launch on the device associated with the current context.
Dispatcher objects are not to be constructed by the user, but instead are created using the
numba.cuda.jit()
decorator.
property
extensions
¶ A list of objects that must have a prepare_args function. When a specialized kernel is called, each argument will be passed through to the prepare_args (from the last object in this list to the first). The arguments to prepare_args are:
ty the numba type of the argument
val the argument value itself
stream the CUDA stream used for the current call to the kernel
retr a list of zeroarg functions that you may want to append postcall cleanup work to.
The prepare_args function must return a tuple (ty, val), which will be passed in turn to the next rightmost extension. After all the extensions have been called, the resulting (ty, val) will be passed into Numba’s default argument marshalling logic.

forall
(ntasks, tpb=0, stream=0, sharedmem=0)¶ Returns a 1Dconfigured kernel for a given number of tasks.
This assumes that:
the kernel maps the Global Thread ID
cuda.grid(1)
to tasks on a 11 basis.the kernel checks that the Global Thread ID is upperbounded by
ntasks
, and does nothing if it is not.
 Parameters
ntasks – The number of tasks.
tpb – The size of a block. An appropriate value is chosen if this parameter is not supplied.
stream – The stream on which the configured kernel will be launched.
sharedmem – The number of bytes of dynamic shared memory required by the kernel.
 Returns
A configured kernel, ready to launch on a set of arguments.

get_regs_per_thread
(signature=None)¶ Returns the number of registers used by each thread in this kernel for the device in the current context.
 Parameters
signature – The signature of the compiled kernel to get register usage for. This may be omitted for a specialized kernel.
 Returns
The number of registers used by the compiled variant of the kernel for the given signature and current device.

inspect_asm
(signature=None, compute_capability=None)¶ Return this kernel’s PTX assembly code for for the device in the current context.
 Parameters
signature – A tuple of argument types.
compute_capability – Deprecated: accepted but ignored, provided only for backwards compatibility.
 Returns
The PTX code for the given signature, or a dict of PTX codes for all previouslyencountered signatures. If the dispatcher is specialized, the PTX code for the single specialization is returned even if no signature was provided.

inspect_llvm
(signature=None, compute_capability=None)¶ Return the LLVM IR for this kernel.
 Parameters
signature – A tuple of argument types.
compute_capability – Deprecated: accepted but ignored, provided only for backwards compatibility.
 Returns
The LLVM IR for the given signature, or a dict of LLVM IR for all previouslyencountered signatures. If the dispatcher is specialized, the IR for the single specialization is returned even if no signature was provided.

inspect_sass
(signature=None, compute_capability=None)¶ Return this kernel’s SASS assembly code for for the device in the current context.
 Parameters
signature – A tuple of argument types.
compute_capability – Deprecated: accepted but ignored, provided only for backwards compatibility.
 Returns
The SASS code for the given signature, or a dict of SASS codes for all previouslyencountered signatures. If the dispatcher is specialized, the SASS code for the single specialization is returned even if no signature was provided.
SASS for the device in the current context is returned.
Requires nvdisasm to be available on the PATH.

inspect_types
(file=None)¶ Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.

specialize
(*args)¶ Create a new instance of this dispatcher specialized for the given args.

property
specialized
¶ True if the Dispatcher has been specialized.

property
Intrinsic Attributes and Functions¶
The remainder of the attributes and functions in this section may only be called from within a CUDA Kernel.
Thread Indexing¶

numba.cuda.
threadIdx
¶ The thread indices in the current thread block, accessed through the attributes
x
,y
, andz
. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute innumba.cuda.blockDim
exclusive.

numba.cuda.
blockIdx
¶ The block indices in the grid of thread blocks, accessed through the attributes
x
,y
, andz
. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute innumba.cuda.gridDim
exclusive.

numba.cuda.
blockDim
¶ The shape of a block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

numba.cuda.
gridDim
¶ The shape of the grid of blocks, accessed through the attributes
x
,y
, andz
.

numba.cuda.
laneid
¶ The thread index in the current warp, as an integer spanning the range from 0 inclusive to the
numba.cuda.warpsize
exclusive.

numba.cuda.
warpsize
¶ The size in threads of a warp on the GPU. Currently this is always 32.

numba.cuda.
grid
(ndim)¶ Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.
Computation of the first integer is as follows:
cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
and is similar for the other two indices, but using the
y
andz
attributes.

numba.cuda.
gridsize
(ndim)¶ Return the absolute size (or shape) in threads of the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.
Computation of the first integer is as follows:
cuda.blockDim.x * cuda.gridDim.x
and is similar for the other two indices, but using the
y
andz
attributes.
Memory Management¶
Creates an array in the local memory space of the CUDA kernel with the given
shape
anddtype
.Returns an array with its content uninitialized.
Note
All threads in the same thread block sees the same array.

numba.cuda.local.
array
(shape, dtype)¶ Creates an array in the local memory space of the CUDA kernel with the given
shape
anddtype
.Returns an array with its content uninitialized.
Note
Each thread sees a unique array.

numba.cuda.const.
array_like
(ary)¶ Copies the
ary
into constant memory space on the CUDA kernel at compile time.Returns an array like the
ary
argument.Note
All threads and blocks see the same array.
Synchronization and Atomic Operations¶

numba.cuda.atomic.
add
(array, idx, value)¶ Perform
array[idx] += value
. Support int32, int64, float32 and float64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multiple dimensional arrays. The number of element inidx
must match the number of dimension ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
sub
(array, idx, value)¶ Perform
array[idx] = value
. Supports int32, int64, float32 and float64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
and_
(array, idx, value)¶ Perform
array[idx] &= value
. Supports int32, uint32, int64, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
or_
(array, idx, value)¶ Perform
array[idx] = value
. Supports int32, uint32, int64, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
xor
(array, idx, value)¶ Perform
array[idx] ^= value
. Supports int32, uint32, int64, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
exch
(array, idx, value)¶ Perform
array[idx] = value
. Supports int32, uint32, int64, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
inc
(array, idx, value)¶ Perform
array[idx] = (0 if array[idx] >= value else array[idx] + 1)
. Supports uint32, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
dec
(array, idx, value)¶ Perform
array[idx] = (value if (array[idx] == 0) or (array[idx] > value) else array[idx]  1)
. Supports uint32, and uint64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multidimensional arrays. The number of elements inidx
must match the number of dimensions ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.atomic.
max
(array, idx, value)¶ Perform
array[idx] = max(array[idx], value)
. Support int32, int64, float32 and float64 only. Theidx
argument can be an integer or a tuple of integer indices for indexing into multiple dimensional arrays. The number of element inidx
must match the number of dimension ofarray
.Returns the value of
array[idx]
before the storing the new value. Behaves like an atomic load.

numba.cuda.
syncthreads
()¶ Synchronize all threads in the same thread block. This function implements the same pattern as barriers in traditional multithreaded programming: this function waits until all threads in the block call it, at which point it returns control to all its callers.

numba.cuda.
syncthreads_count
(predicate)¶ An extension to
numba.cuda.syncthreads
where the return value is a count of the threads wherepredicate
is true.

numba.cuda.
syncthreads_and
(predicate)¶ An extension to
numba.cuda.syncthreads
where 1 is returned ifpredicate
is true for all threads or 0 otherwise.

numba.cuda.
syncthreads_or
(predicate)¶ An extension to
numba.cuda.syncthreads
where 1 is returned ifpredicate
is true for any thread or 0 otherwise.Warning
All syncthreads functions must be called by every thread in the threadblock. Falling to do so may result in undefined behavior.
Cooperative Groups¶

numba.cuda.cg.
this_grid
()¶ Get the current grid group.
 Returns
The current grid group
 Return type

class
numba.cuda.cg.
GridGroup
¶ A grid group. Users should not construct a GridGroup directly  instead, get the current grid group using
cg.this_grid()
.
sync
()¶ Synchronize the current grid group.

Memory Fences¶
The memory fences are used to guarantee the effect of memory operations are visible by other threads within the same threadblock, the same GPU device, and the same system (across GPUs on global memory). Memory loads and stores are guaranteed to not move across the memory fences by optimization passes.
Warning
The memory fences are considered to be advanced API and most
usercases should use the thread barrier (e.g. syncthreads()
).

numba.cuda.
threadfence
()¶ A memory fence at device level (within the GPU).

numba.cuda.
threadfence_block
()¶ A memory fence at thread block level.

numba.cuda.
threadfence_system
()¶ A memory fence at system level (across GPUs).
Warp Intrinsics¶
The argument membermask
is a 32 bit integer mask with each bit
corresponding to a thread in the warp, with 1 meaning the thread is in the
subset of threads within the function call. The membermask
must be all 1 if
the GPU compute capability is below 7.x.

numba.cuda.
syncwarp
(membermask)¶ Synchronize a masked subset of the threads in a warp.

numba.cuda.
all_sync
(membermask, predicate)¶ If the
predicate
is true for all threads in the masked warp, then a nonzero value is returned, otherwise 0 is returned.

numba.cuda.
any_sync
(membermask, predicate)¶ If the
predicate
is true for any thread in the masked warp, then a nonzero value is returned, otherwise 0 is returned.

numba.cuda.
eq_sync
(membermask, predicate)¶ If the boolean
predicate
is the same for all threads in the masked warp, then a nonzero value is returned, otherwise 0 is returned.

numba.cuda.
ballot_sync
(membermask, predicate)¶ Returns a mask of all threads in the warp whose
predicate
is true, and are within the given mask.

numba.cuda.
shfl_sync
(membermask, value, src_lane)¶ Shuffles
value
across the masked warp and returns thevalue
fromsrc_lane
. If this is outside the warp, then the givenvalue
is returned.

numba.cuda.
shfl_up_sync
(membermask, value, delta)¶ Shuffles
value
across the masked warp and returns thevalue
fromlaneid  delta
. If this is outside the warp, then the givenvalue
is returned.

numba.cuda.
shfl_down_sync
(membermask, value, delta)¶ Shuffles
value
across the masked warp and returns thevalue
fromlaneid + delta
. If this is outside the warp, then the givenvalue
is returned.

numba.cuda.
shfl_xor_sync
(membermask, value, lane_mask)¶ Shuffles
value
across the masked warp and returns thevalue
fromlaneid ^ lane_mask
.

numba.cuda.
match_any_sync
(membermask, value, lane_mask)¶ Returns a mask of threads that have same
value
as the givenvalue
from within the masked warp.

numba.cuda.
match_all_sync
(membermask, value, lane_mask)¶ Returns a tuple of (mask, pred), where mask is a mask of threads that have same
value
as the givenvalue
from within the masked warp, if they all have the same value, otherwise it is 0. And pred is a boolean of whether or not all threads in the mask warp have the same warp.
Integer Intrinsics¶
A subset of the CUDA Math API’s integer intrinsics are available. For further documentation, including semantics, please refer to the CUDA Toolkit documentation.

numba.cuda.
popc
()¶ Returns the number of set bits in the given value.

numba.cuda.
brev
()¶ Reverses the bit pattern of an integer value, for example 0b10110110 becomes 0b01101101.

numba.cuda.
clz
()¶ Counts the number of leading zeros in a value.

numba.cuda.
ffs
()¶ Find the position of the least significant bit set to 1 in an integer.
Floating Point Intrinsics¶
A subset of the CUDA Math API’s floating point intrinsics are available. For further documentation, including semantics, please refer to the single and double precision parts of the CUDA Toolkit documentation.

numba.cuda.
fma
()¶ Perform the fused multiplyadd operation. Named after the
fma
andfmaf
in the C api, but maps to thefma.rn.f32
andfma.rn.f64
(roundtonearesteven) PTX instructions.

numba.cuda.
cbrt
(x)¶ Perform the cube root operation, x ** (1/3). Named after the functions
cbrt
andcbrtf
in the C api. Supports float32, and float64 arguments only.
Control Flow Instructions¶
A subset of the CUDA’s control flow instructions are directly available as
intrinsics. Avoiding branches is a key way to improve CUDA performance, and
using these intrinsics mean you don’t have to rely on the nvcc
optimizer
identifying and removing branches. For further documentation, including
semantics, please refer to the relevant CUDA Toolkit documentation.

numba.cuda.
selp
()¶ Select between two expressions, depending on the value of the first argument. Similar to LLVM’s
select
instruction.