Numba’s Cooperative Groups support presently provides grid groups and grid synchronization, along with cooperative kernel launches.
Cooperative groups are supported on Linux, and Windows for devices in TCC
Cooperative Groups also require the CUDA Device Runtime library,
to be available - for conda default channel-installed CUDA toolkit packages, it
is only available in versions 10.2 onwards. System-installed toolkits (e.g. from
NVIDIA distribution packages or runfiles) all include
Using Grid Groups¶
To get the current grid group, use the
g = cuda.cg.this_grid()
Synchronizing the grid is done with the
sync() method of the grid group:
Unlike the CUDA C/C++ API, a cooperative launch is invoked using the same syntax as a normal kernel launch - Numba automatically determines whether a cooperative launch is required based on whether a grid group is synchronized in the kernel.
The grid size limit for a cooperative launch is more restrictive than for a
normal launch - the grid must be no larger than the maximum number of active
blocks on the device on which it is launched. To get maximum grid size for a
cooperative launch of a kernel with a given block size and dynamic shared
memory requirement, use the
max_cooperative_grid_blocks() method of kernel
- _Kernel.max_cooperative_grid_blocks(blockdim, dynsmemsize=0)¶
Calculates the maximum number of blocks that can be launched for this kernel in a cooperative grid in the current context, for the given block and dynamic shared memory sizes.
blockdim – Block dimensions, either as a scalar for a 1D block, or a tuple for 2D or 3D blocks.
dynsmemsize – Dynamic shared memory size in bytes.
The maximum number of blocks in the grid.
This can be used to ensure that the kernel is launched with no more than the
maximum number of blocks. Exceeding the maximum number of blocks for the
cooperative launch will result in a
Applications and Example¶
Grid group synchronization can be used to implement a global barrier across all threads in the grid - applications of this include a global reduction to a single value, or looping over rows of a large matrix sequentially using the entire grid to operate on column elements in parallel.
In the following example, rows are written sequentially by the grid. Each thread in the grid reads a value from the previous row written by it’s opposite thread. A grid sync is needed to ensure that threads in the grid don’t run ahead of threads in other blocks, or fail to see updates from their opposite thread.
First we’ll define our kernel:
1from numba import cuda, int32 2import numpy as np 3 4sig = (int32[:,::1],) 5 email@example.com(sig) 7def sequential_rows(M): 8 col = cuda.grid(1) 9 g = cuda.cg.this_grid() 10 11 rows = M.shape 12 cols = M.shape 13 14 for row in range(1, rows): 15 opposite = cols - col - 1 16 # Each row's elements are one greater than the previous row 17 M[row, col] = M[row - 1, opposite] + 1 18 # Wait until all threads have written their column element, 19 # and that the write is visible to all other threads 20 g.sync()
Then create some empty input data and determine the grid and block sizes:
1# Empty input data 2A = np.zeros((1024, 1024), dtype=np.int32) 3# A somewhat arbitrary choice (one warp), but generally smaller block sizes 4# allow more blocks to be launched (noting that other limitations on 5# occupancy apply such as shared memory size) 6blockdim = 32 7griddim = A.shape // blockdim
Finally we launch the kernel and print the result:
1# Kernel launch - this is implicitly a cooperative launch 2sequential_rows[griddim, blockdim](A) 3 4# What do the results look like? 5# print(A) 6# 7# [[ 0 0 0 ... 0 0 0] 8# [ 1 1 1 ... 1 1 1] 9# [ 2 2 2 ... 2 2 2] 10# ... 11# [1021 1021 1021 ... 1021 1021 1021] 12# [1022 1022 1022 ... 1022 1022 1022] 13# [1023 1023 1023 ... 1023 1023 1023]]
The maximum grid size for
sequential_rows can be enquired using:
overload = sequential_rows.overloads[(int32[:,::1],) max_blocks = overload.max_cooperative_grid_blocks(blockdim) print(max_blocks) # 1152 (e.g. on Quadro RTX 8000 with Numba 0.52.1 and CUDA 11.0)