Supported Python features in CUDA Python

This page lists the Python features supported in the CUDA Python. This includes all kernel and device functions compiled with @cuda.jit and other higher level Numba decorators that targets the CUDA GPU.

Language

Execution Model

CUDA Python maps directly to the single-instruction multiple-thread execution (SIMT) model of CUDA. Each instruction is implicitly executed by multiple threads in parallel. With this execution model, array expressions are less useful because we don’t want multiple threads to perform the same task. Instead, we want threads to perform a task in a cooperative fashion.

For details please consult the CUDA Programming Guide.

Floating Point Error Model

By default, CUDA Python kernels execute with the NumPy error model. In this model, division by zero raises no exception and instead produces a result of inf, -inf or nan. This differs from the normal Python error model, in which division by zero raises a ZeroDivisionError.

When debug is enabled (by passing debug=True to the @cuda.jit decorator), the Python error model is used. This allows division-by-zero errors during kernel execution to be identified.

Constructs

The following Python constructs are not supported:

  • Exception handling (try .. except, try .. finally)

  • Context management (the with statement)

  • Comprehensions (either list, dict, set or generator comprehensions)

  • Generator (any yield statements)

The raise and assert statements are supported, with the following constraints:

  • They can only be used in kernels, not in device functions.

  • They only have an effect when debug=True is passed to the @cuda.jit decorator. This is similar to the behavior of the assert keyword in CUDA C/C++, which is ignored unless compiling with device debug turned on.

Printing of strings, integers, and floats is supported, but printing is an asynchronous operation - in order to ensure that all output is printed after a kernel launch, it is necessary to call numba.cuda.synchronize(). Eliding the call to synchronize is acceptable, but output from a kernel may appear during other later driver operations (e.g. subsequent kernel launches, memory transfers, etc.), or fail to appear before the program execution completes. Up to 32 arguments may be passed to the print function - if more are passed then a format string will be emitted instead and a warning will be produced. This is due to a general limitation in CUDA printing, as outlined in the section on limitations in printing in the CUDA C++ Programming Guide.

Recursion

Self-recursive device functions are supported, with the constraint that recursive calls must have the same argument types as the initial call to the function. For example, the following form of recursion is supported:

@cuda.jit("int64(int64)", device=True)
def fib(n):
    if n < 2:
        return n
    return fib(n - 1) + fib(n - 2)

(the fib function always has an int64 argument), whereas the following is unsupported:

# Called with x := int64, y := float64
@cuda.jit
def type_change_self(x, y):
    if x > 1 and y > 0:
        return x + type_change_self(x - y, y)
    else:
        return y

The outer call to type_change_self provides (int64, float64) arguments, but the inner call uses (float64, float64) arguments (because x - y / int64 - float64 results in a float64 type). Therefore, this function is unsupported.

Mutual recursion between functions (e.g. where a function func1() calls func2() which again calls func1()) is unsupported.

Note

The call stack in CUDA is typically quite limited in size, so it is easier to overflow it with recursive calls on CUDA devices than it is on CPUs.

Stack overflow will result in an Unspecified Launch Failure (ULF) during kernel execution. In order to identify whether a ULF is due to stack overflow, programs can be run under Compute Sanitizer, which explicitly states when stack overflow has occurred.

Built-in types

The following built-in types support are inherited from CPU nopython mode.

  • int

  • float

  • complex

  • bool

  • None

  • tuple

  • Enum, IntEnum

See nopython built-in types.

There is also some very limited support for character sequences (bytes and unicode strings) used in NumPy arrays. Note that this support can only be used with CUDA 11.2 onwards.

Built-in functions

The following built-in functions are supported:

Standard library modules

NumPy support

Due to the CUDA programming model, dynamic memory allocation inside a kernel is inefficient and is often not needed. Numba disallows any memory allocating features. This disables a large number of NumPy APIs. For best performance, users should write code such that each thread is dealing with a single element at a time.

Supported NumPy features:

  • accessing ndarray attributes .shape, .strides, .ndim, .size, etc..

  • indexing and slicing works.

  • A subset of ufuncs are supported, but the output array must be passed in as a positional argument (see Calling a NumPy UFunc). Note that ufuncs execute sequentially in each thread - there is no automatic parallelisation of ufuncs across threads over the elements of an input array.

    The following ufuncs are supported:

    • numpy.sin()

    • numpy.cos()

    • numpy.tan()

    • numpy.arcsin()

    • numpy.arccos()

    • numpy.arctan()

    • numpy.arctan2()

    • numpy.hypot()

    • numpy.sinh()

    • numpy.cosh()

    • numpy.tanh()

    • numpy.arcsinh()

    • numpy.arccosh()

    • numpy.arctanh()

    • numpy.deg2rad()

    • numpy.radians()

    • numpy.rad2deg()

    • numpy.degrees()

Unsupported NumPy features:

  • array creation APIs.

  • array methods.

  • functions that returns a new array.

CFFI support

The from_buffer() method of cffi.FFI objects is supported. This is useful for obtaining a pointer that can be passed to external C / C++ / PTX functions (see the CUDA FFI documentation).