Memory management

Data transfer

Even though Numba can automatically transfer NumPy arrays to the device, it can only do so conservatively by always transferring device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, you can use the following APIs to manually control the transfer:

numba.cuda.device_array(shape, dtype=np.float64, strides=None, order='C', stream=0): Allocate an empty device ndarray. Similar to numpy.empty().

numba.cuda.device_array_like(ary, stream=0): Call device_array() with information from the array.

numba.cuda.to_device(obj, stream=0, copy=True, to=None)

Allocate and transfer a numpy ndarray or structured scalar to the device.

To copy host->device a numpy array:

ary = np.arange(10)
d_ary = cuda.to_device(ary)

To enqueue the transfer to a stream:

stream = cuda.stream()
d_ary = cuda.to_device(ary, stream=stream)

The resulting d_ary is a DeviceNDArray.

To copy device->host:

hary = d_ary.copy_to_host()

To copy device->host to an existing array:

ary = np.empty(shape=d_ary.shape, dtype=d_ary.dtype)
d_ary.copy_to_host(ary)

To enqueue the transfer to a stream:

hary = d_ary.copy_to_host(stream=stream)

In addition to the device arrays, Numba can consume any object that implements cuda array interface. These objects also can be manually converted into a Numba device array by creating a view of the GPU buffer using the following APIs:

numba.cuda.as_cuda_array(obj, sync=True)

Create a DeviceNDArray from any object that implements the cuda array interface.

A view of the underlying GPU buffer is created. No copying of the data is done. The resulting DeviceNDArray will acquire a reference from obj.

If sync is True, then the imported stream (if present) will be synchronized.

numba.cuda.is_cuda_array(obj)

Test if the object has defined the __cuda_array_interface__ attribute.

Does not verify the validity of the interface.

Device arrays

Device array references have the following methods. These methods are to be called in host code, not within CUDA-jitted functions.

class numba.cuda.cudadrv.devicearray.DeviceNDArray(shape, strides, dtype, stream=0, gpu_data=None)

An on-GPU array type

copy_to_host(ary=None, stream=0)

Copy self to ary or create a new Numpy ndarray if ary is None.

If a CUDA stream is given, then the transfer will be made asynchronously as part as the given stream. Otherwise, the transfer is synchronous: the function returns after the copy is finished.

Always returns the host array.

Example:

import numpy as np
from numba import cuda

arr = np.arange(1000)
d_arr = cuda.to_device(arr)

my_kernel[100, 100](d_arr)

result_array = d_arr.copy_to_host()

is_c_contiguous(): Return true if the array is C-contiguous.

is_f_contiguous(): Return true if the array is Fortran-contiguous.

ravel(order='C', stream=0): Flattens a contiguous array without changing its contents, similar to numpy.ndarray.ravel(). If the array is not contiguous, raises an exception.

reshape(*newshape, **kws)

Reshape the array without changing its contents, similarly to numpy.ndarray.reshape(). Example:

d_arr = d_arr.reshape(20, 50, order='F')

Note

DeviceNDArray defines the cuda array interface.

Pinned memory

numba.cuda.pinned(*arylist): A context manager for temporary pinning a sequence of host ndarrays.

numba.cuda.pinned_array(shape, dtype=np.float64, strides=None, order='C'): Allocate an ndarray with a buffer that is pinned (pagelocked). Similar to np.empty().

numba.cuda.pinned_array_like(ary): Call pinned_array() with the information from the array.

Mapped memory

numba.cuda.mapped(*arylist, **kws): A context manager for temporarily mapping a sequence of host ndarrays.

numba.cuda.mapped_array(shape, dtype=np.float64, strides=None, order='C', stream=0, portable=False, wc=False)

Allocate a mapped ndarray with a buffer that is pinned and mapped on to the device. Similar to np.empty()

Parameters

portable – a boolean flag to allow the allocated device memory to be usable in multiple devices.
wc – a boolean flag to enable writecombined allocation which is faster to write by the host and to read by the device, but slower to write by the host and slower to write by the device.

numba.cuda.mapped_array_like(ary, stream=0, portable=False, wc=False): Call mapped_array() with the information from the array.

Managed memory

numba.cuda.managed_array(shape, dtype=np.float64, strides=None, order='C', stream=0, attach_global=True)

Allocate a np.ndarray with a buffer that is managed. Similar to np.empty().

Managed memory is supported on Linux / x86 and PowerPC, and is considered experimental on Windows and Linux / AArch64.

Parameters: attach_global – A flag indicating whether to attach globally. Global attachment implies that the memory is accessible from any stream on any device. If False, attachment is host, and memory is only accessible by devices with Compute Capability 6.0 and later.

Streams

Streams can be passed to functions that accept them (e.g. copies between the host and device) and into kernel launch configurations so that the operations are executed asynchronously.

numba.cuda.stream(): Create a CUDA stream that represents a command queue for the device.

numba.cuda.default_stream(): Get the default CUDA stream. CUDA semantics in general are that the default stream is either the legacy default stream or the per-thread default stream depending on which CUDA APIs are in use. In Numba, the APIs for the legacy default stream are always the ones in use, but an option to use APIs for the per-thread default stream may be provided in future.

numba.cuda.legacy_default_stream(): Get the legacy default CUDA stream.

numba.cuda.per_thread_default_stream(): Get the per-thread default CUDA stream.

numba.cuda.external_stream(ptr)

Create a Numba stream object for a stream allocated outside Numba.

Parameters: ptr (int) – Pointer to the external stream to wrap in a Numba Stream

CUDA streams have the following methods:

class numba.cuda.cudadrv.driver.Stream(context, handle, finalizer, external=False)

auto_synchronize(): A context manager that waits for all commands in this stream to execute and commits any pending memory transfers upon exiting the context.

synchronize(): Wait for all commands in this stream to execute. This will commit any pending memory transfers.

Shared memory and thread synchronization

A limited amount of shared memory can be allocated on the device to speed up access to data, when necessary. That memory will be shared (i.e. both readable and writable) amongst all threads belonging to a given block and has faster access times than regular device memory. It also allows threads to cooperate on a given solution. You can think of it as a manually-managed data cache.

The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.

numba.cuda.shared.array(shape, type)

Allocate a shared array of the given shape and type on the device. This function must be called on the device (i.e. from a kernel or device function). shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. A “simple constant expression” includes, but is not limited to:

A literal (e.g. 10)

A local variable whose right-hand side is a literal or a simple constant expression (e.g. shape, where shape is defined earlier in the function as shape = 10)

A global variable that is defined in the jitted function’s globals by the time of compilation (e.g. shape, where shape is defined using any expression at global scope).

The definition must result in a Python int (i.e. not a NumPy scalar or other scalar / integer-like type). type is a Numba type of the elements needing to be stored in the array. The returned array-like object can be read and written to like any normal device array (e.g. through indexing).

A common pattern is to have each thread populate one element in the shared array and then wait for all threads to finish using syncthreads().

numba.cuda.syncthreads(): Synchronize all threads in the same thread block. This function implements the same pattern as barriers in traditional multi-threaded programming: this function waits until all threads in the block call it, at which point it returns control to all its callers.

Dynamic Shared Memory

In order to use dynamic shared memory in kernel code declare a shared array of size 0:

@cuda.jit
def kernel_func(x):
   dyn_arr = cuda.shared.array(0, dtype=np.float32)
   ...

and specify the size of dynamic shared memory in bytes during kernel invocation:

kernel_func[32, 32, 0, 128](x)

In the above code the kernel launch is configured with 4 parameters:

kernel_func[grid_dim, block_dim, stream, dyn_shared_mem_size]

Note: all dynamic shared memory arrays alias, so if you want to have multiple dynamic shared arrays, you need to take disjoint views of the arrays. For example, consider:

from numba import cuda
import numpy as np

@cuda.jit
def f():
   f32_arr = cuda.shared.array(0, dtype=np.float32)
   i32_arr = cuda.shared.array(0, dtype=np.int32)
   f32_arr[0] = 3.14
   print(f32_arr[0])
   print(i32_arr[0])

f[1, 1, 0, 4]()
cuda.synchronize()

This allocates 4 bytes of shared memory (large enough for one int32 or one float32) and declares dynamic shared memory arrays of type int32 and of type float32. When f32_arr[0] is set, this also sets the value of i32_arr[0], because they’re pointing at the same memory. So we see as output:

3.140000
1078523331

because 1078523331 is the int32 represented by the bits of the float32 value 3.14.

If we take disjoint views of the dynamic shared memory:

from numba import cuda
import numpy as np

@cuda.jit
def f_with_view():
   f32_arr = cuda.shared.array(0, dtype=np.float32)
   i32_arr = cuda.shared.array(0, dtype=np.int32)[1:] # 1 int32 = 4 bytes
   f32_arr[0] = 3.14
   i32_arr[0] = 1
   print(f32_arr[0])
   print(i32_arr[0])

f_with_view[1, 1, 0, 8]()
cuda.synchronize()

This time we declare 8 dynamic shared memory bytes, using the first 4 for a float32 value and the next 4 for an int32 value. Now we can set both the int32 and float32 value without them aliasing:

3.140000
1

Local memory

Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.

numba.cuda.local.array(shape, type)

Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. A “simple constant expression” includes, but is not limited to:

A literal (e.g. 10)

A local variable whose right-hand side is a literal or a simple constant expression (e.g. shape, where shape is defined earlier in the function as shape = 10)

A global variable that is defined in the jitted function’s globals by the time of compilation (e.g. shape, where shape is defined using any expression at global scope).

The definition must result in a Python int (i.e. not a NumPy scalar or other scalar / integer-like type). type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).

Constant memory

Constant memory is an area of memory that is read only, cached and off-chip, it is accessible by all threads and is host allocated. A method of creating an array in constant memory is through the use of:

numba.cuda.const.array_like(arr): Allocate and make accessible an array in constant memory based on array-like arr.

Deallocation Behavior

This section describes the deallocation behaviour of Numba’s internal memory management. If an External Memory Management Plugin is in use (see External Memory Management (EMM) Plugin interface), then deallocation behaviour may differ; you may refer to the documentation for the EMM Plugin to understand its deallocation behaviour.

Deallocation of all CUDA resources are tracked on a per-context basis. When the last reference to a device memory is dropped, the underlying memory is scheduled to be deallocated. The deallocation does not occur immediately. It is added to a queue of pending deallocations. This design has two benefits:

Resource deallocation API may cause the device to synchronize; thus, breaking any asynchronous execution. Deferring the deallocation could avoid latency in performance critical code section.
Some deallocation errors may cause all the remaining deallocations to fail. Continued deallocation errors can cause critical errors at the CUDA driver level. In some cases, this could mean a segmentation fault in the CUDA driver. In the worst case, this could cause the system GUI to freeze and could only recover with a system reset. When an error occurs during a deallocation, the remaining pending deallocations are cancelled. Any deallocation error will be reported. When the process is terminated, the CUDA driver is able to release all allocated resources by the terminated process.

The deallocation queue is flushed automatically as soon as the following events occur:

An allocation failed due to out-of-memory error. Allocation is retried after flushing all deallocations.
The deallocation queue has reached its maximum size, which is default to 10. User can override by setting the environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT. For example, NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=20, increases the limit to 20.
The maximum accumulated byte size of resources that are pending deallocation is reached. This is default to 20% of the device memory capacity. User can override by setting the environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO. For example, NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO=0.5 sets the limit to 50% of the capacity.

Sometimes, it is desired to defer resource deallocation until a code section ends. Most often, users want to avoid any implicit synchronization due to deallocation. This can be done by using the following context manager:

numba.cuda.defer_cleanup()

Temporarily disable memory deallocation. Use this to prevent resource deallocation breaking asynchronous execution.

For example:

with defer_cleanup():
    # all cleanup is deferred in here
    do_speed_critical_code()
# cleanup can occur here

Note: this context manager can be nested.