External Memory Management (EMM) Plugin interface

The CUDA Array Interface enables sharing of data between different Python libraries that access CUDA devices. However, each library manages its own memory distinctly from the others. For example:

By default, Numba allocates memory on CUDA devices by interacting with the CUDA driver API to call functions such as cuMemAlloc and cuMemFree, which is suitable for many use cases.
The RAPIDS libraries (cuDF, cuML, etc.) use the RAPIDS Memory Manager (RMM) for allocating device memory.
CuPy includes a memory pool implementation for both device and pinned memory.

When multiple CUDA-aware libraries are used together, it may be preferable for Numba to defer to another library for memory management. The EMM Plugin interface facilitates this, by enabling Numba to use another CUDA-aware library for all allocations and deallocations.

An EMM Plugin is used to facilitate the use of an external library for memory management. An EMM Plugin can be a part of an external library, or could be implemented as a separate library.

Overview of External Memory Management

When an EMM Plugin is in use (see Setting the EMM Plugin), Numba will make memory allocations and deallocations through the Plugin. It will never directly call functions such as cuMemAlloc, cuMemFree, etc.

EMM Plugins always take responsibility for the management of device memory. However, not all CUDA-aware libraries also support managing host memory, so a facility for Numba to continue the management of host memory whilst ceding control of device memory to the EMM is provided (see The Host-Only CUDA Memory Manager).

Effects on Deallocation Strategies

Numba’s internal Deallocation Behavior is designed to increase efficiency by deferring deallocations until a significant quantity are pending. It also provides a mechanism for preventing deallocations entirely during critical sections, using the defer_cleanup() context manager.

When an EMM Plugin is in use, the deallocation strategy is implemented by the EMM, and Numba’s internal deallocation mechanism is not used. The EMM Plugin could implement:

A similar strategy to the Numba deallocation behaviour, or
Something more appropriate to the plugin - for example, deallocated memory might immediately be returned to a memory pool.

The defer_cleanup context manager may behave differently with an EMM Plugin - an EMM Plugin should be accompanied by documentation of the behaviour of the defer_cleanup context manager when it is in use. For example, a pool allocator could always immediately return memory to a pool even when the context manager is in use, but could choose not to free empty pools until defer_cleanup is not in use.

Management of other objects

In addition to memory, Numba manages the allocation and deallocation of events, streams, and modules (a module is a compiled object, which is generated from @cuda.jit-ted functions). The management of events, streams, and modules is unchanged by the use of an EMM Plugin.

Asynchronous allocation and deallocation

The present EMM Plugin interface does not provide support for asynchronous allocation and deallocation. This may be added to a future version of the interface.

Implementing an EMM Plugin

An EMM Plugin is implemented by deriving from BaseCUDAMemoryManager. A summary of considerations for the implementation follows:

Numba instantiates one instance of the EMM Plugin class per context. The context that owns an EMM Plugin object is accessible through self.context, if required.
The EMM Plugin is transparent to any code that uses Numba - all its methods are invoked by Numba, and never need to be called by code that uses Numba.
The allocation methods memalloc, memhostalloc, and mempin, should use the underlying library to allocate and/or pin device or host memory, and construct an instance of a memory pointer representing the memory to return back to Numba. These methods are always called when the current CUDA context is the context that owns the EMM Plugin instance.
The initialize method is called by Numba prior to the first use of the EMM Plugin object for a context. This method should do anything required to prepare the underlying library for allocations in the current context. This method may be called multiple times, and must not invalidate previous state when it is called.
The reset method is called when all allocations in the context are to be cleaned up. It may be called even prior to initialize, and an EMM Plugin implementation needs to guard against this.
To support inter-GPU communication, the get_ipc_handle method should provide an IpcHandle for a given MemoryPointer instance. This method is part of the EMM interface (rather than being handled within Numba) because the base address of the allocation is only known by the underlying library. Closing an IPC handle is handled internally within Numba.
It is optional to provide memory info from the get_memory_info method, which provides a count of the total and free memory on the device for the context. It is preferable to implement the method, but this may not be practical for all allocators. If memory info is not provided, this method should raise a RuntimeError.
The defer_cleanup method should return a context manager that ensures that expensive cleanup operations are avoided whilst it is active. The nuances of this will vary between plugins, so the plugin documentation should include an explanation of how deferring cleanup affects deallocations, and performance in general.
The interface_version property is used to ensure that the plugin version matches the interface provided by the version of Numba. At present, this should always be 1.

Full documentation for the base class follows:

class numba.cuda.BaseCUDAMemoryManager(*args, **kwargs)

Abstract base class for External Memory Management (EMM) Plugins.

abstract memalloc(size)

Allocate on-device memory in the current context.

Parameters: size (int) – Size of allocation in bytes
Returns: A memory pointer instance that owns the allocated memory
Return type: MemoryPointer

abstract memhostalloc(size, mapped, portable, wc)

Allocate pinned host memory.

Parameters

size (int) – Size of the allocation in bytes
mapped (bool) – Whether the allocated memory should be mapped into the CUDA address space.
portable (bool) – Whether the memory will be considered pinned by all contexts, and not just the calling context.
wc (bool) – Whether to allocate the memory as write-combined.

Returns

A memory pointer instance that owns the allocated memory. The return type depends on whether the region was mapped into device memory.

Return type

MappedMemory or PinnedMemory

abstract mempin(owner, pointer, size, mapped)

Pin a region of host memory that is already allocated.

Parameters

owner – The object that owns the memory.
pointer (int) – The pointer to the beginning of the region to pin.
size (int) – The size of the region in bytes.
mapped (bool) – Whether the region should also be mapped into device memory.

Returns

A memory pointer instance that refers to the allocated memory.

Return type

MappedMemory or PinnedMemory

abstract initialize()

Perform any initialization required for the EMM plugin instance to be ready to use.

Returns: None

abstract get_ipc_handle(memory)

Return an IPC handle from a GPU allocation.

Parameters: memory (MemoryPointer) – Memory for which the IPC handle should be created.
Returns: IPC handle for the allocation
Return type: IpcHandle

abstract get_memory_info()

Returns (free, total) memory in bytes in the context. May raise NotImplementedError, if returning such information is not practical (e.g. for a pool allocator).

Returns: Memory info
Return type: MemoryInfo

abstract reset()

Clears up all memory allocated in this context.

Returns: None

abstract defer_cleanup()

Returns a context manager that ensures the implementation of deferred cleanup whilst it is active.

Returns: Context manager

abstract property interface_version: Returns an integer specifying the version of the EMM Plugin interface supported by the plugin implementation. Should always return 1 for implementations of this version of the specification.

The Host-Only CUDA Memory Manager

Some external memory managers will support management of on-device memory but not host memory. For implementing EMM Plugins using one of these memory managers, a partial implementation of a plugin that implements host-side allocation and pinning is provided. To use it, derive from HostOnlyCUDAMemoryManager instead of BaseCUDAMemoryManager. Guidelines for using this class are:

The host-only memory manager implements memhostalloc and mempin - the EMM Plugin should still implement memalloc.
If reset is overridden, it must also call super().reset() to allow the host allocations to be cleaned up.
If defer_cleanup is overridden, it must hold an active context manager from super().defer_cleanup() to ensure that host-side cleanup is also deferred.

Documentation for the methods of HostOnlyCUDAMemoryManager follows:

class numba.cuda.HostOnlyCUDAMemoryManager(*args, **kwargs)

Base class for External Memory Management (EMM) Plugins that only implement on-device allocation. A subclass need not implement the memhostalloc and mempin methods.

This class also implements reset and defer_cleanup (see numba.cuda.BaseCUDAMemoryManager) for its own internal state management. If an EMM Plugin based on this class also implements these methods, then its implementations of these must also call the method from super() to give HostOnlyCUDAMemoryManager an opportunity to do the necessary work for the host allocations it is managing.

This class does not implement interface_version, as it will always be consistent with the version of Numba in which it is implemented. An EMM Plugin subclassing this class should implement interface_version instead.

memhostalloc(size, mapped=False, portable=False, wc=False)

Implements the allocation of pinned host memory.

It is recommended that this method is not overridden by EMM Plugin implementations - instead, use the BaseCUDAMemoryManager.

mempin(owner, pointer, size, mapped=False)

Implements the pinning of host memory.

It is recommended that this method is not overridden by EMM Plugin implementations - instead, use the BaseCUDAMemoryManager.

reset()

Clears up all host memory (mapped and/or pinned) in the current context.

EMM Plugins that override this method must call super().reset() to ensure that host allocations are also cleaned up.

defer_cleanup()

Returns a context manager that disables cleanup of mapped or pinned host memory in the current context whilst it is active.

EMM Plugins that override this method must obtain the context manager from this method before yielding to ensure that cleanup of host allocations is also deferred.

The IPC Handle Mixin

An implementation of the get_ipc_handle() function is is provided in the GetIpcHandleMixin class. This uses the driver API to determine the base address of an allocation for opening an IPC handle. If this implementation is appropriate for an EMM plugin, it can be added by mixing in the GetIpcHandleMixin class:

class numba.cuda.GetIpcHandleMixin

A class that provides a default implementation of get_ipc_handle().

get_ipc_handle(memory): Open an IPC memory handle by using cuMemGetAddressRange to determine the base pointer of the allocation. An IPC handle of type cu_ipc_mem_handle is constructed and initialized with cuIpcGetMemHandle. A numba.cuda.IpcHandle is returned, populated with the underlying ipc_mem_handle.

Classes and structures of returned objects

This section provides an overview of the classes and structures that need to be constructed by an EMM Plugin.

Memory Pointers

EMM Plugins should construct memory pointer instances that represent their allocations, for return to Numba. The appropriate memory pointer class to use in each method is:

MemoryPointer: returned from memalloc
MappedMemory: returned from memhostalloc or mempin when the host memory is mapped into the device memory space.
PinnedMemory: return from memhostalloc or mempin when the host memory is not mapped into the device memory space.

Memory pointers can take a finalizer, which is a function that is called when the buffer is no longer needed. Usually the finalizer will make a call to the memory management library (either internal to Numba, or external if allocated by an EMM Plugin) to inform it that the memory is no longer required, and that it could potentially be freed and/or unpinned. The memory manager may choose to defer actually cleaning up the memory to any later time after the finalizer runs - it is not required to free the buffer immediately.

Documentation for the memory pointer classes follows.

class numba.cuda.MemoryPointer(context, pointer, size, owner=None, finalizer=None)

A memory pointer that owns a buffer, with an optional finalizer. Memory pointers provide reference counting, and instances are initialized with a reference count of 1.

The base MemoryPointer class does not use the reference count for managing the buffer lifetime. Instead, the buffer lifetime is tied to the memory pointer instance’s lifetime:

When the instance is deleted, the finalizer will be called.
When the reference count drops to 0, no action is taken.

Subclasses of MemoryPointer may modify these semantics, for example to tie the buffer lifetime to the reference count, so that the buffer is freed when there are no more references.

Parameters

context (Context) – The context in which the pointer was allocated.
pointer (ctypes.c_void_p) – The address of the buffer.
size (int) – The size of the allocation in bytes.
owner (NoneType) – The owner is sometimes set by the internals of this class, or used for Numba’s internal memory management. It should not be provided by an external user of the MemoryPointer class (e.g. from within an EMM Plugin); the default of None should always suffice.
finalizer (function) – A function that is called when the buffer is to be freed.

The AutoFreePointer class need not be used directly, but is documented here as it is subclassed by numba.cuda.MappedMemory:

class numba.cuda.cudadrv.driver.AutoFreePointer(*args, **kwargs)

Modifies the ownership semantic of the MemoryPointer so that the instance lifetime is directly tied to the number of references.

When the reference count reaches zero, the finalizer is invoked.

Constructor arguments are the same as for MemoryPointer.

class numba.cuda.MappedMemory(context, pointer, size, owner=None, finalizer=None)

A memory pointer that refers to a buffer on the host that is mapped into device memory.

Parameters

context (Context) – The context in which the pointer was mapped.
pointer (ctypes.c_void_p) – The address of the buffer.
size (int) – The size of the buffer in bytes.
owner (NoneType) – The owner is sometimes set by the internals of this class, or used for Numba’s internal memory management. It should not be provided by an external user of the MappedMemory class (e.g. from within an EMM Plugin); the default of None should always suffice.
finalizer (function) – A function that is called when the buffer is to be freed.

class numba.cuda.PinnedMemory(context, pointer, size, owner=None, finalizer=None)

A pointer to a pinned buffer on the host.

Parameters

context (Context) – The context in which the pointer was mapped.
owner – The object owning the memory. For EMM plugin implementation, this ca
pointer (ctypes.c_void_p) – The address of the buffer.
size (int) – The size of the buffer in bytes.
owner – An object owning the buffer that has been pinned. For EMM plugin implementation, the default of None suffices for memory allocated in memhostalloc - for mempin, it should be the owner passed in to the mempin method.
finalizer (function) – A function that is called when the buffer is to be freed.

Memory Info

If an implementation of get_memory_info() is to provide a result, then it should return an instance of the MemoryInfo named tuple:

class numba.cuda.MemoryInfo(free, total)

Free and total memory for a device.

free: Free device memory in bytes.

total: Total device memory in bytes.

IPC

An instance of IpcHandle is required to be returned from an implementation of get_ipc_handle():

class numba.cuda.IpcHandle(base, handle, size, source_info=None, offset=0)

CUDA IPC handle. Serialization of the CUDA IPC handle object is implemented here.

Parameters

base (MemoryPointer) – A reference to the original allocation to keep it alive
handle – The CUDA IPC handle, as a ctypes array of bytes.
size (int) – Size of the original allocation
source_info (dict) – The identity of the device on which the IPC handle was opened.
offset (int) – The offset into the underlying allocation of the memory referred to by this IPC handle.

Guidance for constructing an IPC handle in the context of implementing an EMM Plugin:

The memory parameter passed to the get_ipc_handle method of an EMM Plugin can be passed as the base parameter.
A suitable type for the handle can be constructed as ctypes.c_byte * 64. The data for handle must be populated using a method for obtaining a CUDA IPC handle appropriate to the underlying library.
size should match the size of the original allocation, which can be obtained with memory.size in get_ipc_handle.
An appropriate value for source_info can be created by calling self.context.device.get_device_identity().
If the underlying memory does not point to the base of an allocation returned by the CUDA driver or runtime API (e.g. if a pool allocator is in use) then the offset from the base must be provided.

Setting the EMM Plugin

By default, Numba uses its internal memory management - if an EMM Plugin is to be used, it must be configured. There are two mechanisms for configuring the use of an EMM Plugin: an environment variable, and a function.

Environment variable

A module name can be provided in the environment variable, NUMBA_CUDA_MEMORY_MANAGER. If this environment variable is set, Numba will attempt to import the module, and and use its _numba_memory_manager global variable as the memory manager class. This is primarily useful for running the Numba test suite with an EMM Plugin, e.g.:

$ NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests

Function

The set_memory_manager() function can be used to set the memory manager at runtime. This should be called prior to the initialization of any contexts, as EMM Plugin instances are instantiated along with contexts.

numba.cuda.set_memory_manager(mm_plugin)

Configure Numba to use an External Memory Management (EMM) Plugin. If the EMM Plugin version does not match one supported by this version of Numba, a RuntimeError will be raised.

Parameters: mm_plugin (BaseCUDAMemoryManager) – The class implementing the EMM Plugin.
Returns: None

Resetting the memory manager

It is recommended that the memory manager is set once prior to using any CUDA functionality, and left unchanged for the remainder of execution. It is possible to set the memory manager multiple times, noting the following:

At the time of their creation, contexts are bound to an instance of a memory manager for their lifetime.
Changing the memory manager will have no effect on existing contexts - only contexts created after the memory manager was updated will use instances of the new memory manager.
numba.cuda.close() can be used to destroy contexts after setting the memory manager so that they get re-created with the new memory manager.
- This will invalidate any arrays, streams, events, and modules owned by the context.
- Attempting to use invalid arrays, streams, or events will likely fail with an exception being raised due to a CUDA_ERROR_INVALID_CONTEXT or CUDA_ERROR_CONTEXT_IS_DESTROYED return code from a Driver API function.
- Attempting to use an invalid module will result in similar, or in some cases a segmentation fault / access violation.

Note

The invalidation of modules means that all functions compiled with @cuda.jit prior to context destruction will need to be redefined, as the code underlying them will also have been unloaded from the GPU.