NBEP 7: CUDA External Memory Management Plugins
- Author
Graham Markall, NVIDIA
- Contributors
Thomson Comer, Peter Entschev, Leo Fang, John Kirkham, Keith Kraus
- Date
March 2020
- Status
Final
Background and goals
The CUDA Array Interface enables sharing of data between different Python libraries that access CUDA devices. However, each library manages its own memory distinctly from the others. For example:
Numba internally manages memory for the creation of device and mapped host arrays.
The RAPIDS libraries (cuDF, cuML, etc.) use the Rapids Memory Manager for allocating device memory.
CuPy includes a memory pool implementation for both device and pinned memory.
The goal of this NBEP is to describe a plugin interface that enables Numba’s internal memory management to be replaced with an external memory manager by the user. When the plugin interface is in use, Numba no longer directly allocates or frees any memory when creating arrays, but instead requests allocations and frees through the external manager.
Requirements
Provide an External Memory Manager (EMM) interface in Numba.
When the EMM is in use, Numba will make all memory allocation using the EMM. It will never directly call functions such as
CuMemAlloc
,cuMemFree
, etc.When not using an External Memory Manager (EMM), Numba’s present behaviour is unchanged (at the time of writing, the current version is the 0.48 release).
If an EMM is to be used, it will entirely replace Numba’s internal memory management for the duration of program execution. An interface for setting the memory manager will be provided.
Device vs. Host memory
An EMM will always take responsibility for the management of device memory. However, not all CUDA memory management libraries also support managing host memory, so a facility for Numba to continue the management of host memory whilst ceding control of device memory to the EMM will be provided.
Deallocation strategies
Numba’s internal memory management uses a deallocation strategy designed to increase efficiency by deferring
deallocations until a significant quantity are pending. It also provides a
mechanism for preventing deallocations entirely during critical sections, using
the defer_cleanup()
context manager.
When the EMM is not in use, the deallocation strategy and operation of
defer_cleanup
remain unchanged.When the EMM is in use, the deallocation strategy is implemented by the EMM, and Numba’s internal deallocation mechanism is not used. For example:
A similar strategy to Numba’s could be implemented by the EMM, or
Deallocated memory might immediately be returned to a memory pool.
The
defer_cleanup
context manager may behave differently with an EMM - an EMM should be accompanied by documentation of the behaviour of thedefer_cleanup
context manager when it is in use.For example, a pool allocator could always immediately return memory to a pool even when the context manager is in use, but could choose not to free empty pools until
defer_cleanup
is not in use.
Management of other objects
In addition to memory, Numba manages the allocation and deallocation of
events, streams, and modules (a module is a
compiled object, which is generated from @cuda.jit
-ted functions). The
management of streams, events, and modules should be unchanged by the presence
or absence of an EMM.
Asynchronous allocation / deallocation
An asynchronous memory manager might provide the facility for an allocation or free to take a CUDA stream and execute asynchronously. For freeing, this is unlikely to cause issues since it operates at a layer beneath Python, but for allocations this could be problematic if the user tries to then launch a kernel on the default stream from this asynchronous memory allocation.
The interface described in this proposal will not be required to support asynchronous allocation and deallocation, and as such these use cases will not be considered further. However, nothing in this proposal should preclude the straightforward addition of asynchronous operations in future versions of the interface.
Non-requirements
In order to minimise complexity and constrain this proposal to a reasonable scope, the following will not be supported:
Using different memory manager implementations for different contexts. All contexts will use the same memory manager implementation - either the Numba internal implementation or an external implementation.
Changing the memory manager once execution has begun. It is not practical to change the memory manager and retain all allocations. Cleaning up the entire state and then changing to a different memory allocator (rather than starting a new process) appears to be a rather niche use case.
Any changes to the
__cuda_array_interface__
to further define its semantics, e.g. for acquiring / releasing memory as discussed in Numba Issue #4886 - these are independent, and can be addressed as part of separate proposals.Managed memory / UVM is not supported. At present Numba does not support UVM - see Numba Issue #4362 for discussion of support.
Interface for Plugin developers
New classes and functions will be added to numba.cuda.cudadrv.driver
:
BaseCUDAMemoryManager
andHostOnlyCUDAMemoryManager
: base classes for EMM plugin implementations.set_memory_manager
: a method for registering an external memory manager with Numba.
These will be exposed through the public API, in the numba.cuda
module.
Additionally, some classes that are already part of the driver module will be
exposed as part of the public API:
MemoryPointer
: used to encapsulate information about a pointer to device memory.MappedMemory
: used to hold information about host memory that is mapped into the device address space (a subclass ofMemoryPointer
).PinnedMemory
: used to hold information about host memory that is pinned (a subclass ofmviewbuf.MemAlloc
, a class internal to Numba).
As an alternative to calling the set_memory_manager
function, an environment
variable can be used to set the memory manager. The value of the environment
variable should be the name of the module containing the memory manager in its
global scope, named _numba_memory_manager
:
export NUMBA_CUDA_MEMORY_MANAGER="<module>"
When this variable is set, Numba will automatically use the memory manager from
the specified module. Calls to set_memory_manager
will issue a warning, but
otherwise be ignored.
Plugin Base Classes
An EMM plugin is implemented by inheriting from the BaseCUDAMemoryManager
class, which is defined as:
class BaseCUDAMemoryManager(object, metaclass=ABCMeta):
@abstractmethod
def memalloc(self, size):
"""
Allocate on-device memory in the current context. Arguments:
- `size`: Size of allocation in bytes
Returns: a `MemoryPointer` to the allocated memory.
"""
@abstractmethod
def memhostalloc(self, size, mapped, portable, wc):
"""
Allocate pinned host memory. Arguments:
- `size`: Size of the allocation in bytes
- `mapped`: Whether the allocated memory should be mapped into the CUDA
address space.
- `portable`: Whether the memory will be considered pinned by all
contexts, and not just the calling context.
- `wc`: Whether to allocate the memory as write-combined.
Returns a `MappedMemory` or `PinnedMemory` instance that owns the
allocated memory, depending on whether the region was mapped into
device memory.
"""
@abstractmethod
def mempin(self, owner, pointer, size, mapped):
"""
Pin a region of host memory that is already allocated. Arguments:
- `owner`: An object owning the memory - e.g. a `DeviceNDArray`.
- `pointer`: The pointer to the beginning of the region to pin.
- `size`: The size of the region to pin.
- `mapped`: Whether the region should also be mapped into device memory.
Returns a `MappedMemory` or `PinnedMemory` instance that refers to the
allocated memory, depending on whether the region was mapped into device
memory.
"""
@abstractmethod
def initialize(self):
"""
Perform any initialization required for the EMM plugin to be ready to
use.
"""
@abstractmethod
def get_memory_info(self):
"""
Returns (free, total) memory in bytes in the context
"""
@abstractmethod
def get_ipc_handle(self, memory):
"""
Return an `IpcHandle` from a GPU allocation. Arguments:
- `memory`: A `MemoryPointer` for which the IPC handle should be created.
"""
@abstractmethod
def reset(self):
"""
Clear up all memory allocated in this context.
"""
@abstractmethod
def defer_cleanup(self):
"""
Returns a context manager that ensures the implementation of deferred
cleanup whilst it is active.
"""
@property
@abstractmethod
def interface_version(self):
"""
Returns an integer specifying the version of the EMM Plugin interface
supported by the plugin implementation. Should always return 1 for
implementations described in this proposal.
"""
All of the methods of an EMM plugin are called from within Numba - they never need to be invoked directly by a Numba user.
The initialize
method is called by Numba prior to any memory allocations
being requested. This gives the EMM an opportunity to initialize any data
structures, etc., that it needs for its normal operations. The method may be
called multiple times during the lifetime of the program - subsequent calls
should not invalidate or reset the state of the EMM.
The memalloc
, memhostalloc
, and mempin
methods are called when Numba
requires an allocation of device or host memory, or pinning of host memory.
Device memory should always be allocated in the current context.
get_ipc_handle
is called when an IPC handle for an array is required. Note
that there is no method for closing an IPC handle - this is because the
IpcHandle
object constructed by get_ipc_handle
contains a close()
method
as part of its definition in Numba, which closes the handle by calling
cuIpcCloseMemHandle
. It is expected that this is sufficient for general use
cases, so no facility for customising the closing of IPC handles is provided by
the EMM Plugin interface.
get_memory_info
may be called at any time after initialize
.
reset
is called as part of resetting a context. Numba does not normally call
reset spontaneously, but it may be called at the behest of the user. Calls to
reset
may even occur before initialize
is called, so the plugin should be
robust against this occurrence.
defer_cleanup
is called when the numba.cuda.defer_cleanup
context manager
is used from user code.
interface_version
is called by Numba when the memory manager is set, to
ensure that the version of the interface implemented by the plugin is
compatible with the version of Numba in use.
Representing pointers
Device Memory
The MemoryPointer
class is used to represent a pointer to memory. Whilst there
are various details of its implementation, the only aspect relevant to EMM
plugin development is its initialization. The __init__
method has the
following interface:
class MemoryPointer:
def __init__(self, context, pointer, size, owner=None, finalizer=None):
context
: The context in which the pointer was allocated.pointer
: Actypes
pointer (e.g.ctypes.c_uint64
) holding the address of the memory.size
: The size of the allocation in bytes.owner
: The owner is sometimes set by the internals of the class, or used for Numba’s internal memory management, but need not be provided by the writer of an EMM plugin - the default ofNone
should always suffice.finalizer
: A method that is called when the last reference to theMemoryPointer
object is released. Usually this will make a call to the external memory management library to inform it that the memory is no longer required, and that it could potentially be freed (though the EMM is not required to free it immediately).
Host Memory
Memory mapped into the CUDA address space (which is created when the
memhostalloc
or mempin
methods are called with mapped=True
) is managed
using the MappedMemory
class:
class MappedMemory(AutoFreePointer):
def __init__(self, context, pointer, size, owner, finalizer=None):
context
: The context in which the pointer was allocated.pointer
: Actypes
pointer (e.g.ctypes.c_void_p
) holding the address of the allocated memory.size
: The size of the allocated memory in bytes.owner
: A Python object that owns the memory, e.g. aDeviceNDArray
instance.finalizer
: A method that is called when the last reference to theMappedMemory
object is released. For example, this method could callcuMemFreeHost
on the pointer to deallocate the memory immediately.
Note that the inheritance from AutoFreePointer
is an implementation detail and
need not concern the developer of an EMM plugin - MemoryPointer
is higher in
the MRO of MappedMemory
.
Memory that is only in the host address space and has been pinned is represented
with the PinnedMemory
class:
class PinnedMemory(mviewbuf.MemAlloc):
def __init__(self, context, pointer, size, owner, finalizer=None):
context
: The context in which the pointer was allocated.pointer
: Actypes
pointer (e.g.ctypes.c_void_p
) holding the address of the pinned memory.size
: The size of the pinned region in bytes.owner
: A Python object that owns the memory, e.g. aDeviceNDArray
instance.finalizer
: A method that is called when the last reference to thePinnedMemory
object is released. This method could e.g. callcuMemHostUnregister
on the pointer to unpin the memory immediately.
Providing device memory management only
Some external memory managers will support management of on-device memory but
not host memory. To make it easy to implement an EMM plugin using one of these
managers, Numba will provide a memory manager class with implementations of the
memhostalloc
and mempin
methods. An abridged definition of this class
follows:
class HostOnlyCUDAMemoryManager(BaseCUDAMemoryManager):
# Unimplemented methods:
#
# - memalloc
# - get_memory_info
def memhostalloc(self, size, mapped, portable, wc):
# Implemented.
def mempin(self, owner, pointer, size, mapped):
# Implemented.
def initialize(self):
# Implemented.
#
# Must be called by any subclass when its initialize() method is
# called.
def reset(self):
# Implemented.
#
# Must be called by any subclass when its reset() method is
# called.
def defer_cleanup(self):
# Implemented.
#
# Must be called by any subclass when its defer_cleanup() method is
# called.
A class can subclass the HostOnlyCUDAMemoryManager
and then it only needs to
add implementations of methods for on-device memory. Any subclass must observe
the following rules:
If the subclass implements
__init__
, then it must also callHostOnlyCUDAMemoryManager.__init__
, as this is used to initialize some of its data structures (self.allocations
andself.deallocations
).The subclass must implement
memalloc
andget_memory_info
.The
initialize
andreset
methods perform initialisation of structures used by theHostOnlyCUDAMemoryManager
.If the subclass has nothing to do on initialisation (possibly) or reset (unlikely) then it need not implement these methods.
However, if it does implement these methods then it must also call the methods from
HostOnlyCUDAMemoryManager
in its own implementations.
Similarly if
defer_cleanup
is implemented, it should enter the context provided byHostOnlyCUDAManager.defer_cleanup()
prior toyield
ing (or in the__enter__
method) and release it prior to exiting (or in the__exit__
method).
Import order
The order in which Numba and the library implementing an EMM Plugin should not
matter. For example, if rmm
were to implement and register an EMM Plugin,
then:
from numba import cuda
import rmm
and
import rmm
from numba import cuda
are equivalent - this is because Numba does not initialize CUDA or allocate any
memory until the first call to a CUDA function - neither instantiating and
registering an EMM plugin, nor importing numba.cuda
causes a call to a CUDA
function.
Numba as a Dependency
Adding the implementation of an EMM Plugin to a library naturally makes Numba a dependency of the library where it may not have been previously. In order to make the dependency optional, if this is desired, one might conditionally instantiate and register the EMM Plugin like:
try:
import numba
from mylib.numba_utils import MyNumbaMemoryManager
numba.cuda.cudadrv.driver.set_memory_manager(MyNumbaMemoryManager)
except:
print("Numba not importable - not registering EMM Plugin")
so that mylib.numba_utils
, which contains the implementation of the EMM
Plugin, is only imported if Numba is already present. If Numba is not available,
then mylib.numba_utils
(which necessarily imports numba
), will never be
imported.
It is recommended that any library with an EMM Plugin includes at least some environments with Numba for testing with the EMM Plugin in use, as well as some environments without Numba, to avoid introducing an accidental Numba dependency.
Example implementation - A RAPIDS Memory Manager (RMM) Plugin
An implementation of an EMM plugin within the Rapids Memory Manager (RMM) is sketched out in this section. This is intended to show an overview of the implementation in order to support the descriptions above and to illustrate how the plugin interface can be used - different choices may be made for a production-ready implementation.
The plugin implementation consists of additions to python/rmm/rmm.py:
# New imports:
from contextlib import context_manager
# RMM already has Numba as a dependency, so these imports need not be guarded
# by a check for the presence of numba.
from numba.cuda import (HostOnlyCUDAMemoryManager, MemoryPointer, IpcHandle,
set_memory_manager)
# New class implementing the EMM Plugin:
class RMMNumbaManager(HostOnlyCUDAMemoryManager):
def memalloc(self, size):
# Allocates device memory using RMM functions. The finalizer for the
# allocated memory calls back to RMM to free the memory.
addr = librmm.rmm_alloc(bytesize, 0)
ctx = cuda.current_context()
ptr = ctypes.c_uint64(int(addr))
finalizer = _make_finalizer(addr, stream)
return MemoryPointer(ctx, ptr, size, finalizer=finalizer)
def get_ipc_handle(self, memory):
"""
Get an IPC handle for the memory with offset modified by the RMM memory
pool.
"""
# This implementation provides a functional implementation and illustrates
# what get_ipc_handle needs to do, but it is not a very "clean"
# implementation, and it relies on borrowing bits of Numba internals to
# initialise ipchandle.
#
# A more polished implementation might make use of additional functions in
# the RMM C++ layer for initialising IPC handles, and not use any Numba
# internals.
ipchandle = (ctypes.c_byte * 64)() # IPC handle is 64 bytes
cuda.cudadrv.memory.driver_funcs.cuIpcGetMemHandle(
ctypes.byref(ipchandle),
memory.owner.handle,
)
source_info = cuda.current_context().device.get_device_identity()
ptr = memory.device_ctypes_pointer.value
offset = librmm.rmm_getallocationoffset(ptr, 0)
return IpcHandle(memory, ipchandle, memory.size, source_info,
offset=offset)
def get_memory_info(self):
# Returns a tuple of (free, total) using RMM functionality.
return get_info() # Function defined in rmm.py
def initialize(self):
# Nothing required to initialize RMM here, but this method is added
# to illustrate that the super() method should also be called.
super().initialize()
@contextmanager
def defer_cleanup(self):
# Does nothing to defer cleanup - a full implementation may choose to
# implement a different policy.
with super().defer_cleanup():
yield
@property
def interface_version(self):
# As required by the specification
return 1
# The existing _make_finalizer function is used by RMMNumbaManager:
def _make_finalizer(handle, stream):
"""
Factory to make the finalizer function.
We need to bind *handle* and *stream* into the actual finalizer, which
takes no args.
"""
def finalizer():
"""
Invoked when the MemoryPointer is freed
"""
librmm.rmm_free(handle, stream)
return finalizer
# Utility function register `RMMNumbaManager` as an EMM:
def use_rmm_for_numba():
set_memory_manager(RMMNumbaManager)
# To support `NUMBA_CUDA_MEMORY_MANAGER=rmm`:
_numba_memory_manager = RMMNumbaManager
Example usage
A simple example that configures Numba to use RMM for memory management and creates a device array is as follows:
# example.py
import rmm
import numpy as np
from numba import cuda
rmm.use_rmm_for_numba()
a = np.zeros(10)
d_a = cuda.to_device(a)
del(d_a)
print(rmm.csv_log())
Running this should result in output similar to the following:
Event Type,Device ID,Address,Stream,Size (bytes),Free Memory,Total Memory,Current Allocs,Start,End,Elapsed,Location
Alloc,0,0x7fae06600000,0,80,0,0,1,1.10549,1.1074,0.00191666,<path>/numba/numba/cuda/cudadrv/driver.py:683
Free,0,0x7fae06600000,0,0,0,0,0,1.10798,1.10921,0.00122238,<path>/numba/numba/utils.py:678
Note that there is some scope for improvement in RMM for detecting the line number at which the allocation / free occurred, but this is outside the scope of the example in this proposal.
Setting the memory manager through the environment
Rather than calling rmm.use_rmm_for_numba()
in the example above, the memory
manager could also be set to use RMM globally with an environment variable, so
the Python interpreter is invoked to run the example as:
NUMBA_CUDA_MEMORY_MANAGER="rmm.RMMNumbaManager" python example.py
Numba internal changes
This section is intended primarily for Numba developers - those with an interest in the external interface for implementing EMM plugins may choose to skip over this section.
Current model / implementation
At present, memory management is implemented in the
Context
class. It maintains lists of
allocations and deallocations:
allocations
is anumba.core.utils.UniqueDict
, created at context creation time.deallocations
is an instance of the_PendingDeallocs
class, and is created whenContext.prepare_for_use()
is called.
These are used to track allocations and deallocations of:
Device memory
Pinned memory
Mapped memory
Streams
Events
Modules
The _PendingDeallocs
class implements the deferred deallocation strategy -
cleanup functions (such as cuMemFree
) for the items above are added to its
list of pending deallocations by the finalizers of objects representing
allocations. These finalizers are run when the objects owning them are
garbage-collected by the Python interpreter. When the addition of a new
cleanup function to the deallocation list causes the number or size of pending
deallocations to exceed a configured ratio, the _PendingDeallocs
object runs
deallocators for all items it knows about and then clears its internal pending
list.
See Deallocation Behavior for more details of this implementation.
Proposed changes
This section outlines the major changes that will be made to support the EMM plugin interface - there will be various small changes to other parts of Numba that will be required in order to adapt to these changes; an exhaustive list of these is not provided.
Context changes
The numba.cuda.cudadrv.driver.Context
class will no longer directly allocate
and free memory. Instead, the context will hold a reference to a memory manager
instance, and its memory allocation methods will call into the memory manager,
e.g.:
def memalloc(self, size):
return self.memory_manager.memalloc(size)
def memhostalloc(self, size, mapped=False, portable=False, wc=False):
return self.memory_manager.memhostalloc(size, mapped, portable, wc)
def mempin(self, owner, pointer, size, mapped=False):
if mapped and not self.device.CAN_MAP_HOST_MEMORY:
raise CudaDriverError("%s cannot map host memory" % self.device)
return self.memory_manager.mempin(owner, pointer, size, mapped)
def prepare_for_use(self):
self.memory_manager.initialize()
def get_memory_info(self):
self.memory_manager.get_memory_info()
def get_ipc_handle(self, memory):
return self.memory_manager.get_ipc_handle(memory)
def reset(self):
# ... Already-extant reset logic, plus:
self._memory_manager.reset()
The memory_manager
member is initialised when the context is created.
The memunpin
method (not shown above but currently exists in the Context
class) has never been implemented - it presently raises a NotImplementedError
.
This method arguably un-needed - pinned memory is immediately unpinned by its
finalizer, and unpinning before a finalizer runs would invalidate the state of
PinnedMemory
objects for which references are still held. It is proposed that
this is removed when making the other changes to the Context
class.
The Context
class will still instantiate self.allocations
and
self.deallocations
as before - these will still be used by the context to
manage the allocations and deallocations of events, streams, and modules, which
are not handled by the EMM plugin.
New components of the driver
module
BaseCUDAMemoryManager
: An abstract class, as defined in the plugin interface above.HostOnlyCUDAMemoryManager
: A subclass ofBaseCUDAMemoryManager
, with the logic fromContext.memhostalloc
andContext.mempin
moved into it. This class will also create its ownallocations
anddeallocations
members, similarly to how theContext
class creates them. These are used to manage the allocations and deallocations of pinned and mapped host memory.NumbaCUDAMemoryManager
: A subclass ofHostOnlyCUDAMemoryManager
, which also contains an implementation ofmemalloc
based on that presently existing in theContext
class. This is the default memory manager, and its use preserves the behaviour of Numba prior to the addition of the EMM plugin interface - that is, all memory allocation and deallocation for Numba arrays is handled within Numba.This class shares the
allocations
anddeallocations
members with its parent classHostOnlyCUDAMemoryManager
, and it uses these for the management of device memory that it allocates.
The
set_memory_manager
function, which sets a global pointing to the memory manager class. This global initially holdsNumbaCUDAMemoryManager
(the default).
Staged IPC
Staged IPC should not take ownership of the memory that it allocates. When the default internal memory manager is in use, the memory allocated for the staging array is already owned. When an EMM plugin is in use, it is not legitimate to take ownership of the memory.
This change can be made by applying the following small patch, which has been tested to have no effect on the CUDA test suite:
diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
index 7832955..f2c1352 100644
--- a/numba/cuda/cudadrv/driver.py
+++ b/numba/cuda/cudadrv/driver.py
@@ -922,7 +922,11 @@ class _StagedIpcImpl(object):
with cuda.gpus[srcdev.id]:
impl.close()
- return newmem.own()
+ return newmem
Testing
Alongside the addition of appropriate tests for new functionality, there will be
some refactoring of existing tests required, but these changes are not
substantial. Tests of the deallocation strategy (e.g. TestDeallocation
,
TestDeferCleanup
) will need to be modified to ensure that they are
examining the correct set of deallocations. When an EMM plugin is in use, they
will need to be skipped.
Prototyping / experimental implementation
Some prototype / experimental implementations have been produced to guide the designs presented in this document. The current implementations can be found in:
Numba branch: https://github.com/gmarkall/numba/tree/grm-numba-nbep-7.
RMM branch: https://github.com/gmarkall/rmm/tree/grm-numba-nbep-7.
CuPy implementation: https://github.com/gmarkall/nbep-7/blob/master/nbep7/cupy_mempool.py - uses an unmodified CuPy.
Current implementation status
RMM Plugin
For a minimal example, a simple allocation and free using RMM works as expected. For the example code (similar to the RMM example above):
import rmm
import numpy as np
from numba import cuda
rmm.use_rmm_for_numba()
a = np.zeros(10)
d_a = cuda.to_device(a)
del(d_a)
print(rmm.csv_log())
We see the following output:
Event Type,Device ID,Address,Stream,Size (bytes),Free Memory,Total Memory,Current Allocs,Start,End,Elapsed,Location
Alloc,0,0x7f96c7400000,0,80,0,0,1,1.13396,1.13576,0.00180059,<path>/numba/numba/cuda/cudadrv/driver.py:686
Free,0,0x7f96c7400000,0,0,0,0,0,1.13628,1.13723,0.000956004,<path>/numba/numba/utils.py:678
This output is similar to the expected output from the example usage presented above (though note that the pointer addresses and timestamps vary compared to the example), and provides some validation of the example use case.
CuPy Plugin
from nbep7.cupy_mempool import use_cupy_mm_for_numba
import numpy as np
from numba import cuda
use_cupy_mm_for_numba()
a = np.zeros(10)
d_a = cuda.to_device(a)
del(d_a)
The prototype CuPy plugin has somewhat primitive logging, so we see the output:
Allocated 80 bytes at 7f004d400000
Freeing 80 bytes at 7f004d400000
Numba CUDA Unit tests
As well as providing correct execution of a simple example, all relevant Numba CUDA unit tests also pass with the prototype branch, for both the internal memory manager and the RMM EMM Plugin.
RMM
The unit test suite can be run with the RMM EMM Plugin with:
NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests
A summary of the unit test suite output is:
Ran 564 tests in 142.211s
OK (skipped=11)
When running with the built-in Numba memory management, the output is:
Ran 564 tests in 133.396s
OK (skipped=5)
i.e. the changes for using an external memory manager do not break the built-in Numba memory management. There are an additional 6 skipped tests, from:
TestDeallocation
: skipped as it specifically tests Numba’s internal deallocation strategy.TestDeferCleanup
: skipped as it specifically tests Numba’s implementation of deferred cleanup.TestCudaArrayInterface.test_ownership
: skipped as Numba does not own memory when an EMM Plugin is used, but ownership is assumed by this test case.
CuPy
The test suite can be run with the CuPy plugin using:
NUMBA_CUDA_MEMORY_MANAGER=nbep7.cupy_mempool python -m numba.runtests numba.cuda.tests
This plugin implementation is presently more primitive than the RMM implementation, and results in some errors with the unit test suite:
Ran 564 tests in 111.699s
FAILED (errors=8, skipped=11)
The 8 errors are due to a lack of implementation of get_ipc_handle
in the
CuPy EMM Plugin implementation. It is expected that this implementation will be
re-visited and completed so that CuPy can be used stably as an allocator for
Numba in the future.