Achieving optimal performance in GPU-centric workflows frequently requires customizing how host and
device memory are allocated. For example, using “pinned” host memory for asynchronous
host <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost of
dynamic device memory allocation.
The goal of the RAPIDS Memory Manager (RMM) is to provide:
A common interface that allows customizing device and
host memory allocation
To install RMM from source, ensure the dependencies are met and follow the steps below:
Clone the repository and submodules
$ git clone --recurse-submodules https://github.com/rapidsai/rmm.git
$ cd rmm
Create the conda development environment rmm_dev
# create the conda environment (assuming in base `rmm` directory)
$ conda env create --name rmm_dev --file conda/environments/all_cuda-118_arch-x86_64.yaml
# activate the environment
$ conda activate rmm_dev
Build and install librmm using cmake & make. CMake depends on the nvcc executable being on
your path or defined in CUDACXX environment variable.
$ mkdir build # make a build directory
$ cd build # enter the build directory
$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path # configure cmake ... use $CONDA_PREFIX if you're using Anaconda
$ make -j # compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system
$ make install # install the library librmm.so to '/install/path'
Building and installing librmm and rmm using build.sh. Build.sh creates build dir at root of
git repository. build.sh depends on the nvcc executable being on your path or defined in
CUDACXX environment variable.
$ ./build.sh -h # Display help and exit
$ ./build.sh -n librmm # Build librmm without installing
$ ./build.sh -n rmm # Build rmm without installing
$ ./build.sh -n librmm rmm # Build librmm and rmm without installing
$ ./build.sh librmm rmm # Build and install librmm and rmm
$ ./build.sh -m librmm tests # Build and install librmm and Build tests on MGPU
Please set ENVIROMENT VARIABLE MACA_PATH before build on MGPU. Pls refer to env.sh
To run tests (Optional):
$ cd build (if you are not already in build directory)
$ make test
Build, install, and test the rmm python package, in the python folder:
Done! You are ready to develop for the RMM OSS project.
Caching third-party dependencies
RMM uses CPM.cmake to
handle third-party dependencies like spdlog, Thrust, GoogleTest,
GoogleBenchmark. In general you won’t have to worry about it. If CMake
finds an appropriate version on your system, it uses it (you can
help it along by setting CMAKE_PREFIX_PATH to point to the
installed location). Otherwise those dependencies will be downloaded as
part of the build.
If you frequently start new builds from scratch, consider setting the
environment variable CPM_SOURCE_CACHE to an external download
directory to avoid repeated downloads of the third-party dependencies.
Using RMM in a downstream CMake project
The installed RMM library provides a set of config files that makes it easy to
integrate RMM into your own CMake project. In your CMakeLists.txt, just add
Since RMM is a header-only library, this does not actually link RMM,
but it makes the headers available and pulls in transitive dependencies.
If RMM is not installed in a default location, use
CMAKE_PREFIX_PATH or rmm_ROOT to point to its location.
One of RMM’s dependencies is the Thrust library, so the above
automatically pulls in Thrust by means of a dependency on the
rmm::Thrust target. By default it uses the standard configuration of
Thrust. If you want to customize it, you can set the variables
THRUST_HOST_SYSTEM and THRUST_DEVICE_SYSTEM; see
Thrust’s CMake documentation.
Using RMM in C++
The first goal of RMM is to provide a common interface for device and host memory allocation.
This allows both users and implementers of custom allocation logic to program to a single
interface.
To this end, RMM defines two abstract interface classes:
Reclaims a previous allocation of size bytes pointed to by p.
pmust have been returned by a previous call to allocate(bytes), otherwise behavior is
undefined
It is up to a derived class to provide implementations of these functions. See
available resources for example device_memory_resource derived classes.
Unlike std::pmr::memory_resource, rmm::mr::device_memory_resource does not allow specifying an
alignment argument. All allocations are required to be aligned to at least 256B. Furthermore,
device_memory_resource adds an additional cuda_stream_view argument to allow specifying the stream
on which to perform the (de)allocation.
cuda_stream_view and cuda_stream
rmm::cuda_stream_view is a simple non-owning wrapper around a CUDA cudaStream_t. This wrapper’s
purpose is to provide strong type safety for stream types. (cudaStream_t is an alias for a pointer,
which can lead to ambiguity in APIs when it is assigned 0.) All RMM stream-ordered APIs take a
rmm::cuda_stream_view argument.
rmm::cuda_stream is a simple owning wrapper around a CUDA cudaStream_t. This class provides
RAII semantics (constructor creates the CUDA stream, destructor destroys it). An rmm::cuda_stream
can never represent the CUDA default stream or per-thread default stream; it only ever represents
a single non-default stream. rmm::cuda_stream cannot be copied, but can be moved.
cuda_stream_pool
rmm::cuda_stream_pool provides fast access to a pool of CUDA streams. This class can be used to
create a set of cuda_stream objects whose lifetime is equal to the cuda_stream_pool. Using the
stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
Depending on this size, multiple calls to cuda_stream_pool::get_stream() may return instances of
rmm::cuda_stream_view that represent identical CUDA streams.
Thread Safety
All current device memory resources are thread safe unless documented otherwise. More specifically,
calls to memory resource allocate() and deallocate() methods are safe with respect to calls to
either of these functions from other threads. They are not thread safe with respect to
construction and destruction of the memory resource object.
Note that a class thread_safe_resource_adapter is provided which can be used to adapt a memory
resource that is not thread safe to be thread safe (as described above). This adapter is not needed
with any current RMM device memory resources.
Stream-ordered Memory Allocation
rmm::mr::device_memory_resource is a base class that provides stream-ordered memory allocation.
This allows optimizations such as re-using memory deallocated on the same stream without the
overhead of synchronization.
A call to device_memory_resource::allocate(bytes, stream_a) returns a pointer that is valid to use
on stream_a. Using the memory on a different stream (say stream_b) is Undefined Behavior unless
the two streams are first synchronized, for example by using cudaStreamSynchronize(stream_a) or by
recording a CUDA event on stream_a and then calling cudaStreamWaitEvent(stream_b, event).
The stream specified to device_memory_resource::deallocate should be a stream on which it is valid
to use the deallocated memory immediately for another allocation. Typically this is the stream
on which the allocation was last used before the call to deallocate. The passed stream may be
used internally by a device_memory_resource for managing available memory with minimal
synchronization, and it may also be synchronized at a later time, for example using a call to
cudaStreamSynchronize().
For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to
device_memory_resource::deallocate. If the stream on which the allocation was last used has been
destroyed before calling deallocate or it is known that it will be destroyed, it is likely better
to synchronize the stream (before destroying it) and then pass a different stream to deallocate
(e.g. the default stream).
Note that device memory data structures such as rmm::device_buffer and rmm::device_uvector
follow these stream-ordered memory allocation semantics and rules.
RMM provides several device_memory_resource derived classes to satisfy various user requirements.
For more detailed information about these resources, see their respective documentation.
cuda_memory_resource
Allocates and frees device memory using cudaMalloc and cudaFree.
managed_memory_resource
Allocates and frees device memory using cudaMallocManaged and cudaFree.
A memory resource that can only allocate a single fixed size. Average allocation and deallocation
cost is constant.
binning_memory_resource
Configurable to use multiple upstream memory resources for allocations that fall within different
bin sizes. Often configured with multiple bins backed by fixed_size_memory_resources and a single
pool_memory_resource for allocations larger than the largest bin size.
Default Resources and Per-device Resources
RMM users commonly need to configure a device_memory_resource object to use for all allocations
where another resource has not explicitly been provided. A common example is configuring a
pool_memory_resource to use for all allocations to get fast dynamic allocation.
To enable this use case, RMM provides the concept of a “default” device_memory_resource. This
resource is used when another is not explicitly provided.
Accessing and modifying the default resource is done through two functions:
Updates the default memory resource pointer for the current CUDA device to new_mr
Returns the previous default resource pointer
If new_mr is nullptr, then resets the default resource to cuda_memory_resource
This function is thread safe with respect to concurrent calls to it and
get_current_device_resource()
For more explicit control, you can use set_per_device_resource(), which takes a device ID.
Example
rmm::mr::cuda_memory_resource cuda_mr;
// Construct a resource that uses a coalescing best-fit pool allocator
rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> pool_mr{&cuda_mr};
rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device resource pointer to `pool_mr`
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`
Multiple Devices
A device_memory_resource should only be used when the active CUDA device is the same device
that was active when the device_memory_resource was created. Otherwise behavior is undefined.
If a device_memory_resource is used with a stream associated with a different CUDA device than the
device for which the memory resource was created, behavior is undefined.
Creating a device_memory_resource for each device requires care to set the current device before
creating each resource, and to maintain the lifetime of the resources as long as they are set as
per-device resources. Here is an example loop that creates unique_ptrs to pool_memory_resource
objects for each device and sets them as the per-device resource for that device.
std::vector<unique_ptr<pool_memory_resource>> per_device_pools;
for(int i = 0; i < N; ++i) {
cudaSetDevice(i); // set device i before creating MR
// Use a vector of unique_ptr to maintain the lifetime of the MRs
per_device_pools.push_back(std::make_unique<pool_memory_resource>());
// Set the per-device resource for device i
set_per_device_resource(cuda_device_id{i}, &per_device_pools.back());
}
Allocators
C++ interfaces commonly allow customizable memory allocation through an Allocator object.
RMM provides several Allocator and Allocator-like classes.
polymorphic_allocator
A stream-ordered allocator similar to std::pmr::polymorphic_allocator.
Unlike the standard C++ Allocator interface, the allocate and deallocate functions take a cuda_stream_view indicating the stream on which the (de)allocation occurs.
stream_allocator_adaptor
stream_allocator_adaptor can be used to adapt a stream-ordered allocator to present a standard Allocator interface to consumers that may not be designed to work with a stream-ordered interface.
Example:
rmm::cuda_stream stream;
rmm::mr::polymorphic_allocator<int> stream_alloc;
// Constructs an adaptor that forwards all (de)allocations to `stream_alloc` on `stream`.
auto adapted = rmm::mr::make_stream_allocator_adaptor(stream_alloc, stream);
// Allocates 100 bytes using `stream_alloc` on `stream`
auto p = adapted.allocate(100);
...
// Deallocates using `stream_alloc` on `stream`
adapted.deallocate(p,100);
thrust_allocator
thrust_allocator is a device memory allocator that uses the strongly typed thrust::device_ptr, making it usable with containers like thrust::device_vector.
See below for more information on using RMM with Thrust.
Device Data Structures
device_buffer
An untyped, uninitialized RAII class for stream ordered device memory allocation.
Example
cuda_stream_view s{...};
// Allocates at least 100 bytes on stream `s` using the *default* resource
rmm::device_buffer b{100,s};
void* p = b.data(); // Raw, untyped pointer to underlying device memory
kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s`
rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
// Allocates at least 100 bytes on stream `s` using the resource `mr`
rmm::device_buffer b2{100, s, mr};
device_uvector<T>
A typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory.
Similar to a thrust::device_vector, but as an optimization, does not default initialize the
contained elements. This optimization restricts the types T to trivially copyable types.
Example
cuda_stream_view s{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the
// default resource
rmm::device_uvector<int32_t> v(100, s);
// Initializes the elements to 0
thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});
rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`
rmm::device_uvector<int32_t> v2{100, s, mr};
device_scalar
A typed, RAII class for allocation of a single element in device memory.
This is similar to a device_uvector with a single element, but provides convenience functions like
modifying the value in device memory from the host, or retrieving the value from device to host.
Example
cuda_stream_view s{...};
// Allocates uninitialized storage for a single `int32_t` in device memory
rmm::device_scalar<int32_t> a{s};
a.set_value(42, s); // Updates the value in device memory to `42` on stream `s`
kernel<<<...,s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory
int32_t v = a.value(s); // Retrieves the value from device to host on stream `s`
host_memory_resource
rmm::mr::host_memory_resource is the base class that defines the interface for allocating and
freeing host memory.
Similar to device_memory_resource, it has two key functions for (de)allocation:
Reclaims a previous allocation of size bytes pointed to by p.
Unlike device_memory_resource, the host_memory_resource interface and behavior is identical to
std::pmr::memory_resource.
Available Resources
new_delete_resource
Uses the global operator new and operator delete to allocate host memory.
pinned_memory_resource
Allocates “pinned” host memory using cuda(Malloc/Free)Host.
Host Data Structures
RMM does not currently provide any data structures that interface with host_memory_resource.
In the future, RMM will provide a similar host-side structure like device_buffer and an allocator
that can be used with STL containers.
Using RMM with Thrust
RAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in two
situations:
As the backing store for thrust::device_vector, and
As temporary storage inside some algorithms, such as thrust::sort.
RMM provides rmm::mr::thrust_allocator as a conforming Thrust allocator that uses
device_memory_resources.
Thrust Algorithms
To instruct a Thrust algorithm to use rmm::mr::thrust_allocator to allocate temporary storage, you
can use the custom Thrust CUDA device execution policy: rmm::exec_policy(stream).
thrust::sort(rmm::exec_policy(stream, ...);
The first stream argument is the stream to use for rmm::mr::thrust_allocator.
The second stream argument is what should be used to execute the Thrust algorithm.
These two arguments must be identical.
Logging
RMM includes two forms of logging. Memory event logging and debug logging.
Memory Event Logging and logging_resource_adaptor
Memory event logging writes details of every allocation or deallocation to a CSV (comma-separated
value) file. In C++, Memory Event Logging is enabled by using the logging_resource_adaptor as a
wrapper around any other device_memory_resource object.
Each row in the log represents either an allocation or a deallocation. The columns of the file are
“Thread, Time, Action, Pointer, Size, Stream”.
The CSV output files of the logging_resource_adaptor can be used as input to REPLAY_BENCHMARK,
which is available when building RMM from source, in the gbenchmarks folder in the build directory.
This log replayer can be useful for profiling and debugging allocator issues.
The following C++ example creates a logging version of a cuda_memory_resource that outputs the log
to the file “logs/test1.csv”.
If a file name is not specified, the environment variable RMM_LOG_FILE is queried for the file
name. If RMM_LOG_FILE is not set, then an exception is thrown by the logging_resource_adaptor
constructor.
In Python, memory event logging is enabled when the logging parameter of rmm.reinitialize() is
set to True. The log file name can be set using the log_file_name parameter. See
help(rmm.reinitialize) for full details.
Debug Logging
RMM includes a debug logger which can be enabled to log trace and debug information to a file. This
information can show when errors occur, when additional memory is allocated from upstream resources,
etc. The default log file is rmm_log.txt in the current working directory, but the environment
variable RMM_DEBUG_LOG_FILE can be set to specify the path and file name.
There is a CMake configuration variable RMM_LOGGING_LEVEL, which can be set to enable compilation
of more detailed logging. The default is INFO. Available levels are TRACE, DEBUG, INFO,
WARN, ERROR, CRITICAL and OFF.
Note that to see logging below the INFO level, the C++ application must also call
rmm::logger().set_level(), e.g. to enable all levels of logging down to TRACE, call
rmm::logger().set_level(spdlog::level::trace) (and compile with -DRMM_LOGGING_LEVEL=TRACE).
Note that debug logging is different from the CSV memory allocation logging provided by
rmm::mr::logging_resource_adapter. The latter is for logging a history of allocation /
deallocation actions which can be useful for replay with RMM’s replay benchmark.
RMM and CUDA Memory Bounds Checking
Memory allocations taken from a memory resource that allocates a pool of memory (such as
pool_memory_resource and arena_memory_resource) are part of the same low-level CUDA memory
allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to
be detected by CUDA tools such as
CUDA Compute Sanitizer memcheck.
Exceptions to this are cuda_memory_resource, which wraps cudaMalloc, and
cuda_async_memory_resource, which uses cudaMallocAsync with CUDA’s built-in memory pool
functionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these
resources are detectable with Compute Sanitizer Memcheck.
It may be possible in the future to add support for memory bounds checking with other memory
resources using NVTX APIs.
Using RMM in Python Code
There are two ways to use RMM in Python code:
Using the rmm.DeviceBuffer API to explicitly create and manage
device memory allocations
Transparently via external libraries such as CuPy and Numba
RMM provides a MemoryResource abstraction to control how device
memory is allocated in both the above uses.
DeviceBuffers
A DeviceBuffer represents an untyped, uninitialized device memory
allocation. DeviceBuffers can be created by providing the
size of the allocation in bytes:
MemoryResource objects are used to configure how device memory allocations are made by
RMM.
By default if a MemoryResource is not set explicitly, RMM uses the CudaMemoryResource, which
uses cudaMalloc for allocating device memory.
rmm.reinitialize() provides an easy way to initialize RMM with specific memory resource options
across multiple devices. See help(rmm.reinitialize) for full details.
For lower-level control, the rmm.mr.set_current_device_resource() function can be
used to set a different MemoryResource for the current CUDA device. For
example, enabling the ManagedMemoryResource tells RMM to use
cudaMallocManaged instead of cudaMalloc for allocating memory:
The default resource must be set for any device before
allocating any device memory on that device. Setting or changing the
resource after device allocations have been made can lead to unexpected
behaviour or crashes. See Multiple Devices
As another example, PoolMemoryResource allows you to allocate a
large “pool” of device memory up-front. Subsequent allocations will
draw from this pool of already allocated memory. The example
below shows how to construct a PoolMemoryResource with an initial size
of 1 GiB and a maximum size of 4 GiB. The pool uses
CudaMemoryResource as its underlying (“upstream”) memory resource:
Note: This only configures CuPy to use the current RMM resource for allocations.
It does not initialize nor change the current resource, e.g., enabling a memory pool.
See here for more information on changing the current memory resource.
Using RMM with Numba
You can configure Numba to use RMM for memory allocations using the
Numba EMM Plugin.
This can be done in two ways:
Setting the environment variable NUMBA_CUDA_MEMORY_MANAGER:
$ NUMBA_CUDA_MEMORY_MANAGER=rmm python (args)
Using the set_memory_manager() function provided by Numba:
>>> from numbax import cuda
>>> import rmm
>>> cuda.set_memory_manager(rmm.RMMNumbaManager)
Note: This only configures Numba to use the current RMM resource for allocations.
It does not initialize nor change the current resource, e.g., enabling a memory pool.
See here for more information on changing the current memory resource.
Using RMM with PyTorch
PyTorch can use RMM
for memory allocation. For example, to configure PyTorch to use an
RMM-managed pool:
PyTorch and RMM will now share the same memory pool.
You can, of course, use a custom memory resource with PyTorch as well:
import rmm
import torch
# note that you can configure PyTorch to use RMM either before or
# after changing RMM's memory resource. PyTorch will use whatever
# memory resource is configured to be the "current" memory resource at
# the time of allocation.
torch.cuda.change_current_allocator(rmm.rmm_torch_allocator)
# configure RMM to use a managed memory resource, wrapped with a
# statistics resource adaptor that can report information about the
# amount of memory allocated:
mr = rmm.mr.StatisticsResourceAdaptor(rmm.mr.ManagedMemoryResource())
rmm.mr.set_current_device_resource(mr)
x = torch.tensor([1, 2]).cuda()
# the memory resource reports information about PyTorch allocations:
mr.allocation_counts
Out[6]:
{'current_bytes': 16,
'current_count': 1,
'peak_bytes': 16,
'peak_count': 1,
'total_bytes': 16,
'total_count': 1}
NOTE: For the latest stable README.md ensure you are on the
mainbranch.Resources
Overview
Achieving optimal performance in GPU-centric workflows frequently requires customizing how host and device memory are allocated. For example, using “pinned” host memory for asynchronous host <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost of dynamic device memory allocation.
The goal of the RAPIDS Memory Manager (RMM) is to provide:
For information on the interface RMM provides and how to use RMM in your C++ code, see below.
For a walkthrough about the design of the RAPIDS Memory Manager, read Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager on the NVIDIA Developer Blog.
Installation
Conda
RMM can be installed with Conda (miniconda, or the full Anaconda distribution) from the
rapidsaichannel:We also provide nightly Conda packages built from the HEAD of our latest development branch.
Note: RMM is supported only on Linux, and only tested with Python versions 3.8 and 3.10.
Note: The RMM package from Conda requires building with GCC 9 or later. Otherwise, your application may fail to build.
See the Get RAPIDS version picker for more OS and version info.
Building from Source
Get RMM Dependencies
Compiler requirements:
gccversion 9.3+nvccversion 11.2+cmakeversion 3.23.1+CUDA/GPU requirements:
You can obtain CUDA from https://developer.nvidia.com/cuda-downloads
Python requirements:
scikit-buildcuda-pythoncythonFor more details, see pyproject.toml
Script to build RMM from source
To install RMM from source, ensure the dependencies are met and follow the steps below:
Clone the repository and submodules
Create the conda development environment
rmm_devBuild and install
librmmusing cmake & make. CMake depends on thenvccexecutable being on your path or defined inCUDACXXenvironment variable.librmmandrmmusing build.sh. Build.sh creates build dir at root of git repository. build.sh depends on thenvccexecutable being on your path or defined inCUDACXXenvironment variable.Please set ENVIROMENT VARIABLE MACA_PATH before build on MGPU. Pls refer to env.sh
To run tests (Optional):
Build, install, and test the
rmmpython package, in thepythonfolder:Done! You are ready to develop for the RMM OSS project.
Caching third-party dependencies
RMM uses CPM.cmake to handle third-party dependencies like spdlog, Thrust, GoogleTest, GoogleBenchmark. In general you won’t have to worry about it. If CMake finds an appropriate version on your system, it uses it (you can help it along by setting
CMAKE_PREFIX_PATHto point to the installed location). Otherwise those dependencies will be downloaded as part of the build.If you frequently start new builds from scratch, consider setting the environment variable
CPM_SOURCE_CACHEto an external download directory to avoid repeated downloads of the third-party dependencies.Using RMM in a downstream CMake project
The installed RMM library provides a set of config files that makes it easy to integrate RMM into your own CMake project. In your
CMakeLists.txt, just addSince RMM is a header-only library, this does not actually link RMM, but it makes the headers available and pulls in transitive dependencies. If RMM is not installed in a default location, use
CMAKE_PREFIX_PATHorrmm_ROOTto point to its location.One of RMM’s dependencies is the Thrust library, so the above automatically pulls in
Thrustby means of a dependency on thermm::Thrusttarget. By default it uses the standard configuration of Thrust. If you want to customize it, you can set the variablesTHRUST_HOST_SYSTEMandTHRUST_DEVICE_SYSTEM; see Thrust’s CMake documentation.Using RMM in C++
The first goal of RMM is to provide a common interface for device and host memory allocation. This allows both users and implementers of custom allocation logic to program to a single interface.
To this end, RMM defines two abstract interface classes:
rmm::mr::device_memory_resourcefor device memory allocationrmm::mr::host_memory_resourcefor host memory allocationThese classes are based on the
std::pmr::memory_resourceinterface class introduced in C++17 for polymorphic memory allocation.device_memory_resourcermm::mr::device_memory_resourceis the base class that defines the interface for allocating and freeing device memory.It has two key functions:
void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)bytesbytes.void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)bytespointed to byp.pmust have been returned by a previous call toallocate(bytes), otherwise behavior is undefinedIt is up to a derived class to provide implementations of these functions. See available resources for example
device_memory_resourcederived classes.Unlike
std::pmr::memory_resource,rmm::mr::device_memory_resourcedoes not allow specifying an alignment argument. All allocations are required to be aligned to at least 256B. Furthermore,device_memory_resourceadds an additionalcuda_stream_viewargument to allow specifying the stream on which to perform the (de)allocation.cuda_stream_viewandcuda_streamrmm::cuda_stream_viewis a simple non-owning wrapper around a CUDAcudaStream_t. This wrapper’s purpose is to provide strong type safety for stream types. (cudaStream_tis an alias for a pointer, which can lead to ambiguity in APIs when it is assigned0.) All RMM stream-ordered APIs take armm::cuda_stream_viewargument.rmm::cuda_streamis a simple owning wrapper around a CUDAcudaStream_t. This class provides RAII semantics (constructor creates the CUDA stream, destructor destroys it). Anrmm::cuda_streamcan never represent the CUDA default stream or per-thread default stream; it only ever represents a single non-default stream.rmm::cuda_streamcannot be copied, but can be moved.cuda_stream_poolrmm::cuda_stream_poolprovides fast access to a pool of CUDA streams. This class can be used to create a set ofcuda_streamobjects whose lifetime is equal to thecuda_stream_pool. Using the stream pool can be faster than creating the streams on the fly. The size of the pool is configurable. Depending on this size, multiple calls tocuda_stream_pool::get_stream()may return instances ofrmm::cuda_stream_viewthat represent identical CUDA streams.Thread Safety
All current device memory resources are thread safe unless documented otherwise. More specifically, calls to memory resource
allocate()anddeallocate()methods are safe with respect to calls to either of these functions from other threads. They are not thread safe with respect to construction and destruction of the memory resource object.Note that a class
thread_safe_resource_adapteris provided which can be used to adapt a memory resource that is not thread safe to be thread safe (as described above). This adapter is not needed with any current RMM device memory resources.Stream-ordered Memory Allocation
rmm::mr::device_memory_resourceis a base class that provides stream-ordered memory allocation. This allows optimizations such as re-using memory deallocated on the same stream without the overhead of synchronization.A call to
device_memory_resource::allocate(bytes, stream_a)returns a pointer that is valid to use onstream_a. Using the memory on a different stream (saystream_b) is Undefined Behavior unless the two streams are first synchronized, for example by usingcudaStreamSynchronize(stream_a)or by recording a CUDA event onstream_aand then callingcudaStreamWaitEvent(stream_b, event).The stream specified to
device_memory_resource::deallocateshould be a stream on which it is valid to use the deallocated memory immediately for another allocation. Typically this is the stream on which the allocation was last used before the call todeallocate. The passed stream may be used internally by adevice_memory_resourcefor managing available memory with minimal synchronization, and it may also be synchronized at a later time, for example using a call tocudaStreamSynchronize().For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to
device_memory_resource::deallocate. If the stream on which the allocation was last used has been destroyed before callingdeallocateor it is known that it will be destroyed, it is likely better to synchronize the stream (before destroying it) and then pass a different stream todeallocate(e.g. the default stream).Note that device memory data structures such as
rmm::device_bufferandrmm::device_uvectorfollow these stream-ordered memory allocation semantics and rules.For further information about stream-ordered memory allocation semantics, read Using the NVIDIA CUDA Stream-Ordered Memory Allocator on the NVIDIA Developer Blog.
Available Resources
RMM provides several
device_memory_resourcederived classes to satisfy various user requirements. For more detailed information about these resources, see their respective documentation.cuda_memory_resourceAllocates and frees device memory using
cudaMallocandcudaFree.managed_memory_resourceAllocates and frees device memory using
cudaMallocManagedandcudaFree.Note that
managed_memory_resourcecannot be used with NVIDIA Virtual GPU Software (vGPU, for use with virtual machines or hypervisors) because NVIDIA CUDA Unified Memory is not supported by NVIDIA vGPU.pool_memory_resourceA coalescing, best-fit pool sub-allocator.
fixed_size_memory_resourceA memory resource that can only allocate a single fixed size. Average allocation and deallocation cost is constant.
binning_memory_resourceConfigurable to use multiple upstream memory resources for allocations that fall within different bin sizes. Often configured with multiple bins backed by
fixed_size_memory_resources and a singlepool_memory_resourcefor allocations larger than the largest bin size.Default Resources and Per-device Resources
RMM users commonly need to configure a
device_memory_resourceobject to use for all allocations where another resource has not explicitly been provided. A common example is configuring apool_memory_resourceto use for all allocations to get fast dynamic allocation.To enable this use case, RMM provides the concept of a “default”
device_memory_resource. This resource is used when another is not explicitly provided.Accessing and modifying the default resource is done through two functions:
device_memory_resource* get_current_device_resource()cuda_memory_resource.set_current_device_resource().get_per_device_resource(), which takes a device ID.device_memory_resource* set_current_device_resource(device_memory_resource* new_mr)new_mrnew_mrisnullptr, then resets the default resource tocuda_memory_resourceget_current_device_resource()set_per_device_resource(), which takes a device ID.Example
Multiple Devices
A
device_memory_resourceshould only be used when the active CUDA device is the same device that was active when thedevice_memory_resourcewas created. Otherwise behavior is undefined.If a
device_memory_resourceis used with a stream associated with a different CUDA device than the device for which the memory resource was created, behavior is undefined.Creating a
device_memory_resourcefor each device requires care to set the current device before creating each resource, and to maintain the lifetime of the resources as long as they are set as per-device resources. Here is an example loop that createsunique_ptrs topool_memory_resourceobjects for each device and sets them as the per-device resource for that device.Allocators
C++ interfaces commonly allow customizable memory allocation through an
Allocatorobject. RMM provides severalAllocatorandAllocator-like classes.polymorphic_allocatorA stream-ordered allocator similar to
std::pmr::polymorphic_allocator. Unlike the standard C++Allocatorinterface, theallocateanddeallocatefunctions take acuda_stream_viewindicating the stream on which the (de)allocation occurs.stream_allocator_adaptorstream_allocator_adaptorcan be used to adapt a stream-ordered allocator to present a standardAllocatorinterface to consumers that may not be designed to work with a stream-ordered interface.Example:
thrust_allocatorthrust_allocatoris a device memory allocator that uses the strongly typedthrust::device_ptr, making it usable with containers likethrust::device_vector.See below for more information on using RMM with Thrust.
Device Data Structures
device_bufferAn untyped, uninitialized RAII class for stream ordered device memory allocation.
Example
device_uvector<T>A typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory. Similar to a
thrust::device_vector, but as an optimization, does not default initialize the contained elements. This optimization restricts the typesTto trivially copyable types.Example
device_scalarA typed, RAII class for allocation of a single element in device memory. This is similar to a
device_uvectorwith a single element, but provides convenience functions like modifying the value in device memory from the host, or retrieving the value from device to host.Example
host_memory_resourcermm::mr::host_memory_resourceis the base class that defines the interface for allocating and freeing host memory.Similar to
device_memory_resource, it has two key functions for (de)allocation:void* host_memory_resource::allocate(std::size_t bytes, std::size_t alignment)bytesbytes aligned to the specifiedalignmentvoid host_memory_resource::deallocate(void* p, std::size_t bytes, std::size_t alignment)bytespointed to byp.Unlike
device_memory_resource, thehost_memory_resourceinterface and behavior is identical tostd::pmr::memory_resource.Available Resources
new_delete_resourceUses the global
operator newandoperator deleteto allocate host memory.pinned_memory_resourceAllocates “pinned” host memory using
cuda(Malloc/Free)Host.Host Data Structures
RMM does not currently provide any data structures that interface with
host_memory_resource. In the future, RMM will provide a similar host-side structure likedevice_bufferand an allocator that can be used with STL containers.Using RMM with Thrust
RAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in two situations:
thrust::device_vector, andthrust::sort.RMM provides
rmm::mr::thrust_allocatoras a conforming Thrust allocator that usesdevice_memory_resources.Thrust Algorithms
To instruct a Thrust algorithm to use
rmm::mr::thrust_allocatorto allocate temporary storage, you can use the custom Thrust CUDA device execution policy:rmm::exec_policy(stream).The first
streamargument is thestreamto use forrmm::mr::thrust_allocator. The secondstreamargument is what should be used to execute the Thrust algorithm. These two arguments must be identical.Logging
RMM includes two forms of logging. Memory event logging and debug logging.
Memory Event Logging and
logging_resource_adaptorMemory event logging writes details of every allocation or deallocation to a CSV (comma-separated value) file. In C++, Memory Event Logging is enabled by using the
logging_resource_adaptoras a wrapper around any otherdevice_memory_resourceobject.Each row in the log represents either an allocation or a deallocation. The columns of the file are “Thread, Time, Action, Pointer, Size, Stream”.
The CSV output files of the
logging_resource_adaptorcan be used as input toREPLAY_BENCHMARK, which is available when building RMM from source, in thegbenchmarksfolder in the build directory. This log replayer can be useful for profiling and debugging allocator issues.The following C++ example creates a logging version of a
cuda_memory_resourcethat outputs the log to the file “logs/test1.csv”.If a file name is not specified, the environment variable
RMM_LOG_FILEis queried for the file name. IfRMM_LOG_FILEis not set, then an exception is thrown by thelogging_resource_adaptorconstructor.In Python, memory event logging is enabled when the
loggingparameter ofrmm.reinitialize()is set toTrue. The log file name can be set using thelog_file_nameparameter. Seehelp(rmm.reinitialize)for full details.Debug Logging
RMM includes a debug logger which can be enabled to log trace and debug information to a file. This information can show when errors occur, when additional memory is allocated from upstream resources, etc. The default log file is
rmm_log.txtin the current working directory, but the environment variableRMM_DEBUG_LOG_FILEcan be set to specify the path and file name.There is a CMake configuration variable
RMM_LOGGING_LEVEL, which can be set to enable compilation of more detailed logging. The default isINFO. Available levels areTRACE,DEBUG,INFO,WARN,ERROR,CRITICALandOFF.The log relies on the spdlog library.
Note that to see logging below the
INFOlevel, the C++ application must also callrmm::logger().set_level(), e.g. to enable all levels of logging down toTRACE, callrmm::logger().set_level(spdlog::level::trace)(and compile with-DRMM_LOGGING_LEVEL=TRACE).Note that debug logging is different from the CSV memory allocation logging provided by
rmm::mr::logging_resource_adapter. The latter is for logging a history of allocation / deallocation actions which can be useful for replay with RMM’s replay benchmark.RMM and CUDA Memory Bounds Checking
Memory allocations taken from a memory resource that allocates a pool of memory (such as
pool_memory_resourceandarena_memory_resource) are part of the same low-level CUDA memory allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to be detected by CUDA tools such as CUDA Compute Sanitizer memcheck.Exceptions to this are
cuda_memory_resource, which wrapscudaMalloc, andcuda_async_memory_resource, which usescudaMallocAsyncwith CUDA’s built-in memory pool functionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these resources are detectable with Compute Sanitizer Memcheck.It may be possible in the future to add support for memory bounds checking with other memory resources using NVTX APIs.
Using RMM in Python Code
There are two ways to use RMM in Python code:
rmm.DeviceBufferAPI to explicitly create and manage device memory allocationsRMM provides a
MemoryResourceabstraction to control how device memory is allocated in both the above uses.DeviceBuffers
A DeviceBuffer represents an untyped, uninitialized device memory allocation. DeviceBuffers can be created by providing the size of the allocation in bytes:
The size of the allocation and the memory address associated with it can be accessed via the
.sizeand.ptrattributes respectively:DeviceBuffers can also be created by copying data from host memory:
Conversely, the data underlying a DeviceBuffer can be copied to the host:
MemoryResource objects
MemoryResourceobjects are used to configure how device memory allocations are made by RMM.By default if a
MemoryResourceis not set explicitly, RMM uses theCudaMemoryResource, which usescudaMallocfor allocating device memory.rmm.reinitialize()provides an easy way to initialize RMM with specific memory resource options across multiple devices. Seehelp(rmm.reinitialize)for full details.For lower-level control, the
rmm.mr.set_current_device_resource()function can be used to set a different MemoryResource for the current CUDA device. For example, enabling theManagedMemoryResourcetells RMM to usecudaMallocManagedinstead ofcudaMallocfor allocating memory:As another example,
PoolMemoryResourceallows you to allocate a large “pool” of device memory up-front. Subsequent allocations will draw from this pool of already allocated memory. The example below shows how to construct a PoolMemoryResource with an initial size of 1 GiB and a maximum size of 4 GiB. The pool usesCudaMemoryResourceas its underlying (“upstream”) memory resource:Other MemoryResources include:
FixedSizeMemoryResourcefor allocating fixed blocks of memoryBinningMemoryResourcefor allocating blocks within specified “bin” sizes from different memory resourcesMemoryResources are highly configurable and can be composed together in different ways. See
help(rmm.mr)for more information.Using RMM with CuPy
You can configure CuPy to use RMM for memory allocations by setting the CuPy CUDA allocator to
rmm_cupy_allocator:Note: This only configures CuPy to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.
Using RMM with Numba
You can configure Numba to use RMM for memory allocations using the Numba EMM Plugin.
This can be done in two ways:
NUMBA_CUDA_MEMORY_MANAGER:set_memory_manager()function provided by Numba:Note: This only configures Numba to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.
Using RMM with PyTorch
PyTorch can use RMM for memory allocation. For example, to configure PyTorch to use an RMM-managed pool:
PyTorch and RMM will now share the same memory pool.
You can, of course, use a custom memory resource with PyTorch as well: