Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager

jwitsoe · December 8, 2020, 7:27pm

Originally published at: https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new Series and DataFrames rather than modifying them in place. The overhead of cudaMalloc and synchronization of cudaFree was holding RAPIDS back. My first…

Mark_Harris · December 8, 2020, 11:47pm

Dear readers: RAPIDS Memory Manager is not just for RAPIDS – we designed it to be a general and flexible framework for efficient and flexible memory management for CUDA applications. One of my favorite parts about the project is the ways in which it has been adapted for use with other CUDA-accelerated libraries and applications, such as Numba and CuPy, as well as all of the RAPIDS libraries.

We look forward to hearing your questions and comments on this post and on RMM!

Thanks,
Mark

eyalhir74 · December 9, 2020, 3:12pm

Very interesting and informative article. Many thanks!!

Couple of questions if I may.

Do you have statistics for the use of RMM with pinned memory? performance/utilization/numa/preferred configuration?
What about big allocations (256MB - 2GBs)? what would you recommend to use?

thanks a lot
Eyal

Mark_Harris · December 10, 2020, 1:12am

Hi Eyal,

I’m not sure I perfectly follow your questions, but:

I don’t have any statistics with pinned memory. Currently we only have a pinned host_memory_resource, not a device_memory_resource. This has been requested, though.
What is your specific question about large allocations? I recommend you use RMM for these as well. :)

eyalhir74 · December 12, 2020, 7:42am

Hi @Mark_Harris,
Thanks a lot for the prompt answer.
In a previous project I was involved, we used pinned memory instead of regular malloced host memory.
On a IBM machine, along with NVLink, we saw very nice speedups, up to about 50GB/s from pinned to device, if I recall correctly. Intel showed nice speedups as well.

We used CUB’s allocator with some changes to make use of pinned memory. We had lot of obstacles mainly because we had to deal with many different scenarios, the amount of allocations, the variance between them (from 1K to 3GB), NUMA stuff, fragmentation etc.

Since we were based on CUB’s implementation, the allocation was using a binning mechanism. However, we too, kinda reached the conclusion that maybe a hybrid solution would work best.

What I asked was whether you have tested the RMM for pinned memory and have insights as to how well RMM pools/approach would work with pinned memory.

As for the large allocations, say we had to allocate 10 distinct 1GB allocations, that takes time, causes fragmentation, NUMA stuff etc… but then the work that required the 10 1GB buffers ended, and another one came along which required 20 buffers of 512MB, or 5 of 2GB…what then? you split the bin? split the allocation? free and re-allocate (which would take tons of time)… and above all NUMA was a pain in the …

Thanks
Eyal

Mark_Harris · December 15, 2020, 9:34pm

Let me clarify terms first. By “pinned” you mean “pinned host” memory. (There is such a thing as pinned device memory – that’s what cudaMalloc allocates, and it is different from pageable memory as allocated using cudaMallocManaged.)

In any case, we have not yet provided a pool for pinned host memory because we didn’t want to have to rewrite our pool machinery. However with the current interface design, we cannot reuse pool_memory_resource for host memory because it is a device_memory_resource.

But I have good news. We are, as I type this, beginning a redesign of our base MR interfaces to enable reuse of allocator machinery for different kinds of memory. Stay tuned.

Mark

eyalhir74 · December 16, 2020, 6:47am

Thanks @Mark_Harris. Yes I was referring to “pinned host” memory.
Looking forward to the new interfaces then :)

thanks
Eyal

beomki.yeo · March 19, 2021, 4:56am

Thanks for nice posting.
I have some questions.

For rmm::device_vector or rmm::device_uvector, what is the best way to retrieve the data to the host memory? It seems value() function of rmm::device_scalar supports it but the other two do not have it. cudaMemcpy would be the only way?
vector of vector (or matrix-like data structure) can be realized with rmm? I am afraid not tho…

Mark_Harris · March 22, 2021, 10:55pm

Great questions!

To retrieve the whole vector or part of it, you could use cudaMemcpy or cudaMemcpyAsync, or thrust::copy. For the latter, I think you’ll need to be careful to use the appropriate pointer types so that Thrust can figure out the memory kinds for the the source and destination. Copying to/from host is something I ran into recently and I agree we could add better support for this.
Like std::vector, RMM vector classes are not really designed to represent matrices. Technically rmm::device_vector is just a thrust::vector. And device_uvector only supports trivially moveable/copyable types, so it’s probably not possible to have a uvector of uvectors!

beomki.yeo · March 27, 2021, 1:20am

Thank you so much!

Topic		Replies	Views
Introducing Low-Level GPU Virtual Memory Management Technical Blog	59	7454	June 4, 2024
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1195	September 12, 2023
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	640	September 13, 2024
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9509	December 4, 2017
From NIC to GPU. CUDA Programming and Performance	40	13555	February 12, 2011
cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst CUDA Programming and Performance	23	5646	May 8, 2012
Using Shared Memory in CUDA C/C++ Technical Blog	36	1954	October 8, 2020
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011
Unified Memory in CUDA 6 Technical Blog	87	1896	August 16, 2019
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8088	June 30, 2010

Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager

Related topics