Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager

Originally published at: https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new Series and DataFrames rather than modifying them in place. The overhead of cudaMalloc and synchronization of cudaFree was holding RAPIDS back. My first…

Dear readers: RAPIDS Memory Manager is not just for RAPIDS – we designed it to be a general and flexible framework for efficient and flexible memory management for CUDA applications. One of my favorite parts about the project is the ways in which it has been adapted for use with other CUDA-accelerated libraries and applications, such as Numba and CuPy, as well as all of the RAPIDS libraries.

We look forward to hearing your questions and comments on this post and on RMM!

Thanks,
Mark

Very interesting and informative article. Many thanks!!

Couple of questions if I may.

  • Do you have statistics for the use of RMM with pinned memory? performance/utilization/numa/preferred configuration?
  • What about big allocations (256MB - 2GBs)? what would you recommend to use?

thanks a lot
Eyal

Hi Eyal,

I’m not sure I perfectly follow your questions, but:

  • I don’t have any statistics with pinned memory. Currently we only have a pinned host_memory_resource, not a device_memory_resource. This has been requested, though.
  • What is your specific question about large allocations? I recommend you use RMM for these as well. :)

Hi @Mark_Harris,
Thanks a lot for the prompt answer.
In a previous project I was involved, we used pinned memory instead of regular malloced host memory.
On a IBM machine, along with NVLink, we saw very nice speedups, up to about 50GB/s from pinned to device, if I recall correctly. Intel showed nice speedups as well.

We used CUB’s allocator with some changes to make use of pinned memory. We had lot of obstacles mainly because we had to deal with many different scenarios, the amount of allocations, the variance between them (from 1K to 3GB), NUMA stuff, fragmentation etc.

Since we were based on CUB’s implementation, the allocation was using a binning mechanism. However, we too, kinda reached the conclusion that maybe a hybrid solution would work best.

What I asked was whether you have tested the RMM for pinned memory and have insights as to how well RMM pools/approach would work with pinned memory.

As for the large allocations, say we had to allocate 10 distinct 1GB allocations, that takes time, causes fragmentation, NUMA stuff etc… but then the work that required the 10 1GB buffers ended, and another one came along which required 20 buffers of 512MB, or 5 of 2GB…what then? you split the bin? split the allocation? free and re-allocate (which would take tons of time)… and above all NUMA was a pain in the …

Thanks
Eyal

Let me clarify terms first. By “pinned” you mean “pinned host” memory. (There is such a thing as pinned device memory – that’s what cudaMalloc allocates, and it is different from pageable memory as allocated using cudaMallocManaged.)

In any case, we have not yet provided a pool for pinned host memory because we didn’t want to have to rewrite our pool machinery. However with the current interface design, we cannot reuse pool_memory_resource for host memory because it is a device_memory_resource.

But I have good news. We are, as I type this, beginning a redesign of our base MR interfaces to enable reuse of allocator machinery for different kinds of memory. Stay tuned.

Mark

Thanks @Mark_Harris. Yes I was referring to “pinned host” memory.
Looking forward to the new interfaces then :)

thanks
Eyal

Thanks for nice posting.
I have some questions.

  1. For rmm::device_vector or rmm::device_uvector, what is the best way to retrieve the data to the host memory? It seems value() function of rmm::device_scalar supports it but the other two do not have it. cudaMemcpy would be the only way?

  2. vector of vector (or matrix-like data structure) can be realized with rmm? I am afraid not tho…

Great questions!

  1. To retrieve the whole vector or part of it, you could use cudaMemcpy or cudaMemcpyAsync, or thrust::copy. For the latter, I think you’ll need to be careful to use the appropriate pointer types so that Thrust can figure out the memory kinds for the the source and destination. Copying to/from host is something I ran into recently and I agree we could add better support for this.

  2. Like std::vector, RMM vector classes are not really designed to represent matrices. Technically rmm::device_vector is just a thrust::vector. And device_uvector only supports trivially moveable/copyable types, so it’s probably not possible to have a uvector of uvectors!

Thank you so much!