Introducing Low-Level GPU Virtual Memory Management

jearnshaw · December 5, 2022, 4:38am

Hi, We are developing a Windows application using the Virtual Memory Management APIs along with the CU_MEM_HANDLE_TYPE_WIN32 shareable handle. We have followed the memMapIpcDrv code sample and have successfully accessed the same CUDA memory on separate processes.

The issue we are experiencing is that after any shared handles have been imported on the child process, the physical memory is not released until the child process exits.

Do you happen to know if this is a known issue or limitation?

We are making sure to unmap/release/close all device-pointers/allocation-handles/shareable-handles on both processes. All function calls exit without errors. Please see the following summary for the sequence of the api calls (error handling and minor details omitted for clarity).

Any help would be greatly appreciated. Thanks!

Parent process:

cuMemCreate(allocHandle)
cuMemExportToShareableHandle(shareableHandle, allocHandle)
cuMemAddressReserve(devicePtr)
cuMemMap(devicePtr)
cuMemSetAccess(devicePtr)
cuMemRelease(allocHandle)
DuplicateHandle(shareableHandle, childShareableHandle)

Child process:

cuMemImportFromShareableHandle(allocHandle, childShareableHandle)
cuMemAddressReserve(devicePtr)
cuMemMap(devicePtr)
cuMemSetAccess(devicePtr)
cuMemRelease(allocHandle)
...
cuMemUnmap(devicePtr)
cuMemAddressFree(devicePtr)
CloseHandle(childShareableHandle)

Parent process:

cuMemUnmap(devicePtr)
cuMemAddressFree(devicePtr)
CloseHandle(shareableHandle)

182yzh · December 5, 2022, 8:29am

Hi, I try some features about cuMemSetAccess, I find if one device can not peer access another, the cuMemSetAccess will raise an cuda error when I set the access to this device. Is more detail information about the limitation of Driver API cuMemSetAccess?

Cory.Perry · December 5, 2022, 9:31pm

Hi jearnshaw, welcome back to the forums!

It looks like everything you’re doing is correct… could you post a reproducer of the problem? Do you see the same issue with the sample code or just your application? I believe there used to be a bug in older drivers (I don’t remember what versions sorry) that did have a leak in this path, but this should be fixed in newer drivers IIRC. Can you post what version driver you’re using?

Cory.Perry · December 5, 2022, 9:37pm

Hi 182yzh,

I’m sorry, I’m not quite following your question. I believe you’re asking if cuMemSetAccess can be used to enable access to memory physically located on a different device but does not have peer access enabled. As described in the blog post above, this is a primary use case for this API, but it still requires that the two communicating device be peer-capable (i.e. cuDeviceCanAccessPeer or the runtime equivalent cudaDeviceCanAccessPeer returns true for the two devices). Memory physically present on one device cannot be accessed by another device without the peer-capability between the two, whatever the underlying backend may be (PCI-E, NVLINK, NVSWITCH, etc). Hope that answers your question, let me know if it doesn’t! Good luck!

jearnshaw · December 6, 2022, 12:25am

Thanks for your quick reply @Cory.Perry. Yes I can reproduce the issue with the sample code as well.

I’ve just modified memMapIpcDrv in a forked cuda-samples here to demonstrate the issue.

The modifications increase the memory allocated to 2GB and add in some delays to inspect the GPU memory usage with nvidia-smi and Task Manager. Please see the following screenshots of the sample running. Our application reallocates memory and the physical memory continues to increase until the child process is exited.

The test machine is running CUDA 11.8 with nvidia driver 527.37 on Windows 10 Pro 22H2.

182yzh · December 6, 2022, 2:17am

Thanks for your reply. Through your reply I find out that the cuMemSetAccess need the capacity among devices(In other words, the cuDeviceCanAccessPeer result is true). Thanks a lot！

Cory.Perry · December 6, 2022, 8:54pm

@jearnshaw please file a support ticket following the instructions on our support site and we can look into diagnosing the issue as this is likely a bug. Here’s a link to the nvidia support site for your convenience:
NVIDIA Support Site

moises.jimenez · August 3, 2023, 12:50pm

Hi @Cory.Perry , was the reason for @jearnshaw s problems resolved? I am experiencing a similar issue with a straightforward POC in which I continuously allocate (cuMemCreate) and free memory (cuMemUnmap/cuMemRelease/cuMemAddressFree), and I see that the memory usage keeps growing indefinitely, as if the memory is never freed.

PS: I just updated my drivers to 535.86.05.

Cory.Perry · August 3, 2023, 4:09pm

@moises.jimenez Unfortunately I don’t have visibility into the exact support ticket that was filed. If you’re seeing the same issue with the latest release drivers available, I would file your own support ticket. It never hurts to have duplicated tickets, and in fact, helps us better prioritize issues such as this in order to build a better product. Hope this helps!

woosungkang · August 17, 2023, 6:00am

Hello, @Cory.Perry
I have some questions on cuMemAddressReserve API

Detail description of the question can be found under post

As i work through your sample code
I found that in cuvector.cpp line 267,
you assume that there exist a chance that cuMemAddressReserve will not reserve the exact virtual address that you desire.

Does cuMemAddressReserve can not guarantee reservation of certain virtual address?

Thx.

Cory.Perry · August 17, 2023, 8:33am

Hi @woosungkang, this is an excellent question!

The addr argument passed to cuMemAddressReserve is a hint, similar to POSIX mmap’s fixed addr argument. If the address specified is reserved by the driver or otherwise not able to be reserved for some reason, cuMemAddressReserve tries to still service the request by finding another suitable reservation rather than return an error. In the case of the sample code, this cuMemAddressReserve tries to extend the reservation by reserving the address range right after the last cuMemAddressReserve’s range, but it could be that another allocation or reservation exists that overlaps with the request. In this case, a new reservation at a random address is returned which isn’t what we want, which is why the sample code frees it, then creates a new large reservation to cover the previous and new memory allocations for the buffer.

As to why the address returned by cuMemAlloc is not reservable after calling cuMemFree, there can be a large number of reasons why this is, most of which an application simply cannot control. For example, the address range returned by cuMemAlloc could have internally been part of a larger reservation managed by the default memory pool, which would not be accessible to cuMemAddressReserve. Either way, the fixed address argument for cuMemAddressReserve is not guaranteed to be honored.

Hope this helps!

woosungkang · August 17, 2023, 8:39am

Thanks for fast reply!!
This makes much more clear!

Thanks again.

2007303105 · August 28, 2023, 7:56am

Hello, @Cory.Perry
i have an problem when i used GPUDirect RDMA. The detail as follow:

I used cuMemCreate/cuMemMap/cuMemMap to get a device memory
And then i used the ibv_reg_mr_iova2 to register the device memory allocated in step 1, then the error “bad address” occurs.
So i want to know whether the device memory allocated by low level virtual memory management api support the GPUDirect RDMA or not? If support, how to register the memory to support GPUDirect RDMA.
Thanks!

Cory.Perry · August 28, 2023, 3:40pm

Hi @2007303105,

Yes, the low level virtual memory APIs do support GPUDirect RDMA, but unlike cuMemAlloc you must specifically request for this feature in the CUMemAllocationProp structure. You can set the CUmemAllocationProp::allocFlags::gpuDirectRDMACapable flag, but please make sure to first check the device attributes CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED and CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED before attempting to do so. If these attributes return zero, specifying the flag will return CUDA_ERROR_INVALID_VALUE from cuMemCreate.

Hope this helps!

2007303105 · August 29, 2023, 8:32am

Thanks for reply. The GPU DirectRDMA issue is solved based on your suggestion.
Thanks agin.

CodyYao · January 15, 2024, 6:28am

Hi Perry, does low level virtual memory APIs support in WSL2?

Cory.Perry · February 14, 2024, 7:02pm

@CodyYao So sorry for the delay in a reply, I completely missed this message during the holidays! Yes, the virtual memory APIs are supported under WSL2! Please check the device attributes under WSL2 for the different features you may be interested in.

woosungkang · June 4, 2024, 4:51am

Hi Perry,

Quick questions: Can I allocate CPU memory via cuMemCreate? If I can, will mapping CPU memory to GPU virtual address space work as mapped memory?

woosungkang · June 4, 2024, 8:33am

Follow up questions,
I tried to allocate memory on the CPU side via cuMemCreate but wasn’t successful. I attempted to set the location parameter in the CUmemAllocationProp as CU_MEM_LOCATION_TYPE_HOST and HOST_NUMA and MAX, but setting HOST and HOST_NUMA gave me compile errors as those are not defined.

When I set the enum manually as 0x2 (which I believe is the HOST enum), cuMemCreate returns the CUDA_ERROR_INVALID_DEVICE error. FYI other paramter gives me same error.

Am I doing something wrong? Or is this approach prohibited?

Cory.Perry · June 4, 2024, 9:04am

@woosungkang Hi and welcome to the developer forums!

Good news: cuMemCreate has recently enabled support for host memory allocation in I believe CUDA version 12.0! As you have guessed, the location type should be either of the following:

CU_MEM_LOCATION_TYPE_HOST
CU_MEM_LOCATION_TYPE_HOST_NUMA
CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT

But in the case of recent drivers, only CU_MEM_LOCATION_TYPE_HOST_NUMA is currently available for allocation. Your system must be NUMA capable (i.e. have libnuma installed) in order for the allocation to work properly, and you need to specify the numa node you wish to allocate your memory to via the id field of the property structure. You can retrieve the best / closest numa node to a target device via the CU_DEVICE_ATTRIBUTE_HOST_NUMA_ID device attribute and pass this directly to the id field of the location structure in the property structure passed to cuMemCreate.

For mapping, when you eventually call the cuMemSetAccess API, you can specify CU_MEM_LOCATION_TYPE_HOST_NUMA (id is ignored) which will allow access to the underlying memory via associated address from the HOST. For allowing access via the associated address from the DEVICE you can use the same method for device allocate memory (specifying location type as DEVICE and setting the id field to device index you wish to enable access for).

but setting HOST and HOST_NUMA gave me compile errors as those are not defined.

Please make sure that you are using a recent version (I believe 12.0+) of the CUDA Toolkit SDK with a recent version of cuda.h that should define these enumeration values, avoiding these compilation errors.

Hope this helps, let us know if you have any other questions!

Topic		Replies	Views
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager Technical Blog	9	942	March 27, 2021
Unified Memory in CUDA 6 Technical Blog	87	1899	August 16, 2019
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1237	September 12, 2023
CUDA 4.0 CUDA Programming and Performance	63	507399	March 28, 2013
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9595	December 4, 2017
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1238	May 14, 2019
vLLM v0.8.4 shows UVM GPU1 BH process with high utilization CUDA Programming and Performance	7	62	April 25, 2025
Using Shared Memory in CUDA C/C++ Technical Blog	36	1994	October 8, 2020
Dynamic Heap initialization CUDA Programming and Performance	12	364	June 24, 2024
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8105	June 30, 2010

Introducing Low-Level GPU Virtual Memory Management

Related topics