Hi, We are developing a Windows application using the Virtual Memory Management APIs along with the CU_MEM_HANDLE_TYPE_WIN32 shareable handle. We have followed the memMapIpcDrv code sample and have successfully accessed the same CUDA memory on separate processes.
The issue we are experiencing is that after any shared handles have been imported on the child process, the physical memory is not released until the child process exits.
Do you happen to know if this is a known issue or limitation?
We are making sure to unmap/release/close all device-pointers/allocation-handles/shareable-handles on both processes. All function calls exit without errors. Please see the following summary for the sequence of the api calls (error handling and minor details omitted for clarity).
Hi, I try some features about cuMemSetAccess, I find if one device can not peer access another, the cuMemSetAccess will raise an cuda error when I set the access to this device. Is more detail information about the limitation of Driver API cuMemSetAccess?
It looks like everything youâre doing is correct⊠could you post a reproducer of the problem? Do you see the same issue with the sample code or just your application? I believe there used to be a bug in older drivers (I donât remember what versions sorry) that did have a leak in this path, but this should be fixed in newer drivers IIRC. Can you post what version driver youâre using?
Iâm sorry, Iâm not quite following your question. I believe youâre asking if cuMemSetAccess can be used to enable access to memory physically located on a different device but does not have peer access enabled. As described in the blog post above, this is a primary use case for this API, but it still requires that the two communicating device be peer-capable (i.e. cuDeviceCanAccessPeer or the runtime equivalent cudaDeviceCanAccessPeer returns true for the two devices). Memory physically present on one device cannot be accessed by another device without the peer-capability between the two, whatever the underlying backend may be (PCI-E, NVLINK, NVSWITCH, etc). Hope that answers your question, let me know if it doesnât! Good luck!
Thanks for your quick reply @Cory.Perry. Yes I can reproduce the issue with the sample code as well.
Iâve just modified memMapIpcDrv in a forked cuda-sampleshere to demonstrate the issue.
The modifications increase the memory allocated to 2GB and add in some delays to inspect the GPU memory usage with nvidia-smi and Task Manager. Please see the following screenshots of the sample running. Our application reallocates memory and the physical memory continues to increase until the child process is exited.
Thanks for your reply. Through your reply I find out that the cuMemSetAccess need the capacity among devices(In other words, the cuDeviceCanAccessPeer result is true). Thanks a lotïŒ
@jearnshaw please file a support ticket following the instructions on our support site and we can look into diagnosing the issue as this is likely a bug. Hereâs a link to the nvidia support site for your convenience: NVIDIA Support Site
Hi @Cory.Perry , was the reason for @jearnshaw s problems resolved? I am experiencing a similar issue with a straightforward POC in which I continuously allocate (cuMemCreate) and free memory (cuMemUnmap/cuMemRelease/cuMemAddressFree), and I see that the memory usage keeps growing indefinitely, as if the memory is never freed.
@moises.jimenez Unfortunately I donât have visibility into the exact support ticket that was filed. If youâre seeing the same issue with the latest release drivers available, I would file your own support ticket. It never hurts to have duplicated tickets, and in fact, helps us better prioritize issues such as this in order to build a better product. Hope this helps!
Hello, @Cory.Perry
I have some questions on cuMemAddressReserve API
Detail description of the question can be found under post
As i work through your sample code
I found that in cuvector.cpp line 267,
you assume that there exist a chance that cuMemAddressReserve will not reserve the exact virtual address that you desire.
Does cuMemAddressReserve can not guarantee reservation of certain virtual address?
The addr argument passed to cuMemAddressReserve is a hint, similar to POSIX mmapâs fixed addr argument. If the address specified is reserved by the driver or otherwise not able to be reserved for some reason, cuMemAddressReserve tries to still service the request by finding another suitable reservation rather than return an error. In the case of the sample code, this cuMemAddressReserve tries to extend the reservation by reserving the address range right after the last cuMemAddressReserveâs range, but it could be that another allocation or reservation exists that overlaps with the request. In this case, a new reservation at a random address is returned which isnât what we want, which is why the sample code frees it, then creates a new large reservation to cover the previous and new memory allocations for the buffer.
As to why the address returned by cuMemAlloc is not reservable after calling cuMemFree, there can be a large number of reasons why this is, most of which an application simply cannot control. For example, the address range returned by cuMemAlloc could have internally been part of a larger reservation managed by the default memory pool, which would not be accessible to cuMemAddressReserve. Either way, the fixed address argument for cuMemAddressReserve is not guaranteed to be honored.
Hello, @Cory.Perry
i have an problem when i used GPUDirect RDMA. The detail as follow:
I used cuMemCreate/cuMemMap/cuMemMap to get a device memory
And then i used the ibv_reg_mr_iova2 to register the device memory allocated in step 1, then the error âbad addressâ occurs.
So i want to know whether the device memory allocated by low level virtual memory management api support the GPUDirect RDMA or not? If support, how to register the memory to support GPUDirect RDMA.
Thanks!
Yes, the low level virtual memory APIs do support GPUDirect RDMA, but unlike cuMemAlloc you must specifically request for this feature in the CUMemAllocationProp structure. You can set the CUmemAllocationProp::allocFlags::gpuDirectRDMACapable flag, but please make sure to first check the device attributes CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED and CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED before attempting to do so. If these attributes return zero, specifying the flag will return CUDA_ERROR_INVALID_VALUE from cuMemCreate.
@CodyYao So sorry for the delay in a reply, I completely missed this message during the holidays! Yes, the virtual memory APIs are supported under WSL2! Please check the device attributes under WSL2 for the different features you may be interested in.
Follow up questions,
I tried to allocate memory on the CPU side via cuMemCreate but wasnât successful. I attempted to set the location parameter in the CUmemAllocationProp as CU_MEM_LOCATION_TYPE_HOST and HOST_NUMA and MAX, but setting HOST and HOST_NUMA gave me compile errors as those are not defined.
When I set the enum manually as 0x2 (which I believe is the HOST enum), cuMemCreate returns the CUDA_ERROR_INVALID_DEVICE error. FYI other paramter gives me same error.
Am I doing something wrong? Or is this approach prohibited?
@woosungkang Hi and welcome to the developer forums!
Good news: cuMemCreate has recently enabled support for host memory allocation in I believe CUDA version 12.0! As you have guessed, the location type should be either of the following:
CU_MEM_LOCATION_TYPE_HOST
CU_MEM_LOCATION_TYPE_HOST_NUMA
CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT
But in the case of recent drivers, only CU_MEM_LOCATION_TYPE_HOST_NUMA is currently available for allocation. Your system must be NUMA capable (i.e. have libnuma installed) in order for the allocation to work properly, and you need to specify the numa node you wish to allocate your memory to via the id field of the property structure. You can retrieve the best / closest numa node to a target device via the CU_DEVICE_ATTRIBUTE_HOST_NUMA_ID device attribute and pass this directly to the id field of the location structure in the property structure passed to cuMemCreate.
For mapping, when you eventually call the cuMemSetAccess API, you can specify CU_MEM_LOCATION_TYPE_HOST_NUMA (id is ignored) which will allow access to the underlying memory via associated address from the HOST. For allowing access via the associated address from the DEVICE you can use the same method for device allocate memory (specifying location type as DEVICE and setting the id field to device index you wish to enable access for).
but setting HOST and HOST_NUMA gave me compile errors as those are not defined.
Please make sure that you are using a recent version (I believe 12.0+) of the CUDA Toolkit SDK with a recent version of cuda.h that should define these enumeration values, avoiding these compilation errors.
Hope this helps, let us know if you have any other questions!