Calling nvidia_p2p_get_pages on another process's device pointer

jason10 · December 3, 2020, 10:46pm

I’ve been working on some P2P DMA code and when everything is running in a single process it works fine. However, I need to have another process coordinate the DMAs. I can successfully pass my device pointer to another process using cudaIpcGetMemHandle/cudaIpcOpenMemHandle, but when I pass that to my kernel module to map the memory, nvidia_p2p_get_pages fails with -EINVAL. The documentation seems to hint that you need to set the (deprecated) va_space and p2ptoken parameters when dealing with another process’s memory, but the p2ptoken is always zero in the source process when I read it. Any suggestions? I don’t want to call get_pages in the source process because I don’t want to trust raw page addresses that come from the client process.

njuffa · December 3, 2020, 11:48pm

Maybe I am misunderstanding the desired functionality, but it seems to me that if implemented as described it would create a security hole that would delight computer virus creators looking for another vector to spread their malware.

jason10 · December 4, 2020, 9:32pm

The intent is that passing a virtual address is less risky because we can validate that address with CUDA/nvidia kernel module that the virtual address represents an actual allocation. This should be much less risky that accepting physical page addresses from the client process.

Obviously there will also be some authentication between the client process and the orchestrator process to help secure it.

At any rate, the question stands: how do we call nvidia_p2p_get_pages with a pointer obtained by another process’s cudaMalloc()?

njuffa · December 4, 2020, 9:52pm

If my initial assessment of a security risk is correct, one could reasonably conclude that the behavior you observe is the rational consequence of NVIDIA consciously eliminating exposure to that risk.

Looking forward to an authoritative response by an NVIDIA engineer.

sredding145 · July 28, 2021, 6:13pm

Has there been any update on this? Were you able to find a solution?

I’m trying to do the same thing, with identical results.

Any information you can provide would be extremely helpful!

Thank you

jason10 · July 29, 2021, 2:01pm

I ended up storing a token in my kernel driver and passing that token between the processes to fetch the physical pages. It was a bit painful to implement.

sredding145 · July 29, 2021, 4:59pm

Thanks for the update! Much appreciated.
I’m sorry to hear there was no real solution.
For now, I’ve implemented a similar workaround.

If any NVIDIA dev sees this, do you know if pinning buffers allocated by other processes will ever be supported? Or perhaps we’ve been doing it wrong?

If not, updating the documentation could be very helpful.
Specifically, the last sentence of section 2.6 of the GPUDirect RDMA Documentation, “Tokens Usage” (GPUDirect RDMA :: CUDA Toolkit Documentation), states:
“When no tokens are used, the NVIDIA driver limits the Kernel API to the process which owns the memory allocation.”

This seems to imply that the p2p and vaSpace tokens are meant to be used for pinning pages from other processes.

This same section starts by warning that the tokens are deprecated, but still supported. Could this be clarified at all? If they are deprecated, will they be replaced by another mechanism in the future? If they are still supported, is there some other requirement for pinning pages from other processes?

Additionally, in section 1.4, “Changes in CUDA 6.0” (GPUDirect RDMA :: CUDA Toolkit Documentation),
“CUDA peer-to-peer tokens are no longer mandatory. For memory buffers owned by the calling process (which is typical) tokens can be replaced by zero (0) in the kernel-mode function nvidia_p2p_get_pages(). This new feature is meant to make it easier for existing third party software stacks to adopt RDMA for GPUDirect.”

Unfortunately, the documentation of nvidia_p2p_get_pages() in the Kernel API section offers no real clarity here. Neither does the example in section 3.2, pinning GPU memory, which just shows a single-process example.

Granted, this may be a niche use case, but at the moment its a significant setback to me.

Has anyone out there been able to get these tokens to work as described? Or, through any other means, pin a CUDA buffer from a process that did not allocate it?

Thank you