Async transfers with non-cuda host memory using page-locked memory not cuda memory


I’m having an issue and I’m hoping someone can explain it to me since I am not an expert on the differences between page-locked verses pageable memory once it has been virtually mapped to user space.

Basically I’ve developed software that manipulates video using cuda. All memory moves host to device and device to host are async and tied to streams. I have no issue if I use host memory that is created by cuda. Everything is happy.

The issue I’m experiencing is when I attempt an async memory transfer with host memory that was not created by cuda.

Here’s an example. I manipulate a frame of video in cuda, and need to push it back to host memory so I can route it to a broadcast quality SDI card. The drivers for this card provide me with page-locked physical memory that is virtually mapped to user space. Since this memory is page-locked I’m making the assumption that Cuda should be able to DMA transfer directly to this memory. I’m wrong. CudaMemcpyAsync will not transfer to this memory, only CudaMemcpy will transfer to this memory.

Will CudaMemcpyAsync only transfer to host memory Cuda creates? If so, Why? Isn’t all page-locked memory the same? What am I missing? Can Cuda not translate the virtual to a physical address? Will Cuda Async only work with addresses in it virtual address range? Please help!

Thanks in advance.

I don’t know the complete answer to your question, but cudaMallocHost does not just allocate pinned memory on the host. It does work on the GPU side to prepare for DMA transfers.

That’s my understanding as well, there’s more going on inside cudaMallocHost() beyond just the allocation, something is being done on the GPU as well. One of the NVIDIA engineers commented on this detail in the forum at some point in the past, I believe in the context of some of us questioning why cudaMallocHost() was quite a bit slower than a normal malloc().

I’m starting to wonder if it might not be a good idea for CUDA to add a new API to pin/unpin memory regions (and whatever related back-end GPU magic is also being done) independently of how it was originally allocated. Similarly, it might also be useful to have a routine that only does the back-end GPU magic, if the user has the memory pinned already. While I don’t propose bloating the CUDA API with a bunch of difficult-to-use functions for this stuff, it does seem that there are inherent issues in mixing CUDA with other libraries that also work with DMA to/from pinned memory regions, and all of the related cache control manipulations etc. By splitting out some of the functionality currently provided in cudaMallocHost() into separate routines, it might be easier to compose CUDA with other libraries that do this stuff. Since I have no idea what cudaMallocHost() is doing on the GPU side of things I can’t do anything but guess whether splitting the allocation, pinning, and GPU-setup into multiple routines would solve this issue or not. Perhaps one of the NVIDIA people can comment on whether splitting cudaMallocHost() into a few other routines would do much for this problem or not.

John Stone

No, any peripheral that can do scatter/gather DMA also must be programmed with the locations of the physical pages to operate on. This characteristic is not unique to GPUs.

Can the broadcast quality SDI card use a buffer allocated by cuMemAllocHost()? (That is almost a rhetorical question. It would surprise me if it could.)

There are two issues in play.

One is that allocating the memory and making it suitable for DMA are tied together into the same operation. This prevents at least one important usage scenario: being able to DMA a given memory range to different GPUs. Decoupling allocation from mapping for the GPU would enable that scenario even if CUDA were still required to perform the allocation.

The other issue is making existing virtual address ranges DMA’able. Across CUDA’s target operating systems, support for this operation is uneven. Windows has very solid kernel mode support to do this; Linux has a user mode API to lock the pages but Linux’s process cleanup does not easily guarantee that the GPU’s mappings will be removed before the pages referenced by the GPU are recycled to the OS. (I am no Linux expert but I am told that in this area, there is variation from one Linux kernel to the next.)

These problems are all tractable but will require time to address.

MediaFrame: which operating system are you targeting? (edit: I know this is the Windows XP forum, but there are still four Windows OS’s: xp32, xp64, vista32, vista64)

When you refer to scatter/gather, are you referring to host memory, device memory or both? I was under the impression that page locked host memory was contiguous.

Currently developing for xp32. Now with all the 64 bit support for XP, I am looking to migrate this software to xp64 in the next few months.

Q: is it possible to obtain the physical address of the memory allocated by cudaMallocHost? I’m in a situation similar to that of MediaFrame, I’m acquiring data from an external device into the system memory, and I want to process that data through the gpu.

I’m comfortable with cuda allocating the non-pageable memory, the only thing is that I must supply the physical address of the memory buffer, but cudaMallocHost gives the virtual address. Is there a user-callable Windows functions that maps virtual addresses back to physical addresses?