I understand the benefits of using pinned host memory for doing copies to and from the device. My problem is that I am capturing an image, copying that to the GPU and processing, then copying back to the host to be viewed. Is the only way I can use the benefits of pinned memory by allocating pinned memory for the source image and copying that data to that pinned memory space? In this case the memcopy to the pinned space would take longer than the whole transfer unpinned transfer to the GPU. The same applies for reading the data from the GPU, would I have to read it to pinned memory and then copy that to a pagable memory space to be read by my view program?
If the image cannot be captured to pinned memory allocated with cuMemAllocHost, then yes, you have to copy to a staging area in order to perform the transfer.
If that is the case, it might be faster to just call CUDA to do the copy for you. It will use its private staging areas to do the same thing. CUDAās pageable memcpy code uses multiple buffers for better CPU/GPU overlap than you would get from copying a whole buffer and then invoking a memcpy of pinned memory.
Yeah, the only disadvantage of the cudaMemcpy from paged memory is that is works synchronously, isnāt it? Or only as far as the copy to the staging area is concerned?
Iām not sure if I understand your question.
As I see it you want to know if you can use memory allocated with cudaMallocHost() the same way as memory allocated with malloc() or something else CUDA-unrelated.
I think the answer is yes. You can use it as if it would be justā¦ memory. You could for example allocate memory with cudaMallocHost() and then load a file from the harddisk into it. So an extra copy from memory allocated with malloc() to memory allocated with cudaMallocHost() is not necessary if your design allows it.
However one should use cudaMallocHost carefully as too much page locked memory could decrease your overall system performance.
Also note that cudaMallocHost has to be called in the same device context as the cudaMemCopy() calls in order for CUDA to being able to utilize the fast memcopy. This is important if you use multiple devices or have a multithreaded application with one thread dedicated to control the GPU.
I actually experimented with using CUDA allocated pinned memory elsewhere in my application and there is absolutely no problem. A pointer to pinned memory can be used anywhere a normal pointer can be used. I ran my own functions on it, saved it to disk, sent it to IPP library etc.
There are no problems and no difference in performance (I hoped it would be fasterā¦).
The correct way to do it is to allocate pinned memory and ābindā it to existing CPU classes (instead of the classes allocating their own memory). A really nit way to do it is if your classes support allocators.
I have (I believe) the same problem as the original poster. To clarify, I am integrating into another application that performs its own memory allocations for its own purposes. My data, when it arrives, arrives in memory allocated by that application, which is non-pinned and over which I have no ability to force it to use cudaMallocHost().
At that point, if I want to get the data onto the GPU, I can either (with the current API), (a) copy the data from itās location in memory into pinned memory, and then commence with copying to the device, or (B) just copy to device out of pageable memory.
What would be really nice is if there were a way to pass cudaMallocHost a memory pointer, so that cudaMallocHost() page-locks the memory and sets up the DMA transfer structures, so then I can just directly copy the data out of its initial location to the GPU and enjoy the speed benefits of pinned memory.
Does anyone know if there is a way to do this? I saw a post in another thread that indicated NVIDIA may be considering a new API call that would essentially enable this feature. Alternately, does anyone know if NVIDIA has released source to cudaMallocHost? The changes wouldnāt be all that complex.
On Windows VirtualLock() may be the function youāre looking for. However Iām not quite sure if CUDA will recognize locked memory as āpinnedā. You need to try =)
You cannot page-lock currently allocated memory in any way, ever. This would amount to some very complex operations at both libc and the virtual memory level:
Make sure no other heap allocations overlap the pages of this memory block, if this is the case then give up
Remove the memory from consideration of the heap
Make sure all the pages are active (in memory)
Make sure the data is laid out consecutive in physical memory (not fragmented)
Tell the operating system that a certain block of memory is special
This will probably be a lot slower than just copying it to pinned memory yourself. Another possiblility might be scatter/gather DMA like used with some network cards and harddisk controllers, you can achieve zero copying with that, but NVidia probably has to do big changes to their driver.
I have the same question as to OP, although his question did not make clear sense as to what he is asking however I am assuming that this is the answer because my question answer is directly based on what you have responded.
Good to know that this is possible, what we are essentially doing is DMA. Nice that Nvidia provides these functionalities all the way back in 2008.
Does Nvidia provides this feature youāre asking for?
I know youāve made this response all the way back in 2008.
However, I assume that it is possible today and just wanted to confirm if you have any more experience in this. Finding GPU DMA info is extremely rare topic even in 2025 (which is unfortunate).
So is it possible today that we can tell CUDA GPU to pin memory in host RAM, then have external C++ applications to directly write data to the pinned memory for the CUDA GPU to copy it directly?
This would prevent doing multiple buffer copies as if the C++ external app which tells the CPU needs to copy from the heap and then write it to the pinned memory. Instead just write it directly to the pinned memory location. Wonder if this is possible today.
So are you implying that pinning will not make sense for the following situation:
I am working on a setup where a live camera continuously dumps its frames into a fixed memory buffer. This buffer is pinned so the GPU can directly access it for processing. After the GPU processes a frame, the camera overwrites the pinned buffer with the next frame, and this cycle repeats.
Given that the data in the pinned memory is constantly changing (new frames overwriting old ones), does using pinned memory improve performance and latency in this scenario, or is it unnecessary?
The data in the pinned buffer is not static and will always constantly change, however I assume that since memory address location is static (pinned), it seems it might improve performance since allocation for buffer is not flying all over the place. Want to hear your thoughts on this.
pinning is only related to the memory addresses. You can read and write from the CPU and the GPU as often as you want. Use asynchronous memory copies and stream synchronization.
If you have a circular ring buffer, you can pin the whole memory buffer (or each of the frames separately).
As pinned memory cannot be swapped, you need enough physical system memory.
To slightly optimize performance further, set the write combining flag for memory, used only for host-to-device transfers.
You can write a routine, which checks whether the buffer address from the camera was pinned.
If you read each frame only once on the GPU, you can also use zero-copy memory with pinned memory. It tends to have a bit better latency and sometimes slightly less bandwidth than asynchronous memory copies and can slow down your GPU computation by a small bit.