Is it possible to use pinned memory? Outside of CUDA

papag62 · February 7, 2008, 6:41pm

I understand the benefits of using pinned host memory for doing copies to and from the device. My problem is that I am capturing an image, copying that to the GPU and processing, then copying back to the host to be viewed. Is the only way I can use the benefits of pinned memory by allocating pinned memory for the source image and copying that data to that pinned memory space? In this case the memcopy to the pinned space would take longer than the whole transfer unpinned transfer to the GPU. The same applies for reading the data from the GPU, would I have to read it to pinned memory and then copy that to a pagable memory space to be read by my view program?

I hope what I’m asking makes sense :huh:

nwilt · February 7, 2008, 7:07pm

If the image cannot be captured to pinned memory allocated with cuMemAllocHost, then yes, you have to copy to a staging area in order to perform the transfer.

If that is the case, it might be faster to just call CUDA to do the copy for you. It will use its private staging areas to do the same thing. CUDA’s pageable memcpy code uses multiple buffers for better CPU/GPU overlap than you would get from copying a whole buffer and then invoking a memcpy of pinned memory.

wumpus · February 7, 2008, 7:10pm

Yeah, the only disadvantage of the cudaMemcpy from paged memory is that is works synchronously, isn’t it? Or only as far as the copy to the staging area is concerned?

seb · February 7, 2008, 9:04pm

I’m not sure if I understand your question.
As I see it you want to know if you can use memory allocated with cudaMallocHost() the same way as memory allocated with malloc() or something else CUDA-unrelated.

I think the answer is yes. You can use it as if it would be just… memory. You could for example allocate memory with cudaMallocHost() and then load a file from the harddisk into it. So an extra copy from memory allocated with malloc() to memory allocated with cudaMallocHost() is not necessary if your design allows it.

However one should use cudaMallocHost carefully as too much page locked memory could decrease your overall system performance.
Also note that cudaMallocHost has to be called in the same device context as the cudaMemCopy() calls in order for CUDA to being able to utilize the fast memcopy. This is important if you use multiple devices or have a multithreaded application with one thread dedicated to control the GPU.

e.ping · February 10, 2008, 8:59am

I actually experimented with using CUDA allocated pinned memory elsewhere in my application and there is absolutely no problem. A pointer to pinned memory can be used anywhere a normal pointer can be used. I ran my own functions on it, saved it to disk, sent it to IPP library etc.
There are no problems and no difference in performance (I hoped it would be faster…).

The correct way to do it is to allocate pinned memory and “bind” it to existing CPU classes (instead of the classes allocating their own memory). A really nit way to do it is if your classes support allocators.

thezim · February 14, 2008, 5:09am

I have (I believe) the same problem as the original poster. To clarify, I am integrating into another application that performs its own memory allocations for its own purposes. My data, when it arrives, arrives in memory allocated by that application, which is non-pinned and over which I have no ability to force it to use cudaMallocHost().

At that point, if I want to get the data onto the GPU, I can either (with the current API), (a) copy the data from it’s location in memory into pinned memory, and then commence with copying to the device, or (B) just copy to device out of pageable memory.

What would be really nice is if there were a way to pass cudaMallocHost a memory pointer, so that cudaMallocHost() page-locks the memory and sets up the DMA transfer structures, so then I can just directly copy the data out of its initial location to the GPU and enjoy the speed benefits of pinned memory.

Does anyone know if there is a way to do this? I saw a post in another thread that indicated NVIDIA may be considering a new API call that would essentially enable this feature. Alternately, does anyone know if NVIDIA has released source to cudaMallocHost? The changes wouldn’t be all that complex.

Thanks in Advance!

AndreiB · February 14, 2008, 6:17am

On Windows VirtualLock() may be the function you’re looking for. However I’m not quite sure if CUDA will recognize locked memory as ‘pinned’. You need to try =)

wumpus · February 14, 2008, 8:40am

You cannot page-lock currently allocated memory in any way, ever. This would amount to some very complex operations at both libc and the virtual memory level:

Make sure no other heap allocations overlap the pages of this memory block, if this is the case then give up
Remove the memory from consideration of the heap
Make sure all the pages are active (in memory)
Make sure the data is laid out consecutive in physical memory (not fragmented)
Tell the operating system that a certain block of memory is special

This will probably be a lot slower than just copying it to pinned memory yourself. Another possiblility might be scatter/gather DMA like used with some network cards and harddisk controllers, you can achieve zero copying with that, but NVidia probably has to do big changes to their driver.

rajhlinux · January 21, 2025, 7:25am

I have the same question as to OP, although his question did not make clear sense as to what he is asking however I am assuming that this is the answer because my question answer is directly based on what you have responded.

Good to know that this is possible, what we are essentially doing is DMA. Nice that Nvidia provides these functionalities all the way back in 2008.

rajhlinux · January 21, 2025, 7:30am

Does Nvidia provides this feature you’re asking for?
I know you’ve made this response all the way back in 2008.

However, I assume that it is possible today and just wanted to confirm if you have any more experience in this. Finding GPU DMA info is extremely rare topic even in 2025 (which is unfortunate).

rajhlinux · January 21, 2025, 7:39am

So is it possible today that we can tell CUDA GPU to pin memory in host RAM, then have external C++ applications to directly write data to the pinned memory for the CUDA GPU to copy it directly?

This would prevent doing multiple buffer copies as if the C++ external app which tells the CPU needs to copy from the heap and then write it to the pinned memory. Instead just write it directly to the pinned memory location. Wonder if this is possible today.

Thanks.

Curefab · January 21, 2025, 11:39am

You can even pin memory allocated by a 3rd party DLL.
Pinning takes time, so only do it, if that memory is reused for multiple transfers.

If the memory is allocated by a different process, it gets more complicated.

Robert_Crovella · January 21, 2025, 12:41pm

you can pin currently allocated memory (e.g. an allocation that was done using e.g. new or malloc) using cudaHostRegister

rajhlinux · January 21, 2025, 11:46pm

Hello Curefab, thanks for your reply.

So are you implying that pinning will not make sense for the following situation:

I am working on a setup where a live camera continuously dumps its frames into a fixed memory buffer. This buffer is pinned so the GPU can directly access it for processing. After the GPU processes a frame, the camera overwrites the pinned buffer with the next frame, and this cycle repeats.

Given that the data in the pinned memory is constantly changing (new frames overwriting old ones), does using pinned memory improve performance and latency in this scenario, or is it unnecessary?

The data in the pinned buffer is not static and will always constantly change, however I assume that since memory address location is static (pinned), it seems it might improve performance since allocation for buffer is not flying all over the place. Want to hear your thoughts on this.

Thanks.

Curefab · January 22, 2025, 9:21am

Hi rajhlinux,

pinning is only related to the memory addresses. You can read and write from the CPU and the GPU as often as you want. Use asynchronous memory copies and stream synchronization.

If you have a circular ring buffer, you can pin the whole memory buffer (or each of the frames separately).
As pinned memory cannot be swapped, you need enough physical system memory.

To slightly optimize performance further, set the write combining flag for memory, used only for host-to-device transfers.

You can write a routine, which checks whether the buffer address from the camera was pinned.

If you read each frame only once on the GPU, you can also use zero-copy memory with pinned memory. It tends to have a bit better latency and sometimes slightly less bandwidth than asynchronous memory copies and can slow down your GPU computation by a small bit.

Topic		Replies	Views
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13456	May 4, 2018
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8650	February 14, 2008
question about page locked memory CUDA Programming and Performance	2	8740	April 21, 2009
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4503	February 8, 2010
fine control of memory pinning in CUDA CUDA Programming and Performance	12	16552	May 1, 2008
CUDA device memory access? CUDA Programming and Performance	11	15686	August 5, 2011
Can I use Unified Memory in a soft real-time system? CUDA Programming and Performance	13	345	April 1, 2024
Enabling Concurrent Managed Memory access on GPU CUDA Programming and Performance	10	85	February 12, 2025
Does pinned memory can accessed by Device? CUDA Programming and Performance	4	1225	March 18, 2024
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1196	July 10, 2020

Is it possible to use pinned memory? Outside of CUDA

Related topics