Is it possible to use pinned memory? Outside of CUDA

I understand the benefits of using pinned host memory for doing copies to and from the device. My problem is that I am capturing an image, copying that to the GPU and processing, then copying back to the host to be viewed. Is the only way I can use the benefits of pinned memory by allocating pinned memory for the source image and copying that data to that pinned memory space? In this case the memcopy to the pinned space would take longer than the whole transfer unpinned transfer to the GPU. The same applies for reading the data from the GPU, would I have to read it to pinned memory and then copy that to a pagable memory space to be read by my view program?

I hope what I’m asking makes sense :huh:

If the image cannot be captured to pinned memory allocated with cuMemAllocHost, then yes, you have to copy to a staging area in order to perform the transfer.

If that is the case, it might be faster to just call CUDA to do the copy for you. It will use its private staging areas to do the same thing. CUDA’s pageable memcpy code uses multiple buffers for better CPU/GPU overlap than you would get from copying a whole buffer and then invoking a memcpy of pinned memory.

Yeah, the only disadvantage of the cudaMemcpy from paged memory is that is works synchronously, isn’t it? Or only as far as the copy to the staging area is concerned?

I’m not sure if I understand your question.
As I see it you want to know if you can use memory allocated with cudaMallocHost() the same way as memory allocated with malloc() or something else CUDA-unrelated.

I think the answer is yes. You can use it as if it would be just… memory. You could for example allocate memory with cudaMallocHost() and then load a file from the harddisk into it. So an extra copy from memory allocated with malloc() to memory allocated with cudaMallocHost() is not necessary if your design allows it.

However one should use cudaMallocHost carefully as too much page locked memory could decrease your overall system performance.
Also note that cudaMallocHost has to be called in the same device context as the cudaMemCopy() calls in order for CUDA to being able to utilize the fast memcopy. This is important if you use multiple devices or have a multithreaded application with one thread dedicated to control the GPU.

I actually experimented with using CUDA allocated pinned memory elsewhere in my application and there is absolutely no problem. A pointer to pinned memory can be used anywhere a normal pointer can be used. I ran my own functions on it, saved it to disk, sent it to IPP library etc.
There are no problems and no difference in performance (I hoped it would be faster…).

The correct way to do it is to allocate pinned memory and “bind” it to existing CPU classes (instead of the classes allocating their own memory). A really nit way to do it is if your classes support allocators.

I have (I believe) the same problem as the original poster. To clarify, I am integrating into another application that performs its own memory allocations for its own purposes. My data, when it arrives, arrives in memory allocated by that application, which is non-pinned and over which I have no ability to force it to use cudaMallocHost().

At that point, if I want to get the data onto the GPU, I can either (with the current API), (a) copy the data from it’s location in memory into pinned memory, and then commence with copying to the device, or (B) just copy to device out of pageable memory.

What would be really nice is if there were a way to pass cudaMallocHost a memory pointer, so that cudaMallocHost() page-locks the memory and sets up the DMA transfer structures, so then I can just directly copy the data out of its initial location to the GPU and enjoy the speed benefits of pinned memory.

Does anyone know if there is a way to do this? I saw a post in another thread that indicated NVIDIA may be considering a new API call that would essentially enable this feature. Alternately, does anyone know if NVIDIA has released source to cudaMallocHost? The changes wouldn’t be all that complex.

Thanks in Advance!

On Windows VirtualLock() may be the function you’re looking for. However I’m not quite sure if CUDA will recognize locked memory as ‘pinned’. You need to try =)

You cannot page-lock currently allocated memory in any way, ever. This would amount to some very complex operations at both libc and the virtual memory level:

  • Make sure no other heap allocations overlap the pages of this memory block, if this is the case then give up

  • Remove the memory from consideration of the heap

  • Make sure all the pages are active (in memory)

  • Make sure the data is laid out consecutive in physical memory (not fragmented)

  • Tell the operating system that a certain block of memory is special

This will probably be a lot slower than just copying it to pinned memory yourself. Another possiblility might be scatter/gather DMA like used with some network cards and harddisk controllers, you can achieve zero copying with that, but NVidia probably has to do big changes to their driver.