CUDA 2.2 pinned memory white paper

Just wanted folks to take note of the white paper on CUDA 2.2’s new pinned memory APIs. In the SDK it is kinda buried with the simpleZeroCopy sample, but it covers all aspects of the new APIs, including portable and mapped pinned memory and write-combining.
CUDA2.2PinnedMemoryAPIs.pdf (249 KB)

Many thanks… that’s a very detailed explaination of something I thought was a bit under-detailed in the programming guide.

I think it would be helpful if it should be somehow inserted into the programming guide.

thanks a lot

eyal

thanks indeed. This whitepaper clarifies quite a few of my assumptions that previously were, well, exactly that: assumptions :)

Thanks for the paper. It brings up some comments and questions.

§1.1: the equivalent paragraph in the programming guide (3.2.5.1) is confusing, as it doesn’t mention the very useful multi-gpu support feature at all.

§3.2: How does it affect the Stream and Event-based synchronizations? Is it planned to add by-type or by-call requests? From the reference manual (3.4.2.6 & 3.20.2.2) I assume it’s for all calls, but only cudaThreadSynchronize is mentioned in this paragraph.

$4.5: Now, that’s the one that brings up questions. It implies that any cudaMallocHost or cudaHostAlloc will be mapped into the “CUDA’s 32 bit linear address space” when the proper flag is set. I assume this is the memory space directly addressed by the GPU itself. Which means:

  1. One can’t really use the flag on C1060, M1060 and S1070, because the internal 4 GiB memory requires the entire space by itself.

  2. This space is NOT used for the traditional cudaMallocHost, so another mechanism (I assume it’s in a DMA engine) is used, because I know by experience one can use more than 3 GiB on the card and more than 3 GiB in pinned memory.

  3. Whatever the mechanism I surmise in 2), it is not 32 bits limited, as §3.3 says that 64 bits host can allocate more than 4 GiB of pinned memory now. I assume this will fail if the host-mapping flag is active.

Thanks for any extra clarifications.

Cordially,

Apologies for the delay. I should learn to check in on threads I’ve replied to more often!

cudaSetDeviceFlags() is the CUDA runtime equivalent to the flags specified to cuCtxCreate(). The context-wide “auto,” “yield,” and “spin” only affect the context behavior during pageable memcpy. (since there is no way to steer this behavior in the memcpy API) The context-wide “blocking sync” only affects the behavior of cuCtxSynchronize()/cudaThreadSynchronize(), since finer-grained blocking waits can be done with CUDA events created by calling cudaEventCreateWithFlags() with the cudaEventBlockingSync flag. (The driver API equivalent is to specify the CU_EVENT_BLOCKING_SYNC flag to cuEventCreate()).

Yes, this is a good summary. Pinned memory pollutes the 32-bit CUDA address space only if you have called cudaSetDeviceFlags() with cudaDeviceMapHost. And if that flag is set, then all pinned allocations consume address space, even ones not marked as cudaHostAllocMapped.

Thanks

This API should allow us to allocate a non-paged, non-cacheable host buffer that can receive an external DMA transfer, yes?

I’m looking for te most efficent way to move data from an external PCI device into th GPU.

Is there any efficient approach found yet? if yes , please let me know, I need to do something very similar.

I need to transfer data from one PCI device memory to GPU most efficiently. possibly bypassing the CPU memory.

Thanks

Is there any efficient approach found yet? if yes , please let me know, I need to do something very similar.

I need to transfer data from one PCI device memory to GPU most efficiently. possibly bypassing the CPU memory.

Thanks