CUDA 2.2 pinned memory white paper

nwilt · June 4, 2009, 2:21am

Just wanted folks to take note of the white paper on CUDA 2.2’s new pinned memory APIs. In the SDK it is kinda buried with the simpleZeroCopy sample, but it covers all aspects of the new APIs, including portable and mapped pinned memory and write-combining.
CUDA2.2PinnedMemoryAPIs.pdf (249 KB)

eyalhir74 · June 4, 2009, 7:01am

Many thanks… that’s a very detailed explaination of something I thought was a bit under-detailed in the programming guide.

I think it would be helpful if it should be somehow inserted into the programming guide.

thanks a lot

eyal

e.ping · June 4, 2009, 11:49pm

thanks indeed. This whitepaper clarifies quite a few of my assumptions that previously were, well, exactly that: assumptions :)

Romain_DOLBEAU · June 11, 2009, 3:25pm

Thanks for the paper. It brings up some comments and questions.

Â§1.1: the equivalent paragraph in the programming guide (3.2.5.1) is confusing, as it doesn’t mention the very useful multi-gpu support feature at all.

Â§3.2: How does it affect the Stream and Event-based synchronizations? Is it planned to add by-type or by-call requests? From the reference manual (3.4.2.6 & 3.20.2.2) I assume it’s for all calls, but only cudaThreadSynchronize is mentioned in this paragraph.

$4.5: Now, that’s the one that brings up questions. It implies that any cudaMallocHost or cudaHostAlloc will be mapped into the “CUDA’s 32 bit linear address space” when the proper flag is set. I assume this is the memory space directly addressed by the GPU itself. Which means:

One can’t really use the flag on C1060, M1060 and S1070, because the internal 4 GiB memory requires the entire space by itself.
This space is NOT used for the traditional cudaMallocHost, so another mechanism (I assume it’s in a DMA engine) is used, because I know by experience one can use more than 3 GiB on the card and more than 3 GiB in pinned memory.
Whatever the mechanism I surmise in 2), it is not 32 bits limited, as Â§3.3 says that 64 bits host can allocate more than 4 GiB of pinned memory now. I assume this will fail if the host-mapping flag is active.

Thanks for any extra clarifications.

Cordially,

nwilt · July 24, 2009, 3:56pm

Apologies for the delay. I should learn to check in on threads I’ve replied to more often!

cudaSetDeviceFlags() is the CUDA runtime equivalent to the flags specified to cuCtxCreate(). The context-wide “auto,” “yield,” and “spin” only affect the context behavior during pageable memcpy. (since there is no way to steer this behavior in the memcpy API) The context-wide “blocking sync” only affects the behavior of cuCtxSynchronize()/cudaThreadSynchronize(), since finer-grained blocking waits can be done with CUDA events created by calling cudaEventCreateWithFlags() with the cudaEventBlockingSync flag. (The driver API equivalent is to specify the CU_EVENT_BLOCKING_SYNC flag to cuEventCreate()).

Â§4.5: Now, that’s the one that brings up questions. It implies that any cudaMallocHost or cudaHostAlloc will be mapped into the “CUDA’s 32 bit linear address space” when the proper flag is set. I assume this is the memory space directly addressed by the GPU itself. Which means:

One can’t really use the flag on C1060, M1060 and S1070, because the internal 4 GiB memory requires the entire space by itself.

This space is NOT used for the traditional cudaMallocHost, so another mechanism (I assume it’s in a DMA engine) is used, because I know by experience one can use more than 3 GiB on the card and more than 3 GiB in pinned memory.

Whatever the mechanism I surmise in 2), it is not 32 bits limited, as Â§3.3 says that 64 bits host can allocate more than 4 GiB of pinned memory now. I assume this will fail if the host-mapping flag is active.

Yes, this is a good summary. Pinned memory pollutes the 32-bit CUDA address space only if you have called cudaSetDeviceFlags() with cudaDeviceMapHost. And if that flag is set, then all pinned allocations consume address space, even ones not marked as cudaHostAllocMapped.

rwoodsrwoods · November 9, 2009, 11:13pm

Thanks

This API should allow us to allocate a non-paged, non-cacheable host buffer that can receive an external DMA transfer, yes?

I’m looking for te most efficent way to move data from an external PCI device into th GPU.

satyam_shivam · July 1, 2010, 10:51am

Is there any efficient approach found yet? if yes , please let me know, I need to do something very similar.

I need to transfer data from one PCI device memory to GPU most efficiently. possibly bypassing the CPU memory.

Thanks

satyam_shivam · July 1, 2010, 10:51am

Is there any efficient approach found yet? if yes , please let me know, I need to do something very similar.

I need to transfer data from one PCI device memory to GPU most efficiently. possibly bypassing the CPU memory.

Thanks

Topic		Replies	Views
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6582	January 22, 2025
can I use pinned memory? CUDA Programming and Performance	6	2738	September 21, 2009
Pinned memory concept - windows driver CUDA Programming and Performance	0	1530	January 20, 2012
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8801	October 28, 2010
gpu access host memory CUDA Programming and Performance	1	697	January 20, 2012
fine control of memory pinning in CUDA CUDA Programming and Performance	12	16828	May 1, 2008
cudaMemcpy to non-pinned memory CUDA Programming and Performance	5	1872	October 12, 2021
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8761	February 14, 2008
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	1031	November 3, 2020
pageable and non-pageable memory CUDA Programming and Performance	2	6476	December 31, 2008

CUDA 2.2 pinned memory white paper

Related topics