There’s a lot of info spread throughout posts on the forum…I’ll highlight a couple…
1 and 2: http://forums.nvidia.com/index.php?s=&…st&p=286686
The GPU always must DMA from pinned memory. If you use malloc() for your host data, then it is in pageable (non-pinned memory). When you call cudaMemcpy(), the CUDA driver has to first memcpy the data from your non-pinned pointer to an internal pinned memory pointer, and then the host->GPU DMA can be invoked.
If you allocate your host memory with cudaMallocHost and initialize the data there directly, then the driver doesn’t have to memcpy from pageable to pinned memory before DMAing – it can DMA directly.
That is why it is faster.
3: http://forums.nvidia.com/index.php?s=&…st&p=497317
From what I gather, Pinned memory is great if you are going to be copying data back and forth between the CPU and GPU quite often but may not be that beneficial if you’re not doing many transfers…
Check out the topics above and other topics regarding pinned memory for more info