Pinned Memory slower than pageable memory

Hello all,

I am learning the ins and outs of Cuda. In moving up the ladder of advancing programming, I simply replaced host “malloc” function with cudaHostAlloc functions with default settings (no streams, no async copies). I expected to see a speed improvement in my app in just doing that or at least no change but instead I got a significant 20% performance hit. This does not make sense. The book “CUDA by example” claims that I should experience a two fold improvement just by making that change but I’m not seeing it. On the contrary, using malloc is much better. Is there something else I am not doing?

Thanks in advance
NW

CPU Reading Pinned memory locations can be TERRIBLY slow.

Pinned memory is BEST only when you WRITE from CPU to Pinned Memory and then DMA it to the card.
Remember to specify “WRITE COMBINED” attribute for the CPU WRITE to be effective. (It will be effective for continously increasing write address order)

If you are reading from pinned memory (un-cached memory), you will loose out performance.

CPU Reading Pinned memory locations can be TERRIBLY slow.

Pinned memory is BEST only when you WRITE from CPU to Pinned Memory and then DMA it to the card.
Remember to specify “WRITE COMBINED” attribute for the CPU WRITE to be effective. (It will be effective for continously increasing write address order)

If you are reading from pinned memory (un-cached memory), you will loose out performance.

If you’re blindly replacing every malloc with a cudaHostAlloc, then I’m not too surprised the wall time goes up. cudaHostAlloc is very slow, since it needs to allocate contiguous physical pages. Furthermore, since those pages are pinned into physical RAM, the OS has less RAM available for virtual memory, possibly causing the OS to swap, and almost certainly causing fragmentation of physical RAM. Pinned memory makes PCIe transfers faster, but if those aren’t a problem, it’s not necessary.

If you’re blindly replacing every malloc with a cudaHostAlloc, then I’m not too surprised the wall time goes up. cudaHostAlloc is very slow, since it needs to allocate contiguous physical pages. Furthermore, since those pages are pinned into physical RAM, the OS has less RAM available for virtual memory, possibly causing the OS to swap, and almost certainly causing fragmentation of physical RAM. Pinned memory makes PCIe transfers faster, but if those aren’t a problem, it’s not necessary.