Pinned Memory slower than pageable memory

NeedWisdom · September 16, 2010, 12:43pm

Hello all,

I am learning the ins and outs of Cuda. In moving up the ladder of advancing programming, I simply replaced host “malloc” function with cudaHostAlloc functions with default settings (no streams, no async copies). I expected to see a speed improvement in my app in just doing that or at least no change but instead I got a significant 20% performance hit. This does not make sense. The book “CUDA by example” claims that I should experience a two fold improvement just by making that change but I’m not seeing it. On the contrary, using malloc is much better. Is there something else I am not doing?

Thanks in advance
NW

Sarnath · September 16, 2010, 3:15pm

CPU Reading Pinned memory locations can be TERRIBLY slow.

Pinned memory is BEST only when you WRITE from CPU to Pinned Memory and then DMA it to the card.
Remember to specify “WRITE COMBINED” attribute for the CPU WRITE to be effective. (It will be effective for continously increasing write address order)

If you are reading from pinned memory (un-cached memory), you will loose out performance.

Sarnath · September 16, 2010, 3:15pm

CPU Reading Pinned memory locations can be TERRIBLY slow.

Pinned memory is BEST only when you WRITE from CPU to Pinned Memory and then DMA it to the card.
Remember to specify “WRITE COMBINED” attribute for the CPU WRITE to be effective. (It will be effective for continously increasing write address order)

If you are reading from pinned memory (un-cached memory), you will loose out performance.

YDD · September 16, 2010, 3:17pm

If you’re blindly replacing every malloc with a cudaHostAlloc, then I’m not too surprised the wall time goes up. cudaHostAlloc is very slow, since it needs to allocate contiguous physical pages. Furthermore, since those pages are pinned into physical RAM, the OS has less RAM available for virtual memory, possibly causing the OS to swap, and almost certainly causing fragmentation of physical RAM. Pinned memory makes PCIe transfers faster, but if those aren’t a problem, it’s not necessary.

YDD · September 16, 2010, 3:17pm

If you’re blindly replacing every malloc with a cudaHostAlloc, then I’m not too surprised the wall time goes up. cudaHostAlloc is very slow, since it needs to allocate contiguous physical pages. Furthermore, since those pages are pinned into physical RAM, the OS has less RAM available for virtual memory, possibly causing the OS to swap, and almost certainly causing fragmentation of physical RAM. Pinned memory makes PCIe transfers faster, but if those aren’t a problem, it’s not necessary.

Topic		Replies	Views
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	466	March 28, 2024
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13300	May 4, 2018
malloc() + cuMemHostRegister() faster than cuMemAllocHost() CUDA Programming and Performance	0	1074	October 9, 2013
why using pinned memory is faster? CUDA Programming and Performance	3	2858	November 30, 2007
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	597	March 7, 2019
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7598	January 5, 2012
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6246	January 22, 2025
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	827	October 8, 2018
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	380	October 9, 2018
Fast processing of large amounts of pinned memory CUDA Programming and Performance	2	714	August 29, 2017

Pinned Memory slower than pageable memory

Related topics