I have a question about the distinction between “page-locked” and “pinned” for a buffering scheme I’m working up for 32-bit code under WinXP x64 (long story).
Transferring host to device from pageable memory, I get a bandwidth of 2151 Mb/s.
Transferring host to device from pinned memory, I get a bandwidth of 5715 Mb/s.
AWE (Address Windowing Extensions) allows me to allocate physical RAM and then map virtual page ranges to it. The key system calls are:
AllocateUserPhysicalPages – to get the physical RAM
VirtualAlloc – to get a reserved page range
MapUserPhysicalPages – to associate the two
Note that I need the “Lock pages in memory” privilege to even be allowed to allocate physical pages from a user program, so it’s a good bet that the pages are locked. Also note that cudaMallocHost doesn’t need this privilege so the page allocation must be occurring somewhere inside the driver stack.
Transferring host to device from AWE memory, I get a bandwidth of 2148 Mb/s. I would have thought that I’d get more like 5715 Mb/s.
So my question has two parts:
Does anyone know what distinction CUDA makes between memory allocated with cudaMallocHost and any other memory? Because it doesn’t appear to be solely based on page-locked-ness.
Has anyone run into this particular issue and have any idea how to get closer to the higher transfer rate?
I know I can copy between the two types of RAM but I’d rather not if there’s a less clunky solution.