Transfer Speed For AWE-Allocated Memory

I have a question about the distinction between “page-locked” and “pinned” for a buffering scheme I’m working up for 32-bit code under WinXP x64 (long story).

Transferring host to device from pageable memory, I get a bandwidth of 2151 Mb/s.

Transferring host to device from pinned memory, I get a bandwidth of 5715 Mb/s.

AWE (Address Windowing Extensions) allows me to allocate physical RAM and then map virtual page ranges to it. The key system calls are:

AllocateUserPhysicalPages – to get the physical RAM
VirtualAlloc – to get a reserved page range
MapUserPhysicalPages – to associate the two

Note that I need the “Lock pages in memory” privilege to even be allowed to allocate physical pages from a user program, so it’s a good bet that the pages are locked. Also note that cudaMallocHost doesn’t need this privilege so the page allocation must be occurring somewhere inside the driver stack.

Transferring host to device from AWE memory, I get a bandwidth of 2148 Mb/s. I would have thought that I’d get more like 5715 Mb/s.

So my question has two parts:

  1. Does anyone know what distinction CUDA makes between memory allocated with cudaMallocHost and any other memory? Because it doesn’t appear to be solely based on page-locked-ness.

  2. Has anyone run into this particular issue and have any idea how to get closer to the higher transfer rate?

I know I can copy between the two types of RAM but I’d rather not if there’s a less clunky solution.

It also has to register those memory ranges in the driver context / GPU so they will be recognized for DMA transfers. There is no way to do this registration yourself in CUDA 2.1.

Be patient for CUDA 2.2. It has a big overhaul in how the CUDA driver deals with pinned memory. I cannot say for certain, but given hints that tmurray has dropped on the forums it may be possible to do what you are trying to do.

There are two memcpy paths (well, there are more for some special cases, but you really only have to worry about two): one for pinned memory and one for pageable memory. Pageable memory gets memcpy’d on the host to an area of pinned memory and then the GPU performs a DMA transfer from the pinned region to global memory. Pinned memory is page-locked, its address space is copied to the card and it can perform a DMA transfer at a later point (which is why there’s asynchronous memcpys from pinned memory and not pageable).

2.2 adds several things relating to pinned memory, but we still don’t allow you to map regions that you’ve page-locked yourself to the GPU’s address space. Two reasons for this (there are probably more):

  • if a region is page-locked, mapped, then unlocked, a copy to that region will do (I’m pretty sure) very bad things.
  • what I’ve described is true on OSes with simple driver models–WinXP and Linux. On OSX and Vista/Server08/Win7, things are significantly more complicated (we have to do a lot of fiddling behind the scenes), so exposing a mapping call is not really ideal.

Is there some reason why you can’t use cudaMallocHost?

Thanks for looking at this. (And thanks to MisterAnderson42 for the hint.) The reason for not preferring cudaMallocHost (although it’s beginning to sound like my best bet) is that my 32-bit app does a lot of I/O and needs to have more than 2-3 Gb in play, so that it would be nice (but not absolutely essential) to be able to double-buffer the I/O and materialize data into a locked range using the AWE api just in time for the Device to use it.

I would prefer 64-bit but there’s a lot of code that hasn’t migrated yet.

I understand that there are driver issues that I don’t understand. Nonetheless, maybe it’s worth thinking about the following calls in a future api:

ThisBozoClaimsToHaveLockedTheFollowingRange(…)

ThisBozoNowClaimsThatTheFollowingRangeIsUnlocked(…)

with page references in between checked for being locked?

There’s an implicit race condition here, I think, so enabling such calls would be somewhat dangerous…

Wow. This 4 year old question is exactly my own! I also tried DMA from AWE memory regions to the GPUs, expecting the 11 GB/s DMA rate I get when using cudaHostAlloc. Nope, it’s always the slower 2 GB/s, even though the memory is page locked.

I am using 4 Kepler K10 cards. I just finished upgrading my project and all support dlls from 32bit mode to x64 mode. I plan to use cudaHostAlloc instead of the AWE scheme to access the other 60GB of host memory and have it have the 11 GB/s DMA speed to and from the GPUs.

The previous answers and explanations provided confirmation for my approach. Tnx.

(My first post)
Mitch

The problem with pulling up old posts is that the information is stale. Recent versions of cuda allow you to register mapped regions of memory. Look up cudaHostRegister in the reference manual for a current version of CUDA.