Unable get over 512MB of page-locked memory with cudaHostRegister or cudaMallocHost...

In some recent work with CUDA development I have run into trouble gaining access to more than 512MB of page locked host memory to use for asynchronous copies to/from a CUDA device (GTX 690). This seems to be an obsticle whether I allocate the RAM myself with malloc() and then call cudaHostRegister() or call cudaMallocHost() to get the page locked memory directly. FYI… the function cudaHostRegister() is being passed the flag cudaHostRegisterMapped. When cudaMallocHost is called and cumulative page-locked memory allocations cross a value of about 512MB the failure reports the error “out of memory”.

I am working with a Windows 7 64-bit OS on a computer with 32GB of RAM so the amount of physical memory is not an issue. I am using Microsoft Visual Studio 2010 and CUDA 4.2.

In trying to work through this problem, I have read through the microsoft help on process working set sizes. As an investigation, and independent of CUDA, in C++ I have set the ProcessWorkingSetSize to over 3GB, allocated 3GB of memory with malloc (in 1GB chunks) and successfully locked the three memory chunks with VirtualLock(). Therefore, I know my program has permission and the system has enough resources to supply an adequate amount of page locked memory for my problem. Note 3GB is not a limit, it is just all I asked the system to allocate and lock.

Does anyone know if there is a limitation in CUDA 4.2 on the amount of page locked memory either cudaHostRegister() or cudaMallocHost() can work with? Additionally, does anyone know if there is a way to register pre-allocated page locked host memory (say malloced and locked with VirtualLock()) with CUDA so the asynchronous copy functions can work with it? I assume but do not know if this registration is necessary to use the asynchronous copy functions.

I would like to try to avoid synchronous copies or copying smaller amounts of data at a time by looping through transferring them into page locked memory and then issuing asynchronous copies. I have large amounts of data to copy that I am already overlapping with kernel execution so either of these work arounds will cost me cycles.

Thanks in advance for any help. :^)

According to a recent post by tmurray (http://forums.nvidia.com/index.php?showtopic=231616&view=findpost&p=1420933) it seems to me you should be able to pin chunks of up to 2GB. He also hints that this 2GB limit may be relaxed in CUDA 5.0, so if this is a possibility I would suggest trying the CUDA 5.0 preview available to registered developers. If the issue with the 512 MB limit persist even with CUDA 5.0, I would suggest filing a bug, attaching a self-contained repro case. A link to the bug reporting form can be found on the registered developer website.

Thanks @njuffa for your quick feedback and recognizing the problem similarity with what @twerdster experienced. I must have missed his discussion thread as I searched for issues with cudaMallocHost() and cudaHostRegister() instead of cudaHostAlloc(). I updated to cudatoolkit_5.0.7 and driver 302.59 and the problem has been resolved. Cheers.

Thanks for closing the loop. It’s good to hear the latest software fixed the issue.