In some recent work with CUDA development I have run into trouble gaining access to more than 512MB of page locked host memory to use for asynchronous copies to/from a CUDA device (GTX 690). This seems to be an obsticle whether I allocate the RAM myself with malloc() and then call cudaHostRegister() or call cudaMallocHost() to get the page locked memory directly. FYI… the function cudaHostRegister() is being passed the flag cudaHostRegisterMapped. When cudaMallocHost is called and cumulative page-locked memory allocations cross a value of about 512MB the failure reports the error “out of memory”.
I am working with a Windows 7 64-bit OS on a computer with 32GB of RAM so the amount of physical memory is not an issue. I am using Microsoft Visual Studio 2010 and CUDA 4.2.
In trying to work through this problem, I have read through the microsoft help on process working set sizes. As an investigation, and independent of CUDA, in C++ I have set the ProcessWorkingSetSize to over 3GB, allocated 3GB of memory with malloc (in 1GB chunks) and successfully locked the three memory chunks with VirtualLock(). Therefore, I know my program has permission and the system has enough resources to supply an adequate amount of page locked memory for my problem. Note 3GB is not a limit, it is just all I asked the system to allocate and lock.
Does anyone know if there is a limitation in CUDA 4.2 on the amount of page locked memory either cudaHostRegister() or cudaMallocHost() can work with? Additionally, does anyone know if there is a way to register pre-allocated page locked host memory (say malloced and locked with VirtualLock()) with CUDA so the asynchronous copy functions can work with it? I assume but do not know if this registration is necessary to use the asynchronous copy functions.
I would like to try to avoid synchronous copies or copying smaller amounts of data at a time by looping through transferring them into page locked memory and then issuing asynchronous copies. I have large amounts of data to copy that I am already overlapping with kernel execution so either of these work arounds will cost me cycles.
Thanks in advance for any help. :^)