I am using CUDA to accelerate a simple double-precision floating-point matrix algorithm, which does some mathematical calculations on each element of the matrix. The code works well with small or medium size matrices; however, the performance degradation happens when I use very large matrices which takes up nearly all the device memory.

I profiled the code using nsight, and found that the system memory (host memory), besides device memory, was used for some buffers allocated by cudaMalloc(). For example, when I tried to allocate 1.7 GB device memory using cudaMalloc() calls, the actual memory allocation is 850 MB system memory and 850 MB device memory. This is weird and confusing to me since I expected cudaMalloc() allocated all the buffers on the device memory. It also hurts performance significantly. I am using CUDA 5.5 on Windows 8.1 64-bit. My card is GeForce GTX 680.

So does anyone have similar experience? I would like to know why it happens, and if possible, how to avoid it. I would greatly appreciate if you could help me out!