Maximum Device Memory Can't transfer over ~300MB to device

I have a problem where I need to transfer a large matrix from host to device. This works fine when the array is smaller than about 8E7 floats, but above this, when I run the kernel I usually get a CUTIL CUDA error (usually “unknown error.”, but it has also been other things like “the launch timed out and was terminated.”). This is only a transfer of about 300 MB. Any ideas of what is causing this problem? Shouldn’t I be able to transfer much more to the GPU memory than this? I have 6GB of RAM on my system and am running everything in 64 bit. I have 896 MB of dedicated video memory.

I am pretty sure that the problem is not from the kernel algorithm itself because it works fine with smaller problems and I only have this problem for larger matrices. The number of block, threads, registers, and shared memory space does not depend on the size of this matrix, so I don’t think it is breaking due to this. The larger matrix only causes a loop to execute more times within the kernel. The threads are synced before each repetition of the loop.

Any thoughts of why I cannot use larger arrays?


Disable TDR.…dm_timeout.mspx

Thanks. This fixed the problem. Now I only seem to be limited by the total RAM of the system.