I have a problem where I need to transfer a large matrix from host to device. This works fine when the array is smaller than about 8E7 floats, but above this, when I run the kernel I usually get a CUTIL CUDA error (usually “unknown error.”, but it has also been other things like “the launch timed out and was terminated.”). This is only a transfer of about 300 MB. Any ideas of what is causing this problem? Shouldn’t I be able to transfer much more to the GPU memory than this? I have 6GB of RAM on my system and am running everything in 64 bit. I have 896 MB of dedicated video memory.
I am pretty sure that the problem is not from the kernel algorithm itself because it works fine with smaller problems and I only have this problem for larger matrices. The number of block, threads, registers, and shared memory space does not depend on the size of this matrix, so I don’t think it is breaking due to this. The larger matrix only causes a loop to execute more times within the kernel. The threads are synced before each repetition of the loop.
Any thoughts of why I cannot use larger arrays?