I’m calling my kernel with one block, and a user selectable number of threads. If I do any number of threads from 1 to 192, everything works as it should, but if I give it 193, I get back an error of CUDA_ERROR_INVALID_IMAGE. I haven’t been able to find in the documentation or by googling what that error actually indicates. What’s the general interpretation of that code?
the kernel call and error code setting is as follows:
where probSize is set by the user at run time. This is running on a Tesla cluster node, and it only copies a small ammount of data, so I’m skeptical that I might be running out of memory.
Actually, I just realized I was looking at the driver api error enumeration on accident. It’s actually cudaErrorLaunchOutOfResources. If I change the code to:
So, my kernel creates a rather large number of locals variables (around 30 or so) for each thread. Could this be the cause for running out of resources?
Ok, that makes a little more sense then. I just reduced my threads per block size to 32, and now it seems to be working… at least for up to 256 threads. :D