What does CUDA_ERROR_INVALID_IMAGE mean? This should be a quick one...

I’m calling my kernel with one block, and a user selectable number of threads. If I do any number of threads from 1 to 192, everything works as it should, but if I give it 193, I get back an error of CUDA_ERROR_INVALID_IMAGE. I haven’t been able to find in the documentation or by googling what that error actually indicates. What’s the general interpretation of that code?

the kernel call and error code setting is as follows:

Sanders<<<1,probSize>>>(fex, _Dneq, _Dy, _Dt, _Dtout, _Ditol, _Drtol, _Datol, _Ditask, _Distate, _Diopt, _Drwork, _Dlrw, _Diwork, _Dliw, jex, _Djt, _Dcommon, _Derr,probSize);

		error = cudaGetLastError();		

		error2 = cudaThreadSynchronize();

where probSize is set by the user at run time. This is running on a Tesla cluster node, and it only copies a small ammount of data, so I’m skeptical that I might be running out of memory.

Thanks,

Paul

Actually, I just realized I was looking at the driver api error enumeration on accident. It’s actually cudaErrorLaunchOutOfResources. If I change the code to:

int threadsPerBlock = 192;

		int blocksPerGrid = (probSize + threadsPerBlock -1)/threadsPerBlock;

		cudaEventRecord(start, 0);

		Sanders<<<blocksPerGrid,threadsPerBlock>>>(fex, _Dneq, _Dy, _Dt, _Dtout, _Ditol, _Drtol, _Datol, _Ditask, _Distate, _Diopt, _Drwork, _Dlrw, _Diwork, _Dliw, jex, _Djt, _Dcommon, _Derr,probSize);

		error = cudaGetLastError();		

		error2 = cudaThreadSynchronize();

I get a Launch Failure instead in error2, with No Errors from error, rather than LaunchOutOfResources from error and NoErrors from error2…

So, my kernel creates a rather large number of locals variables (around 30 or so) for each thread. Could this be the cause for running out of resources?

compiling with ptxas verbose option gives:

: Used 83 registers, 216+0 bytes lmem, 156+16 bytes smem, 144 bytes cmem[0], 440 bytes cmem[1]

So… it looks like I’m running out of registers? it’s running on a C1060, with max reg per block of 16384. 193 * 83 = 16,019. Does that seem right?

Registers are allocated per warp of threads so for 193 threads it would actually be 224*83 = 18592 registers.

Ok, that makes a little more sense then. I just reduced my threads per block size to 32, and now it seems to be working… at least for up to 256 threads. :D

Thanks,

Paul