Can kernels return error messages

I know this may be a newbie question, but is there any way to force kernels to return error messages if something went wrong. I’m asking because I have a kernel which I started with approx. 100 million threads and instead of telling me that that I can’t have more than X number of threads, my program just skips the execution as a whole, even cudaThreadSynchronize didn’t catch this error…

Check the return code of cudaThreadSynchronize().

Yup that’s what I’m doing, but it doesn’t report any errors, just keeps on skipping it :(

Are there any CUDA function calls between the kernel launch and the cudaThreadSynchronize() call? If there is, one of the intermediate calls may have returned the error, since it will be detected quickly.

Nope, nothing:

dim3 dimGridAccu(1600,31,2000);	

dim3 dimBlockAccu(1,1,1);

accumulatePixels_CU<<<dimGridAccu,dimBlockAccu>>>(..args..);

cudaErr = cudaThreadSynchronize();

if(cudaErr != cudaSuccess){

	cout << "problem13 " << cudaErr << endl;

}

This runs 1600x31 times, but completely ignores anything that has to with the z dimension. If I exchange grid and block stuff, it doesn’t execute at all. I know I can’t run this many threads at once, but errors would have been nice…

BTW: atomicAdd() is inter-block safe right?

Thanks for the replies!

EDIT:

Ok, I seem to be a bit confused on the Grid/Block restrictions. Running the CUDA Device Query that came with the SDK, I am told that my hardware (GTX 580) supports grids of dimension 65535x65535x65535, but any kernel I run, that has a third dimension, just ignores the third dimension, why is this? Or is that a maximum of 65335 blocks in total??

Actually, I don’t see why this configuration should fail. Appendix F.1 says that the restriction on grid size is 65535 in each dimension, but doesn’t list a limit for total number of blocks. (Note this is different than the block size, which is limited in each dimension, as well as the product.) The only restriction is that a grid with a z dimension can only be run on compute capability 2.0 devices. What GPU are you using, and are you compiling with -arch sm_20? Perhaps the CUDA runtime just ignores the z axis if you are running on a device that doesn’t support it? (That would be a bug in my mind, but at least it would explain something.)

Yes.

Are you running CUDA 4.0? It is needed to support grids with a z component different from 1.

Try:

cudaDeviceSynchronize()

err = cudaGetLastError()

if (err != cudaSuccess)

   ....

I’ve always had trouble getting consistent results from the advertised error return code from cudaThreadSychronize() (now cudaDeviceSynchronize in 4.0).

And regarding 3D grids, CUDA 4.0 can execute 3D grids - but only on compute 2.0 hardware, I think.

Thanks for the replies guys! I’m using a gtx 580 (compute capability 2.0), so that wont be the problem. I only have version 3.2 installed, so that’s probably the reason. I’ll give it a try and see.

Update: Problem solved (upgraded to CUDA 4.0)!!