Why does my kernel launch?


Admittedly, the subject is a somewhat stupid question. This is because I’m missing smth. in CUDA API.

I launch my kernels as follows:

const dim3 grid(numBlocks, 1);
const dim3 threads(numThreads);
const unsigned sharedMemRequested = 123;

myKernel<<<grid, threads, sharedMemRequested>>>(param1, param2, param3);

if (cudaSuccess != cudaThreadSynchronize()) {

// report error somehow.


I am using GTX 280 card.

To my surprise, the kernel will “kind of launch”, even if the number of threads (specified by variable numThreads) is very large, e.g. 10*1024. Moreover, the kernel will also “kind of launch”, if the number of threads is reasonable (say, 512), but the number of registers, as reported by nvcc, when --ptxas-options=-v parameter is supplied to it, times the number of threads per block is greater than 16K. As far as I understand, 16K registers/block is the hardware limit for my card.

In the above paragraph “kind of launch” means, that cudaThreadSynchronize() does NOT report an error, but the kernel produces a wrong result, different from the one, when a large, but smaller number of threads is used.

Hence my question: is there an easy way to detect, at the execution time, that the number of registers, available on the card is not sufficient for launching the requested number of threads?

Thank you in advance for your explanation!

I know the manual says that cudaThreadSynchronize() returns the last error, but I’ve had issues with that (back in CUDA 0.8/0.9, I have no idea what the state is now). I always do this to be paranoid


error = cudaGetLastError()

if (error != cudaSuccess)

   print out error

In your code with num_threads > 16384/num_regs, you should get the error “too many resources requested for launch”

The other possibility is that you have a too old version of the CUDA driver installed. If you try to use CUDA 2.1 with, say a 170.x series driver some really strange things will happen.

Mr Anderson, your code fixed my issue, and I now see the error message you suggest I should. I also see some other error message, which makes total sense, if I attempt to launch a ridiculously high number of threads. Thank you!

One minor question: do you think in makes more sense to invoke cudaGetLastError() BEFORE or AFTER cudaThreadSynchronize()?

Thanks again!

Also being from from the CUDA 0.9 generation, I too had to do as MrAnderson to catch errors:


cudaError_t err; /* error value for cuda calls */

/* kernel call here */

/* check last error message from kernel */


err = cudaGetLastError();

if (err != cudaSuccess) {

    fprintf(stderr, "cuda kernel failed - %s\n",cudaGetErrorString(err));



Huh. I guess they never fixed that issue. Strange that it hasn’t come up more on the forums. Maybe I should submit a bug… as the manual clearly says

There really isn’t any room for interpretation there.

Most definitely after. Since kernel calls are asynchronous, calling cudaGetLastError() before the synchronize will only check for the last error set before the kernel launched.

Also note that you probably only want to enable these error checks if a flag is true (the SDK, for example, only does in Debug builds). Calling cudaThreadSynchronize() after ever kernel call can slow performance, especially if you are expecting to overlap CPU computations with kernels on the GPU.

Mr Anderson, your and my interpretations are in agreement here. However, having given more thought to this, I can now imagine interpreting this as follows: The function returns only after all the asynchronous tasks have completed, but the error status is returned as it was at the time when cudaThreadSynchronize() was invoked, i.e possibly BEFORE the asynchronous tasks have completed. I can’t see who would find this kind of functionality useful, though.

I would imagine, that if a kernel fails to launch at all, then the call into the kernel is NOT asynchronous, and hence the failure-to-launch kind of an error should be reported by getLastError(…) no matter whether cudaThreadSynchronize() was invoked after the launch, or not.

That’s a good point. I check for such errors within ASSERT-like macros.

BTW, if you choose to file a bug report on this, then you might want to note, that I observed this behavior on the most recent cuda driver (updated yesterday) and sdk-2.1.

It would be not very easy for me to confirm this, but I seem to have noticed, that if the number of threads is reasonable, but the block won’t fit into the register space, then the kernel still gets launched, and behaves as if some registers step on each other and hence the device executes the kernel code incorrectly. I saw an error message, that was generated by my kernel code under such circumstances, before I put in the error check, that you suggested into my host code.

Thank you for your thoughts!