Why does my kernel launch?

cudesnick · February 12, 2009, 3:50am

Hi,

Admittedly, the subject is a somewhat stupid question. This is because I’m missing smth. in CUDA API.

I launch my kernels as follows:

const dim3 grid(numBlocks, 1);
const dim3 threads(numThreads);
const unsigned sharedMemRequested = 123;

myKernel<<<grid, threads, sharedMemRequested>>>(param1, param2, param3);

if (cudaSuccess != cudaThreadSynchronize()) {

// report error somehow.

}

I am using GTX 280 card.

To my surprise, the kernel will “kind of launch”, even if the number of threads (specified by variable numThreads) is very large, e.g. 10*1024. Moreover, the kernel will also “kind of launch”, if the number of threads is reasonable (say, 512), but the number of registers, as reported by nvcc, when --ptxas-options=-v parameter is supplied to it, times the number of threads per block is greater than 16K. As far as I understand, 16K registers/block is the hardware limit for my card.

In the above paragraph “kind of launch” means, that cudaThreadSynchronize() does NOT report an error, but the kernel produces a wrong result, different from the one, when a large, but smaller number of threads is used.

Hence my question: is there an easy way to detect, at the execution time, that the number of registers, available on the card is not sufficient for launching the requested number of threads?

Thank you in advance for your explanation!

MisterAnderson42 · February 12, 2009, 12:27pm

I know the manual says that cudaThreadSynchronize() returns the last error, but I’ve had issues with that (back in CUDA 0.8/0.9, I have no idea what the state is now). I always do this to be paranoid

cudaThreadSynchronize()

error = cudaGetLastError()

if (error != cudaSuccess)

   print out error

In your code with num_threads > 16384/num_regs, you should get the error “too many resources requested for launch”

The other possibility is that you have a too old version of the CUDA driver installed. If you try to use CUDA 2.1 with, say a 170.x series driver some really strange things will happen.

cudesnick · February 13, 2009, 6:19am

I know the manual says that cudaThreadSynchronize() returns the last error, but I’ve had issues with that (back in CUDA 0.8/0.9, I have no idea what the state is now). I always do this to be paranoid
cudaThreadSynchronize()

error = cudaGetLastError()

if (error != cudaSuccess)

   print out error
In your code with num_threads > 16384/num_regs, you should get the error “too many resources requested for launch”

The other possibility is that you have a too old version of the CUDA driver installed. If you try to use CUDA 2.1 with, say a 170.x series driver some really strange things will happen.

Mr Anderson, your code fixed my issue, and I now see the error message you suggest I should. I also see some other error message, which makes total sense, if I attempt to launch a ridiculously high number of threads. Thank you!

One minor question: do you think in makes more sense to invoke cudaGetLastError() BEFORE or AFTER cudaThreadSynchronize()?

Thanks again!

Fugl · February 13, 2009, 11:29am

Also being from from the CUDA 0.9 generation, I too had to do as MrAnderson to catch errors:

[codebox]

cudaError_t err; /* error value for cuda calls */

/* kernel call here */

/* check last error message from kernel */

cudaThreadSynchronize();

err = cudaGetLastError();

if (err != cudaSuccess) {

    fprintf(stderr, "cuda kernel failed - %s\n",cudaGetErrorString(err));

}

[/codebox]

MisterAnderson42 · February 13, 2009, 2:24pm

Huh. I guess they never fixed that issue. Strange that it hasn’t come up more on the forums. Maybe I should submit a bug… as the manual clearly says

There really isn’t any room for interpretation there.

Most definitely after. Since kernel calls are asynchronous, calling cudaGetLastError() before the synchronize will only check for the last error set before the kernel launched.

Also note that you probably only want to enable these error checks if a flag is true (the SDK, for example, only does in Debug builds). Calling cudaThreadSynchronize() after ever kernel call can slow performance, especially if you are expecting to overlap CPU computations with kernels on the GPU.

cudesnick · February 13, 2009, 4:01pm

Mr Anderson, your and my interpretations are in agreement here. However, having given more thought to this, I can now imagine interpreting this as follows: The function returns only after all the asynchronous tasks have completed, but the error status is returned as it was at the time when cudaThreadSynchronize() was invoked, i.e possibly BEFORE the asynchronous tasks have completed. I can’t see who would find this kind of functionality useful, though.

I would imagine, that if a kernel fails to launch at all, then the call into the kernel is NOT asynchronous, and hence the failure-to-launch kind of an error should be reported by getLastError(…) no matter whether cudaThreadSynchronize() was invoked after the launch, or not.

That’s a good point. I check for such errors within ASSERT-like macros.

BTW, if you choose to file a bug report on this, then you might want to note, that I observed this behavior on the most recent cuda driver (updated yesterday) and sdk-2.1.

It would be not very easy for me to confirm this, but I seem to have noticed, that if the number of threads is reasonable, but the block won’t fit into the register space, then the kernel still gets launched, and behaves as if some registers step on each other and hence the device executes the kernel code incorrectly. I saw an error message, that was generated by my kernel code under such circumstances, before I put in the error check, that you suggested into my host code.

Thank you for your thoughts!

Topic		Replies	Views
How to check if kernel was launched? Is possible that kernel failed to launch but it was not recorde CUDA Programming and Performance	3	3342	March 8, 2010
"max threads exceeded" error isn't reported CUDA Programming and Performance	1	3661	October 13, 2011
Silent kernel failure CUDA Programming and Performance	25	8633	May 18, 2020
Can kernels return error messages CUDA Programming and Performance	8	4078	June 27, 2011
kernel printf strange behaviour of printf in __global__ sub CUDA Programming and Performance	1	3964	February 22, 2011
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2463	October 12, 2021
Large Thread Size prevents Kernel from running CUDA Programming and Performance	8	1024	May 16, 2011
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3273	December 3, 2008
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4393	September 22, 2009
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9708	April 7, 2011

Why does my kernel launch?

Related topics