Silent kernel failure

I wrote a cuda program which behaves very strangely: When I run it with more than X threads, I get a silent kernel failure. I run cudaDeviceSynchronize() after the kernel, and then I check its return value for cudaSuccess – and it matches, but no code of the kernel is executed!

I check my memory usage with nvidia-smi, and I still have around 20% free, so that is not the issue.

Does anyone have a clue what could be going wrong and how I can programmatically check for these errors?

The kernel is not huge, but still too big to post here. When I reduce the size of the kernel, I don’t run into these issues.


You might be running too many threads(per block or per SM,) or there may be insufficient registers for them, etc. There are limits for these things, per compute capability(hardware architecture) listed in the Programming Guide.

You must check cudaGetLastError() right after <<<,>>> (assuming that’s how you’re running your kernel.) Checking after cudaDeviceSynchronize() isn’t enough.

There is a limit on how many threads can be per block. For Fermi cards is 1024 threads per block. Also you have limits on the number of blocks in one direction. You must not exceed 65000 in each direction. The register are 32k registers pe SM. If one thread uses lets say p registers, you will run max 32k/p (but not more than 1024) threads per block. The command cudaGetLastError() from the last post will tell why is it crashing.

It also happened to me to have a mistake in the code, like outof bounds access in th array, and the kernel would crash only for larger systems.

There are two kind of errors associated with kernel launches:

(1) Pre-launch errors occur synchronously, when the driver detects lack of resources, invalid kernel images, and the like, prior to launching the kernel on the device. To catch these, call cudaGetLastError() directly after the kernel invocation, as pointed out by the previous posters.

(2) Kernel execution errors occur asynchronously, and are reported via the status of the next synchronizing operation. A common idiom is to use cudaThreadSynchronize() to catch these. A common kernel execution error is the ULF (unspecified launch failure) which is the GPU equivalent of a segfault on the host, i.e. typically caused by an out-of-bounds access. Another typical kernel execution error is a timeout error when a GPU is also used as a display device and the watchdog timer killed a running kernel that exceeded the timeout to make sure graphics doesn’t get blocked indefinitely.

As the previous posters have pointed out, the issue in this case is probably an unchecked pre-launch error resulting from an out-of-resources condition of some kind.

Thanks for the many replies! I did forget to check for errors after the kernel launch.

Now I see an error 7: too many resources requested for launch.

I know the thread limits of the card (1024) and I am below that, so I’m guessing the code compiled using too many registers. Is there a way to check what resource exactly exceeded the card capacity?


First step, find out how many registers and shared memory are used by the kernels ( compile with -Xptxas=-v ).
A possible alternative is to use launch bounds ( look at the documentation)

Just lost hours of debugging to find out this fucking silent error, just because I miswapped “grid” and “block” in the kernel parameters launch…

Thank you NVIDIA to silent an error and make user feel like everything is fine !

Errors are being reported. It is the responsibility of your code to look at the report and decide how to react. If you do not know how to do that, Google for proper CUDA error checking and take the first link.

As njuffa said, if there is an actual runtime-detectable error, it will be reported.

It is possible of course that no error is reported in certain situations. For example, if you intend to launch 256 blocks of 512 threads, and you swap and instead launch 512 blocks of 256 threads, there wouldn’t be an error reported. The GPU has no way of knowing your intent in this case. Many GPU codes wouldn’t actually behave any differently with that sort of swap, but some may.

In that kind of situation, the runtime cannot detect that any actual error has occurred.

But if you had a runtime detectable error in launch configuration (which would typically run afoul of the threads per block parameter) then the runtime will report that, if the errors are properly captured.

Admittedly it is a bit awkward though that almost all errors will be reported asynchronously by later CUDA calls but kernel launch errors require their own an extra call to cudaPeekAtLastError() or cudaGetLastError() to be detected. On re-checking the Programming Guide this is reasonably documented now, but it hasn’t been communicated well in the past.

I can only recommend using cuda-memcheck, which does detect launch errors even if the code gets this check (or any other) wrong.

Time for a signature overhaul.

By the way, what would be the rationale for calling cudaPeekAtLastError() right after a kernel launch as recommended in the first link brought up by a Google search for “canonical CUDA error checking” often cited here, if the error code subsequently gets absorbed by the CUDA runtime anyway? By calling cudaGetLastError() instead it is at least somewhat obvious that the the error will not get reported again later even if one does not remember this particular quirk of the API.

No idea, I have never used cudaPeekAtLastError(). But the existence of the function suggests that it is useful for some use cases, because generally speaking APIs in CUDA and associated libraries are driven by user demand (these may be users internal to NVIDIA or the general population of CUDA programmers).

Ad-hoc hypothesizing suggests it may be used for situations where error detection and error handling is performed in two separate steps. If at detection stage a status other that “success” is found with cudaPeekAtLastError(), control is handed off to a centralized handler for error handling, which then calls cudaGetLastError() to extract the error status and clear it.

The way cudaGetLastError() works is roughly modeled on OpenGL error handling, as I recall.

I don’t know what rationale the original poster of the link you linked had.

cudaPeekAtLastError does not clear the error code.

It can be reported again by the CUDA runtime API, either via another cudaPeekAtLastError call, or another cudaGetLastError call.

Your previous statement was not how I would have worded it anyway, but did not think it was worthwhile to say anything.

" kernel launch errors require their own call to cudaPeekAtLastError() or cudaGetLastError() to be detected"

I wouldn’t say they require “their own” call. They require a call, somewhere, even if it be at the end of the application. The CUDA runtime API does not “absorb” such an error. But I understand what you mean.

The logic of the CUDA runtime error reporting is :

  1. If an asynchronous error has been detected, report it. To my knowledge these are all sticky.
  2. If a synchronous error results from the current requested operation, report it. To my knowledge these are all non-sticky. Reporting of such error if by cudaGetLastError clears it (PeekAt does not, see below)
  3. Return cudaSuccess.

You’ll note the 1-2-3 sequence does not provide a mechanism to report a previous synchronous error, however it may have occurred. PeekAt/GetLastError are special. By definition they do not generate their own synchronous error in step 2, therefore they report a previous (uncleared) synchronous error. PeekAt does not clear such. Get does.

I believe this is a sound definition. Perhaps someone will prove me wrong.

I am not questioning the existence of cudaPeekAtLastError(). I can see there are use cases and vaguely remember having used it myself a long time ago - it might have been checking for launch errors in asynchronous library code where I still wanted to let the callers receive any other errors themselves.
I am only thinking that cudaGetLastError() may be a slightly better recommendation when checking for launch errors in your own code.

EDIT: This is directed at njuffa - I hadn’t refreshed the browser in a while when I posted this.

Robert: I believe your description is both accurate and clear. I wish the Programming Guide were as clear, and not just accurate. Thank you very much for this post!

For a long time somewhere between initially getting familiar in the CUDA 1.x timeframe and relatively recently when I revisited the subject for a training course it had slipped my mind that just checking at the next synchronizing CUDA call is not sufficient, because not all CUDA errors are sticky.
If this happens to me who is regularly involved with CUDA it seems plausible for a less frequent CUDA user to get it wrong, as we regularly observe here and on StackOverflow.

My post #10 was intentionally lax in language to be a bit closer to the perspective of a CUDA learner. (Not that I would get it as accurate as you regularly do, even if I tried to). That realization after the course preparation has left some marks. I think communication could have been better there instead of just blaming users for getting it wrong, and I feel guilty of that myself too.

Edited #10 to (somewhat) account for your criticism.

Already I had to edit mine also. I don’t disagree that CUDA error handling is problematic. This can be confirmed simply by observing the amount of churn created by those who don’t use it, for whatever reason. The forums are absolutely littered with examples. And it’s tedious for everybody involved.

And people get angry at me if I tell them to use it in a strong/forceful way. Just have to keep banging the drum, gently.

Well, Nvidia would be perfectly placed to amend the C++ API to throw on errors by default, with some way of switching back to reinstating the old behavior. (Yes I know C++ exceptions are a can of worms.)

I am sure Nvidia will have discussed this to death internally already. Not sure what particular reasons are holding back such a change, but I would like to think they could be overcome with sufficient will.

A simple C API is useful for interfacing with myriad other languages that are not C++, and its existence has served CUDA quite well considering the huge number of bindings available for it today.

It is certainly true that some of the design features of CUDA that feel a bit painful today and look dated after 14 years of existence are due to trade-offs that were made to ensure its rapid and wide adoption.

Ok, maybe switching the behavior is a bad way of phrasing it. Wrapping the existing (mostly C) API and directing new unsuspecting C++ users to the wrapper I think sounds more doable.

Expecting a plug by epk2 here (disclaimer: I’ve never actually looked at his C++ wrapper). But the entire point would be to have this offered by Nvidia with the SDK, and to be the default where a C++ programmer new to CUDA ends up.