cudaGetLastError returns a strange error


in my code, I have a similar sequence:

// do some stuff, launch kernels, etc

res = cudaDeviceSynchronize();
// check res

res = cudaGetLastError();
// check res

All calculations are done on the default stream and one thread.

The cudaDeviceSynchronize returns cudaSuccess, but the cudaGetLastError call returns an invalid device function error.

Should this be possible according the CUDA API specification?

I mean the sync call should wait until the device is finished, so no errors should be emitted between those two code lines (once again, assuming a single threaded app).

How can this happen?

Yes, its possible.

cudaDeviceSynchronize() returns the error code/result from the actual synchronization process, as well as any previous asynchronous errors. The invalid device function error is not an asynchronous error. It is an error that is discoverable/reportable at the moment the kernel launch is issued, not an error that results from kernel execution. It is also a non-sticky error, i.e. an error that does not “corrupt” the CUDA context, therefore it is not reported via ordinary API activity, but is reported via cudaGetLastError.

If you used proper CUDA error checking, you would discover this error before getting to the cudaDeviceSynchronize function.

Thank you!

“It is also a non-sticky error”
“an asynchronous error”

Is there a document that goes deeper into CUDA error handling? I don’t remember these terms from the CUDA programming guide and also the API documentation does not seem to mention them.

“cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed.” (

From the above, I’ve got the impression that cudaDeviceSynchronize() should ‘catch’ all previous errors.

I can’t point you to a single concise reference that covers these topics. There are various questions/answers on Stack Overflow which cover responses to these types of questions (and perhaps probably here on Here is one such example on SO:

cudaGetLastError() or cudaPeekAtLastError(), which is/are referred to in many treatments of “proper CUDA error checking”, should catch any previous error, whether synchronous or asynchronous, sticky or non-sticky. The same cannot be said for cudaDeviceSynchronize(). This is easy to prove with a simple test case.