CUDA errors: determine "sticky-ness"

giorgiore27 · November 2, 2023, 8:24pm

I’m trying to determine how to figure out if a CUDA error is stiky or not, specifically since I need to know if the host process should be terminated or not.

So, I made some tests to figure out how sticky errors can be investigated, and found the following:

// this kernel will access 'input' or 'output' out of bounds, causing a 'cudaErrorIllegalAddress' (code 700)
a_kernel_with_error<<<1,1>>>(input, n_input, output, n_output);

// call 'cudaGetLastError' to verify whether the kernel launch failed
cudaError_t first_get(::cudaGetLastError());
// 'first_get' is 'cudaSuccess', the "out of bounds" error is asynchronous

// run some host code ...

// get to a synchronization point, e.g. copy 'output' to cpu
std::vector<int> output_copy(n_output);
cudaError_t sync_point(::cudaMemcpy(output_copy.data(), output, count * sizeof(int), cudaMemcpyDeviceToHost));
// 'sync_point' stores 'cudaErrorIllegalAddress', since now cpu and gpu synchronize

// now, there is a CUDA error, and I need to determine whether to kill the host process or not
// so, I start to inquiry the gpu with respect to errors:
cudaError_t new_get(::cudaGetLastError());
// 'new_get' stores 'cudaErrorIllegalAddress', as expected since the gpu got a sticky error,
// but if I call again 'cudaGetLastError' the error has been reset to 'cudaSuccess'
cudaError_t a_second_get(::cudaGetLastError()); // cudaSuccess

// every other call to 'cudaPeekAtLastError' or 'cudaGetLastError' returns now 'cudaSuccess'

// only if I synchronize again, I get 'cudaErrorIllegalAddress', but it can be reset with 'cudaGetLastError'
cudaError_t device_sync(::cudaDeviceSynchronize()); // cudaErrorIllegalAddress
cudaError_t another_get(::cudaGetLastError());      // cudaErrorIllegalAddress
cudaError_t last_get(::cudaGetLastError());         // cudaSuccess

So, it looks like that to determine if an error is sticky, the following lines of code should suffice:

cudaError_t first_sync(::cudaDeviceSynchronize());
cudaError_t try_to_reset(::cudaGetLastError());
cudaError_t second_sync(::cudaDeviceSynchronize());
if (first_sync == second_sync)
   // the error is sticky

Can anyone confirm I can go with syncronization-reset-syncronization in order to determine if an error is sticky?

As a side note, tbh I don’t understand why ‘cudaGetLastError’ resets the error even with sticky errors, this creates some confusion.

Robert_Crovella · November 2, 2023, 8:51pm

~~I don’t think it used to be that way, but that is neither here nor there.~~

It seems sufficient simply to call cudaDeviceSynchronize() twice (or even just once). If the same error (or any error) is indicated, its a pretty good indicator, I think.

cudaDeviceSynchronize() can return previously occurring asynchronous errors. Pretty much all asynchronous errors (these are errors that occur as a result of device code execution) are sticky. I’m not aware of any synchronous errors that cudaDeviceSynchronize() could/would report that would cloud this understanding.

njuffa · November 2, 2023, 11:42pm

The current description of cudaGetLastError() says:

Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread and resets it to cudaSuccess.

To the best of my knowledge and recollection, this has always been the defined behavior of this function. If I recall correctly, this behavior was modeled on the corresponding functionality in OpenGL. Generally speaking, the early design of CUDA was strongly influenced by a desire to keep the learning curve as shallow as possible by defining CUDA semantics in ways that were familiar to users of C, Cg, and OpenGL.

For example, providing device function intrinsics for some standard math functions was directly driven by the desire to offer math functions of performance identical to what Cg provided (reduced accuracy, but really fast).

giorgiore27 · November 3, 2023, 8:38am

Can we say that every and only asynchronous error is sticky and require the application exit? My goal is to determine whether or not restart the executable, otherwise I will restart it for every cuda-related error, but it is a pity.

I choose to call twice cudaDeviceSynchronize() in order to try to reset the error in between, if it cannot be reset than it must sticky.

giorgiore27 · November 3, 2023, 8:46am

I am familiar with the documentation of cudaGetLastError(), but looking in the internet for a better understanding of sticky errors, I found these two explanations:

Lei Mao says

when the kernel tries to access invalid memory address during kernel execution, it will result in a sticky error which will be captured and returned by all the subsequent CUDA runtime API calls

in Proper-CUDA-Error-Checking

Robert Crovella says

This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function

in StackOverflow

So, honestly this is a little bit confusing, maybe I can suggest creating a specific API call to check whether the CUDA context is irremediably broken

njuffa · November 3, 2023, 9:03am

@giorgiore27 I cited the documentation for cudaGetLastError() as a point of reference, not because I was making assumptions whether you are familiar with the documentation. Before we talk about something let’s first define what it is we are talking about, before reasoning about it, exploring its history, or the rationale behind it.

Frankly, this discussion about sticky versus non-sticky errors is foreign to me. What I have always understood the expected usage of cudaGetLastError() to be is this: It tells a programmer whether an error was recorded by the CUDA runtime since they last looked using this function.

The programmer then gets to make a decision at that point what to do next if the status is not cudaSuccess. The CUDA runtime error status is automatically reset to cudaSuccess by the call to cudaGetLastError(), so the next call to cudaGetLastError() again indicates whether an error occurred in the meantime.

If an error condition (such as ULF) causes the destruction of the current CUDA context, all CUDA runtime API functions become inoperable, as they pertain to a context that no longer exists. This also affects cudaGetLastError(): It cannot reset the error status to cudaSuccess, because that error status is part of the CUDA context, which no longer exists.

giorgiore27 · November 3, 2023, 9:16am

This is exactly the aim of my question: let me reprhase that. Looking at the internet (again, it is not something that I found in the documentation), you need to kill the host process when there is a sticky error (the “owning” process must also terminate.). In my workflow, kill the host process is a little bit of a pain, so I am willing to do that only when it is strictly needed. This is the reason why I am asking how to understand if I am dealing with a sticky error and need to restart the application.

My point about cudaGetLastError() was just a side note about (in my opinion) a misleading behaviour of the API, since it looks like the error is gone but the GPU is in an error state that cannot be reset.

njuffa · November 3, 2023, 10:09am

My suggestion would be to stick, first and foremost, to the official CUDA documentation. If the official documentation is unclear, it may make sense to consult secondary sources to seek clarification. However, this carries the risk that the authors of the secondary sources experienced the same kind of confusion you came up against.

CUDA does not prescribe any specific course of action after a programmer finds out that their CUDA context has been destroyed as a consequence of an unrecoverable error. That is why you will not find anything about that in CUDA documentation, with the possible exception of the Best Practices Guide.

A hypothetical example: Your CUDA-accelerated app also includes a CPU-only legacy code path because not all of your customers have a GPU of sufficient compute capability yet. Your application also uses an all-or-nothing model of execution: either an app-level operation completes successfully or otherwise the app behaves as if that operation had never been issued. After you find that the CUDA context used by your app has been destroyed as the result of an unrecoverable error, you switch execution to the unaccelerated legacy path. The app remains functional, albeit at reduced performance.

It do not think this is a correct characterization of the behavior of cudaGetLastError(). You will, however, need to issue a call to cudaDeviceSynchronize() to ensure that errors that occurred asynchronously (and may cause destruction of the CUDA context) are correctly reflected in the error status.

Robert_Crovella · November 3, 2023, 2:01pm

Sticky errors are those errors that corrupt the CUDA context. In the CUDA runtime API, a corrupted CUDA context is non-recoverable. The only method to recover from it is to allow the owning process to terminate.

Yes, all asynchronous errors are sticky. Only asynchronous errors are sticky.

Asynchronous errors are those errors that occur as a result of device code execution itself.

It is true in some sense that a call to cudaGetLastError() “resets” the error. However in practice, it is self-evident that it does not reset the sticky error in any practical way. The same error will reoccur (i.e. be reported again) if any non-trivial usage of the CUDA runtime API is attempted. For all practical purposes the context is corrupted by the sticky error and is useless.

system · November 17, 2023, 2:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.