Difference in error handling between driver api and runtime api

youkaichao1 · June 17, 2025, 8:14am

Coming from How to clear cuda errors? , but that conversation is locked, so i opened a new one.

I checked the resource https://www.olcf.ornl.gov/wp-content/uploads/2021/06/cuda_training_series_cuda_debugging.pdf , but find it does not cover driver API.

It seems the driver api error handling CUDA Driver API :: CUDA Toolkit Documentation only has a cuGetErrorName and cuGetErrorString, which are clearly stateless functions that just implement a look-up table.

It seems we only have the concept of “error clearing” in runtime API cudaGetLastError?

My mental model is:

each cuda context has a flag to track if the current context is corrupted. when a kernel runs into issues (illegal memory access, illegal instruction, etc), that flag is set, and the context cannot be used anymore.

for driver api, if that flag is set, return the error; otherwise, just return the execution result of the driver api.

for runtime api (including kernel launch) , it additionally tracks a flag for persistent (persistent across runtime API calls) but clear-able errors, notably kernel launch errors like invalid shared memory size. if either flag is set, return the error; otherwise, return the execution result of the runtime api.

only certain runtime apis will put the persistent error flag, simple calls like cudaMalloc will not set the flag. So cudaMalloc failure will not affect the following kernel launch, but a failed kernel launch will affect the following cudaMalloc. Of course, an illegal memory access inside the kernel will fail both of them.

Is it correct?

youkaichao1 · June 19, 2025, 7:48am

After digging for a while, I think we can treat cuda driver API as the following:

CUresult some_driver_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // execute the corresponding driver API implementation
    return some_driver_api_implementation(some_args);
}

And treat cuda runtime API as the following:

cudaError_t some_runtime_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // call the corresponding driver API to implement the functionality
    cudaError_t value = some_runtime_api_implementation(some_args);
    // if the call is not successful, update the global variable
    if (value != CUDA_SUCCESS) {
        last_error_code = value;
    }
    // return the call result
    return value;
}

The difference is whether a failed API call would affect a global last_error_code. If we never call cudaGetLastError, then they are the same. However, since many code would explicitly call cudaGetLastError to check errors, the difference matters in practice.

youkaichao1 · June 19, 2025, 7:57am

so the code in How to clear cuda errors? - #3 by njuffa is problematic actually, although it can allocate memory successfully, the global error state is polluted. We need to call cudaGetLastError to clear the error for it to be useful.

youkaichao1 · July 2, 2025, 3:26pm

cc @Robert_Crovella ?

Robert_Crovella · July 2, 2025, 7:43pm

Do you have a specific question?
Yes, error handling is not purely a 1:1 correspondence between Driver API and Runtime API. The Driver API is in general not an exact 1:1 correspondence analog of the Runtime API. I’ve spent some time looking at Runtime API error handling, very little at Driver API error handling. My general impression is yes, the runtime API seems to have a stateful error memory that the driver API does not seem to have.

In general, the “status of the context” holds the sticky-error-ness/state for both the driver API and the runtime API. If the context is corrupted (in the case of the runtime API, the primary context) then it will be unusable in either runtime or driver API case. In the runtime API case I refer to this as a sticky error. The non-sticky-error-memory that I have previously referred to seems to be primarily an artifact of the runtime API.

In any event, for anyone who is concerned about API level errors, my general advice would be to capture the error status at every opportunity (rigorous error checking) and do something sensible whenever an error is reported. This is sensible for either the runtime or driver API, from my perspective.

The non-sticky-error-memory in the runtime API is only reported via peek at last error or get last error calls. It is not reported on any other runtime API call except the actual call that triggered the error.

I believe this is consistent with and essentially the point of njuffa’s demonstrator.

youkaichao1 · July 3, 2025, 4:26am

if i can control all of the code i run, then i would definitely do it. however, the reality is that we often run third-party code, and they might not follow the practice. so i need to understand the error handling behavior when something is wrong.

for example, many people just write kernel<<<>>> (example here: pytorch/aten/src/ATen/native/cuda/SortStable.cu at 5e320eea665f773b78f6d3bfdbb1898b8e09e051 · pytorch/pytorch · GitHub ) without error checking, then it would pollute the error state of runtime api, but now i know it will not affect the driver api.

system · July 17, 2025, 4:27am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to reset CUDA error in driver API CUDA Programming and Performance	5	7747	February 18, 2014
Reset CUDA error when using driver API cuXXX functions keep returning the same error code CUDA Programming and Performance	0	1294	July 7, 2010
Runtime Vs API Vs Driver version.... CUDA Programming and Performance	1	6700	May 11, 2011
runtime api to driver api problem CUDA Programming and Performance	2	1696	October 21, 2008
Clearing Cuda Errors CUDA Programming and Performance	6	11452	December 1, 2009
Runtime API? Driver API? CUDA Programming and Performance	2	6178	May 7, 2008
cudaSafeCall() Runtime API error CUDA Programming and Performance	1	7086	November 27, 2010
Driver API & Runtime API How do they vary? CUDA Programming and Performance	0	938	May 27, 2010
Runtime API error? CUDA Programming and Performance	0	627	June 6, 2011
CUDA Driver API and CUDA runtime mutually exclusive CUDA Programming and Performance	1	910	July 30, 2009

Difference in error handling between driver api and runtime api

Related topics