Difference in error handling between driver api and runtime api

Coming from How to clear cuda errors? , but that conversation is locked, so i opened a new one.

I checked the resource https://www.olcf.ornl.gov/wp-content/uploads/2021/06/cuda_training_series_cuda_debugging.pdf , but find it does not cover driver API.

It seems the driver api error handling CUDA Driver API :: CUDA Toolkit Documentation only has a cuGetErrorName and cuGetErrorString, which are clearly stateless functions that just implement a look-up table.

It seems we only have the concept of “error clearing” in runtime API cudaGetLastError?

My mental model is:

each cuda context has a flag to track if the current context is corrupted. when a kernel runs into issues (illegal memory access, illegal instruction, etc), that flag is set, and the context cannot be used anymore.

for driver api, if that flag is set, return the error; otherwise, just return the execution result of the driver api.

for runtime api (including kernel launch) , it additionally tracks a flag for persistent (persistent across runtime API calls) but clear-able errors, notably kernel launch errors like invalid shared memory size. if either flag is set, return the error; otherwise, return the execution result of the runtime api.

only certain runtime apis will put the persistent error flag, simple calls like cudaMalloc will not set the flag. So cudaMalloc failure will not affect the following kernel launch, but a failed kernel launch will affect the following cudaMalloc. Of course, an illegal memory access inside the kernel will fail both of them.

Is it correct?

1 Like

After digging for a while, I think we can treat cuda driver API as the following:

CUresult some_driver_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // execute the corresponding driver API implementation
    return some_driver_api_implementation(some_args);
}

And treat cuda runtime API as the following:

cudaError_t some_runtime_api(some_args) {
    // check if context is corrupted
    if (context_is_corrupted) {
        return corresponding_error_code;
    }
    // call the corresponding driver API to implement the functionality
    cudaError_t value = some_runtime_api_implementation(some_args);
    // if the call is not successful, update the global variable
    if (value != CUDA_SUCCESS) {
        last_error_code = value;
    }
    // return the call result
    return value;
}

The difference is whether a failed API call would affect a global last_error_code. If we never call cudaGetLastError, then they are the same. However, since many code would explicitly call cudaGetLastError to check errors, the difference matters in practice.

so the code in How to clear cuda errors? - #3 by njuffa is problematic actually, although it can allocate memory successfully, the global error state is polluted. We need to call cudaGetLastError to clear the error for it to be useful.

cc @Robert_Crovella ?

Do you have a specific question?
Yes, error handling is not purely a 1:1 correspondence between Driver API and Runtime API. The Driver API is in general not an exact 1:1 correspondence analog of the Runtime API. I’ve spent some time looking at Runtime API error handling, very little at Driver API error handling. My general impression is yes, the runtime API seems to have a stateful error memory that the driver API does not seem to have.

In general, the “status of the context” holds the sticky-error-ness/state for both the driver API and the runtime API. If the context is corrupted (in the case of the runtime API, the primary context) then it will be unusable in either runtime or driver API case. In the runtime API case I refer to this as a sticky error. The non-sticky-error-memory that I have previously referred to seems to be primarily an artifact of the runtime API.

In any event, for anyone who is concerned about API level errors, my general advice would be to capture the error status at every opportunity (rigorous error checking) and do something sensible whenever an error is reported. This is sensible for either the runtime or driver API, from my perspective.

The non-sticky-error-memory in the runtime API is only reported via peek at last error or get last error calls. It is not reported on any other runtime API call except the actual call that triggered the error.

I believe this is consistent with and essentially the point of njuffa’s demonstrator.

if i can control all of the code i run, then i would definitely do it. however, the reality is that we often run third-party code, and they might not follow the practice. so i need to understand the error handling behavior when something is wrong.

for example, many people just write kernel<<<>>> (example here: pytorch/aten/src/ATen/native/cuda/SortStable.cu at 5e320eea665f773b78f6d3bfdbb1898b8e09e051 · pytorch/pytorch · GitHub ) without error checking, then it would pollute the error state of runtime api, but now i know it will not affect the driver api.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.