Binary compatibility between similar NVIDIA hardware


I have a code, which runs fine on my GeForce RTX 2080 SUPER, also fine on a friend’s GeForce RTX 2070S OC 8G, but it fails (= crashes upon start) on GeForce RTX 2080.

GeForce RTX 2080 SUPER and GeForce RTX 2080 are very similar. So I looked into the differences of both. There’s nothing which I could possibly detect as reason why it runs on the former and crashes on the latter.

E.g. GeForce 20 series - Wiki and more …

Is there a common mistake I might have made, or an well known overlook which stroke?

“Fail” or “crash” is too vague a description. Before getting into the details of what actually happens, please confirm that the application in question performs proper CUDA error checking: It checks the return status of every API call to CUDA or a CUDA-associated library, and every kernel call.

Does “crash” mean that one of these checks did not return cudaSuccess (or the equivalent for a CUDA-associated library)? If so, what was the API called and what was the return status? If the “crash” is not a failing status check, what is the exact nature of it? A segmentation fault?

A common reason for an abnormal program termination are bugs in one’s code, among which there are uninitialized data, out-of-bounds accesses to allocated data, race conditions, and a lack of check for error conditions near their point of origin (for example, failing to check that a dynamic memory allocation was successful). Have you used appropriate tools (e.g. valgrind) to look for those? When you run the app under control of cuda-memcheck does it report any issues?

I use CUDA error checking for the API calls (all cudaSuccess), but not for the kernel call. How do you check the kernel calls?

By “crash” (sorry for the vague phrasing), I mean that the binary just stops/returns seconds after being started, and without any message, just like (e.g.) if the threads in a block were too high.

cuda-memcheck returns no error either.

However valgrind returned the following:

$ valgrind --leak-check=yes ./etsi
==96537== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==96537==    This could cause spurious value errors to appear.
==96537==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==96537== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
==96537==    This could cause spurious value errors to appear.
==96537==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.

I found that cudaMallocManaged() was the reason ?!

int *hitsn;
HANDLE_ERROR( cudaMallocManaged((void **) &hitsn, sizeof(int)) );

Considering that the hardware is virtually identical, I’m running out of ideas.

An invalid configuration for a kernel launch does not generally result in program termination. The issue of abnormal program termination may be unrelated to the GPU, considering that cuda-memcheck reports no issues (you ran the app with cuda-memcheck on the system with the RTX 2080, correct?).

I am afraid we won’t make progress in remote failure diagnosis without a minimal reproducer. A good guess for a difference between substantially identical systems is usually that a different amount of available memory causes a memory allocation to fail, because that’s not just a function of system configuration but also of current system state. If allocation failures are not handled gracefully, that could lead to sudden program termination.

You might also want to look at the kernel execution time (either by using the profiler or timing it yourself). If it is near the operating system’s watchdog timer limit for a GPU connected to a display (usually around 2 seconds), the kernel might timeout on slower GPUs, causing the CUDA context to be torn down, which causes all subsequent CUDA API calls to fail.

A simplistic way to check kernel launches in CUDA code is use of a macro like this:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

Side question: The RTX 2080 machine (which fails to run my code) has no CUDA SDK etc. installed. Does the compiled binary require any NVIDIA prerequisites to execute?

It might. It certainly requires a properly installed GPU driver that is sufficiently new to meet the requirements of however the application was compiled (against which CUDA version it was compiled). Beyond that, other requirements might be specific to how the application was built. The usual requirements here would be any expected dynamically linked libraries. However, an application that has a requirement on a dynamically linked library will normally give a fairly explicit message at runtime if it cannot locate that library. So this is probably not the case here. An application might have other requirements that have nothing to do with CUDA.

Getting out a host code debugger and tracing the host code execution up to the point of app exit might be instructive. You could also sprinkle printf or similar in the code and rebuild it, to emulate the behavior of debugging/localization.

1 Like

Update … the RTX 2080 machine had (probably - not sure) 466.11 drivers installed. Update to 496.13 was done. After that, the code ran! I’m surprised the drivers were the culprit.

The initial error (when the code didn’t run on that machine) was:
the provided PTX was compiled with an unsupported toolchain. in ...

The code was compiled with 472.12 and CUDA 11.4.100.

very confusing since you previously said:

If you had actually reported this to begin with:

we could have immediately directed you to update your driver.

Not sure why this would be surprising since generally in a software stack, higher layers require certain properties of lower layers. When things seem out of whack, a common heuristic for resolving the issue is to install the latest available driver package suitable for one’s hardware.