CUDA version affects inference results during batching

We are running an application in an ubuntu jammy docker image on an NVIDIA Quadro P2200, with driver version 515.65.01 and CUDA 11.7. We are doing classification using Linear regression, and are experiencing instabilites with the results when we do inference depending on the CUDA version we are running.

We perform inference in batching, hence we process 8 lines from the image at a time. The issue only occurs when we perform inference during batching. Deactivating batching fixes the problem, but we are dependent on enabling batching due to processing restrictions. The inference model is an ONNX model loaded with opencv and generated from tensorflow. Running inference in python with onnxruntime yields correct results, ruling out any possibility that the problem comes from the model itself.

The issue shows itself in that the classified images sent through inference do not yield updated results, ie, the annotated pixels are not updated and performed inference on. Instead, the pixels of the resulting image are all annotated to one class(defined as foreign) and are never changed no matter how many images we annotate and send through inference.

We have encountered this issue for a while, and have tested the application on multiple computers with varying CUDA versions. The interesting thing is that most of the computers were able to obtain correct results with CUDA 10.2, and some also for CUDA 11.4. All other versions trigger the inference issue during batching.

We get no warning or error message in our logs, and there are no indications that there is a mismatch between any packages or extensions.

Is there anyone here who has encountered the same issue before - or might have some insight that is helpful to debug this problem?

I don’t think there is enough information here that lets us conclude that the root cause of this behavior is to be found inside CUDA (although that is certainly one possibility).

Consider hypothetical scenarios: (1) There is an uninitialized piece of data used somewhere in the software stack. Depending on the CUDA version, that uninitialized data could have a different value. (2) There is a CUDA API call that fails, but the return status is not, or incorrectly, checked by the software one level up in the stack. An error (such as a failed allocation) could arise with one CUDA version but not another. (3) CUDA provides an undetermined order of operations in certain circumstances (for example, order of summation when atomics are used). Different results could be returned depending on CUDA version, as generated code executed by the GPU could differ. It is also possible that some math functions return slightly different results depending on CUDA version (there are non-zero error bounds on many of them). Other places in the code may be making assumptions that do not hold under these conditions. (4) Some code may invoke undefined behavior (one customer case I remember involved converting negative floating-point data into unsigned integer data), and since the behavior is undefined, it may well be different between CUDA versions.

The first line of defense is to compile all software with -Wall -Werror, as a static check. Carefully review and address any issues reported. The next line of defense is to use run-time checkers. For device work, this would mean using compute-sanitizer, for the host code, using valgrind. This will catch (some) instances of out-of-bounds access, uninitialized data, and race conditions.

If that doesn’t yield anything, it’s off to garden-variety debugging activities. First, find the smallest possible configuration (problem size, feature set) that can reproduce the issue. Next, start instrumenting the code. This may seem primitive and old-fashioned, but I have had good success with inserting printf calls that dump into log file(s), and have succeeded in finding bugs others had failed to locate.

Start by logging only a few key data items (such as dimensions, operation codes, etc) passed into API calls at every level of the software stack as well as key data items returned by API calls. Keep adding detail over time. Compare logs from a known good and a faulty run. Clearly at some point you will come across differences, and they will indicate where to keep digging into more details.

It sounds like a fairly extensive software stack with many moving parts may be involved. This means one needs to mentally prepare for the possibility that it could take a week or two of full-time to find out with certainty what is going on. That’s about the time I spent resolving the most challenging bugs in my career. As I recall, the one that took me one week to pin-point the root cause then took one minute to fix (a one-line change to add a missing synchronization call to CUDA code at the bottom of the software stack).