Random program behavior on A100 GPUs


I am working on a MPI program using the following toolchain:

  • CUDA-aware, UCX-enabled (v1.12.x) OpenMPI (v4.1.x)
  • Clang compiler (v14.0.x) to enable OpenMP offload code (all kernels are written using OpenMP)
  • cuFFT, cuBLAS (I’m running CUDA 11.7)

I have a fairly big test suite I need to run to validate said program and I’m doing so on Slurm-enabled clusters. However, I’m getting weird results according to where I run it:

  • On the first cluster, I request a single node that’s made of 4x A100 GPUs and when I run the code there, there are random tests that will fail (though two successive runs in the exact same settings will fail at different points, thus yielding some kind of random failure behavior)
  • I tried running the same test suite on another, smaller cluster. The first test I did on a node that’s made of 4x RTX3090 GPUs - where all tests ran successfully multiple times. I observed the same results when running on 2x RTX2080Ti - no failure.
  • The same small cluster also has A100s nodes, and there I got the same kind of random failure behavior than on the first cluster.

To be honest, I’m kind of stumped at what might be the source of this seemingly random program behavior on A100 GPUs - does anyone have an idea what might be the cause of this ?

Thanks in advance for any help !

Hm, random behavior of software. The possibilities are endless. With high likelihood there is a bug (or bugs) in your application or your test suite. Check host code, device code, and MPI communication for the following:

(1) out-of-bounds accesses or uninitialized data (e.g. off-by-one errors, invalid pointers)
(2) race conditions (e.g. missing synchronization)
(3) invoking undefined behavior (in C++, CUDA APIs, MPI calls, etc)
(4) unchecked error status (API calls of any kind; but look in particular at memory allocation, bulk copies, and CUDA stream management first)

Hardware related failures (you do have ECC enabled on the A100 nodes, correct?) are possible but highly unlikely, specially since you are seeing problem on two different clusters.

If anything in this application is driven by a PRNG, make sure it is configured such that it returns the exact same sequence every time the app is invoked.

Use standard debugging techniques (such as logging of intermediate data; inputs and returns from API calls) to narrow down where in the software differences occur. as you noted, there may be multiple such points. Trace back through the code from each of them. Hopefully they all converge on a single root cause. Good luck.

Thanks for the reply ! I agree with all you said; possibilities are indeed endless.
However, the one thing I’m very weirded about is the fact that I’m getting different behavior when running on non-A100 GPUs (as in everything behaves correctly/as expected, in a non-random fashion when I’m running on RTX3090 GPUs) - hence my question on what kind of difference there are between both GPUs that might explain the behavior I’m observing.

Based on my considerable experience with debugging weird failures (and on more than one occasion succeeding in finding the root cause(s) where others had failed), that is not something I would focus on.

Have you tried using compute-sanitizer, valgrind, and other such tools to find semi-obvious issues?

1 Like

OKay, thanks for the input !
I actually have used valgrind yes - however, I’m having issues that prevent me from using compute-sanitizer sadly, as per my post here.

Well, that’s too bad, but not all is lost. I have successfully debugged large-ish codes with nothing more than printfs for logging and a remote console.

From that other thread it seems like your application is quite complex. If this were my code and build process, I would first try to reduce the complexity (while keeping failures observable) as much as possible before doing a deep dive and back trace on any observed random differences.

You are describing a large test framework. I assume that is for testing at application level. Is there also good unit test coverage for the constituent modules of this application? Knowledge about poorly tested modules might lead to hypotheses about root causes. The nature of the random failures themselves might also contains hints (such as when you get a bunch of NaN outputs). Whether to use such hints to shortcut debugging from first principles is a judgement call. I have been bitten in the behind by using such shortcuts as many times as I have used them successfully.

Unfortunately, I’m all too familiar with printf debugging as well ! :D

It’s indeed a relatively complex application, but I’m trying and somewhat managing to zoom in potentially responsible code blocks. I definitely suspect this is some kind of asynchronicity issue in the kernel launches or some mis-use of cuFFT/cuBLAS APIs on my part. (I’m going to add error status checking of CUDA APIs as a first step…)
Fortunately, I’m also fairly confident MPI isn’t the issue here, as I’m getting the same behavior whether I’m using one or several ranks.

As I said, there are strategies for hypothesizing about root causes and all of them tend to be quite problematic in the case of complex applications. Another such technique is to look at the change history in the versioning system, identify the last working change list and the first broken one, diff them and divine a root cause from that.

The problem with that approach is that any change may simply uncover a slumbering latent bug. And given the difference in observed behavior between the RTX 3080 and the A100, that (i.e. a latent bug) is what I would suspect here most of all.

Status checking CUDA API (and CUBLAS API, and CUFFT API, etc. etc.) calls should be a mandatory practice. It is certainly a best practice. The point of this is that if something fails, we would want to know that at the earliest possible time.

Isn’t it so that running the application through cuda-memcheck would report any CUDA API faults as an added bonus?

While that is so (not sure whether the coverage is 100%), OP reported that they are unable to usecompute-sanitizer. See thread linked above. I guess cuda-memcheck could be sufficiently different in its internal workings that a quick check whether it can be invoked successfully is worthwhile.

Actually, I tried using cuda-memcheck as well, and it doesn’t work either - but with a different error than compute-sanitizer.

However, I managed to track down the bug to cuFFT and fix it (I’m still a bit weirded out about why it was an issue in the first place, but I guess that’s life :D) - thanks everyone for your help ! :-)