I have a heisenbug, but how to approach it?

I’m going to start this without posting any code. I have a new code base, so far 240k lines of original C++ and CUDA. The behavior that I am seeing appears in the C++ layer of some of my newest classes and functions (I am making unit tests as I go). When I compile (using cmake) with CUDA on, one of the modes of the new object shows a Heisenbug (fixed-precision accumulation). The accumulation may be in fixed-precision integers, but I’m only doing this on a single-threaded process for the sake of the test. The results are all over the place, and the more I try to debug, performing other operations on the arrays in question, the harder the bug gets to see (it becomes less and less likely to trigger, sometimes requiring hundreds of iterations of the same test). The bug is also much less likely to occur when debugging flags are engaged in the compilation, but it does still affect results from time to time, and when I take away the various debugging lines and recompile with optimizations, particularly removing sums over the array in question to check its properties (they read the array but do not modify it), the bug resumes its appearances the majority of the time.

I have noticed that this does not occur on either of my laptops, but rather than different compilers (GNU g++ 9.4.0, versus Clang) I wonder if the actual issue is whether I have CUDA (version 11.4) involved in the compilation (via cmake 3.18, I am compiling with GNU G++ 9.4.0 and CUDA 11.4). If I turn off the CUDA cmake directive, the bug goes away entirely, even when the compiler is given optimization flags that really seemed to make the bug come out. I am still building the C++ code with g++ and then linking to CUDA units with NVCC. There is a great deal of templating going on in the code, but there isn’t any compiler complaint to this effect no matter how I compile.

It may be significant that, when CUDA is not engaged, the arrays in question are allocated by new[] and when CUDA is engaged in the compilation, the arrays in question are allocated by cudaHostAlloc(). However, I have performed accumulations like this in many, many ways on arrays created with either new[] or cudaHostAlloc() depending on whether the code is compiled to run on the CPU or to run on the CPU with extensions that run on the GPU. I have never had this srt of trouble. Also, I have tried changing the data types under which the arrays are allocated and also the exact methods used to accumulate numbers into them, all leading to vairous forms of the Heisenbug.

If I hadn’t already checked further, I’d say it looks like an initialization issue: when I do manage to include lots of debugging code and still have enough luck to get the bug to come out, I have been seeing that the array which should hold the interpolated weights of a series of particles seems to fail to add contributions from whole particles, as if it tarted back over from zero and then resumed accumulation. However, when I run with valgrind I have been unable to detect a memory or initialization error.

Any help with this would be appreciated–I can send the code, in whole or in part, to a credentialed expert who would be interested to take a look. The package is planned for open-source release once it is useable by the moecular simulations community.

Some general suggestions:

  • Perform proper cuda api error checking on each and every call and kernel launch.
  • Run your code with compute-sanitizer
  • Draw a workflow diagram with the different cuda streams to see if you are missing synchronization between different streams or between cpu and gpu
1 Like

That is a lot of code. Did you write it yourself over multiple years? Is it an open-source project?

If the former, did you put proper unit tests in place and did yo do careful integration testing as the pieces came together? Is there daily or continuous automated regression testing? If the latter, what do the authors have to say about this problem? Has anybody ever made similar observations before? If so, what root causes did they identify?

The root-cause could be on the CUDA side or it could be in host code. Have you run all available compile-time and run-time checks, e.g. -Wall clean, valgrind clean, compute-sanitizer clean? Have you tried reducing compiler optimizations. While it doesn’t prove anything (bugs could me masked rather than disappear), this may allow for a “golden run” from which you can create a detailed debug log to be compared with a debug log from a failing run. If the code doesn’t have built-in debug log creation that is normally compiled out, you would need to add ad-hoc instrumentation.

Are any parts of the code taking in random numbers? You would want to clamp the seeds of PRNGs to known values to make sure runs are repeatable and turn of any true random number generation. Any use of floating-point atomic operations in the CUDA portion of the code that could trigger differences between runs with identical input data? If so, can you replace them with a (potentially slow, but) deterministic mechanism?

How often does the bug show up. Once in a hundred runs, virtually every run (different results on every run)? What kind of turn-around are you getting per run? Have you identified the smallest data set that still triggers the issue with reasonable probability?

1 Like

Thanks for the many suggestions. I will give compute-sanitizer a try, although I have checked to see that the code runs valgrind-clean, there is proper CUDA error checking for each cudaMalloc(), cudaHostAlloc(), cudaMemcpy() and other such calls.

I am writing the code base myself, and it has been going for two years, two months now. There is proper integration testing, with over 3500 unit tests spanning most objects and parts of the code (I haven’t engaged gcovr recently, but line coverage is pretty good and the latest object should be but another iteration on the same stuff I’ve been doing over and over for the past two years. As I alluded to in my earlier post, there was one thing I was doing slightly differently in my newest object which was not common in the code beforehand, specifically I was reinterpret_cast<>'ing arrays of double to arrays of long long int in order to re-use the space for fixed-precision accumulation one array position at a time, which the class would then have a method to read one element at a time and convert once the accumulation was done. I tried undoing that entirely, allocating a separate array of long long int and never recasting anything, but the same problem persisted so I went back to the original design.

The object can operate in “single” or “double” precision mode, and it will allocate one of two arrays (it’s not a matter of templating) in either case. There’s an array of float and and array of double and only one of the arrays gets allocated depending on how the class object’s “mode” is set. There is also a “fixed-precision bits” setting, which controls the precision of fixed-precision accumulation. If this is set to anything reater than zero, the float array is re-interpreted as an array of int and the double array is re-interpreted as long long int, PLUS an additional array of int get allocated, the same length as either of the others, for “overflow” bits. The fixed-precision accumulation is “split” between two accumulators, two 32-bit integers for “single” precision mode and 64- and 32-bit integers in “double” precision mode. I have used this accumulation and associated methods for more than a year, since the code base was less than 100,000 lines, and it is involved in probably 1000 different unit tests. As I said earlier, the problem only shows up in certain situations. When I do real-numbered accumulation (that is, summing things in the float or double floating-point numbers), the problem never appears. When I switch to the fixed-precision accumulation, the problem shows up very frequently if I have no internal checking or debugging flags turned on in the compilation (-O3), less frequently if I tone down to -O0 -g, and hardly ever (have to run it many times) if I go in and, say, read the values in the critical arrays to sum them up (not changing the arrays, just summing their contents to see that the contributions from each step are properly registered). If I do the summation at one point in the code, that is just before the next contribution is to happen, I have caught that the arrays are acting as if the prior contributions all worked but the next contribution completely blows them away (this seems to happen about one in 50 tries). If I do that check a second time, after the next contribution has taken effect, to compare the before and after sums, I don’t know if I ever see the bug or if it’s just a one out of 2500 chance that I haven’t been able to catch.

Is this a GPU race condition? Sounds a lot like such things, but all of this is being tested on a serial, single-threaded CPU in host memory. There is no GPU kernel being HOLY SHIT I think I just realized what’s happening.

There is no GPU intended to be used, but that doesn’t mean that the GPU isn’t actually firing off. My “initialize” function, which only fires off for fixed-precision accumulation, will launch a GPU kernel with the available GPU specs to set all values of the integer-casted arrays to zero. This would be fine if one were running all in the CUDA default stream, as the next kernel would be queued behind the initialization. But my kernel is being fed an abstract that points to arrays on the host (oops), and it’s firing off anyway because there is a GPU even though I intended to do this on a serial process. I forgot a break; at the end of my switch statement that selects to initialize memory on the host or device. So the CPU initializes the host memory, then falls through to the part of the switch that operates on device memory, fires off the kernel with an abstract of data from the host, and then gets to work accumulating the array contributions in its own host memory. But at the same time the kernel is issuing commands to “set that to zero”, erasing the math that the single CPU thread had done up to that point.

Now this should all work. Oh, the perils of switch statements, but such is the case with new, green code.

I will contact the forum again if it does not, and if anyone is interested in a next-generation, well engineered molecular simulations code base that can run tens of thousands of small molecule calculations or, when this and a few other things are working as intended, a few dozen condensed-phase molecular dynamics trajectories in a single runtime process, hit me up.

Indeed, fixing the break; in the switch case statement solved the problem. The code now runs as expected, in whatever compilation mode. It also makes a lot of sense–having the CPU do more work to investigate its progress took extra time, which gave the GPU kernel more time to complete. Similar when compiling with -O0 -g: the CPU didn’t get into its work fast enough to have the GPU still showering the array with zeros (“death from above!”).

I think we just witnessed a prime example of this:

In software engineering, rubber duck debugging (or rubberducking) is a method of debugging code by articulating a problem in spoken or written natural language.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.