I’m going to start this without posting any code. I have a new code base, so far 240k lines of original C++ and CUDA. The behavior that I am seeing appears in the C++ layer of some of my newest classes and functions (I am making unit tests as I go). When I compile (using cmake) with CUDA on, one of the modes of the new object shows a Heisenbug (fixed-precision accumulation). The accumulation may be in fixed-precision integers, but I’m only doing this on a single-threaded process for the sake of the test. The results are all over the place, and the more I try to debug, performing other operations on the arrays in question, the harder the bug gets to see (it becomes less and less likely to trigger, sometimes requiring hundreds of iterations of the same test). The bug is also much less likely to occur when debugging flags are engaged in the compilation, but it does still affect results from time to time, and when I take away the various debugging lines and recompile with optimizations, particularly removing sums over the array in question to check its properties (they read the array but do not modify it), the bug resumes its appearances the majority of the time.
I have noticed that this does not occur on either of my laptops, but rather than different compilers (GNU g++ 9.4.0, versus Clang) I wonder if the actual issue is whether I have CUDA (version 11.4) involved in the compilation (via cmake 3.18, I am compiling with GNU G++ 9.4.0 and CUDA 11.4). If I turn off the CUDA cmake directive, the bug goes away entirely, even when the compiler is given optimization flags that really seemed to make the bug come out. I am still building the C++ code with g++ and then linking to CUDA units with NVCC. There is a great deal of templating going on in the code, but there isn’t any compiler complaint to this effect no matter how I compile.
It may be significant that, when CUDA is not engaged, the arrays in question are allocated by new[]
and when CUDA is engaged in the compilation, the arrays in question are allocated by cudaHostAlloc()
. However, I have performed accumulations like this in many, many ways on arrays created with either new[]
or cudaHostAlloc()
depending on whether the code is compiled to run on the CPU or to run on the CPU with extensions that run on the GPU. I have never had this srt of trouble. Also, I have tried changing the data types under which the arrays are allocated and also the exact methods used to accumulate numbers into them, all leading to vairous forms of the Heisenbug.
If I hadn’t already checked further, I’d say it looks like an initialization issue: when I do manage to include lots of debugging code and still have enough luck to get the bug to come out, I have been seeing that the array which should hold the interpolated weights of a series of particles seems to fail to add contributions from whole particles, as if it tarted back over from zero and then resumed accumulation. However, when I run with valgrind I have been unable to detect a memory or initialization error.
Any help with this would be appreciated–I can send the code, in whole or in part, to a credentialed expert who would be interested to take a look. The package is planned for open-source release once it is useable by the moecular simulations community.