Wave Simulation giving random, wrong output

slotboom.n · August 24, 2022, 1:47pm

I am currently writing a 3D wave equation simulator for electromagnetic waves. A few days ago, my main iteration kernel started behaving unpredictably - generating different results every time the simulation is run, with the same initial conditions. It is hard to provide a minimal example to reproduce the error, since the error is different every time the program is executed. Sometimes it fails at the first execution of the kernel, sometimes it manages to perform hundreds of steps flawlessly and then fail all of a sudden.

To illustrate what “failure” means in this context, consider this image, or this one. The streaks coming from the bottom upwards in the Ey, Bx and Bz fields are obviously errors. The plots were generated using matplotlib in python, the kernel itself is written in C. Looking at the data dumps of the plots that exhibit the error shows that the values that are wrong are -nan(ind). I can be sure that the error occurs somewhere during the execution of the kernel, because the wrong values propagate outwards, i.e. they must be in device memory, and are used for the next iteration. The iteration kernel is the only kernel acting on the data in device memory.

Further observations:

The initial false values are always in the lower half of the plots. The simulation is ran in 3D, at a resolution of 256x256x256, stored in a 1D array indexed as x + 256*y + 256*256*z where x, y, z are the integer coordinates of the voxels.
When the error occurs (i.e. after how many steps) is random. Sometimes it runs fine for hundreds of iterations, and the exact same initial conditions (i.e. restarting the program without any alterations) can lead to the error occuring immediately, after the first kernel execution.
The pattern of the error does not always look the same. Viewing the x-y-slices, they always start as lines along the y-axis.
On the iteration step before the one that fails, the maxima of the fields drop by multiple orders of magnitude

If necessary, I can provide the full code of the kernel. The purpose of this post is to ask what such behaviour might be related to, as I would expect to get the same wrong plot every time the code is re-ran if it was a systematic error. My guess is that it is memory-related - looking at the PTX code generated using NVCC shows that the kernel uses 702 registers, more than half of which are 64-bit. I am running the code on two GTX 1080Ti’s, the error occurs on both of them; I have also asked a friend to run it on his GTX 1080, and he got the same faulty results as I did, with the same random behaviour. Otherwise, it would be possible that there is some sort of numerical instability, but as far as I know that would yield consistently wrong results. Are NVIDIA GPU’s arithmetics deterministic? I am not using any randomized input in the simulation.

How does one debug code that does not exhibit consistent behaviour?

ngeneva · September 26, 2022, 8:55pm

Hi @slotboom.n

I know this is perhaps a bit late, but my gut feeling is similar to what you have… this looks like either a memory problem, perhaps a driver related issue or perhaps an issue with the communication between the two GPUs you are running on. If you’re running the docker image you could try a bare metal install. We have not encountered an error like this when testing on V100/A100 GPUs.

There is some driver information in our userguide related to what our docker image is built with. I would verify your versions match.

Results should be deterministic with the same code (assuming you set all random seeds (torch, numpy, etc).

Topic		Replies	Views
Strange behaviour in extended simulations CUDA Programming and Performance	15	8351	October 12, 2010
Very simple kernel gives wrong results sometimes See code in thread CUDA Programming and Performance	3	1829	September 1, 2009
Random result by my cuda program CUDA Programming and Performance	4	3414	March 9, 2010
Kernel runs successfully a few times, then crashes CUDA Programming and Performance	2	1846	August 18, 2009
Working in emulation but not in device mode CUDA Programming and Performance	5	3765	January 8, 2009
Getting around apparent CUDA bugs CUDA Programming and Performance	5	996	September 20, 2011
time dependant simulation problems with floating point precision CUDA Programming and Performance	14	7920	May 13, 2009
random kernel execution failure with unknown error CUDA programming on Linux CUDA Programming and Performance	9	8671	June 11, 2008
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2853	September 10, 2010
Possible nvcc bug? CUDA Programming and Performance	13	8834	January 9, 2011

Wave Simulation giving random, wrong output

Related topics