Wave Simulation giving random, wrong output

I am currently writing a 3D wave equation simulator for electromagnetic waves. A few days ago, my main iteration kernel started behaving unpredictably - generating different results every time the simulation is run, with the same initial conditions. It is hard to provide a minimal example to reproduce the error, since the error is different every time the program is executed. Sometimes it fails at the first execution of the kernel, sometimes it manages to perform hundreds of steps flawlessly and then fail all of a sudden.

To illustrate what “failure” means in this context, consider this image, or this one. The streaks coming from the bottom upwards in the Ey, Bx and Bz fields are obviously errors. The plots were generated using matplotlib in python, the kernel itself is written in C. Looking at the data dumps of the plots that exhibit the error shows that the values that are wrong are -nan(ind). I can be sure that the error occurs somewhere during the execution of the kernel, because the wrong values propagate outwards, i.e. they must be in device memory, and are used for the next iteration. The iteration kernel is the only kernel acting on the data in device memory.

Further observations:

  • The initial false values are always in the lower half of the plots. The simulation is ran in 3D, at a resolution of 256x256x256, stored in a 1D array indexed as x + 256*y + 256*256*z where x, y, z are the integer coordinates of the voxels.
  • When the error occurs (i.e. after how many steps) is random. Sometimes it runs fine for hundreds of iterations, and the exact same initial conditions (i.e. restarting the program without any alterations) can lead to the error occuring immediately, after the first kernel execution.
  • The pattern of the error does not always look the same. Viewing the x-y-slices, they always start as lines along the y-axis.
  • On the iteration step before the one that fails, the maxima of the fields drop by multiple orders of magnitude

If necessary, I can provide the full code of the kernel. The purpose of this post is to ask what such behaviour might be related to, as I would expect to get the same wrong plot every time the code is re-ran if it was a systematic error. My guess is that it is memory-related - looking at the PTX code generated using NVCC shows that the kernel uses 702 registers, more than half of which are 64-bit. I am running the code on two GTX 1080Ti’s, the error occurs on both of them; I have also asked a friend to run it on his GTX 1080, and he got the same faulty results as I did, with the same random behaviour. Otherwise, it would be possible that there is some sort of numerical instability, but as far as I know that would yield consistently wrong results. Are NVIDIA GPU’s arithmetics deterministic? I am not using any randomized input in the simulation.

How does one debug code that does not exhibit consistent behaviour?

Hi @slotboom.n

I know this is perhaps a bit late, but my gut feeling is similar to what you have… this looks like either a memory problem, perhaps a driver related issue or perhaps an issue with the communication between the two GPUs you are running on. If you’re running the docker image you could try a bare metal install. We have not encountered an error like this when testing on V100/A100 GPUs.

There is some driver information in our userguide related to what our docker image is built with. I would verify your versions match.

Results should be deterministic with the same code (assuming you set all random seeds (torch, numpy, etc).