I’m able to compile and run successfully on two single-node clusters with 1 and 2 RTX Quadro 5000 GPUs (1 MPI rank per GPU) using the last version of the hpc-sdk 2022.3 and Cuda 11.6.
However, when I try to run on a slightly larger single-node cluster (4 NV 100 GPUs), the code compiles and runs successfully but I get the wrong results. The hpc-sdk version on this latter cluster is 2021 and cuda 11.3.
Is there any known bugs that have been resolved between these two versions? I tried many things (e.g. almost removing all the acceleration kernels) but still the problem is there. What i see in the output fields is some data corruption near the border of the different MPI domains.
Some other details: I use unified memory and managed memory.
How wrong are the results? Slightly or completely different?
Some numerical differences are expected when running massively parallel code. Rounding error due the order of operations can effect results especially if using reductions or atomics.
Also, the GPU will use fuse-multiply-add (FMA) operations by default (disable via the -Mnofma flag). FMA reduces rounding error since the it’s single operation rather than a two step multiply followed by an add. Hence if comparing results from a CPU that doesn’t use FMA, the results can be different.
For large differences, the are often the result of memory not being sync’d between the GPU and CPU. Since you’re using Unified Memory, it’s unlikely to be a sync issue, but given UM is only available for dynamically allocated memory, if the code use static memory, then it could be a problem.
Another possibility is that you have a race condition in the code. For example if a shared variable should be private, or a shared variable is being updated from multiple threads without an atomic directive.
You can certainly update your compiler, but the issue is unlikely to be a compiler issue. Though, if you can provide a reproducing example, I can investigate.
Thanks for the detailed explanation.
The difference is rather big since the start (7/8 times larger than the rounding error, I’m using double precision). Then, as the code runs, the difference between the expected results and the actual results becomes larger and the code crashes.
The weird thing is that sometimes this error is big, sometimes it is smaller and the code runs longer.
It is a FFT of a dynamically allocated vector (performed using cuFFT), there shouldn’t te be any race conditions.
Being an HPC cluster I cannot use all the hpc-sdk versions but they let me try with hpc-sdk 2022 and the code now runs smoothly and I get the same results as the other machines.
The code is rather big, but if you want, i can provide you a reproducing example of the part that was giving me problems.
Interesting. No idea why it now works with the 22.x releases. I don’t see any related bug reports, but something changed.
The code is rather big, but if you want, i can provide you a reproducing example of the part that was giving me problems.
If it’s not too much trouble, this would be great. I don’t like when things get fixed by magic (i.e. without explanation). Plus if it does turn out to be a compiler issue, then I can add a regression test so the error does not get reintroduced.
Yes, i was also very surprised.
One thing that i noticed is that with hpc-sdk-2021 the -Minfo=accel output is: “Generating Tesla code” while with hpc-sdk-2022 the -Minfo=accel output is “Generating Nvidia GPU code”. Is this may be relevant to the case? or it is only a cosmetic thing of the different versions?
I will prepare the test case in the next days. Where can i send this code?
For others, the problem with Keroro’s code was that it was being linked with a local installation of cuFFT which then caused a mismatch with the CUDA version being used with 21.9 compilers. Setting “-cudalib=cufft” on the link like instead, which will have the compilers link with the same cuFFT version as the CUDA version, fixed the issue.