Hpc-sdk version and different results

keroro · April 22, 2022, 10:30am

I’m developing a code based on MPI + openACC.

I’m able to compile and run successfully on two single-node clusters with 1 and 2 RTX Quadro 5000 GPUs (1 MPI rank per GPU) using the last version of the hpc-sdk 2022.3 and Cuda 11.6.
However, when I try to run on a slightly larger single-node cluster (4 NV 100 GPUs), the code compiles and runs successfully but I get the wrong results. The hpc-sdk version on this latter cluster is 2021 and cuda 11.3.

Is there any known bugs that have been resolved between these two versions? I tried many things (e.g. almost removing all the acceleration kernels) but still the problem is there. What i see in the output fields is some data corruption near the border of the different MPI domains.

Some other details: I use unified memory and managed memory.

MatColgrove · April 22, 2022, 6:06pm

Hi keroro,

How wrong are the results? Slightly or completely different?

Some numerical differences are expected when running massively parallel code. Rounding error due the order of operations can effect results especially if using reductions or atomics.

Also, the GPU will use fuse-multiply-add (FMA) operations by default (disable via the -Mnofma flag). FMA reduces rounding error since the it’s single operation rather than a two step multiply followed by an add. Hence if comparing results from a CPU that doesn’t use FMA, the results can be different.

For large differences, the are often the result of memory not being sync’d between the GPU and CPU. Since you’re using Unified Memory, it’s unlikely to be a sync issue, but given UM is only available for dynamically allocated memory, if the code use static memory, then it could be a problem.

Another possibility is that you have a race condition in the code. For example if a shared variable should be private, or a shared variable is being updated from multiple threads without an atomic directive.

You can certainly update your compiler, but the issue is unlikely to be a compiler issue. Though, if you can provide a reproducing example, I can investigate.

-Mat

keroro · April 23, 2022, 6:22am

Hi Mat,

Thanks for the detailed explanation.
The difference is rather big since the start (7/8 times larger than the rounding error, I’m using double precision). Then, as the code runs, the difference between the expected results and the actual results becomes larger and the code crashes.
The weird thing is that sometimes this error is big, sometimes it is smaller and the code runs longer.

It is a FFT of a dynamically allocated vector (performed using cuFFT), there shouldn’t te be any race conditions.

Being an HPC cluster I cannot use all the hpc-sdk versions but they let me try with hpc-sdk 2022 and the code now runs smoothly and I get the same results as the other machines.

The code is rather big, but if you want, i can provide you a reproducing example of the part that was giving me problems.

MatColgrove · April 25, 2022, 4:59pm

Interesting. No idea why it now works with the 22.x releases. I don’t see any related bug reports, but something changed.

The code is rather big, but if you want, i can provide you a reproducing example of the part that was giving me problems.

If it’s not too much trouble, this would be great. I don’t like when things get fixed by magic (i.e. without explanation). Plus if it does turn out to be a compiler issue, then I can add a regression test so the error does not get reintroduced.

keroro · April 27, 2022, 8:24am

Yes, i was also very surprised.
One thing that i noticed is that with hpc-sdk-2021 the -Minfo=accel output is: “Generating Tesla code” while with hpc-sdk-2022 the -Minfo=accel output is “Generating Nvidia GPU code”. Is this may be relevant to the case? or it is only a cosmetic thing of the different versions?

I will prepare the test case in the next days. Where can i send this code?

MatColgrove · April 27, 2022, 2:19pm

Cosmetic. Tesla is a specific product line while “Nvidia” is more general.

I will prepare the test case in the next days. Where can i send this code?

You can upload in a post (look for the icon with an up arrow over a computer).

If you don’t want the code available publicly, send me a direct message with the the uploaded file.

keroro · May 26, 2022, 2:16pm

Dear MatClogrove,

I’ve just sent you a message with the code.

MatColgrove · May 27, 2022, 1:52pm

For others, the problem with Keroro’s code was that it was being linked with a local installation of cuFFT which then caused a mismatch with the CUDA version being used with 21.9 compilers. Setting “-cudalib=cufft” on the link like instead, which will have the compilers link with the same cuFFT version as the CUDA version, fixed the issue.

-Mat

system · June 10, 2022, 1:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23705	July 27, 2010
Issues with migrating OpenACC codes to a newer card and HPC SDK nvc, nvc++ and nvfortran	6	61	November 19, 2024
Problems with MPI and OpenACC using 21.3: try-catch block prevents parallelization and annoying output is created nvc, nvc++ and nvfortran	5	973	September 17, 2021
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14010	September 12, 2015
OpenMP doesn't work in a templated function CUDA Programming and Performance	4	2254	September 14, 2009
Numerical Reproducibility & Randomness nvc, nvc++ and nvfortran	2	277	March 11, 2024
I have a question about the openacc parallel lead nvc, nvc++ and nvfortran cuda	5	25	July 25, 2024
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13543	May 29, 2021
Random program behavior on A100 GPUs CUDA Programming and Performance	11	752	September 19, 2022
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3257	July 7, 2008

Hpc-sdk version and different results

Related topics