PGI 18.5 generates incorrect results for O1 up tp O3

Joseph_A · March 28, 2019, 6:14am

Dear OpenACC community,

we have a huge code which was developed using the PGI compiler 17.10 and OpenACC on a P100 cluster. With this setup and the O3 option it generated correct results.

On a different cluster we are now using the the PGI compiler 18.5 and the V100.
However, the program does not generate correct results anymore.

So far we tracked it down to the optimization flag, if we compile and link with -O0 we get also get correct results using the PGI compiler 18.5. But starting from -O1 up to -O3 the results are incorrect.
At same point the also algorithm starts to print out “NaN”.

We compile and link with these flags:

PGIOPTS=-Mcuda=9.0,ptxinfo
PGIOPTS+=-Mpreprocess
PGIOPTS+=-Mlarge_arrays -mcmodel=medium
PGIOPTS+=-ta=tesla:cc70 
PGIOPTS+=-O0  
PGIOPTS+=-mp 
PGIOPTS+=-acc -Minfo=accel -Minfo

We also tried the -fast option for compiling and linking but it still generates wrong results.

Is there a way to to debug it and find out what happens?

Thank you for your help

brentl · March 28, 2019, 5:53pm

Does the different cluster have a different processor? You might try setting the processor type to something “older”, like Nehalem for x86, for instance.

But it could be, maybe likely, an optimization bug. The best way to go about it is to create a working version, compiled with -O0, and create a failing version with -O3. Then combine the two sets of objects. For instance, take [a-l].o from one set and [m-z].o from another. Do a binary search to find the failing file or function. Once you have that, we can zero in on where the actual problem lies.

Not in 18.5, but in later compilers we have a new feature called PCAST which would help in situations like this. Compiler-assisted debugging and comparing between a gold version and a test version.

Joseph_A · March 29, 2019, 4:52am

Thank you for your quick replay!
Thank you for the detailed description on how to debug such an error.

We have now tested the code on a DGX2 cluster with the PGI compiler 19.1 and it seems to run correctly.

The P100 cluster has a Broadwell CPU and both other V100 clusters have a Skylake CPU.
We will also install the newest PGI 19.X compiler on the other V100 cluster and then check it again if it runs correctly. Hopefully it works and we don’t need to debug it.

Thank you for your help

Topic		Replies	Views
Errors when building with PGI compiler Legacy PGI Compilers	10	15175	January 16, 2012
Can -acc generate different numerical results ? Legacy PGI Compilers	1	1294	March 25, 2019
-fast compiler instruction is producing incorrect results. Legacy PGI Compilers	8	7031	May 2, 2014
poor pgi openmp performance?? Legacy PGI Compilers	17	20366	August 3, 2012
OpenACC c++ code doesn't compile on new pgi 2015 release Legacy PGI Compilers	2	3479	March 6, 2015
Odd error maybe due to numerical resolution? Legacy PGI Compilers	3	2389	February 1, 2011
pgc++-Fatal-/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/compilers/bin/tools/cpp1 TERMINATED by signal 11 Legacy PGI Compilers	1	637	June 22, 2021
-O3 leads to error Legacy PGI Compilers	10	5304	April 10, 2018
OpenACC: -O2 and above gave wrong results Legacy PGI Compilers	4	3950	June 12, 2020
PGI 18.7 segv building OpenMPI with optimization at -O2+ Legacy PGI Compilers	4	2814	March 11, 2019

PGI 18.5 generates incorrect results for O1 up tp O3

Related topics