OpenACC: -O2 and above gave wrong results

Hi, here’s an example that failed with -O2 and above. O1 and below were good.

The example is so short and easy that I don’t think extra explanation is necessary.

Tested on cc70 and cc60 cards. The results are the same.

I’ve already found a workaround (function g in the example), but this workaround isn’t practical in my project. I need the original function f to work too.

Thanks in advance.

// ubuntu 18.04
// pgc++ 19.10-0 LLVM 64-bit target on x86-64 Linux -tp haswell
// PGI Compilers and Tools

// pgc++ -acc -ta:tesla,cc70 -Minfo=accel acc.cpp -O2; ./a.out

#include <cstdio>

#pragma acc routine seq
inline void f1(float& a) { a = -1; }

#pragma acc routine seq
inline void f2(float& a) { a = -2; }

#pragma acc routine seq
inline void f(float& a, int flag) {
   if (flag == 1) f1(a);
   else f2(a);
}

#pragma acc routine seq
inline void g(float& a, int flag) {
   if (flag == 1) a = -1;
   else a = -2;
}

int main() {
   int n = 1;
   float outf, outg;
   #pragma acc parallel loop
   for (int i = 0; i < n; ++i) {
      float cf = 10, cg = 10;
      f(cf, 1); // wrong
      g(cg, 1); // right
      outf = cf;
      outg = cg;
   }
   printf("outf %lf\n", outf); // prints 10.0 with -O2 and above
   printf("outg %lf\n", outg); // always prints -1.0

   return 0;
}

Thanks stw. I can recreate the error here and have added an issue report (TPR #28202).

Looks like a regression when going from the 19.5 to 19.7 compilers. After inlining, the compiler is doing some optimization which is causing it to basically eliminate the if checks (since flag is always 1) and just set the out values to the end result. Though somehow in the case where there’s multiple levels of if statement, it’s using the initial value instead of the result value.

As a work around, you can disable inlining (-Mnoautoinline).

% pgc++ -ta=tesla -O2 test.cpp; a.out
outf 10.000000
outg -1.000000
% pgc++ -ta=tesla -O2 test.cpp -Mnoautoinline; a.out
outf -1.000000
outg -1.000000

-Mat

Hi Mat, thank you for your explaination.

My follow-up question is: if this acc routine was called inside a compute-intensive kernel, is this fix gonna cause more overhead compared to the correctly inlined version or the overhead shall be eliminated by other follwing procedures in compilation?

I understand the strictly correct answer might be tricky so I was just expecting a general reply. Thanks.

is this fix gonna cause more overhead compared to the correctly inlined version or the overhead shall be eliminated by other follwing procedures in compilation?

Most likely it will cause some performance loss. Probably not too bad, but you’d need to run it through a profiler to see the effect. Though as you note, comparing performance against code that gives incorrect answers is not very useful.

-Mat

Fixed with HPC SDK 20.5, which is currently in EA https://developer.nvidia.com/hpc-sdk