I was porting an in-house fortran code to OpenACC by HPC SDK and got some strange result for -O3 optimization. The result can be reproduced by the following piece of code:
integer i, na
real, allocatable :: w(:), ww(:)
na = 8
w = 1.
ww = -1.
do i = 1, na
a = w(i)
w(i) = ww(i)
ww(i) = a
!$acc end kernels
write(*, *) w
write(*, *) ww
I need to sweep the elements in two arrays, when I compile the code with:
nvfortran -acc -Minfo -r8 -O3 bug_test.f90
The program output shows that both w and ww all “-1”
However, the program works well when compile with -O2 or -O.
The program even shows a right output for -O3 optimization while I targeting -acc=multicore or -acc=host. It seems that the compiler takes a strange optimization strategy for the GPU code. I have to avoid -O3 optimization in my code now.
My OS is Ubuntu 18.04.2 LTS, HPC SDK version is 21.3, cuda version is 11.0.