Wrong results when using vector clause in parallel loop with array syntax

Hello,
I work in the CINECA User Support Team, and we received from one of our users a report on a code which produces wrong numerical results.Please note that we found how to change the code for it to produce correct results, but we actually wonder if the original code should work as well or, better, how to explain why it is wrong.
Below you find the original Fortran openACC code posted by our user, and how he compiles it. It shows that,:

  • when using the vector clause for the parallel loop directive
  • if the loop instructions involve some operation on the array returned by a function call
  • as a result all threads produce the same value (the one corresponding to i=1), see output array a.
    The code also reports the (correct) output for the array b, obtained by saving the return value of get_array in a local (array) variable c, and then operating on c.
    Note that:
  • without the vector clause it works
  • with acc kernels instead of parallel loop it works
  • with the vector clause and scalar variables (replacing the array get_arr) it works
    Is this an expected behaviour?
    Many thanks in advance for any suggestion you may have,
    best
    Isabella
type ! compile with: 
! nvfortran -c -o test.o test.F90 -cuda -acc -gpu=cc70 -Minfo=accel -g -r8 -traceback -Mnoinline
! nvfortran -o test test.o -cuda -acc -gpu=cc70 -Minfo=accel -g -r8 -traceback -Mnoinline

module simple
contains
  function get_arr(a)
  !$acc routine seq
    integer, dimension(2) :: get_arr
    integer, intent(in) :: a
    get_arr(1) = a
    get_arr(2) = a
  end function get_arr
end module simple


program testprogram
  use simple
  implicit none
  integer, parameter :: n = 16
  integer, dimension(n) :: a, b 
  integer, dimension(2) :: c
  integer :: i

  write (*,*) "test start"

  !$acc parallel loop gang worker vector &
  !$acc           private(c, i) copyout(a, b)
  do i = 1, n
    c = get_arr(i) * 2
    a(i) = c(1)
    c = get_arr(i)
    b(i) = c(1) * 2
  end do

  write (*,*) "result"
  write (*,*) "a"
  write (*,*)  a
  write (*,*) "b"
  write (*,*)  b
end program testprogram
or paste code here

Hi i.baccarelli,

I suspect the key as to what’s going on is found in the compiler feedback messages (-Minfo=accel):

% nvfortran test.F90 -acc -Minfo=accel -g ; a.out
get_arr:
      3, Generating acc routine seq
         Generating NVIDIA GPU code
testprogram:
     26, Generating copyout(a(:)) [if not already present]
         Generating NVIDIA GPU code
         29, !$acc loop gang, worker(4), vector(32) ! blockidx%x threadidx%y threadidx%x
         30, !$acc loop seq
     26, Local memory used for c
         Generating implicit copy(get_arr1(:)) [if not already present]
         Generating copyout(b(:)) [if not already present]
     30, Loop is parallelizable

It looks like the fixed sized local array “get_arr” is getting hoisted causing the compiler to implicitly copy it back to the device. Since it’s not also being implicitly privatized, it’s causing a potential race condition.

The wrong answers do seem to only appear when “-g” is used (another workaround is to remove -g), but if I’m correct and it is a race condition, the successful cases may just be due to luck in the timing of when get_arr is used.

I’ll need a compiler engineer to dig into the details to confirm if I’m correct, or if something else is going on. Hence, I added a problem report, TPR #31360.

Thanks for the report,
Mat