Wrong results when using vector clause in parallel loop with array syntax

I work in the CINECA User Support Team, and we received from one of our users a report on a code which produces wrong numerical results.Please note that we found how to change the code for it to produce correct results, but we actually wonder if the original code should work as well or, better, how to explain why it is wrong.
Below you find the original Fortran openACC code posted by our user, and how he compiles it. It shows that,:

  • when using the vector clause for the parallel loop directive
  • if the loop instructions involve some operation on the array returned by a function call
  • as a result all threads produce the same value (the one corresponding to i=1), see output array a.
    The code also reports the (correct) output for the array b, obtained by saving the return value of get_array in a local (array) variable c, and then operating on c.
    Note that:
  • without the vector clause it works
  • with acc kernels instead of parallel loop it works
  • with the vector clause and scalar variables (replacing the array get_arr) it works
    Is this an expected behaviour?
    Many thanks in advance for any suggestion you may have,
type ! compile with: 
! nvfortran -c -o test.o test.F90 -cuda -acc -gpu=cc70 -Minfo=accel -g -r8 -traceback -Mnoinline
! nvfortran -o test test.o -cuda -acc -gpu=cc70 -Minfo=accel -g -r8 -traceback -Mnoinline

module simple
  function get_arr(a)
  !$acc routine seq
    integer, dimension(2) :: get_arr
    integer, intent(in) :: a
    get_arr(1) = a
    get_arr(2) = a
  end function get_arr
end module simple

program testprogram
  use simple
  implicit none
  integer, parameter :: n = 16
  integer, dimension(n) :: a, b 
  integer, dimension(2) :: c
  integer :: i

  write (*,*) "test start"

  !$acc parallel loop gang worker vector &
  !$acc           private(c, i) copyout(a, b)
  do i = 1, n
    c = get_arr(i) * 2
    a(i) = c(1)
    c = get_arr(i)
    b(i) = c(1) * 2
  end do

  write (*,*) "result"
  write (*,*) "a"
  write (*,*)  a
  write (*,*) "b"
  write (*,*)  b
end program testprogram
or paste code here

Hi i.baccarelli,

I suspect the key as to what’s going on is found in the compiler feedback messages (-Minfo=accel):

% nvfortran test.F90 -acc -Minfo=accel -g ; a.out
      3, Generating acc routine seq
         Generating NVIDIA GPU code
     26, Generating copyout(a(:)) [if not already present]
         Generating NVIDIA GPU code
         29, !$acc loop gang, worker(4), vector(32) ! blockidx%x threadidx%y threadidx%x
         30, !$acc loop seq
     26, Local memory used for c
         Generating implicit copy(get_arr1(:)) [if not already present]
         Generating copyout(b(:)) [if not already present]
     30, Loop is parallelizable

It looks like the fixed sized local array “get_arr” is getting hoisted causing the compiler to implicitly copy it back to the device. Since it’s not also being implicitly privatized, it’s causing a potential race condition.

The wrong answers do seem to only appear when “-g” is used (another workaround is to remove -g), but if I’m correct and it is a race condition, the successful cases may just be due to luck in the timing of when get_arr is used.

I’ll need a compiler engineer to dig into the details to confirm if I’m correct, or if something else is going on. Hence, I added a problem report, TPR #31360.

Thanks for the report,

Dear Mat,
could you get some news on your hypothesis on TPR #31360?
thank you for all your help,

Hi Isabella,

Engineering did take a look and came to the same conclusion that “get_arr” isn’t getting implicitly privatized as it should. However, they gave the task a lower priority so haven’t assigned someone to fix it as of yet. Let me talk to management and see if I can get it bumped higher.