Order of operations within do concurrent on GPU

I am attempting to run the MOM6 ocean model on a GPU using the do concurrent migration.

A requirement of this model is to guarantee the bit reproducibility floating-point calculations.

I have a question related to the order of operations inside of do concurrent loops. I have a double-precision four-element array:

x = [0., 2.**-52, 2.**-52, 2,]

If I sum this in order, then I should get 2 + 2**-51. If I sum it in any other order, say (a+b)+(c+d) then the other terms are less than the LSB of 2. and are dropped.


Using a standard Fortran loop with no optimization:

s = 0.
do i = 1, L
  s = s + x(i)
end do

I get the expected result:

s (cpu): 2.0000000000000004
s (hex): 4000000000000001

If I enable -O2, then the reduction is vectorized:

s (cpu): 2.0000000000000000
s (hex): 4000000000000000

presumably using the (a+b)+(c+d) expression above. (My assembly is not great, but I believe that’s what I am seeing.)


Now, consider the do-concurrent examples below.

If I nest this loop inside of a single do-concurrent iteration:

s = 0.
do concurrent (n=1:1) local(i)
  do i = 1, L
    s = s + x(i)
  end do
end do

then I seem to get the optimized result:

s (gp1): 2.0000000000000000
s (hex): 4000000000000000

An explicit reduction gives a similar result:

s = 0.
do concurrent (i=1:L) reduce(+:s)
  s = s + x(i)
enddo
s (gp2): 2.0000000000000000
s (hex): 4000000000000000

In general, there’s no issue here with using the optimized expression. The issue for me is that we seem to have no control over it.

AFAIK, the Fortran standard requires that the order of arithmetic ought to be preserved, and any involuntary reordering should be considered an extension or feature. I am unsure how that extends to do-concurrent loops, but I would hope it applies to the contents of each loop iteration, as in the second example.

I’m not thrilled about the level of control over the do-loop example. Ideally, I’d like to vectorize non-reducing loops, and leave the summations untouched, perhaps only vectorizing operations to sum() and similar intrinsics. But at least there is some set of flags which execute the loop as it is written.

As far as I can tell, we don’t have any control over the do-loop inside the do-concurrent.

Is there some way to control, or at least know, the order of operations inside of the do-concurrent loops?

To clarify: the primary issue is with do-concurrent loops on the GPUs.

For most optimizations, the nvfortran CPU behavior appears to be ordered, and produces the expected result.

If I enable -fast on nvfortran, then the standard do-loop uses the optimized result. The do-concurrent loops appear unaffected by -fast.

Overall, I am happy with these results (though lack of do-concurrent CPU optimization is odd).


Regarding other compilers on CPU:

GNU produces the ordered result for a very wide range of optimization flags. Enabling -ffast-math produces the optimized result.

Intel always produced the ordered result, even with -fast. (However, Intel runs microbenchmarks to determine optimization, so it may have simply not chosen to use vectorized instructions.)

Results were more or less identical for do and do-concurrent.

Hi Marshall,

I attempted to recreate this issue but was unable with “-fast”. However if I use “-Ofast” instead, then I see the behavior.

“-Ofast” includes the flag “-Mfprelaxed” which is similar to gfortran’s “-ffast-math” where a relaxed precision is used. Adding “-Mnofprelaxed” will give the expected results.

While I don’t think it’s the issue here, running a loop in parallel can cause the order of operations for a reduction to become non-deterministic thus resulting in differences in rounding error that can cause slight differences in accuracy.

Here’s the example I created:

% cat test.F90
program test
  implicit none

  real(8), dimension(:), allocatable :: x
  integer :: i, j, L, n
  real(8) :: s
  L = 4
  allocate (x(L))
  x(1) = 0.0
  x(2) = 2.**(-52)
  x(3) = 2.**(-52)
  x(4) = 2.0

  s=0.0
  do i = 1, L
      s = s+x(i)
  end do
  write(*,*) "Simple loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

  s=0.0
  do concurrent (n=1:1) local(i)
  do i = 1, L
      s = s+x(i)
  end do
  end do
  write(*,*) "outer do concurrent loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

  s=0.0
  do concurrent (n=1:L) reduce(+:s)
      s = s+x(n)
  end do
  write(*,*) "do concurrent w/ reduce loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

end program test

-fast with and without -Mfprelaxed

% nvfortran test.F90 -fast -V23.5 ; a.out
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000004
       4000000000000001
 do concurrent w/ reduce loop
  2.0000000000000004
       4000000000000001
% nvfortran test.F90 -fast -Mfprelaxed -V23.5 ; a.out
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000000
       4000000000000000
 do concurrent w/ reduce loop
  2.0000000000000000
       4000000000000000

-Ofast with and without -Mnofprelaxed

% nvfortran test.F90 -Ofast -V23.5 ; a.out
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000000
       4000000000000000
 do concurrent w/ reduce loop
  2.0000000000000000
       4000000000000000
% nvfortran test.F90 -Ofast -Mnofprelaxed -V23.5 ; a.out
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000004
       4000000000000001
 do concurrent w/ reduce loop
  2.0000000000000004
       4000000000000001

Please try adding “-Mnofprelaxed” or removing “-Mfprelaxed” to see if it works around your issue. Though if I haven’t accurately replicated the issue, please provide a reproducing example and I’ll investigate further.

-Mat

Hi Mat, thanks very much for getting back to me. Your example program replicates my own, so we can use that as reference.

I should have clarified in the original message that this is with -stdpar=gpu, i.e. with GPUs enabled.

Also, I should mention that I am using V23.1 on Linux. -Ofast does not appear to be a valid flag.


Although I am more interested in the GPU output, here is the CPU output for comparison. (I also include -O, i.e. -O2 without SIMD))

$ nvfortran -O sample.f90 && ./a.out 
 Simple loop
  2.0000000000000004  From my perspective, the behavior can be controlled
       4000000000000001
 outer do concurrent loop
  2.0000000000000004
       4000000000000001
 do concurrent w/ reduce loop
  2.0000000000000004
       4000000000000001
$ nvfortran -fast sample.f90 && ./a.out 
 Simple loop
  2.0000000000000000
       4000000000000000
 outer do concurrent loop
  2.0000000000000004
       4000000000000001
 do concurrent w/ reduce loop
  2.0000000000000004
       4000000000000001
$ nvfortran -fast -Mfprelaxed sample.f90 && ./a.out 
 Simple loop
  2.0000000000000000
       4000000000000000
 outer do concurrent loop
  2.0000000000000000
       4000000000000000
 do concurrent w/ reduce loop
  2.0000000000000000
       4000000000000000

Although our simple loop outputs differ, the do-concurrent behavior is consistent with what I am seeing.


When I compile with GPUs, I get different results:

$ nvfortran -stdpar=gpu sample.f90 && ./a.out 
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000000
       4000000000000000
 do concurrent w/ reduce loop
  2.0000000000000000
       4000000000000000
$ nvfortran -stdpar=gpu -Mnofprelaxed sample.f90 && ./a.out 
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000000
       4000000000000000
 do concurrent w/ reduce loop
  2.0000000000000000
       4000000000000000

Here, it seems that the GPU always does a SIMD-like evaluation, regardless of the compiler flag settings.


Which comes to my main question: Is there any way to control this behavior on the GPU? Is there an equivalent -Mnofprelaxed for the GPU bytecode?

(ignoring this issue, -Mfprelaxed looks like a very useful flag for our purposes, thanks very much for bringing it to my attention.)

There’s the “-gpu=fastmath” flag, but that’s not on by default so not causing this issue.

While I’m not 100% certain, I believe the issue here is rounding error with these very small numbers when performing the parallel reduction.

If you don’t mind adding some OpenACC directives (our DO CONCURRENT is built on top of our OpenACC implementation), you can switch to using atomics instead of reductions.

% cat test2.f90
program test
  implicit none

  real(8), dimension(:), allocatable :: x
  integer :: i, j, L, n
  real(8) :: s
  L = 4
  allocate (x(L))
  x(1) = 0.0
  x(2) = 2.**(-52)
  x(3) = 2.**(-52)
  x(4) = 2.0

  s=0.0
  do i = 1, L
      s = s+x(i)
  end do
  write(*,*) "Simple loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

  s=0.0
  do concurrent (n=1:1) local(i)
  do i = 1, L
!$acc atomic update
      s = s+x(i)
  end do
  end do
  write(*,*) "outer do concurrent loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

  s=0.0
  do concurrent (n=1:L)
!$acc atomic update
      s = s+x(n)
  end do
  write(*,*) "do concurrent w/ reduce loop"
  write(*,'(F20.16)') s
  write(*,'(Z)') s

end program test
% nvfortran -fast -stdpar=gpu test2.f90
% a.out
 Simple loop
  2.0000000000000004
       4000000000000001
 outer do concurrent loop
  2.0000000000000004
       4000000000000001
 do concurrent w/ reduce loop
  2.0000000000000004
       4000000000000001

Alternately for the case with the inner loop, you can add the “loop seq” so the inner loop isn’t parallelized and therefor no implicit reduction is performed. Granted the code would then run serially on the device and be quite slow.

  s=0.0
  do concurrent (n=1:1) local(i)
!$acc loop seq
  do i = 1, L
      s = s+x(i)
  end do
  end do
1 Like

Thank you for following up on this, it’s been hard to find time for me to reply.

I tested out the !$acc loop seq directive and it seems to do what we need. -Minfo indicates this explicitly:

     23, Generating NVIDIA GPU code
         21, Loop parallelized across CUDA thread blocks, CUDA threads(32) blockidx%x threadidx%x
             Generating implicit reduction(+:s)
         23, Loop run sequentially

We have been hoping to avoid the insertion of platform-specific directives, but perhaps in this case there is no other alternative.

However, I do wonder if reordering a loop inside of a do-concurrent loops breaks compliance with the language standard. Flags like -Ofast are welcome and have their place, but IMO there should be some way to retain in-order sums, even on a GPU. If we didn’t care about order, then we would use sum().


(I seem unable to compile with !$acc atomic update in the final loop; the compiler appears to hang and never finish. But I think it’s fair to expect a do-concurrent to operate in an unknown order, and this doesn’t worry me.)

I think for now !$acc loop seq is your best option. At least it is a directive and can be ignored depending on the compiler or options used. We are looking into our do concurrent support in general, with a goal to address areas where it is deficient wrt OpenACC.

1 Like

Thank you Brent, I think we’ll go with this for now. If there’s some way in the future to control the order on the GPU, that would be appreciated.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.