I am attempting to run the MOM6 ocean model on a GPU using the do concurrent
migration.
A requirement of this model is to guarantee the bit reproducibility floating-point calculations.
I have a question related to the order of operations inside of do concurrent loops. I have a double-precision four-element array:
x = [0., 2.**-52, 2.**-52, 2,]
If I sum this in order, then I should get 2 + 2**-51
. If I sum it in any other order, say (a+b)+(c+d)
then the other terms are less than the LSB of 2.
and are dropped.
Using a standard Fortran loop with no optimization:
s = 0.
do i = 1, L
s = s + x(i)
end do
I get the expected result:
s (cpu): 2.0000000000000004
s (hex): 4000000000000001
If I enable -O2
, then the reduction is vectorized:
s (cpu): 2.0000000000000000
s (hex): 4000000000000000
presumably using the (a+b)+(c+d)
expression above. (My assembly is not great, but I believe that’s what I am seeing.)
Now, consider the do-concurrent examples below.
If I nest this loop inside of a single do-concurrent iteration:
s = 0.
do concurrent (n=1:1) local(i)
do i = 1, L
s = s + x(i)
end do
end do
then I seem to get the optimized result:
s (gp1): 2.0000000000000000
s (hex): 4000000000000000
An explicit reduction gives a similar result:
s = 0.
do concurrent (i=1:L) reduce(+:s)
s = s + x(i)
enddo
s (gp2): 2.0000000000000000
s (hex): 4000000000000000
In general, there’s no issue here with using the optimized expression. The issue for me is that we seem to have no control over it.
AFAIK, the Fortran standard requires that the order of arithmetic ought to be preserved, and any involuntary reordering should be considered an extension or feature. I am unsure how that extends to do-concurrent loops, but I would hope it applies to the contents of each loop iteration, as in the second example.
I’m not thrilled about the level of control over the do-loop example. Ideally, I’d like to vectorize non-reducing loops, and leave the summations untouched, perhaps only vectorizing operations to sum()
and similar intrinsics. But at least there is some set of flags which execute the loop as it is written.
As far as I can tell, we don’t have any control over the do-loop inside the do-concurrent.
Is there some way to control, or at least know, the order of operations inside of the do-concurrent loops?