Reproducibility of atomic operations

Hi,

If I have the following code:

!$acc kernels default(present) present(a,sum0,sum1)
!$acc loop independent
        do k=2,npm1
!$acc loop independent
          do i=1,nrm1
!$acc atomic
            sum0(i)=sum0(i)+a%r(i,2,k)*dph(k)*pl_i*two
          enddo
        enddo
!$acc end kernels

Should I expect the same answer each time I run it, or is there a chance the atomics are done in a different order each time so my results will vary due to order-of-sum floating point variations?

I ask because the code this is part of is returning floating-point level different answers every time I run it on the same GPU, same code, etc.

If so, is there some kind of environment variable I can set to make the compiler/run-time compute the atomics in a reproducible way (even if slower)?

Thanks,

  • Ron

Hi Ron,

or is there a chance the atomics are done in a different order each time so my results

The order in which CUDA threads are run is non-deterministic, hence the atomic can executed in a different order each time the code is run.

Is this the full loop? If so, you may want to interchange the loops and then only run “i” in parallel. Each thread will then sum one element of sum0 and help with reproducability. You can also then try doing a vector reduction across “k”, but may encounter a similar issue. Though with fewer threads per reduction, rounding error may not be as bad.

!$acc kernels loop default(present) present(a,sum0,sum1)
    do i=1,nrm1
        sum = 0
! Optionally use a vector reduction
!acc loop vector reduction(+:sum)
        do k=2,npm1
            sum=sum+a%r(i,2,k)*dph(k)*pl_i*two
          enddo
         sum0(i) = sum0(i) + sum
       enddo

Hope this helps,
Mat

Hi,

Thanks!

So there is no ENV I can set to force the threads to be deterministic for the atomics for testing?

I have previously tried inverting the loops but the performance went down because the single loop dimension is too small to parallelize well across the GPU.

I have not tried the new array-reduction support yet because I typically only add a new feature to the code when the feature is supported in the latest community edition.
Are array-reductions supported in 19.4?

Thanks!

  • Ron

Hi Ron,

So there is no ENV I can set to force the threads to be deterministic for the atomics for testing?

Not that I’m aware of. Scheduling is done by the CUDA driver and I high doubt there’s a way to force the order in which the threads read/write to memory.

Are array-reductions supported in 19.4?

No, we’re still working on adding support for this.

-Mat