Nvfortran atomicAdd incorrect work

I use nvfortran 23.1-0 64-bit target on x86-64 Linux -tp skylake-avx512

try to use AtomicAdd to caluclate sum of arrays’ elements. Set print from kernel and it seems like don’t work properly.

            1    1.000000        0.000000    
            2    2.000000        0.000000    
            3    3.000000        0.000000    
            4    4.000000        0.000000    
            5    5.000000        0.000000    
Sum of array(5) =   0.0
            1    1.000000        0.000000    
Sum of array(1) =   0.0

the source code is the following:

module a
  use cudafor
  implicit none 

  contains

  attributes(global) &
  subroutine sum_kernel_atomic(a, n, res)
    implicit none
    
    real :: a(*)          !< input devica array
    integer, value :: n   !< input len of array
    real :: res           !< output sum of array
    integer :: idx
    idx = threadIdx%x
    if (idx <= n) then 
      !
      res = atomicadd(res, a(idx))
      ! print the values 
      write(*,*) idx, a(idx), res 
    end if 
    return 
  end subroutine sum_kernel_atomic 

end module a

program atomic_test
  use a 
  use cudafor 
  implicit none
  integer :: n, m
  real , allocatable :: arr(:)
  real , allocatable, device :: darr(:)

  real :: res 
  real, device :: dres 

  n = 5
  allocate (arr(n))
  arr(:) = [(m, m=1,n)]
  allocate (darr(n))
  darr(:) = arr(:)


  dres = 0.0
  call sum_kernel_atomic<<<1,n>>>(darr, n, dres) 
  res = dres 
  write(*, '("Sum of array(",i1,") = ",f5.1)') n, res 

  m = 1
  dres = 0.0
  call sum_kernel_atomic<<<1,1>>>(darr, m, dres) 
  res = dres 
  write(*, '("Sum of array(",i1,") = ",f5.1)') m, res 

  deallocate(arr)
  deallocate(darr)
end program atomic_test

Hi medvsn,

“atomicAdd” returns the value of the variable before it was updated. Hence by using:

res = atomicadd(res, a(idx))

the code is overwriting “res” with the old value. You’ll want to use a different variable to capture the return value.

-Mat

I replace original line res = atomicadd(res, a(idx)) to the following two operations:

tmp = res 
tmp = atomicadd(res, a(idx))

and get target result.

          idx     a(idx)             res             tmp
            1    1.000000        15.00000        0.000000    
            2    2.000000        15.00000        1.000000    
            3    3.000000        15.00000        3.000000    
            4    4.000000        15.00000        6.000000    
            5    5.000000        15.00000        10.00000    
Sum of array(5) =  15.0
            1    1.000000        1.000000        0.000000    
Sum of array(1) =   1.0

I think it’s a bit weird, it means, that res = tmp + a(idx), where tmp must be output. or not?

P.S.: Firstly, I tried to use res = atomicadd(tmp, a(idx)) , but I got zeros in res values.

Sorry, I’m not clear on what you’re asking.

“tmp” in this context is the value of “res” prior to the add, so looks correct.

“res” is correct as well at 15. All the threads in a warp (i.e. groups of 32 threads) execute the same instruction at the same time, so by the time the code gets to the print, all threads have executed the atomicAdd and value is the same. If there were a larger amount of threads (i.e. > 32), then you’re likely to see different values appear. Though the value printed will be non-deterministic and depend on the order the threads executed the atomicAdd. In general it’s not good practice to try an use the intermediary values from a global variable used in an atomic.

If you do need to print the value of res after the atomicAdd, you’d want to do something like:

tmp = atomicadd(res, a(idx))
tmp = tmp + a(idx)
write(*,*) idx, a(idx), tmp

Thank you.
To clarify, if I need to calculate sum of array and store it to variable res. To organize it I use temporary variable tmp for old value and use function atomicAdd in the following way:

tmp = atomicAdd(res, a(idx))  ! tmp -- old value of res
                              ! res -- new value after operation inside the function atomicaAdd