Consecutive atomic function used in program

Hi guys, I’m working on the CUDA Fortran programming. When I try to use consecutive ATOMIC functions in a device code (as the following code, I use 4 atomicadd), it turns out to be wrong.

1. attributes(global) subroutine d_calDensity(nlen,x,y,tag,rho,mx,my)
2.        implicit none
3.        ! declare subroutine interface
4.        integer(kind=4),value :: nlen,mx,my
5.        integer(kind=4),dimension (nlen,2) :: tag
6.        real(kind=4),dimension (mx,my)  :: rho
7.        real(kind=4),dimension (nlen,2) :: x,y
8.        ! define temporary parameters
9.        integer(kind=4) :: i,j,idx,idy
10.       real(kind=4) :: dx,dy,w11,w12,w22,w21,istat
11.       ! define the 2D thread index for particle index
12.       i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
13.       j = blockIdx%y;
14.       ! calculate the number density of Harris type using atomic process
15.       if(i <= nlen .and. j == 1)then
16.             if(tag(i,1).gt.0)then
17.                    idx = int(x(i,1));      idy = int(y(i,1))
18.                    dx = x(i,1) - real(idx,4)
19.                    dy = y(i,1) - real(idy,4)
20.                    w11 = (1.0 - dx)*(1.0 - dy)
21.                    w12 = (1.0 - dx)*dy
22.                    w22 = dx*dy
23.                    w21 = dx*(1.0 - dy)
24.                    istat = atomicadd(rho(idx,idy),w11)
25.                    istat = atomicadd(rho(idx+1,idy),w21)
26.                    istat = atomicadd(rho(idx+1,idy+1),w22)
27.                    istat = atomicadd(rho(idx,idy+1),w12)
28.             end if
29.       end if
30.       return
31. end subroutine d_calDensity

But if I only use ONE atomic operation command, or either one of four (Line 24-27), it works and I can get the right result (has been compared with the CPU version result). Is this means that the atomic operation can not be used one after another? Or I just do not write this in a proper form? Thanks a lot for helping me!

You can use atomics one after the other. There is nothing wrong or illegal about that. It’s not possible for me to look at your code snippet and make any other comments about whatever you may be observing. I would suggest debugging your code, in the failing case.

Generally speaking, this is not a valid way of assessing the correctness of the results of floating-point computation. “A person with one watch always knows what time it is, but a person with two watches is never sure”. Use of a reference implementation that utilizes a higher-precision floating-point format may represent a suitable point of comparison.

Floating-point arithmetic is (as contrasted with the underlying mathematical operations) not associative. In other words, there can be no guarantee that (a+b)+c == (b+c)+a == (a+c)+b when evaluated in finite precision floating-point arithmetic. Use of multiple atomic additions causes the summation of the operands to occur in indeterminate order, and because of non-associativity, the order matters numerically. The actual order of additions may vary between runs of the same code and/or between GPUs.

Note that this code contains constructs that are numerically questionable because they may exhibit symptoms of subtractive cancellation. For example, it will generally be advantageous to write fma (-dx, dy, dy) instead of (1.0 - dx) * dy. FMA is available in Fortran 2018 as a built-in IEEE_FMA, from what I understand. I do not know what CUDA Fortran supports. If there is no support for explicitly coded FMA yet, you may want to file an enhancement request with NVIDIA.