Openacc atomic update of array elements

Hi, I have this code where I have to increment q (initially q(:,:,:) = 0).

do n=1,NP
	call getIJK(x(n),y(n),z(n),i,j,k) ! x,y,z input, i,j,k output
	q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
enddo

I have to parallelize it with openacc and I tried

!$acc parallel loop gang vector num_gangs(nblocks) vector_length(nthreads)
do n=1,NP
	call getIJK(x(n),y(n),z(n),i,j,k) ! x,y,z input, i,j,k output
	!$acc atomic update
	q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
enddo
!$acc end parallel loop

It compiles but the results are wrong. How can I fix it?

Hi utente731,

Difficult to say for sure without a full repreproducing example. Though I’m assuming you’re passing i, j, and k by reference? If so, they wont be private by default so the problem could be there.

What do the compiler feedback messages show? (i.e. add “-Minfo=accel” to your compilation).

Does adding “private(i,j,k)” to the parallel compute construct fix the issue?

If not, please put together a minimal reproducing example.

Thanks,
Mat

Note: what I have shown is a representative example, the real code is at the end of this reply.

Though I’m assuming you’re passing i, j, and k by reference?

The compiler doesn’t give informations on these variables.

What do the compiler feedback messages show? (i.e. add “-Minfo=accel” to your compilation).

With -Minfo=all:

802, Generating NVIDIA GPU code
        803, !$acc loop gang(maxnblocks), vector(nthreads) ! blockidx%x threadidx%x

Does adding “private(i,j,k)” to the parallel compute construct fix the issue?

No, it doesn’t fix the issue.

About the reproducing example I have to say that in the same program, there is another routine similarly structured that seems to work but the compiler says:

35, Generating NVIDIA GPU code
         36, !$acc loop gang(maxnblocks) ! blockidx%x
         38, !$acc loop seq
         39, !$acc loop seq
     38, Loop carried dependence of zp,yp prevents parallelization
         Loop carried backward dependence of zp,yp prevents vectorization
         Loop carried dependence of xp prevents parallelization
         Loop carried backward dependence of xp prevents vectorization
     39, Loop carried dependence of zp,yp prevents parallelization
         Loop carried backward dependence of zp,yp prevents vectorization
         Loop carried dependence of xp prevents parallelization
         Loop carried backward dependence of xp prevents vectorization
         Loop carried dependence of conc prevents parallelization
         Loop carried backward dependence of conc prevents vectorization
     41, Reference argument passing prevents parallelization: ierrel
         Reference argument passing prevents parallelization: restos
         Reference argument passing prevents parallelization: restoy
         Reference argument passing prevents parallelization: restox
         Reference argument passing prevents parallelization: k
         Reference argument passing prevents parallelization: j
         Reference argument passing prevents parallelization: i
         Reference argument passing prevents parallelization: snorm
         Reference argument passing prevents parallelization: ynorm
         Reference argument passing prevents parallelization: xnorm
         Reference argument passing prevents parallelization: saves
         Reference argument passing prevents parallelization: zint

And the routine is the following:

!$acc data present(xp, yp, zp, istatu, partyp, amass, volijk2, cgridp, &
!$acc& emip, outp, terain2, xrel2, yrel2, sgridh2, &
!$acc& ipter, nmed_calcon, ndputi,conc,maxnblocks,nthreads)

!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(1)
    do p = 1,ipter
        if (istatu(p).eq.1.or.istatu(p).eq.-1.or.istatu(p).eq.-2) THEN
            do mat = 1,outp%nummat
            do n = 1,outp%numsorg(mat)
                if (partyp(p).eq.outp%vetsou(mat,n)) then
                    call relo3d &
                        (xp(p) ,yp(p) ,zp(p) ,sgridh2,cgridp%nliv,cgridp%nliv , &
                        terain2,cgridp%nx ,cgridp%ny ,cgridp%nx ,cgridp%ny ,cgridp%top  , &
                        xrel2  ,yrel2  ,cgridp%dx    ,cgridp%dy    , &
                        ZINT  ,SAVES ,XNORM ,YNORM ,SNORM , &
                        I     ,J     ,K     , &
                        RESTOX,RESTOY,RESTOS, &
                        IERREL)
                    if (ierrel.eq.0.or.ierrel.eq.4) then 
                        !$acc atomic update
                        conc(i,j,k,mat) = conc(i,j,k,mat) + &
                        amass(p,outp%specie(mat))/(volijk2(i,j,k)*nmed_calcon)  
                    endif   
                endif   
            enddo    
            enddo
        endif     
    enddo
!$acc end parallel loop 
!$acc end data

I don’t have perfectly reproducing example but the real code is the following:


!$acc data present(velm(1:ipter),dtempm(1:ipter))

		!$acc data copy(velc,dtempc,velcu,dtempcu,numpar) &
		!$acc& present(xp,yp,zp,istatu,prisep,volijkpr,zgridpr,sgridh,terain,xrel,yrel, &
		!$acc& sgridhm,xrelm,yrelm,terainm,zgrid1)
		!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(nthreads) 
		do n=1,ipter 
			if (istatu(n).eq.1.or.istatu(n).eq.-1.or.istatu(n).eq.-2) then
				call relo3d(xp(n) ,yp(n) ,zp(n) ,sgridh,prisep%nliv,prisep%nliv , &
		            terain,prisep%nx,prisep%ny,prisep%nx,prisep%ny,prisep%top, &
		            xrel,yrel ,prisep%dx,prisep%dy, &
		            adummy,bdummy,cdummy,ddummy,edummy , &
		            ig, jg, kg, &
		            fdummy,gdummy,hdummy , &
		            ierrel)
				! outputs of relo3d are ig,jg,kg,ierrel, other variables are the inputs
				if (ierrel .eq. 0 .or. ierrel .eq. 4) then
					!$acc atomic update
					dtempc(ig,jg,kg) = dtempc(ig,jg,kg) + dtempm(n)/volijkpr(ig,jg,kg)
					!$acc end atomic
					!$acc atomic update
					velc(ig,jg,kg) = velc(ig,jg,kg) + velm(n)/volijkpr(ig,jg,kg)
					!$acc end atomic
					!$acc atomic update
					numpar(ig,jg,kg) = numpar(ig,jg,kg) + 1
					!$acc end atomic
				endif
			endif
		enddo

!$acc end data

There is not repeatability on the results, probably because of race condition.

Many thanks,
Massimiliano

Hi Massimiliano,

Does the first snip-it fail or was this only a to show the basic structure and the “real” code, snip-it #3, what you’re using to evaluate? I tried to write are reproducing example (see below) by mixing #1 and #3, but the code seems to run fine.

I agree that from what you describe does seem to be due to some type of race condition such as a variable not being privatized. While in OpenACC scalars are first private by default, if a scalar is passed by reference (default in Fortran), the compiler may need to assume that it’s address will be taken and in effect cause it to be made shared by default. Hence why I asked you to try explicitly adding these variables to a “private” clause. Though it’s not always needed (as is the case with my example below) and typically the compiler message will give you an indication that you need to privatize these variables. Though explicitly adding the scalars to a private clause may help with those “Reference argument passing prevents parallelization” messages.

My feeling is that it’s unlikely a problem with the atomic update itself but rather a scalar not being private, or possibly a data synchronization issue where some device value isn’t getting updated. Of course, I’m just guessing here, and without a reproducing example, it’s very difficult to say for sure.

One main difference I see between #2 (working) and #3 (failing) is that you set #2’s vector length to 1 and #3 to use “nthreads”. Granted I don’t know what the value of nthreads is, but presume it’s greater than 1.

Not sure it will help, but here’s the example program I used to try but failed to reproduce an error.

% cat test.f90

module foo

contains

subroutine getIJK (x,y,z,i,j,k,ierrel)
!$acc routine seq
   integer, intent(in)  :: x,y,z
   integer, intent(out) :: i,j,k,ierrel
   i=x
   j=y
   k=z
   ierrel=x
end subroutine getIJK

end module foo

program main
   use foo
   integer :: n,i,j,k,ierrel
   integer, dimension(:), allocatable :: x,y,z
   real, dimension(:,:,:), allocatable :: q,v
   real, dimension(:), allocatable :: m
   real :: r
   integer :: NP, NN,maxnblocks,nthreads
   NP=1024
   NN=64
   maxnblocks=NN
   ntheads=32
   allocate(x(NP))
   allocate(y(NP))
   allocate(z(NP))
   allocate(m(NP))
   allocate(q(NN,NN,NN))
   allocate(v(NN,NN,NN))
   q=0.0
   v=1.0
   do n=1,NP
     call random_number(r)
     x(n) = int(r*NN)+1
     call random_number(r)
     y(n) = int(r*NN)+1
     call random_number(r)
     z(n) = int(r*NN)+1
     m(n) = n
   enddo

!$acc data copy(q(:,:,:)) copyin(x(:),y(:),z(:),m(:),v(:,:,:))

!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(nthreads)
do n=1,NP
        call getIJK(x(n),y(n),z(n),i,j,k,ierrel) ! x,y,z input, i,j,k output
        if (ierrel .eq. 1 .or. ierrel .eq. 4) then
          !$acc atomic update
           q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
        endif
enddo
!$acc end parallel loop

!$acc end data

 do i=1,NN
 do k=1,NN
 do j=1,NN
   if (q(i,j,k) .gt. 0) then
      print *, i,j,k,q(i,j,k)
   endif
enddo
enddo
enddo

end program main
% nvfortran test.f90 -acc -Minfo=accel -V22.9 ; a.out > log.acc
getijk:
      7, Generating acc routine seq
         Generating NVIDIA GPU code
main:
     49, Generating copyin(m(:),v(:,:,:),x(:),y(:),z(:)) [if not already present]
         Generating copy(q(:,:,:)) [if not already present]
     51, Generating NVIDIA GPU code
         52, !$acc loop gang(64), vector(nthreads) ! blockidx%x threadidx%x
% nvfortran test.f90  -V22.9 ; a.out > log.cpu
% diff log.acc log.cpu
%

-Mat

Many thanks Mat, thanks for your precious help.
Your example works and this means that the directive atomic update is actually working.
In our case the program still doesn’t work because with the same settings, two runs give slightly different results and those results are not correct. So, each time (for each run) the results are slightly different: does it mean the there is some race condition?
I add that I have isolated the atomic update of numpar with (maxnblocks,nthreads) (in parallel) and it’s working. So the problem could be somewhere in the code relative to array variables velm,dtempm.
Thanks.
Massimiliano

With atomics from multiple GPU threads and blocks, the updates (adds) will not occur in the same order, so you will get slight run to run variation due to different order of operations in floating point arithmetic.