Openacc atomic update of array elements

utente731 · October 25, 2022, 9:46am

Hi, I have this code where I have to increment q (initially q(:,:,:) = 0).

do n=1,NP
	call getIJK(x(n),y(n),z(n),i,j,k) ! x,y,z input, i,j,k output
	q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
enddo

I have to parallelize it with openacc and I tried

!$acc parallel loop gang vector num_gangs(nblocks) vector_length(nthreads)
do n=1,NP
	call getIJK(x(n),y(n),z(n),i,j,k) ! x,y,z input, i,j,k output
	!$acc atomic update
	q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
enddo
!$acc end parallel loop

It compiles but the results are wrong. How can I fix it?

MatColgrove · October 25, 2022, 3:01pm

Hi utente731,

Difficult to say for sure without a full repreproducing example. Though I’m assuming you’re passing i, j, and k by reference? If so, they wont be private by default so the problem could be there.

What do the compiler feedback messages show? (i.e. add “-Minfo=accel” to your compilation).

Does adding “private(i,j,k)” to the parallel compute construct fix the issue?

If not, please put together a minimal reproducing example.

Thanks,
Mat

utente731 · October 27, 2022, 8:55am

Note: what I have shown is a representative example, the real code is at the end of this reply.

Though I’m assuming you’re passing i, j, and k by reference?

The compiler doesn’t give informations on these variables.

What do the compiler feedback messages show? (i.e. add “-Minfo=accel” to your compilation).

With -Minfo=all:

802, Generating NVIDIA GPU code
        803, !$acc loop gang(maxnblocks), vector(nthreads) ! blockidx%x threadidx%x

Does adding “private(i,j,k)” to the parallel compute construct fix the issue?

No, it doesn’t fix the issue.

About the reproducing example I have to say that in the same program, there is another routine similarly structured that seems to work but the compiler says:

35, Generating NVIDIA GPU code
         36, !$acc loop gang(maxnblocks) ! blockidx%x
         38, !$acc loop seq
         39, !$acc loop seq
     38, Loop carried dependence of zp,yp prevents parallelization
         Loop carried backward dependence of zp,yp prevents vectorization
         Loop carried dependence of xp prevents parallelization
         Loop carried backward dependence of xp prevents vectorization
     39, Loop carried dependence of zp,yp prevents parallelization
         Loop carried backward dependence of zp,yp prevents vectorization
         Loop carried dependence of xp prevents parallelization
         Loop carried backward dependence of xp prevents vectorization
         Loop carried dependence of conc prevents parallelization
         Loop carried backward dependence of conc prevents vectorization
     41, Reference argument passing prevents parallelization: ierrel
         Reference argument passing prevents parallelization: restos
         Reference argument passing prevents parallelization: restoy
         Reference argument passing prevents parallelization: restox
         Reference argument passing prevents parallelization: k
         Reference argument passing prevents parallelization: j
         Reference argument passing prevents parallelization: i
         Reference argument passing prevents parallelization: snorm
         Reference argument passing prevents parallelization: ynorm
         Reference argument passing prevents parallelization: xnorm
         Reference argument passing prevents parallelization: saves
         Reference argument passing prevents parallelization: zint

And the routine is the following:

!$acc data present(xp, yp, zp, istatu, partyp, amass, volijk2, cgridp, &
!$acc& emip, outp, terain2, xrel2, yrel2, sgridh2, &
!$acc& ipter, nmed_calcon, ndputi,conc,maxnblocks,nthreads)

!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(1)
    do p = 1,ipter
        if (istatu(p).eq.1.or.istatu(p).eq.-1.or.istatu(p).eq.-2) THEN
            do mat = 1,outp%nummat
            do n = 1,outp%numsorg(mat)
                if (partyp(p).eq.outp%vetsou(mat,n)) then
                    call relo3d &
                        (xp(p) ,yp(p) ,zp(p) ,sgridh2,cgridp%nliv,cgridp%nliv , &
                        terain2,cgridp%nx ,cgridp%ny ,cgridp%nx ,cgridp%ny ,cgridp%top  , &
                        xrel2  ,yrel2  ,cgridp%dx    ,cgridp%dy    , &
                        ZINT  ,SAVES ,XNORM ,YNORM ,SNORM , &
                        I     ,J     ,K     , &
                        RESTOX,RESTOY,RESTOS, &
                        IERREL)
                    if (ierrel.eq.0.or.ierrel.eq.4) then 
                        !$acc atomic update
                        conc(i,j,k,mat) = conc(i,j,k,mat) + &
                        amass(p,outp%specie(mat))/(volijk2(i,j,k)*nmed_calcon)  
                    endif   
                endif   
            enddo    
            enddo
        endif     
    enddo
!$acc end parallel loop 
!$acc end data

I don’t have perfectly reproducing example but the real code is the following:


!$acc data present(velm(1:ipter),dtempm(1:ipter))

		!$acc data copy(velc,dtempc,velcu,dtempcu,numpar) &
		!$acc& present(xp,yp,zp,istatu,prisep,volijkpr,zgridpr,sgridh,terain,xrel,yrel, &
		!$acc& sgridhm,xrelm,yrelm,terainm,zgrid1)
		!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(nthreads) 
		do n=1,ipter 
			if (istatu(n).eq.1.or.istatu(n).eq.-1.or.istatu(n).eq.-2) then
				call relo3d(xp(n) ,yp(n) ,zp(n) ,sgridh,prisep%nliv,prisep%nliv , &
		            terain,prisep%nx,prisep%ny,prisep%nx,prisep%ny,prisep%top, &
		            xrel,yrel ,prisep%dx,prisep%dy, &
		            adummy,bdummy,cdummy,ddummy,edummy , &
		            ig, jg, kg, &
		            fdummy,gdummy,hdummy , &
		            ierrel)
				! outputs of relo3d are ig,jg,kg,ierrel, other variables are the inputs
				if (ierrel .eq. 0 .or. ierrel .eq. 4) then
					!$acc atomic update
					dtempc(ig,jg,kg) = dtempc(ig,jg,kg) + dtempm(n)/volijkpr(ig,jg,kg)
					!$acc end atomic
					!$acc atomic update
					velc(ig,jg,kg) = velc(ig,jg,kg) + velm(n)/volijkpr(ig,jg,kg)
					!$acc end atomic
					!$acc atomic update
					numpar(ig,jg,kg) = numpar(ig,jg,kg) + 1
					!$acc end atomic
				endif
			endif
		enddo

!$acc end data

There is not repeatability on the results, probably because of race condition.

Many thanks,
Massimiliano

MatColgrove · October 27, 2022, 9:46pm

Hi Massimiliano,

Does the first snip-it fail or was this only a to show the basic structure and the “real” code, snip-it #3, what you’re using to evaluate? I tried to write are reproducing example (see below) by mixing #1 and #3, but the code seems to run fine.

I agree that from what you describe does seem to be due to some type of race condition such as a variable not being privatized. While in OpenACC scalars are first private by default, if a scalar is passed by reference (default in Fortran), the compiler may need to assume that it’s address will be taken and in effect cause it to be made shared by default. Hence why I asked you to try explicitly adding these variables to a “private” clause. Though it’s not always needed (as is the case with my example below) and typically the compiler message will give you an indication that you need to privatize these variables. Though explicitly adding the scalars to a private clause may help with those “Reference argument passing prevents parallelization” messages.

My feeling is that it’s unlikely a problem with the atomic update itself but rather a scalar not being private, or possibly a data synchronization issue where some device value isn’t getting updated. Of course, I’m just guessing here, and without a reproducing example, it’s very difficult to say for sure.

One main difference I see between #2 (working) and #3 (failing) is that you set #2’s vector length to 1 and #3 to use “nthreads”. Granted I don’t know what the value of nthreads is, but presume it’s greater than 1.

Not sure it will help, but here’s the example program I used to try but failed to reproduce an error.

% cat test.f90

module foo

contains

subroutine getIJK (x,y,z,i,j,k,ierrel)
!$acc routine seq
   integer, intent(in)  :: x,y,z
   integer, intent(out) :: i,j,k,ierrel
   i=x
   j=y
   k=z
   ierrel=x
end subroutine getIJK

end module foo

program main
   use foo
   integer :: n,i,j,k,ierrel
   integer, dimension(:), allocatable :: x,y,z
   real, dimension(:,:,:), allocatable :: q,v
   real, dimension(:), allocatable :: m
   real :: r
   integer :: NP, NN,maxnblocks,nthreads
   NP=1024
   NN=64
   maxnblocks=NN
   ntheads=32
   allocate(x(NP))
   allocate(y(NP))
   allocate(z(NP))
   allocate(m(NP))
   allocate(q(NN,NN,NN))
   allocate(v(NN,NN,NN))
   q=0.0
   v=1.0
   do n=1,NP
     call random_number(r)
     x(n) = int(r*NN)+1
     call random_number(r)
     y(n) = int(r*NN)+1
     call random_number(r)
     z(n) = int(r*NN)+1
     m(n) = n
   enddo

!$acc data copy(q(:,:,:)) copyin(x(:),y(:),z(:),m(:),v(:,:,:))

!$acc parallel loop gang vector num_gangs(maxnblocks) vector_length(nthreads)
do n=1,NP
        call getIJK(x(n),y(n),z(n),i,j,k,ierrel) ! x,y,z input, i,j,k output
        if (ierrel .eq. 1 .or. ierrel .eq. 4) then
          !$acc atomic update
           q(i,j,k) = q(i,j,k) + m(n)/v(i,j,k)
        endif
enddo
!$acc end parallel loop

!$acc end data

 do i=1,NN
 do k=1,NN
 do j=1,NN
   if (q(i,j,k) .gt. 0) then
      print *, i,j,k,q(i,j,k)
   endif
enddo
enddo
enddo

end program main
% nvfortran test.f90 -acc -Minfo=accel -V22.9 ; a.out > log.acc
getijk:
      7, Generating acc routine seq
         Generating NVIDIA GPU code
main:
     49, Generating copyin(m(:),v(:,:,:),x(:),y(:),z(:)) [if not already present]
         Generating copy(q(:,:,:)) [if not already present]
     51, Generating NVIDIA GPU code
         52, !$acc loop gang(64), vector(nthreads) ! blockidx%x threadidx%x
% nvfortran test.f90  -V22.9 ; a.out > log.cpu
% diff log.acc log.cpu
%

-Mat

utente731 · November 2, 2022, 8:34am

Many thanks Mat, thanks for your precious help.
Your example works and this means that the directive atomic update is actually working.
In our case the program still doesn’t work because with the same settings, two runs give slightly different results and those results are not correct. So, each time (for each run) the results are slightly different: does it mean the there is some race condition?
I add that I have isolated the atomic update of numpar with (maxnblocks,nthreads) (in parallel) and it’s working. So the problem could be somewhere in the code relative to array variables velm,dtempm.
Thanks.
Massimiliano

bleback · November 2, 2022, 6:43pm

With atomics from multiple GPU threads and blocks, the updates (adds) will not occur in the same order, so you will get slight run to run variation due to different order of operations in floating point arithmetic.

Topic		Replies	Views
Simple question about atomic in openacc Legacy PGI Compilers	1	3115	October 23, 2014
debug error when use openacc directives Legacy PGI Compilers	3	3473	July 14, 2014
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2725	December 4, 2023
Reduction of an array inside the parallel region Legacy PGI Compilers	11	7068	September 20, 2017
Confused Newbie could do with aq little help Legacy PGI Compilers	11	10779	July 25, 2014
The result changes, after adding the openacc statement Legacy PGI Compilers	11	3942	July 21, 2018
Need advice for OpenACC directives Legacy PGI Compilers	6	7374	July 5, 2016
paralellize some loops with omp + acc Legacy PGI Compilers	16	6949	March 19, 2018
Why the result with openacc is different without openacc in the following code nvc, nvc++ and nvfortran	1	470	July 18, 2024
!$acc atomic update problems Legacy PGI Compilers	3	2499	December 6, 2017

Openacc atomic update of array elements

Related topics