!$acc atomic update problems


My program has a subroutine with the following structure :


========================== start accelerated part
!$acc data fi,sigma
!$acc parallel
!$acc loop


-part in which gdp is computed : function of i,j but only one number
for each loop i,j

!$acc atomic update
end do
end do

!$acc end parallel
!$acc end data
===========================end accelerated part

end do
end do

nbod: 1-15
idir : 1-25

inf(ib),inf(jb) : 1000-30000

sigma(jb,j,kb,kmode) : available . is input to the subroutine
fi : generated in this subroutine, exported for use in other part

fi,sigma,gdp are single precision complex numbers.

The gpu : gtx 1080 ti (11 gb)

Problem :
When compiled (version 17.10) for multicore cpu then the
addition of !$acc atomic update leads to higher computation time

When compiled for gpu , no compilation problems but for higher values
of (inf(ib)> approx 2000) the computation crashes on calling this
subroutine. For lower numbers results ok.
If I leave out the atomic directive answers for fi are incorrect.

What can be done to :

  1. stop the crashing for higher values of the unknowns of the gpu version.
  2. improve the speed of the gpu version

or :
Is there a work around ?

By the way the error on crashing of the GPU version is:

call to cuMemcpyDtoHAsync returned error 999: Unknown
call to cuMemFreeHost returned error 999: Unknown

Thanks for your time,


Hi Jo,

  1. stop the crashing for higher values of the unknowns of the gpu version.

What’s the size of “fi”? When I see a program that fails as the size of the bounds increase, my first thought is that the array is growing over 2GB and that you should try adding “-Mlarge_arrays” if “fi” is an allocatable or “-mcmodel=medium” if it’s static. These flags allow for arrays greater than 2GB.

If that’s not it, then we’ll need a reproducing example in order to investigate since error “999” is generic.

  1. improve the speed of the gpu version

Have you run the code through a profiler to see where the time is being spent? Again, I’d need a reproducer to offer specific suggestions, but things to look for are the loop schedule being used, have you parallelized all potential loops, are your array’s stride-1 dimension being accessed by the “vector” loop index, how may registers are in use per thread, has data movement been optimized, etc.


Hi Mat,

Thank you for the quick response !
Unfortunately, -Mlarge_arrays did not do the job. I am now making up a shortened version of the code which hopefully will show the problem as well. I will probably have to send the code with some input files to you.
The input files generate all the data necessary to run the offending part of the code which is actually only a few lines , as you saw before.
The accompanying files will be several GBs in size for the bigger problems. How do I get these to you ?



Send the code to PGI Customer Service (trs@pgroup.com) and ask Dave to forward the code to me.