There’s not enough information here to give you a diffinative answer and wrong answers could be due to various reasons. It’s possible that data isn’t properly being copied between the host and device, or it could be a race condition in your code. If you could post a full reproducing example, that would be helpful.
I do notice that your atomic capture is in an incorrect form. Try capturing the value of the array to a local variable before using it as an index into ibox.
Thanks for your answer� It still not right.
I think this is the problem of openacc statement. Once I add the openacc statement, the result is wrong, without the openacc statement, the result is right. My code is part of a large project that contains more than 30 subroutine.F files. Can I upload the entire project code or only upload the subroutine code
As I don’t know how to upload files, I will post some code below.
subroutine step
include 'common.2D'
call ini_divide(2)
call divide(nbp1,npt,2)
end
subroutine ini_divide(kind_p)
include 'common.2D'
!$acc declare present(nc(:,:),ibox(:,:,:),nct)
!$acc update device(nct)
!$acc kernels copyin(nplink_max)
do i=1,nct
nc(i,kind_p) = 0
ibox(i,kind_p,1:nplink_max) = 0
enddo
!$acc end kernels
!$acc update host(nc(:,:),ibox(:,:,:))
return
end
subroutine divide(n_start,n_end,kind_p)
include 'common.2D'
!$acc parallel loop vector
do k=n_start,n_end
if (iflag(k).ne.0) then
dx = xp(k) - xmin
dz = zp(k) - zmin
icell = int( dx * one_over_3h ) + 1
kcell = int( dz * one_over_3h ) + 1
ii = icell + (kcell - 1)*ncx
!$acc atomic capture
nc(ii,kind_p) = nc(ii,kind_p)+1
idx = nc(ii,kind_p)
!$acc end atomic
!$acc atomic write
ibox(ii,kind_p,idx)=k
endif
enddo
!$acc end parallel
return
end
Thank you very much for your advice. I’ve found the problem. It’s a data transmission problem. I have fixed it.
But the result is still slightly wrong. Now, the result is correct when using the ‘seq’ statement in the parallel area, but when using the ‘vector’ statement, the result will have a smile error.
Since I have incomplete information, it’s difficult for me to offer advice here. From the information given, it seems likely that this loop is not parallelizable and you should run it serially. Perhaps there’s an alternative algorithm you can use?