Hi all,
I was wondering if the following code is a good way for offloading computations to GPUs using -stdpar
while having these running asynchronously, here all in queue 1, or if there is an alternative, more explicit way:
! cat test_stdpar async
program main
implicit none
integer, parameter :: n = 100
integer :: nn(3)
real(8), arr(n,n,n), tot
integer :: i,j,k
nn(:) = n
!$acc enter data create(arr) copyin(nn) async(1)
!$acc kernels default(present) async(1)
arr(:,:,:) = 1.d0
!$acc end kernels
tot = 0.d0
!$acc kernels default(present) async(1)
do concurrent(k=1:nn(3),j=1:nn(2),i=1:nn(1)) shared(arr) reduce(+:tot)
arr(i,j,k) = i*j*k
tot = tot + 1
end do
!$acc end kernels
!$acc exit data copyout(arr) async(1)
!$acc wait(1)
print*,arr(10,10,10)
end program main
nvfortran
with -Minfo=accel
seems to produce the output I’d expect.
nvfortran -stdpar -Minfo=accel test.f90
main:
9, Generating enter data create(arr(:,:,:))
Generating enter data copyin(nn(:))
10, Generating default present(arr(:,:,:))
11, Loop is parallelizable
Generating NVIDIA GPU code
11, ! blockidx%x threadidx%x auto-collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
14, Generating default present(arr(:,:,:))
15, Loop is parallelizable
Generating NVIDIA GPU code
15, ! blockidx%x threadidx%x auto-collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
Generating reduction(+:tot)
20, Generating exit data copyout(arr(:,:,:))
Also, if I have a large OpenACC codebase handling explicitly all host/device data movement and offloading of nested 3D loops with !$acc parallel do collapse(3) default(present) [...] async(1)
, and change it to something combining !$acc kernels default(present) async(1)
and do concurrent
as above, should I be concerned of some side effects, e.g., unwanted use of managed memory when compiling only with -stdpar
? Or perhaps default(present)
helps to prevent such potential issues?