Combining stdpar with OpenACC async

Hi all,

I was wondering if the following code is a good way for offloading computations to GPUs using -stdpar while having these running asynchronously, here all in queue 1, or if there is an alternative, more explicit way:

! cat test_stdpar async
program main
  implicit none
  integer, parameter :: n = 100
  integer :: nn(3)
  real(8), arr(n,n,n), tot
  integer :: i,j,k
  nn(:) = n
  !$acc enter data create(arr) copyin(nn) async(1)
  !$acc kernels default(present) async(1)
  arr(:,:,:) = 1.d0
  !$acc end kernels
  tot = 0.d0
  !$acc kernels default(present) async(1)
  do concurrent(k=1:nn(3),j=1:nn(2),i=1:nn(1)) shared(arr) reduce(+:tot)
    arr(i,j,k) = i*j*k
    tot = tot + 1
  end do
  !$acc end kernels
  !$acc exit data copyout(arr) async(1)
  !$acc wait(1)
  print*,arr(10,10,10)
end program main

nvfortran with -Minfo=accel seems to produce the output I’d expect.

nvfortran -stdpar -Minfo=accel test.f90
main:
      9, Generating enter data create(arr(:,:,:))
         Generating enter data copyin(nn(:))
     10, Generating default present(arr(:,:,:))
     11, Loop is parallelizable
         Generating NVIDIA GPU code
         11,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
     14, Generating default present(arr(:,:,:))
     15, Loop is parallelizable
         Generating NVIDIA GPU code
         15,   ! blockidx%x threadidx%x auto-collapsed
             !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
             Generating reduction(+:tot)
     20, Generating exit data copyout(arr(:,:,:))

Also, if I have a large OpenACC codebase handling explicitly all host/device data movement and offloading of nested 3D loops with !$acc parallel do collapse(3) default(present) [...] async(1), and change it to something combining !$acc kernels default(present) async(1) and do concurrent as above, should I be concerned of some side effects, e.g., unwanted use of managed memory when compiling only with -stdpar? Or perhaps default(present) helps to prevent such potential issues?

Hi Pedro,

In this case, you’re not actually using STDPAR to offload the code but rather OpenACC which is auto-parallelizing the do concurrent loop. Fortran standard language parallelism doesn’t have a similar functionality to async. So I think it’s fine to use this method if you need async, but you’re just relying on OpenACC to do it.

Note that in this code the compute region will block waiting for the reduction variable to get copied back to the host. To make it non-blocking, put “tot” into an outer data region so only the device copy of “tot” needs updating.

should I be concerned of some side effects, e.g., unwanted use of managed memory when compiling only with -stdpar ?

“-stdpar” implies “-acc -gpu=managed”, so yes managed memory will be enabled. You can disable it via “-gpu=nomanaged” or just use “-acc” without “-stdpar” (unless STDPAR is used elsewhere outside of a kernels region).

Granted managed memory is only available with allocated memory and here you’re using static arrays, so it doesn’t matter.

-Mat

1 Like