OpenACC accelerated routines and WAIT directives

Does the WAIT directive work within accelerated routines? If not, is there a way to specify that an accelerated routine’s loop is to finish before its execution can continue?

For illustration purposes only:

module Vars
integer A(:),B(:)
!$acc declare create(A,B)
end module Vars

subroutine Calc()
!$acc routine worker
use Vars

!$acc loop
do index=1,1000
A[index] = Value(index)
end do
!-- WAIT-1:
!$acc wait

!$acc loop
do index=1,1000
A[index] = A(index)+ Calc2(index)
end do
!-- WAIT-2:
!$acc wait
end subroutine Calc

integer function Calc2(max)
!$acc routine worker
use Vars
integer max;

Calc2=0
!$acc loop
do index=1,max
Calc2 = Calc2+B(index)
end do
!-- WAIT-3:
!$acc wait
end function Calc2

allocate(A(0:10000))
allocate(B(0:10000))
do index=1:10000
B[index]=index
end do

!$acc update device(A,B)

!$acc parallel
call Calc
!$acc end parallel
!$ WAIT-4:
!$acc wait
!$acc update self(A)

Will WAIT-1 in Calc prevent its second loop from executing until the first one is done? What about Calc2’s WAIT-3? Is it needed or is there an implicit barrier at the end of routines?

And lastly, is WAIT-4 needed before the update or is it implicit in the barrier for the compute construct?

Hi Richard,

The “wait” directive can only be used from within host code, not on the device. OpenACC explicitly does not include barrier operation within device code (there may be implicit barriers added by the compiler, the those would be target and implementation dependent).

“wait” is used to set a host barrier to wait for the various “async” queues that can be created. Note that “async” is only allowed on compute constructs (parallel, kernels), unstructured data regions, and update directives.

Here given you’re using worker routines, I’m assuming your outer loops are gang and that you have an outer parallel region. In this case, there is no implied barrier between the gangs loops. Instead, you may want to each of the gang loops, “parallel loop gang”. By default parallel regions do block before proceeding. Though in cases where you don’t have a dependency, you can then apply “async(n)”, where “n” is a queue id, to make the parallel region non-blocking.

-Mat