Async wait in OpenACC

I am currently writing some code where I want to launch multiple kernels while continuing execution on the CPU. Below is a short snippet of OpenACC part of the code. The code below has two nested loops which should be launched as two different kernels (in the full code these are loops are inside an “acc routine gang” function). Index (i1,i2,i3,iv) of the second loop only depends on index (i1,i2,i3,iv) of the first loop.

Currently I have solve this dependency by adding “!$acc wait(number) async(number)”. The idea is that the wait itself is seen as a “kernel” which is launched to the GPU so the CPU is not such waiting. However, I haven’t been able to find other examples where “wait async” is used.

I have tried to solve the dependency by making sure the same GPU thread handle Index (i1,i2,i3,iv) in both loops. However, as far as I can understand OpenACC provides no gaurantee for which thread handle which index?

I was wondering if there is another way to implement this to avoid the use of wait statements as they seem to have an impact on performance when I try to scale the code.

The data movement itself is also async, and I’m unsure if I need the first wait-statement before the first loop.

!$acc data copyin(self) copy(self%mem) create(self%prim, &
  !$acc          self%left, self%rght, self%grad, self%flux, self%ff) &
  !$acc async(self%task_number)

  !$acc wait(self%task_number) async(self%task_number)
  !$acc parallel loop collapse(4) default(none) async(self%task_number) 
      do iv=1,5
  do i3=self%lb(3),self%ub(3)
      do i2=self%lb(2),self%ub(2)
        do i1=self%lb(1),self%ub(1)            
          self%mem(i1,i2,i3,iv,self%new,1) = self%mem(i1,i2,i3,iv,self%it,1)! + 10 !TODO remove
          if (iv == 1) then
            self%prim(i1,i2,i3,1) = log(self%mem(i1,i2,i3, 1,self%it,1))
            self%prim(i1,i2,i3,iv)  = self%mem(i1,i2,i3,iv,self%it,1) &
                            / self%mem(i1,i2,i3, 1,self%it,1)
          end if
      end do
    end do
  end do
end do
  !$acc end parallel

  !$acc wait(self%task_number) async(self%task_number)
  !$acc parallel loop collapse(4) default(none) async(self%task_number) 
        do iv=1,5
      do i3=self%lb_1(3),self%ub_1(3)
        do i2=self%lb_1(2),self%ub_1(2)
          do i1=self%lb_1(1),self%ub_1(1)
            self%grad (i1,i2,i3,iv,1) = 0.5*(self%prim(i1+1,i2,i3,iv) - self%prim(i1-1,i2,i3,iv))
            self%grad (i1,i2,i3,iv,2) = 0.5*(self%prim(i1,i2+1,i3,iv) - self%prim(i1,i2-1,i3,iv))
          end do
        end do
      end do
    end do
  !$acc end parallel

  !acc wait(self%task_number) async(self%task_number) 
  !$acc end data

In advance, thanks

For this, you wouldn’t use a “wait async”. Rather compute constructs (parallel, kernels) having the same async queue number are implicitly dependent. Hence in your case, the first parallel region will be launched asynchronously on the device, the CPU will continue, launch the second parallel region, and then the CPU continues on until encountering a “wait”. However, since the second parallel region has the same queue number, it will wait for the first parallel region to complete before running on the device.

“wait async” is used to create dependencies between different async queues. So if your parallel regions had different queue numbers, you add “wait(Q1) async(Q2)” where Q2 should wait for Q1 to finish before running. The CPU doesn’t doesn’t block on a “wait async”, for that you’d use “wait” by itself. “!$acc wait” says to wait for all async queues while “!$acc wait(Qnumber)”, wait for a particular async queue to complete. “wait(Q1) async(Q1)”, as you have it, is extraneous.

Your issue is most likely the data region. “async” isn’t a valid clause on a structured data region (i.e. !$acc data / !$acc end data) and can only be used on an unstructured data region (i.e. !$acc enter data / !$acc exit data). Hence the code is most likely blocking there.

For a tutorial on using async, here’s a set of slides by Jeff Larkin which is derived from his chapter in the Parallel Programing with OpenACC book

For which examples can be found at:

The caveat being that functionally the host will still block when copying data back from the host (i.e. “update self(arr) async(Qid)”) so it’s better to not include copies back to the host until after all GPU computation is complete.

Thank you for the fast reply

I did not know that async cannot be used on structured data clause, and you are right, that is most likely my problem.
The reason I have the async is because I do not want the CPU-thread to spin-wait for the data

In the full code, I launch many OpenMP task, each of which execute the code shown previously but for different data. Therefore I would not want my CPU-thread to wait for the data, but rather continue with its work.

I did try with an unstructured data region, but this did not seem to work properly when accessing arrays from a class (self%mem). Could this be caused by compiler flag missing, or is this a known issue with the current OpenACC version?

Regards, Michael

There shouldn’t be any reason why this to work, but without an example it’s difficult for me to know what the problem is. Any details you can provide as to what the actual error is (runtime? compilation? verification?), would be helpful.

You might try using our Deep Copy feature (-gpu=deepcopy) which will implicitly do a deep copy of the class so you only need:

!$acc enter data copyin(self)

But I have no idea if this will fix the problem you’re seeing.