I am currently writing some code where I want to launch multiple kernels while continuing execution on the CPU. Below is a short snippet of OpenACC part of the code. The code below has two nested loops which should be launched as two different kernels (in the full code these are loops are inside an “acc routine gang” function). Index (i1,i2,i3,iv) of the second loop only depends on index (i1,i2,i3,iv) of the first loop.
Currently I have solve this dependency by adding “!$acc wait(number) async(number)”. The idea is that the wait itself is seen as a “kernel” which is launched to the GPU so the CPU is not such waiting. However, I haven’t been able to find other examples where “wait async” is used.
I have tried to solve the dependency by making sure the same GPU thread handle Index (i1,i2,i3,iv) in both loops. However, as far as I can understand OpenACC provides no gaurantee for which thread handle which index?
I was wondering if there is another way to implement this to avoid the use of wait statements as they seem to have an impact on performance when I try to scale the code.
The data movement itself is also async, and I’m unsure if I need the first wait-statement before the first loop.
!$acc data copyin(self) copy(self%mem) create(self%prim, &
!$acc self%left, self%rght, self%grad, self%flux, self%ff) &
!$acc async(self%task_number)
!$acc wait(self%task_number) async(self%task_number)
!$acc parallel loop collapse(4) default(none) async(self%task_number)
do iv=1,5
do i3=self%lb(3),self%ub(3)
do i2=self%lb(2),self%ub(2)
do i1=self%lb(1),self%ub(1)
self%mem(i1,i2,i3,iv,self%new,1) = self%mem(i1,i2,i3,iv,self%it,1)! + 10 !TODO remove
if (iv == 1) then
self%prim(i1,i2,i3,1) = log(self%mem(i1,i2,i3, 1,self%it,1))
else
self%prim(i1,i2,i3,iv) = self%mem(i1,i2,i3,iv,self%it,1) &
/ self%mem(i1,i2,i3, 1,self%it,1)
end if
end do
end do
end do
end do
!$acc end parallel
!$acc wait(self%task_number) async(self%task_number)
!$acc parallel loop collapse(4) default(none) async(self%task_number)
do iv=1,5
do i3=self%lb_1(3),self%ub_1(3)
do i2=self%lb_1(2),self%ub_1(2)
do i1=self%lb_1(1),self%ub_1(1)
self%grad (i1,i2,i3,iv,1) = 0.5*(self%prim(i1+1,i2,i3,iv) - self%prim(i1-1,i2,i3,iv))
self%grad (i1,i2,i3,iv,2) = 0.5*(self%prim(i1,i2+1,i3,iv) - self%prim(i1,i2-1,i3,iv))
end do
end do
end do
end do
!$acc end parallel
!acc wait(self%task_number) async(self%task_number)
!$acc end data
In advance, thanks