how to overlap loops inside a kernel

I have a fortran kernel like following:

  1. !$acc kernels
  2. do i = 1, times

    10 !$acc loop
    11 do j = 1, pver

    20 enddo
  3. !$acc loop
    31 do k = 1, levels

    40 enddo

    50 end do ! loop i

    60 !$acc end kernels

The loop on line 2 can not be parallelized due to dependence.
But loop on line 11 and line 31 can be parallelized.
Is there a way to overlap the loop in line 11 and line 31 ?

I know that if there is no loop line 2, I can use async for loop 11 and 31.
But the async is not allowed inside kernels.

Any suggestions ?
Thanks,

Hi shan,

Since the code is incomplete, I can only give a best guess as to the best options for you. Having a full example would be helpful.

Given this, I’d say you probably want to do something like this:

1. ! Add an OpenACC data region
2. do i = 1, times 
... 
10 !$acc kernels loop async(1)
11 do j = 1, pver 
.. 
20 enddo 
... 
30. !$acc kernels loop async(2) 
31 do k = 1, levels 
... 
40 enddo
41 !$acc wait  ! use wait here if there is a dependency within times loop 
.. 
50 end do ! loop i
51 !$acc wait  ! use wait here if there isn't a dependency within times loop 
... 
60 !$acc end kernels

Since the “times” loop has a dependency, you may not want to offload it to the device. Having an outer sequential loop will cause the loop to be run in “gang-redundant” mode, so every gang will execute the same code. So you’d either need to run only a single gang thus inhibiting performance, or run multiple gangs which would each execute the inner loops redundantly.

Then you can also use async with different queue numbers to have the loop execute concurrently on the device. The placement of the “wait” directive will depend on what else is going on in the times loop and if it needs any of the data brought back from the device.

Hope this helps,
Mat