Triply nested loop using implicit OpenACC

_Sayan · September 5, 2012, 5:34pm

Greetings,

I would like to know how the PGI compiler handles nested loops in an implicit model. My code:

!$ACC KERNELS          &
!$ACC PRESENT(p0,p1)               
!$ACC LOOP INDEPENDENT
do k=k0,k1
 !$ACC LOOP INDEPENDENT
 do j=j0,j1
 !$ACC LOOP INDEPENDENT
  do i=i0,i1

I am assuming that the outer two loops would be distributed to Y/X blocks and would the innermost loop be vectorized in this case?

Thank you,
Sayan

MatColgrove · September 5, 2012, 5:57pm

Hi Sayan,

The most likely schedule is a 2-D block (gang) using a strip mined k and j loops, and a 3-D thread block (vector) from the k, j, and i loops. Though, this is highly dependent upon what the body of the loop looks like and how the data is accessed.

Hope this helps,
Mat

_Sayan · September 5, 2012, 7:09pm

Thanks Mat - is it possible to comment in general if this is a good way to use OpenACC (in terms of performance)? Actually we observe different performance when we run this code block against different compilers. So I wanted to ask if I should explicitly use gangs and vector clauses in order to tune my code.

MatColgrove · September 5, 2012, 7:58pm

So I wanted to ask if I should explicitly use gangs and vector clauses in order to tune my code.

Personally, I don’t find explicit schedule tuning to help much. I find the PGI compiler finds a good one in the vast majority of cases and I’d rather not tie my program to a particular schedule since it may not be optimal for other devices.

However, since your tuning for the compiler not the device, it may be worth it to you to set the schedule yourself. Granted, there’s more to performance than the schedule, so fixing the schedule may still yield varying performance. Worth a try though.

Mat