Specified loop mapping schedule not applied (PGI Acc)

sWienke · January 19, 2012, 8:34am

Hi,
I have a nested loop and want to apply a parallel-schedule to the outer loop and a vector-schedule to the inner loop. However, the compiler feedback says that it only parallelized the outer loop with parallel,vector. How can I tell the compiler not to do that?

The only idea I have why the compiler does not do what I specified is that there should be a reduction for the inner loop which it might not realize.

Pseudo code:

#pragma acc for parallel // reduction over sum needed
for i<n {
  // do stuff
#pragma acc for vector(BLOCK_SIZE) // reduction over tmp needed
     for j < m {
          tmp += // something
     }
     // use tmp
     sum += // something
}

So, the reduction for “sum” (outer loop) is recognized by the compiler. However, instead of using another reduction for the inner loop and use by given loop schedule, the compiler moves the parallelism totally to the outer loop (and thereby needs no reduction for the inner loop).

Regards, Sandra

MatColgrove · January 19, 2012, 6:27pm

Hi Sandra,

The problem here is that the inner loop can’t be parallelized since it contains a dependency (tmp). Right now the compiler takes the view from the thread level where every thread would need to have it’s own private copy of tmp.

What were are investigating is how to parallelize these types of loops so that the parallization at the block and thread level are taken into account. So your code would turn into something more like what you’re thinking:

#pragma acc for parallel // reduction over sum needed
! Each block works on a single "i"
for i<n {

! This section would be performed by a single thread
  // do stuff

! Now perform tmp's reductions using all the threads in a block
#pragma acc for vector(BLOCK_SIZE) // reduction over tmp needed
     for j < m {
          tmp += // something
     }

! Now back to using a single thread per block
     // use tmp

! create a partial sum per block, then launch 
! a separate kernel to perform the final sum reduction
     sum += // something

}

Right now you need to break these up into multiple loops and manually privatize tmp (i.e. make tmp and array). Something like:

#pragma acc region
{
for i<n {
  // do stuff
}


for i<n {
     for j < m {
          tmp[i] += // something
     }
}

for i<n {
     // use tmp[i]
     sum += // something
}
}  // end the acc region

Sorry, no time line on when such support would be available.

Best Regards,
Mat

sWienke · January 23, 2012, 12:00pm

Hi Mat,
I did try your suggestion. But splitting the code into multiple loops did not make it faster than a version based on my initial code but with the parallelism always on the outer loop (so inner loops are executed serially). I think, since we have the same kind of parallelism (totally on outer loop) in both versions, but your suggestion uses more arrays and more synchronizations (between loops), it makes sense that your suggestion slows the runtime down. Or am I missing anything? So no benefit?

Topic		Replies	Views
Nested loops in C Legacy PGI Compilers	2	3670	September 9, 2010
PGI accelerator model with nested loops Legacy PGI Compilers	3	4362	September 9, 2010
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4176	December 6, 2012
parallelize inner loop Legacy PGI Compilers	4	4009	July 14, 2011
OpenACC and nested loops Legacy PGI Compilers	2	4025	September 19, 2014
OpenACC reductions Legacy PGI Compilers	1	2461	March 26, 2012
PGI attempts to parallelize sequential loop Legacy PGI Compilers	3	2604	August 28, 2012
Unknown reason for sequential execution Legacy PGI Compilers	3	1953	May 1, 2018
prevent parallelization Legacy PGI Compilers	3	1921	March 22, 2012
reduction clause Legacy PGI Compilers	2	3031	May 26, 2014

Specified loop mapping schedule not applied (PGI Acc)

Related topics