Hi,
I have a nested loop and want to apply a parallel-schedule to the outer loop and a vector-schedule to the inner loop. However, the compiler feedback says that it only parallelized the outer loop with parallel,vector. How can I tell the compiler not to do that?
The only idea I have why the compiler does not do what I specified is that there should be a reduction for the inner loop which it might not realize.
Pseudo code:
#pragma acc for parallel // reduction over sum needed
for i<n {
// do stuff
#pragma acc for vector(BLOCK_SIZE) // reduction over tmp needed
for j < m {
tmp += // something
}
// use tmp
sum += // something
}
So, the reduction for “sum” (outer loop) is recognized by the compiler. However, instead of using another reduction for the inner loop and use by given loop schedule, the compiler moves the parallelism totally to the outer loop (and thereby needs no reduction for the inner loop).
The problem here is that the inner loop can’t be parallelized since it contains a dependency (tmp). Right now the compiler takes the view from the thread level where every thread would need to have it’s own private copy of tmp.
What were are investigating is how to parallelize these types of loops so that the parallization at the block and thread level are taken into account. So your code would turn into something more like what you’re thinking:
#pragma acc for parallel // reduction over sum needed
! Each block works on a single "i"
for i<n {
! This section would be performed by a single thread
// do stuff
! Now perform tmp's reductions using all the threads in a block
#pragma acc for vector(BLOCK_SIZE) // reduction over tmp needed
for j < m {
tmp += // something
}
! Now back to using a single thread per block
// use tmp
! create a partial sum per block, then launch
! a separate kernel to perform the final sum reduction
sum += // something
}
Right now you need to break these up into multiple loops and manually privatize tmp (i.e. make tmp and array). Something like:
#pragma acc region
{
for i<n {
// do stuff
}
for i<n {
for j < m {
tmp[i] += // something
}
}
for i<n {
// use tmp[i]
sum += // something
}
} // end the acc region
Sorry, no time line on when such support would be available.
Hi Mat,
I did try your suggestion. But splitting the code into multiple loops did not make it faster than a version based on my initial code but with the parallelism always on the outer loop (so inner loops are executed serially). I think, since we have the same kind of parallelism (totally on outer loop) in both versions, but your suggestion uses more arrays and more synchronizations (between loops), it makes sense that your suggestion slows the runtime down. Or am I missing anything? So no benefit?