Nested Directives in OpenACC

I’m trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I’ve Tesla 40c and my OpenACC compiler is PGI version 15.7.

My code is so simple. When I try to compile following code compiler returns me these messages

PGCC-S-0155-Illegal context for pragma: acc  parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:



#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }

I’ve also tried to change my code to use routine directives. But I couldn’t compile again

#pragma acc routine workers
foo(...)
{

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }
}

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  foo(...);

}

How am I supposed to use dynamic parallelism in OpenACC?

Hi grynet0029,

There’s still a few OpenACC 2.0 features that we don’t support as of yet: nested parallelism and “declare link”.

Are you just experimenting with dynamic parallelism or do you have an application that could benefit from this feature? If it’s the latter, we’d be interested in learning more. We lowered the priority of these items since we haven’t found compelling use cases.

Thanks,
Mat

From my experience, Dynamic Parallelism:

  • Doesn’t improve performance (it’s just another way of launching kernels, so most kernel launch overhead is still there)
  • Has more compilation issues and makes code less maintainable
  • Harder to profile precisely
  • Has less hardware support
  • Doesn’t help improving algorithm design

Thanks for the input.

I don’t agree that generalizing. Depends how you are using dynamic parallelism.
If I don’t use GPU device in gpgpu programming rules, the results might be worse than CPU. Then I’ve to say that GPU is bad computing device?

I gave those conclusions assuming all guidelines for DP are followed.

Every algorithm has its inherent granularity, and some algorithms have different granularities at different stages (calculating Jacobian matrix is a good example). This is the primary reason DP was introduced: to handle mixed-granularity algorithms more gracefully. However, DP in its current form does not achieve this original goal.

The primary reason it does not work is simple: there is no direct data coherency between mater grid’s local data (shared memory) and the child grid’s local data - everything has to go through global memory. Without this local data coherency, all DP does is just launching kernels from GPU.

Even DP’s kernels launching capability faces a tough dilemma. If each master thread spawn a large number of child threads (enough to fill at least one warp), the resulting child grid would be overwhelmingly large, creating extra launching overhead. On the other hand, if each master thread only spawn a handful of child threads, the child grid would have many idle threads in warps (at least 32 threads to fill a warp), severely reducing overall performance. It would be simply better to launch the child grid separately from CPU and let the grid figure out the mapping.

Hence, unless the master grid is extremely small, DP will only reduce performance. Using DP for level scheduling is one example, but I personally found no observable performance improvement doing this.

Simply put: there is always a way to achieve the same or better performance than DP by not using DP.