Nested Directives in OpenACC

grynet0029 · August 12, 2015, 2:03pm

I’m trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I’ve Tesla 40c and my OpenACC compiler is PGI version 15.7.

My code is so simple. When I try to compile following code compiler returns me these messages

PGCC-S-0155-Illegal context for pragma: acc  parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }

I’ve also tried to change my code to use routine directives. But I couldn’t compile again

#pragma acc routine workers
foo(...)
{

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }
}

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  foo(...);

}

How am I supposed to use dynamic parallelism in OpenACC?

MatColgrove · August 12, 2015, 5:40pm

Hi grynet0029,

There’s still a few OpenACC 2.0 features that we don’t support as of yet: nested parallelism and “declare link”.

Are you just experimenting with dynamic parallelism or do you have an application that could benefit from this feature? If it’s the latter, we’d be interested in learning more. We lowered the priority of these items since we haven’t found compelling use cases.

Thanks,
Mat

rikisyo · August 13, 2015, 2:04pm

From my experience, Dynamic Parallelism:

Doesn’t improve performance (it’s just another way of launching kernels, so most kernel launch overhead is still there)
Has more compilation issues and makes code less maintainable
Harder to profile precisely
Has less hardware support
Doesn’t help improving algorithm design

MatColgrove · August 13, 2015, 2:26pm

Thanks for the input.

grynet0029 · August 13, 2015, 4:16pm

I don’t agree that generalizing. Depends how you are using dynamic parallelism.
If I don’t use GPU device in gpgpu programming rules, the results might be worse than CPU. Then I’ve to say that GPU is bad computing device?

rikisyo · August 13, 2015, 5:13pm

I gave those conclusions assuming all guidelines for DP are followed.

Every algorithm has its inherent granularity, and some algorithms have different granularities at different stages (calculating Jacobian matrix is a good example). This is the primary reason DP was introduced: to handle mixed-granularity algorithms more gracefully. However, DP in its current form does not achieve this original goal.

The primary reason it does not work is simple: there is no direct data coherency between mater grid’s local data (shared memory) and the child grid’s local data - everything has to go through global memory. Without this local data coherency, all DP does is just launching kernels from GPU.

Even DP’s kernels launching capability faces a tough dilemma. If each master thread spawn a large number of child threads (enough to fill at least one warp), the resulting child grid would be overwhelmingly large, creating extra launching overhead. On the other hand, if each master thread only spawn a handful of child threads, the child grid would have many idle threads in warps (at least 32 threads to fill a warp), severely reducing overall performance. It would be simply better to launch the child grid separately from CPU and let the grid figure out the mapping.

Hence, unless the master grid is extremely small, DP will only reduce performance. Using DP for level scheduling is one example, but I personally found no observable performance improvement doing this.

Simply put: there is always a way to achieve the same or better performance than DP by not using DP.