#pragma acc kernels loop Versus #pragma acc parallel loop

Hi,

The OpenACC standard did not seem to have the “loop” clause after the “kernels” directive, however the PGI compiler still allows:

#pragma acc kernels loop

In this case will the “kernels” directive act the same as the “parallel” directive?

i.e.

#pragma acc kernels loop

is actually equal to

#pragma acc parallel loop

Thanks a lot for your help!

Feng

Hi Feng,

The OpenACC standard does allow “kernels loop”. Here’s section 2.9 of the OpenACC 2.0 standard:

2.9 Combined Directives

Summary
The combined OpenACC parallel loop and kernels loop directives are shortcuts for specifying a loop directive nested immediately inside a parallel or kernels construct. The meaning is identical to explicitly specifying a parallel or kernels directive containing a loop directive. Any clause that is allowed on a parallel or loop directive is allowed on the parallel loop directive, and any clause allowed on a kernels or loop directive are allowed on a kernels loop directive.



In this case will the “kernels” directive act the same as the “parallel” directive?

With “parallel” the user needs to define which loops to parallelize while with “kernels” the compiler will determine it. (See: http://www.pgroup.com/lit/articles/insider/v4n2a1.htm)

However, when “kernels loop” is used, the user is specifying which loop to parallelize which is similar to “parallel loop”.

  • Mat

Hi, Mat:

Thanks for your reply, just want to make sure, for the below loop:

#pragma acc data copyin(step) copyout(sum)
#pragma acc parallel loop private(i,x) reduction(+:sum)
      for (i = 0; i < n; i++) {
        x = (i+0.5)*step;
        sum +=  4.0/(1.0+x*x);
      }
    pi = step * sum;

If I replace the “parallel” with “kernels” directive:

#pragma acc kernels loop private(i,x) reduction(+:sum)

the PGI compiler still compiles fine, in this case, will the compiler tend to do the “parallel” way or “kernels” way? i.e. will the compiler just ignore my private and reduction keywords?

PS: Complete code below:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main() {
  long long int i, n=10000000000;
  double start_time, end_time;
  double x, pi;
  double sum = 0.0;
  double step = 1.0/(double) n;

#pragma acc data copyin(step) copyout(sum)
  {
    start_time = omp_get_wtime();
#pragma acc kernels loop private(i,x) reduction(+:sum)
      for (i = 0; i < n; i++) {
        x = (i+0.5)*step;
        sum +=  4.0/(1.0+x*x);
      }
    pi = step * sum;
    end_time = omp_get_wtime();
  }
  printf("pi = %17.15f\n",pi);
  printf("time to compute = %g seconds\n", (double)(end_time - start_time));
  return 0;
}

Thanks,

Hi Feng,

the PGI compiler still compiles fine, in this case, will the compiler tend to do the “parallel” way or “kernels” way? i.e. will the compiler just ignore my private and reduction keywords?

Apologies but I’m not understanding the question. One main difference between “parallel loop” and “kernels loop” is that “parallel” is implicitly “independent” since you’re telling the compiler which loops are parallelizable. “kernels loop” the compiler needs to analysis the loop for dependencies. However, they both treat “private” and “reduction” clauses the same given they are “loop” clauses.

The PGI compiler is able to discover reductions so while “reduction” isn’t ignored, it’s not needed here.

Scalars are private by default so your “private” clause is unnecessary. Loop index variables like “i” are treated special so would be ignored when listed in a “private” clause. However by putting “x” in a “private” clause, you are creating an array of “x”, one for each thread, that is accessed in global memory. This has the potential of slowing your program down. By letting the compiler privatize “x”, “x” will most likely be declared locally within the kernel and has the potential of being put in a register.

Rule of thumb is to not put scalars in “private” clauses unless needed. For example when the scalar has global storage or is passed by reference to a routine.

Your code does have an error. The device “sum” is copied out of the data region after you use it on the host. Hence “pi” is using the host “sum”, which is zero. Either update “sum” or move the computation of “pi” after your data region.

Hope this helps,
Mat