#pragma acc kernels loop Versus #pragma acc parallel loop

fchen14 · May 28, 2015, 7:22pm

Hi,

The OpenACC standard did not seem to have the “loop” clause after the “kernels” directive, however the PGI compiler still allows:

#pragma acc kernels loop

In this case will the “kernels” directive act the same as the “parallel” directive?

i.e.

#pragma acc kernels loop

is actually equal to

#pragma acc parallel loop

Thanks a lot for your help!

Feng

MatColgrove · May 28, 2015, 9:02pm

Hi Feng,

The OpenACC standard does allow “kernels loop”. Here’s section 2.9 of the OpenACC 2.0 standard:

2.9 Combined Directives

Summary
The combined OpenACC parallel loop and kernels loop directives are shortcuts for specifying a loop directive nested immediately inside a parallel or kernels construct. The meaning is identical to explicitly specifying a parallel or kernels directive containing a loop directive. Any clause that is allowed on a parallel or loop directive is allowed on the parallel loop directive, and any clause allowed on a kernels or loop directive are allowed on a kernels loop directive.

In this case will the “kernels” directive act the same as the “parallel” directive?

With “parallel” the user needs to define which loops to parallelize while with “kernels” the compiler will determine it. (See: Account Login | PGI)

However, when “kernels loop” is used, the user is specifying which loop to parallelize which is similar to “parallel loop”.

Mat

fchen14 · May 29, 2015, 10:14pm

Hi, Mat:

Thanks for your reply, just want to make sure, for the below loop:

#pragma acc data copyin(step) copyout(sum)
#pragma acc parallel loop private(i,x) reduction(+:sum)
      for (i = 0; i < n; i++) {
        x = (i+0.5)*step;
        sum +=  4.0/(1.0+x*x);
      }
    pi = step * sum;

If I replace the “parallel” with “kernels” directive:

#pragma acc kernels loop private(i,x) reduction(+:sum)

the PGI compiler still compiles fine, in this case, will the compiler tend to do the “parallel” way or “kernels” way? i.e. will the compiler just ignore my private and reduction keywords?

PS: Complete code below:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main() {
  long long int i, n=10000000000;
  double start_time, end_time;
  double x, pi;
  double sum = 0.0;
  double step = 1.0/(double) n;

#pragma acc data copyin(step) copyout(sum)
  {
    start_time = omp_get_wtime();
#pragma acc kernels loop private(i,x) reduction(+:sum)
      for (i = 0; i < n; i++) {
        x = (i+0.5)*step;
        sum +=  4.0/(1.0+x*x);
      }
    pi = step * sum;
    end_time = omp_get_wtime();
  }
  printf("pi = %17.15f\n",pi);
  printf("time to compute = %g seconds\n", (double)(end_time - start_time));
  return 0;
}

Thanks,

MatColgrove · June 1, 2015, 3:56pm

Hi Feng,

the PGI compiler still compiles fine, in this case, will the compiler tend to do the “parallel” way or “kernels” way? i.e. will the compiler just ignore my private and reduction keywords?

Apologies but I’m not understanding the question. One main difference between “parallel loop” and “kernels loop” is that “parallel” is implicitly “independent” since you’re telling the compiler which loops are parallelizable. “kernels loop” the compiler needs to analysis the loop for dependencies. However, they both treat “private” and “reduction” clauses the same given they are “loop” clauses.

The PGI compiler is able to discover reductions so while “reduction” isn’t ignored, it’s not needed here.

Scalars are private by default so your “private” clause is unnecessary. Loop index variables like “i” are treated special so would be ignored when listed in a “private” clause. However by putting “x” in a “private” clause, you are creating an array of “x”, one for each thread, that is accessed in global memory. This has the potential of slowing your program down. By letting the compiler privatize “x”, “x” will most likely be declared locally within the kernel and has the potential of being put in a register.

Rule of thumb is to not put scalars in “private” clauses unless needed. For example when the scalar has global storage or is passed by reference to a routine.

Your code does have an error. The device “sum” is copied out of the data region after you use it on the host. Hence “pi” is using the host “sum”, which is zero. Either update “sum” or move the computation of “pi” after your data region.

Hope this helps,
Mat

Topic		Replies	Views
Difference of using "acc parallel loop" and " Legacy PGI Compilers	3	2830	July 29, 2015
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4176	December 6, 2012
Reduction results in wrong results. Bug? Legacy PGI Compilers	8	7635	January 24, 2014
private OpenACC clause on loop, kernels, and parallel constr Legacy PGI Compilers	6	10726	February 1, 2013
default(none) directive behaviour as per OpenMP? Legacy PGI Compilers	3	3015	April 30, 2019
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4021	May 31, 2012
does "acc loop seq" work Legacy PGI Compilers	2	3912	October 3, 2012
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4027	August 20, 2012
the difference between kernels and parallel instructions Legacy PGI Compilers	2	1982	June 12, 2012
Atomic usage Legacy PGI Compilers	8	3618	July 2, 2019

#pragma acc kernels loop Versus #pragma acc parallel loop

Related topics