acc kernels / acc parallel question

Hello,
I tried to accelerate the following code fragment:

for (i = step; i < n - step; i += step) {
	k = f[i - 1] / d[i - stride];
	m = e[i] / d[i + stride];
	b[i] = b[i] - k*b[i - stride] - m*b[i + stride];
	d[i] = d[i] - k*e[i - stride] - m*f[i + stride - 1];
	e[i] = -m*e[i + stride];
	f[i - 1] = -k*f[i - stride - 1];
}

It is taken from a well tested multicore application and the loop iterations are independent. OpenACC needs help with the arrays and can’t that the iterations are independent. That’s okay.
Putting ahead

#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop private(k,m)

and compiling with

pgcc -fast -O3 -acc -ta=tesla -Minfo:accel

(I use PGI 17.4 Community Edition) produces:

24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq

and a crashing code:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Putting ahead

#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)

produces:

24, Generating copy(b[:n],d[:n],e[:n-1],f[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq

and everything is fine.
Putting ahead

#pragma acc parallel copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)

produces:

24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization

and a running code but the results are incorrect.

Shouldn’t all this be roughly the same? Could someone help me to understand what I am missing? Especially why is parallel producing incorrect results?

Thanks a lot

Hi phi-psi,

The main difference between “kernels” and “parallel” is that with “kernels” you’re asking the compiler to discover the parallelism and thus needs first prove that there are no dependencies and thus safely parallelize the loop. With “parallel”, you’re telling the compiler to parallelize the loop irregardless to any dependencies it might discover. For “kernels”, you can add the “independent” clause to the loop directive asserting to the compiler that there are no dependencies. For “parallel”, independent is on by default.

I went ahead and tried to reproduce your errors with the code below. I was able to reproduce the kernels crash error (CASE1). It’s not surprising that the compiler can’t automatically parallelize the loop given the dependencies, but the generated sequential code shouldn’t cause the code to crash on the device. Hence, I added a problem report (TPR#24723) and sent it to engineering for further investigation.

However, I was not able to reproduce the wrong answers. Can you either modify my toy example to better reflect your code, post a reproducing example, or send the full code to PGI Customer Service (trs@pgroup.com)?

% cat test1.c

#include <stdlib.h>
#include <stdio.h>

int main ( ) {

int i, n, step, stride;
double k,m;
double *b, *d, *e, *f;


n = 1000;
step = 5;
stride = 2;

b = malloc(sizeof(double)*n);
d = malloc(sizeof(double)*n);
e = malloc(sizeof(double)*n);
f = malloc(sizeof(double)*n);

for (i=0;i<n;++i) {
  b[i] = 1.0+i;
  d[i] = 2.0+i;
  e[i] = 3.0+i;
  f[i] = 4.0+i;
}

#if defined(CASE1)
#pragma acc kernels loop copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#elif defined(CASE2)
#pragma acc parallel loop seq copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#elif defined(CASE3)
#pragma acc kernels loop independent copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#else
#pragma acc parallel loop copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#endif
for (i = step; i < n - step; i += step) {
   k = f[i - 1] / d[i - stride];
   m = e[i] / d[i + stride];
   b[i] = b[i] - k*b[i - stride] - m*b[i + stride];
   d[i] = d[i] - k*e[i - stride] - m*f[i + stride - 1];
   e[i] = -m*e[i + stride];
   f[i - 1] = -k*f[i - stride - 1];
 }

printf("B");
for (i=step;i<50;i+=step) {
  printf(":%f",b[i]);
}
printf("\nD");
for (i=step;i<50;i+=step) {
  printf(":%f",d[i]);
}
printf("\nE");
for (i=step;i<50;i+=step) {
  printf(":%f",e[i]);
}
printf("\nF");
for (i=step;i<50;i+=step) {
  printf(":%f",f[i]);
}
printf("\n");
free(b);
free(d);
free(e);
free(f);

}



% pgcc test1.c -O3 -V17.4 ; a.out
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 -DCASE2 ; a.out
main:
     30, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop seq
     36, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
         Loop carried dependence of b-> prevents parallelization
         Loop carried backward dependence of b-> prevents vectorization
         Loop carried dependence of d-> prevents parallelization
         Loop carried backward dependence of d-> prevents vectorization
         Loop carried dependence of e-> prevents parallelization
         Loop carried backward dependence of e-> prevents vectorization
         Loop carried dependence of f-> prevents parallelization
         Loop carried backward dependence of f-> prevents vectorization
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 -DCASE3 ; a.out
main:
     32, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
     36, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 ; a.out
main:
     34, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
  • Mat

Many thanks for the fast reply, Mat. I sent a test project which reproduces the issues to trs@pgroup.com.