acc kernels / acc parallel question

phipsi · August 31, 2017, 1:25pm

Hello,
I tried to accelerate the following code fragment:

for (i = step; i < n - step; i += step) {
	k = f[i - 1] / d[i - stride];
	m = e[i] / d[i + stride];
	b[i] = b[i] - k*b[i - stride] - m*b[i + stride];
	d[i] = d[i] - k*e[i - stride] - m*f[i + stride - 1];
	e[i] = -m*e[i + stride];
	f[i - 1] = -k*f[i - stride - 1];
}

It is taken from a well tested multicore application and the loop iterations are independent. OpenACC needs help with the arrays and can’t that the iterations are independent. That’s okay.
Putting ahead

#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop private(k,m)

and compiling with

pgcc -fast -O3 -acc -ta=tesla -Minfo:accel

(I use PGI 17.4 Community Edition) produces:

24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq

and a crashing code:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Putting ahead

#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)

produces:

24, Generating copy(b[:n],d[:n],e[:n-1],f[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq

and everything is fine.
Putting ahead

#pragma acc parallel copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)

produces:

24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization

and a running code but the results are incorrect.

Shouldn’t all this be roughly the same? Could someone help me to understand what I am missing? Especially why is parallel producing incorrect results?

Thanks a lot

MatColgrove · August 31, 2017, 5:25pm

Hi phi-psi,

The main difference between “kernels” and “parallel” is that with “kernels” you’re asking the compiler to discover the parallelism and thus needs first prove that there are no dependencies and thus safely parallelize the loop. With “parallel”, you’re telling the compiler to parallelize the loop irregardless to any dependencies it might discover. For “kernels”, you can add the “independent” clause to the loop directive asserting to the compiler that there are no dependencies. For “parallel”, independent is on by default.

I went ahead and tried to reproduce your errors with the code below. I was able to reproduce the kernels crash error (CASE1). It’s not surprising that the compiler can’t automatically parallelize the loop given the dependencies, but the generated sequential code shouldn’t cause the code to crash on the device. Hence, I added a problem report (TPR#24723) and sent it to engineering for further investigation.

However, I was not able to reproduce the wrong answers. Can you either modify my toy example to better reflect your code, post a reproducing example, or send the full code to PGI Customer Service (trs@pgroup.com)?

% cat test1.c

#include <stdlib.h>
#include <stdio.h>

int main ( ) {

int i, n, step, stride;
double k,m;
double *b, *d, *e, *f;


n = 1000;
step = 5;
stride = 2;

b = malloc(sizeof(double)*n);
d = malloc(sizeof(double)*n);
e = malloc(sizeof(double)*n);
f = malloc(sizeof(double)*n);

for (i=0;i<n;++i) {
  b[i] = 1.0+i;
  d[i] = 2.0+i;
  e[i] = 3.0+i;
  f[i] = 4.0+i;
}

#if defined(CASE1)
#pragma acc kernels loop copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#elif defined(CASE2)
#pragma acc parallel loop seq copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#elif defined(CASE3)
#pragma acc kernels loop independent copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#else
#pragma acc parallel loop copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#endif
for (i = step; i < n - step; i += step) {
   k = f[i - 1] / d[i - stride];
   m = e[i] / d[i + stride];
   b[i] = b[i] - k*b[i - stride] - m*b[i + stride];
   d[i] = d[i] - k*e[i - stride] - m*f[i + stride - 1];
   e[i] = -m*e[i + stride];
   f[i - 1] = -k*f[i - stride - 1];
 }

printf("B");
for (i=step;i<50;i+=step) {
  printf(":%f",b[i]);
}
printf("\nD");
for (i=step;i<50;i+=step) {
  printf(":%f",d[i]);
}
printf("\nE");
for (i=step;i<50;i+=step) {
  printf(":%f",e[i]);
}
printf("\nF");
for (i=step;i<50;i+=step) {
  printf(":%f",f[i]);
}
printf("\n");
free(b);
free(d);
free(e);
free(f);

}

% pgcc test1.c -O3 -V17.4 ; a.out
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 -DCASE2 ; a.out
main:
     30, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop seq
     36, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
         Loop carried dependence of b-> prevents parallelization
         Loop carried backward dependence of b-> prevents vectorization
         Loop carried dependence of d-> prevents parallelization
         Loop carried backward dependence of d-> prevents vectorization
         Loop carried dependence of e-> prevents parallelization
         Loop carried backward dependence of e-> prevents vectorization
         Loop carried dependence of f-> prevents parallelization
         Loop carried backward dependence of f-> prevents vectorization
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 -DCASE3 ; a.out
main:
     32, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
     36, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000
% pgcc test1.c -acc -Minfo=accel -O3 -V17.4 ; a.out
main:
     34, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
         Accelerator kernel generated
         Generating Tesla code
         36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
B:-7.511111:-12.771429:-17.852632:-22.891667:-27.914483:-32.929412:-37.939927:-42.947727:-47.953741
D:-11.488889:-16.228571:-21.147368:-26.108333:-31.085517:-36.070588:-41.060073:-46.052273:-51.046259
E:-8.888889:-13.928571:-18.947368:-23.958333:-28.965517:-33.970588:-38.974359:-43.977273:-48.979592
F:9.000000:14.000000:19.000000:24.000000:29.000000:34.000000:39.000000:44.000000:49.000000

Mat

phipsi · September 1, 2017, 8:58am

Many thanks for the fast reply, Mat. I sent a test project which reproduces the issues to trs@pgroup.com.

Topic		Replies	Views
Code not accelerated using acc kernels Legacy PGI Compilers	2	3401	January 30, 2017
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10950	November 27, 2013
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
Accelerator restriction: invalid loop Legacy PGI Compilers	5	6411	September 26, 2017
Oddity in OpenACC Legacy PGI Compilers	15	13006	November 23, 2015
Atomic usage Legacy PGI Compilers	8	3618	July 2, 2019
Question on how to accelerate a complex loop Legacy PGI Compilers	3	1552	March 25, 2019
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4026	May 31, 2012
Parallelizing a loop Legacy PGI Compilers	9	5468	March 1, 2016
Add OpenACC to a Fortran loop Legacy PGI Compilers	5	7151	December 3, 2015

acc kernels / acc parallel question

Related topics