Hello,
I tried to accelerate the following code fragment:
for (i = step; i < n - step; i += step) {
k = f[i - 1] / d[i - stride];
m = e[i] / d[i + stride];
b[i] = b[i] - k*b[i - stride] - m*b[i + stride];
d[i] = d[i] - k*e[i - stride] - m*f[i + stride - 1];
e[i] = -m*e[i + stride];
f[i - 1] = -k*f[i - stride - 1];
}
It is taken from a well tested multicore application and the loop iterations are independent. OpenACC needs help with the arrays and can’t that the iterations are independent. That’s okay.
Putting ahead
#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop private(k,m)
and compiling with
pgcc -fast -O3 -acc -ta=tesla -Minfo:accel
(I use PGI 17.4 Community Edition) produces:
24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq
and a crashing code:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
Putting ahead
#pragma acc kernels copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)
produces:
24, Generating copy(b[:n],d[:n],e[:n-1],f[:n-1])
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq
and everything is fine.
Putting ahead
#pragma acc parallel copy(d[0:n],e[0:n-1],f[0:n-1],b[0:n])
#pragma acc loop seq private(k,m)
produces:
24, Generating copy(b[:n],d[:n],f[:n-1],e[:n-1])
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop seq
26, Complex loop carried dependence of f->,d->,e->,b-> prevents parallelization
Loop carried dependence of b-> prevents parallelization
Loop carried backward dependence of b-> prevents vectorization
Loop carried dependence of d-> prevents parallelization
Loop carried backward dependence of d-> prevents vectorization
Loop carried dependence of f->,e-> prevents parallelization
Loop carried backward dependence of f->,e-> prevents vectorization
and a running code but the results are incorrect.
Shouldn’t all this be roughly the same? Could someone help me to understand what I am missing? Especially why is parallel producing incorrect results?
Thanks a lot