I’m really having a hard time getting OpenACC to work. I stumbled on this tutorial set a while back:
I tried to solve an assignment problem numerically using OpenMP (works great) and OpenACC (not at all). It’s a simple electrostatics BVP solvable via relaxation method (Jackson 2.13). pgc++ will not, for whatever reason replicate the behaviour in the tutorial. I even rewrote the exact code in the tutorial in a c++ environment and I can’t see what the difference would be other than the fact that pgcc is used in the tutorial (I prefer to live in the post 1982 era).
You can clone my code from here (the latest commit is fine):
https://bitbucket.org/stevenovakov/openacc-intro
“relax.cc” is my relaxation method implementation for Jackson 2.13, and “p4allrework.cc” is the rewrite of the NVIDIA dev blog code. If you just go “>>make clean all” in the root, the single thread and OpenMP accelerated (works w/ all 12 threads on my CPU) executables turn out ok, but the OpenACC versions do not. I’ve tried all sorts of combinations of various pragma statements, by reference of
http://www.openacc.org/sites/default/files/OpenACC.1.0_0.pdf
and nothing seems to work. 'relax.cc" has problems with a mystery scalar variable:
main:
146, Generating copyin(yy[:],xx[:])
Generating copy(_T42193656,error)
Generating copy(phi[:])
149, Accelerator restriction: scalar variable live-out from loop: _T42193656
Accelerator scalar kernel generated
154, Accelerator restriction: scalar variable live-out from loop: _T42193656
std::abs(float):
38, include “cmath”
21, include “cmath”
88, Generating implicit acc routine seq
const T1 & std::max(const T1 &, const T1 &):
34, include “iostream”
39, include “ostream”
38, include “ios”
40, include “char_traits.h”
39, include “stl_algobase.h”
217, Generating implicit acc routine seq
38, include “cmath”
“p4allrework.cc” has problems with some sort of strange access problem (I did use the restrict declaration for all of my shared containers, as suggested in various places):
main:
113, Generating copy(A[:],Anew[:])
115, Generating copy(error,_T36976592)
119, Complex loop carried dependence of Anew-> prevents parallelization
Complex loop carried dependence of prevents parallelization
Scalar last value needed after loop for error at line 142
Accelerator restriction: scalar variable live-out from loop: error,_T36976592
Parallelization would require privatization of array Anew[:]
Accelerator kernel generated
Generating Tesla code
121, Complex loop carried dependence of Anew-> prevents parallelization
Complex loop carried dependence of prevents parallelization
Scalar last value needed after loop for error at line 142
Accelerator restriction: scalar variable live-out from loop: error,_T36976592
133, Loop carried dependence of A-> prevents parallelization
Loop carried backward dependence of A-> prevents vectorization
135, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
135, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
std::abs(float):
28, include “cmath”
21, include “cmath”
88, Generating implicit acc routine seq
const T1 & std::max(const T1 &, const T1 &):
27, include “iostream”
39, include “ostream”
38, include “ios”
40, include “char_traits.h”
39, include “stl_algobase.h”
217, Generating implicit acc routine seq
I’ve bashed my head against this for a few hours and I’m not sure what to do, so here I am! Every “rework” I try ends up spitting out one of the above errors. Only the second loop in 'p4allrework.cc" (a simple array copy method) seems to accelerate correctly. It seems very strange for me that nearly identical code to the official developer blog does not work as is, when the post concerned advertised a ~4x speedup over single threaded performance.
Any help is greatly appreciated. Thank you.