OpenACC - Basic Relaxation Method Not Being Accelerated

I’m really having a hard time getting OpenACC to work. I stumbled on this tutorial set a while back:

http://devblogs.nvidia.com/parallelforall/openacc-example-part-1/

I tried to solve an assignment problem numerically using OpenMP (works great) and OpenACC (not at all). It’s a simple electrostatics BVP solvable via relaxation method (Jackson 2.13). pgc++ will not, for whatever reason replicate the behaviour in the tutorial. I even rewrote the exact code in the tutorial in a c++ environment and I can’t see what the difference would be other than the fact that pgcc is used in the tutorial (I prefer to live in the post 1982 era).

You can clone my code from here (the latest commit is fine):

https://bitbucket.org/stevenovakov/openacc-intro

“relax.cc” is my relaxation method implementation for Jackson 2.13, and “p4allrework.cc” is the rewrite of the NVIDIA dev blog code. If you just go “>>make clean all” in the root, the single thread and OpenMP accelerated (works w/ all 12 threads on my CPU) executables turn out ok, but the OpenACC versions do not. I’ve tried all sorts of combinations of various pragma statements, by reference of

http://www.openacc.org/sites/default/files/OpenACC.1.0_0.pdf

and nothing seems to work. 'relax.cc" has problems with a mystery scalar variable:

main:
146, Generating copyin(yy[:],xx[:])
Generating copy(_T42193656,error)
Generating copy(phi[:])
149, Accelerator restriction: scalar variable live-out from loop: _T42193656
Accelerator scalar kernel generated
154, Accelerator restriction: scalar variable live-out from loop: _T42193656
std::abs(float):
38, include “cmath”
21, include “cmath”
88, Generating implicit acc routine seq
const T1 & std::max(const T1 &, const T1 &):
34, include “iostream”
39, include “ostream”
38, include “ios”
40, include “char_traits.h”
39, include “stl_algobase.h”
217, Generating implicit acc routine seq
38, include “cmath”

“p4allrework.cc” has problems with some sort of strange access problem (I did use the restrict declaration for all of my shared containers, as suggested in various places):

main:
113, Generating copy(A[:],Anew[:])
115, Generating copy(error,_T36976592)
119, Complex loop carried dependence of Anew-> prevents parallelization
Complex loop carried dependence of prevents parallelization
Scalar last value needed after loop for error at line 142
Accelerator restriction: scalar variable live-out from loop: error,_T36976592
Parallelization would require privatization of array Anew[:]
Accelerator kernel generated
Generating Tesla code
121, Complex loop carried dependence of Anew-> prevents parallelization
Complex loop carried dependence of prevents parallelization
Scalar last value needed after loop for error at line 142
Accelerator restriction: scalar variable live-out from loop: error,_T36976592
133, Loop carried dependence of A-> prevents parallelization
Loop carried backward dependence of A-> prevents vectorization
135, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
135, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
std::abs(float):
28, include “cmath”
21, include “cmath”
88, Generating implicit acc routine seq
const T1 & std::max(const T1 &, const T1 &):
27, include “iostream”
39, include “ostream”
38, include “ios”
40, include “char_traits.h”
39, include “stl_algobase.h”
217, Generating implicit acc routine seq

I’ve bashed my head against this for a few hours and I’m not sure what to do, so here I am! Every “rework” I try ends up spitting out one of the above errors. Only the second loop in 'p4allrework.cc" (a simple array copy method) seems to accelerate correctly. It seems very strange for me that nearly identical code to the official developer blog does not work as is, when the post concerned advertised a ~4x speedup over single threaded performance.

Any help is greatly appreciated. Thank you.

Hi Steve,

At least for relax.cc, the issue seems to be “std::max”. There’s probably a global variable buried in there.

The simple work around is to add your own comparison.

          diff = std::abs(phi[iwrite + ibase] - phi[iread + ibase]);
          if (error < diff) {
             error = diff;
          }



% pgc++ relax.cc --c++11 -acc -Minfo=accel -DOACC -ta=tesla -V15.10 -fast
main:
    147, Loop is parallelizable
    152, Accelerator restriction: size of the GPU copy of phi,xx,yy is unknown
         Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        147, #pragma acc loop gang /* blockIdx.y */
        152, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
        181, Max reduction generated for error
    152, Generating copyin(yy[:],xx[:])
         Generating copy(phi[:])
std::abs(float):
     38, include "cmath"
          21, include "cmath"
               88, Generating implicit acc routine seq
% a.out
Initialization Complete. Starting iteration...

0, 0.0500000007
Iteration Complete. Total Iterations: 9, Time Elapsed: 0.460000 (s)
Writing to "output csvs" ...


Accelerator Kernel Timing data
./relax.cc
  main  NVIDIA  devicenum=0
    time(us): 2,257
    144: compute region reached 10 times
        152: data copyin transfers: 10
             device time(us): total=65 max=9 min=6 avg=6
        152: kernel launched 10 times
            grid: [2x198]  block: [128]
             device time(us): total=1,614 max=167 min=153 avg=161
            elapsed time(us): total=2,227 max=568 min=175 avg=222
        152: reduction kernel launched 10 times
            grid: [1]  block: [256]
             device time(us): total=71 max=8 min=7 avg=7
            elapsed time(us): total=282 max=32 min=27 avg=28
        152: data copyout transfers: 10
             device time(us): total=177 max=28 min=16 avg=17
    152: data region reached 10 times
        152: data copyin transfers: 30
             device time(us): total=241 max=26 min=6 avg=8
    186: data region reached 10 times
        186: data copyout transfers: 10
             device time(us): total=89 max=17 min=8 avg=8

Hope this helps,
Mat

Hello Mat,

Thanks for the response, I’ll give that a shot in a few minutes here. I’m just curious why it would be treated as a global variable when I specified "reduction( max: error) at the first #pragma acc loop independent call? Doesn’t that (like in openmp), tell the compiler to keep local copies of everything per (openmp: thread, openacc: work group???) and then reduce upon completion?

Also, the port of the parallel for all blog series code: the original author claimed 4x speedup, so something worked well…

Also your compiler flags are interesting, -V15.10? Is that mandatory? Does it not compile the latest version automatically?

Thank you!

Also, thank you, you saved my life, works like a beauty:

Accelerator Kernel Timing data
/media/CANOPY/Projects/Assorted/OpenACC Intro/relax.cc
  main  NVIDIA  devicenum=0
    time(us): 856,089
    146: compute region reached 9357 times
        154: kernel launched 9357 times
            grid: [2x254]  block: [128]
             device time(us): total=823,035 max=327 min=81 avg=87
            elapsed time(us): total=945,207 max=964 min=93 avg=101
        154: reduction kernel launched 9357 times
            grid: [1]  block: [256]
             device time(us): total=33,054 max=5 min=3 avg=3
            elapsed time(us): total=143,680 max=225 min=14 avg=15
    146: data region reached 9357 times
    191: data region reached 9357 times

Just to give you an idea of the speedup I’m experiencing now (for a 1024 x 1024 grid):

single thread: 1064 sec
omp 12 thread (w/ reduction(max:error)/without: 266.677/474.060 sec
oacc: 38.55s

OMP is ~ 4x speedup
OACC is ~$$$ 28x $$$ speedup, and all it takes is literally 4 lines of code and compiler install. Love it.

Hi Steve,

I’d need look into how “std::max” is implemented, but my assumption is that there’s an errno or other type of global variable implied when using this routine. Since it’s a function call, the entire function needs to be brought over the device, including any underlying data structures that it may use. The issue has nothing to do with the reduction clause, the reduction variable, nor how parallel reductions are implemented.

-V15.10? Is that mandatory? Does it not compile the latest version automatically?

No, the version flag is not necessary and should work fine with the version you have installed. I will typically post using the version flag only to make it more clear which version I was testing with.

OACC is ~$$$ 28x $$$ speedup, and all it takes is literally 4 lines of code and compiler install.

Excellent result. This is normal speed-up for a single compute intensive loop such as this. I’ve even seen as high as 70x with one directive (an MRI code). Once you start porting larger application which includes more data movement and less compute intensive parts of your code, the speed-up of a single Tesla K80 over an 32 core Haswell system is more likely to be in the 2-4x range. Still quite impressive though.

  • Mat
  • Mat

Update, I’ve finished my tutorial for this stuff, the article is up on my site:

http://www.stevenovakov.com/blog/2015-10-30/#s6.3

After passing flags for compiler optimization, the speedup from single threaded to OpenACC is more like

openmp/st : ~3-4x
openacc/st : ~11-12x

Which is not quite 28x, but is still significant. It offers ~3-4x improvement over my 12 thread Haswell-E so that’s kind of what you were hinting at in the last reply.