Can I just apply OpenACC directives like this to individual loops?
Yes. It’s one of the benefits of OpenACC. It can be added incrementally, loop by loop. The only issue is that you may see sub-optimal performance due to excessive data movement until you add higher level data management.
2669, !$acc loop gang ! blockidx%x
2673, !$acc loop vector(128) ! threadidx%x
2674, Sum reduction generated for d_diag
2676, Sum reduction generated for d_off_diag
2670, Complex loop carried dependence of nedgesoncell,config_len_disp,kdiff,edgesoncell,u,defc_a,v,defc_b prevents parallelization
From this output we see that the compiler has parallelized the “iCell” and “iEdge” loop but runs the “k” loop sequentially due to loop dependencies.
What I’d like you try is changing your directive to:
!$acc kernels
!$acc loop gang vector collapse(2) independent
do iCell = 1, nCells
do k=1, nVertLevels
d_diag = 0.
d_off_diag = 0.
!$acc loop seq
do iEdge = 1, nEdgesOnCell(iCell)
The “seq” loop is probably unnecessary but I added it for illustration.
By default, the compiler attempts to parallelize all loops and will vectorize the innermost loop. However given your algorithm and the fact that the value of “nCells” is large, I think it’s better to just parallelize the outer two loops and run the inner loop sequentially. Reductions can be expensive, in particular for small inner loops, and it’s sometimes better to run these sequentially.
I added “independent” to assert to the compiler that the “k” loop can be parallelized. I do find it odd that the compiler complained about dependencies, unless you’re using pointers in which case the compiler has to assume that kDiff could point to the same memory as another pointer.
I collapse the “iCell” and “k” loops since I want “vector” to apply to the “k” loop but since it’s smaller than a good vector length (41<128), it’s better to collapse the loops together rather than making “iCell” use a “gang” schedule and “k” solely use “vector”. “vector” loops should correspond to the stride-1 dimension to allow for coalesced memory access.
the results are wrong/different to the one without OpenAcc.
The wrong answers could be due to the inner loop reductions being performed in parallel since parallel reductions can lead to divergent answers. But it could be bad code generation by the compiler as well. Without a reproducer, it’s difficult for me to tell. Though hopefully the adjusted schedule above will solve the issue. If not, I may ask for a reproducing example and/or additional information.
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Typically this is caused by an out-of-bounds errors, a host address being used, too large of a dynamic allocation from the device causing the heap to be exhausted, etc.
My best guess in this case is that it’s compiler code generation issue. However if it persists after you apply the explicit schedule listed above, consider checking for out-of-bounds errors in your host code (I like to use Valgrind to find these).