Thanks brentl,
Yes, the problem is most definitely a race-condition and I assume has to do with global, shared, gang private and loop private which happen in the body of the outer loop by the compiler and affect the inner k loop. I’m trying to get to grips with it, but I’m honestly just not sure what’s where anymore and have tried various ways to force minJacNew, integral and grad_G to be global arrays and just use atomic operations without success so far.
The full code is rather huge (around 300MB) and takes a good 10+ min to compile so I’ll try and avoid having to ask you to do that! I could send you the full script for this problematic section of the code however and I’m attaching the Minfo output here.
Minfo doesn’t seem to give any useful information about the loop which is inside the #pragma acc routine however, so it’s hard to know what it is doing there. The output is identical if I use private(minJacNew[:1]…) or if I create temporary variables inside the el loop, and add reduction(min:minJac_temp) reduction(sum:integral_temp) etc. However, the results are not, and both still suffer from the race-condition.
I have tried putting the k loop inside a vector loop, worker loop, with private and no atomics, without private and with/out atomics, having the routines as worker or vector, putting the el loop in worker and k loop in vector with private on the el loop and atomics in the k loop and still get different problems each time. Anyway, here is the Minfo output where the GetFunctional routine is a vector routine, the k loop is vector loop, and atomic updates are used. Hope it helps!
Nektar::Utilities::ProcessVarOpti::Optimise3D3D(Nektar::Utilities::DerivUtilGPU &, Nektar::Utilities::NodesGPU &, Nektar::Utilities::ElUtilGPU &, Nektar::Utilities::Residual &, int, Nektar::Utilities::optimiser):
45, include "Optimise.hxx"
408, Generating present(elUtil[:],derivUtil[:],nodes[:],res[:])
Accelerator kernel generated
Generating Tesla code
423, #pragma acc loop gang /* blockIdx.x */
434, #pragma acc loop seq
465, #pragma acc loop seq
508, #pragma acc loop seq
408, CUDA shared memory used for h_Xc,minJacNew,sk,..inline,G,grad_G,eval,h_Xn,integral
Generating implicit copyin(this[:])
508, Loop carried reuse of h_Xn prevents parallelization
Loop carried scalar dependence for alpha at line 219,510,511,541,512
Nektar::Utilities::ProcessVarOpti::c1():
40, include "NodeOpti.h"
42, include "ProcessVarOpti.h"
219, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
45, include "Optimise.hxx"
Nektar::Utilities::ProcessVarOpti::gradTol():
40, include "NodeOpti.h"
42, include "ProcessVarOpti.h"
220, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
45, include "Optimise.hxx"
Nektar::Utilities::ProcessVarOpti::alphaTol():
40, include "NodeOpti.h"
42, include "ProcessVarOpti.h"
221, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
45, include "Optimise.hxx"
void Nektar::Utilities::ProcessVarOpti::CalcEValues<(int)3>(const double (&)[(T1*T1)], double (&)[T1]):
45, include "Optimise.hxx"
39, include "Hessian.hxx"
241, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetNodeCoordGPU<(int)2>(double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int):
45, include "Optimise.hxx"
78, Generating acc routine seq
Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetNodeCoordGPU<(int)3>(double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int):
45, include "Optimise.hxx"
88, Generating acc routine seq
Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::SetNodeCoordGPU<(int)2>(const double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int, int):
45, include "Optimise.hxx"
105, Generating acc routine seq
Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::SetNodeCoordGPU<(int)3>(const double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int, int):
45, include "Optimise.hxx"
118, Generating acc routine seq
Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)1, (Nektar::Utilities::optimiser)3>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
45, include "Optimise.hxx"
136, Generating Tesla code
149, #pragma acc loop seq
157, #pragma acc loop vector /* threadIdx.x */
173, #pragma acc loop seq
202, #pragma acc loop seq
204, #pragma acc loop seq
211, #pragma acc loop seq
213, #pragma acc loop seq
216, #pragma acc loop seq
281, #pragma acc loop seq
288, #pragma acc loop seq
291, #pragma acc loop seq
298, #pragma acc loop seq
301, #pragma acc loop seq
310, #pragma acc loop seq
321, #pragma acc loop seq
323, #pragma acc loop seq
157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)1, (Nektar::Utilities::optimiser)0>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
45, include "Optimise.hxx"
136, Generating Tesla code
149, #pragma acc loop seq
157, #pragma acc loop vector /* threadIdx.x */
173, #pragma acc loop seq
202, #pragma acc loop seq
204, #pragma acc loop seq
211, #pragma acc loop seq
213, #pragma acc loop seq
216, #pragma acc loop seq
281, #pragma acc loop seq
288, #pragma acc loop seq
291, #pragma acc loop seq
298, #pragma acc loop seq
301, #pragma acc loop seq
157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)0, (Nektar::Utilities::optimiser)3>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
45, include "Optimise.hxx"
136, Generating Tesla code
149, #pragma acc loop seq
157, #pragma acc loop vector /* threadIdx.x */
173, #pragma acc loop seq
202, #pragma acc loop seq
204, #pragma acc loop seq
211, #pragma acc loop seq
213, #pragma acc loop seq
216, #pragma acc loop seq
157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)0, (Nektar::Utilities::optimiser)0>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
45, include "Optimise.hxx"
136, Generating Tesla code
149, #pragma acc loop seq
157, #pragma acc loop vector /* threadIdx.x */
173, #pragma acc loop seq
202, #pragma acc loop seq
204, #pragma acc loop seq
211, #pragma acc loop seq
213, #pragma acc loop seq
216, #pragma acc loop seq
157, Loop is parallelizable
The offender in line 508
" 508, Loop carried reuse of h_Xn prevents parallelization
Loop carried scalar dependence for alpha at line 219,510,511,541,512 " is the while loop which should be in serial anyway.
The line that seems suspicious to me is the one above
" 408, CUDA shared memory used for
h_Xc,minJacNew,sk,…inline,G,grad_G,eval,h_Xn,integral
Generating implicit copyin(this[:]) " but I’m not sure it is actually an issue or not.
If it is possible to send you the code, or portions of it, that would be great too, just let me know where to send it to.
Really appreciate any help I can get,
Mashy.