Nested parallel loops data locality problems.

Hi,

I’m dealing with quite a complicated code which has previously been palatalised for CPU only using boost and pthreads, and for GPU + CPU using kokkos. We are now trying to consolidate the code with other unrelated parts that are using OpenACC, but are not as highly nested and complicated as this one.

The part of the code I’m having problems with has between 3 and 5 nested loops, with the outer loop in parallel, and the main inner loop (in an inline routine) and which contains accumulations and min (with atomics in the kokkos accelerated version) also in parallel.

The code works in serial, with only the outer loop in parallel, and with some extra #pragma acc update device/self in place, also with only the inner loop in parallel. However, when I’m trying to use nested parallel loops, I cannot get the accumulated or minimum values to work as expected.

As the code is quite long, I tried to provide an outline of the main structure. The issues I’m facing are with integral, minJacNew and grad_G. I’ve set minJacNew and integral as double[1] as to make them public rather than private as they are declared inside a parallel region, but have also attempted to use them as scalar doubles outside the initial loop and copy them to the GPU (#pragma acc declare create etc.) but faced the same problems. I’ve also tried using temporary values inside GetFunctional for minJacNew and integral, set them to private and use reductions (which worked when parallelising only the inner loop), but cannot do that with grad_G as it is an array.

#pragma acc parallel loop gang present(derivUtil, nodes, elUtil, res)  //NEEDS TO BE IN PARALLEL
for(int node=0; node<coloursetSize; node++)
{ 
     double minJacNew[1], integral[1], grad_G[9];
     <initialise and obtain incidences, constants and variables for GetFunctional>

     GetFunctional<3,true>(derivUtil, nodes, elUtil, integral, 
                                                        minJacNew, grad_G, nElmt, node, cs, ep);

     <do stuff with minJacNew[1], integral[1], grad_G[9] to update nodes>

    while (alpha > alphaTol)
    {
        GetFunctional<3,false>(derivUtil, nodes, elUtil, integral,  
                                                             minJacNew, grad_G, nElmt, node, cs, ep);    
    
        if (condition dependant on new value of  integral)
        {
              break;
        }
        alpha /= 2;
    }

    <do stuff with minJacNew[1], integral[1] to update nodes>
}

#pragma acc routine //worker 
template<const int DIM, const bool gradient> inline void
ProcessVarOpti::GetFunctional(const DerivUtilGPU &derivUtil, const NodesGPU &nodes,
const ElUtilGPU &elUtil,double integral[1], double minJacNew[1],
double grad_G[9], const int nElmt, int node, int cs, double ep)
{
     minJacNew[0] = DBL_MAX;
     integral[0] = 0.0;

     <obtain incidences and set constants>
     #pragma acc loop seq
     for(int el=0; el<nElmt; ++el)
     {
         <obtain incidences>
         #pragma acc loop seq //#pragma acc loop independent  //NEEDS TO BE IN PARALLEL
         for(int k=0; k<ptsHigh; ++k)
         {
              <do lots  of stuff to calculate jacDet and inc>
              //#pragma acc atomic
              minJacNew[0] = fmin(minJacNew[0], jacDet);
              //#pragma acc atomic
              integral[0] += inc;

              if(gradient)
              {                  
                  #pragma acc loop seq
                  for(int m=0; m<9; ++m)
                  {
                         <do stuff to calculate inc2>
                         //#pragma acc atomic
                         grad_G[m] += inc2;
                   }
              }
         }
     }
}

We might have to see the entire code, or at least the Minfo output you get when you compile. There is likely a race-condition in the code when run in parallel, for instance,

minJacNew[0] = DBL_MAX;
integral[0] = 0.0;

could be done by a parallel thread scheduled much later than other threads do the atomic updates on those locations.

I would put the gang parallel loop at the outermost, like you have done. Then try putting a vector loop on the k loop. Pay attention to global data (where I think you want the atomic targets to be), gang private data, and thread-private data.

Thanks brentl,

Yes, the problem is most definitely a race-condition and I assume has to do with global, shared, gang private and loop private which happen in the body of the outer loop by the compiler and affect the inner k loop. I’m trying to get to grips with it, but I’m honestly just not sure what’s where anymore and have tried various ways to force minJacNew, integral and grad_G to be global arrays and just use atomic operations without success so far.

The full code is rather huge (around 300MB) and takes a good 10+ min to compile so I’ll try and avoid having to ask you to do that! I could send you the full script for this problematic section of the code however and I’m attaching the Minfo output here.

Minfo doesn’t seem to give any useful information about the loop which is inside the #pragma acc routine however, so it’s hard to know what it is doing there. The output is identical if I use private(minJacNew[:1]…) or if I create temporary variables inside the el loop, and add reduction(min:minJac_temp) reduction(sum:integral_temp) etc. However, the results are not, and both still suffer from the race-condition.

I have tried putting the k loop inside a vector loop, worker loop, with private and no atomics, without private and with/out atomics, having the routines as worker or vector, putting the el loop in worker and k loop in vector with private on the el loop and atomics in the k loop and still get different problems each time. Anyway, here is the Minfo output where the GetFunctional routine is a vector routine, the k loop is vector loop, and atomic updates are used. Hope it helps!

Nektar::Utilities::ProcessVarOpti::Optimise3D3D(Nektar::Utilities::DerivUtilGPU &, Nektar::Utilities::NodesGPU &, Nektar::Utilities::ElUtilGPU &, Nektar::Utilities::Residual &, int, Nektar::Utilities::optimiser):
     45, include "Optimise.hxx"
         408, Generating present(elUtil[:],derivUtil[:],nodes[:],res[:])
              Accelerator kernel generated
              Generating Tesla code
             423, #pragma acc loop gang /* blockIdx.x */
             434, #pragma acc loop seq
             465, #pragma acc loop seq
             508, #pragma acc loop seq
         408, CUDA shared memory used for h_Xc,minJacNew,sk,..inline,G,grad_G,eval,h_Xn,integral
              Generating implicit copyin(this[:])
         508, Loop carried reuse of h_Xn prevents parallelization
              Loop carried scalar dependence for alpha at line 219,510,511,541,512
Nektar::Utilities::ProcessVarOpti::c1():
     40, include "NodeOpti.h"
          42, include "ProcessVarOpti.h"
              219, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating Tesla code
     45, include "Optimise.hxx"
Nektar::Utilities::ProcessVarOpti::gradTol():
     40, include "NodeOpti.h"
          42, include "ProcessVarOpti.h"
              220, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating Tesla code
     45, include "Optimise.hxx"
Nektar::Utilities::ProcessVarOpti::alphaTol():
     40, include "NodeOpti.h"
          42, include "ProcessVarOpti.h"
              221, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating Tesla code
     45, include "Optimise.hxx"
void Nektar::Utilities::ProcessVarOpti::CalcEValues<(int)3>(const double (&)[(T1*T1)], double (&)[T1]):
     45, include "Optimise.hxx"
          39, include "Hessian.hxx"
              241, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetNodeCoordGPU<(int)2>(double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int):
     45, include "Optimise.hxx"
          78, Generating acc routine seq
              Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetNodeCoordGPU<(int)3>(double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int):
     45, include "Optimise.hxx"
          88, Generating acc routine seq
              Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::SetNodeCoordGPU<(int)2>(const double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int, int):
     45, include "Optimise.hxx"
         105, Generating acc routine seq
              Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::SetNodeCoordGPU<(int)3>(const double (&)[T1], const Nektar::Utilities::NodesGPU &, int, int, int):
     45, include "Optimise.hxx"
         118, Generating acc routine seq
              Generating Tesla code
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)1, (Nektar::Utilities::optimiser)3>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
     45, include "Optimise.hxx"
         136, Generating Tesla code
             149, #pragma acc loop seq
             157, #pragma acc loop vector /* threadIdx.x */
             173, #pragma acc loop seq
             202, #pragma acc loop seq
             204, #pragma acc loop seq
             211, #pragma acc loop seq
             213, #pragma acc loop seq
             216, #pragma acc loop seq
             281, #pragma acc loop seq
             288, #pragma acc loop seq
             291, #pragma acc loop seq
             298, #pragma acc loop seq
             301, #pragma acc loop seq
             310, #pragma acc loop seq
             321, #pragma acc loop seq
             323, #pragma acc loop seq
         157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)1, (Nektar::Utilities::optimiser)0>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
     45, include "Optimise.hxx"
         136, Generating Tesla code
             149, #pragma acc loop seq
             157, #pragma acc loop vector /* threadIdx.x */
             173, #pragma acc loop seq
             202, #pragma acc loop seq
             204, #pragma acc loop seq
             211, #pragma acc loop seq
             213, #pragma acc loop seq
             216, #pragma acc loop seq
             281, #pragma acc loop seq
             288, #pragma acc loop seq
             291, #pragma acc loop seq
             298, #pragma acc loop seq
             301, #pragma acc loop seq
         157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)0, (Nektar::Utilities::optimiser)3>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
     45, include "Optimise.hxx"
         136, Generating Tesla code
             149, #pragma acc loop seq
             157, #pragma acc loop vector /* threadIdx.x */
             173, #pragma acc loop seq
             202, #pragma acc loop seq
             204, #pragma acc loop seq
             211, #pragma acc loop seq
             213, #pragma acc loop seq
             216, #pragma acc loop seq
         157, Loop is parallelizable
void Nektar::Utilities::ProcessVarOpti::GetFunctional<(int)3, (bool)0, (Nektar::Utilities::optimiser)0>(const Nektar::Utilities::DerivUtilGPU &, const Nektar::Utilities::NodesGPU &, const Nektar::Utilities::ElUtilGPU &, double *, double *, double *, int, int, int, double):
     45, include "Optimise.hxx"
         136, Generating Tesla code
             149, #pragma acc loop seq
             157, #pragma acc loop vector /* threadIdx.x */
             173, #pragma acc loop seq
             202, #pragma acc loop seq
             204, #pragma acc loop seq
             211, #pragma acc loop seq
             213, #pragma acc loop seq
             216, #pragma acc loop seq
         157, Loop is parallelizable

The offender in line 508
" 508, Loop carried reuse of h_Xn prevents parallelization
Loop carried scalar dependence for alpha at line 219,510,511,541,512 " is the while loop which should be in serial anyway.

The line that seems suspicious to me is the one above
" 408, CUDA shared memory used for
h_Xc,minJacNew,sk,…inline,G,grad_G,eval,h_Xn,integral
Generating implicit copyin(this[:]) " but I’m not sure it is actually an issue or not.

If it is possible to send you the code, or portions of it, that would be great too, just let me know where to send it to.

Really appreciate any help I can get,

Mashy.

If there is any variable in the list of things it puts into shared memory that you think should be private to each thread, then that certainly is an error. Shared memory variables are shared between all threads (vector lanes) in a gang.