Using classes in openACC

Hello everyone,
The program I’m trying to edit contains several classes. My problem is that in class A I have this loop that I would like to run on the gpu:

 for (iPoint = 0; iPoint < nPoint; iPoint++) {
    Density = nodes->GetDensity(iPoint);

    Velocity2 = 0.0;
    for (iDim = 0; iDim < nDim; iDim++)
      Velocity2 += pow(nodes->GetSolution(iPoint,iDim+1)/Density,2);

    StaticEnergy= nodes->GetEnergy(iPoint) - 0.5*Velocity2;

    GetFluidModel()->SetTDState_rhoe(Density, StaticEnergy);
    Pressure= GetFluidModel()->GetPressure();
    Temperature= GetFluidModel()->GetTemperature();

    /*--- Use the values at the infinity ---*/

    su2double Solution[MAXNVAR] = {0.0};
    if ((Pressure < 0.0) || (Density < 0.0) || (Temperature < 0.0)) {
      Solution[0] = Density_Inf;
      for (iDim = 0; iDim < nDim; iDim++)
        Solution[iDim+1] = Velocity_Inf[iDim]*Density_Inf;
      Solution[nDim+1] = Energy_Inf*Density_Inf;
      nodes->SetSolution(iPoint,Solution);
      nodes->SetSolution_Old(iPoint,Solution);
      counter_local++;
    }

  }

In the code nodes-> is a pointer to a Class C but it is defined in Class B:

In class B.hpp

#include C.hpp

C* nodes = nullptr;

when i try to add Openacc directives:

#pragma acc kernels
  {
  for (iPoint = 0; iPoint < nPoint; iPoint++) {

	 su2double TDVariables[2]={0.0};

    Density = nodes->GetDensity(iPoint);
    TDVariables[0]=Density;

    Velocity2 = 0.0;

    for (iDim = 0; iDim < nDim; iDim++)
      Velocity2 += pow(nodes->GetSolution(iPoint,iDim+1)/Density,2);

    StaticEnergy= nodes->GetEnergy(iPoint) - 0.5*Velocity2;
    TDVariables[1]=StaticEnergy;
    GetFluidModel()->SetTDVariables(iPoint,nPoint,TDVariables);

  }
  }

The compiler reports:

CEulerSolver::CEulerSolver(CGeometry *, CConfig *, unsigned short, bool):
     74, Generating enter data copyin(this[:1])
         Generating enter data create(Velocity2,StaticEnergy,Density,Temperature,Pressure)
    288, Generating implicit private(iPoint)
         Generating implicit copyin(this[:]) [if not already present]
    290, Complex loop carried dependence of TDVariables,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes-> prevents parallelization
         Loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes-> prevents parallelization
         Loop carried backward dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes-> prevents vectorization
         Scalar last value needed after loop for Density at line 318
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.__b_7CSolver.nDim prevents parallelization
         Loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes prevents parallelization
         Loop carried backward dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes prevents vectorization
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes-> prevents parallelization
         Loop carried dependence of this-> prevents parallelization
         Loop carried backward dependence of this-> prevents vectorization
         Complex loop carried dependence of this-> prevents parallelization
         Loop carried dependence of ->__vptr prevents parallelization
         Loop carried backward dependence of ->__vptr prevents vectorization
         Loop carried dependence of ->__vptr-> prevents parallelization
         Loop carried backward dependence of ->__vptr-> prevents vectorization
         Loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.__b_7CSolver.nPoint prevents parallelization
         Loop carried backward dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.__b_7CSolver.nPoint prevents vectorization
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable prevents parallelization
         Loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable prevents parallelization
         Loop carried backward dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable prevents vectorization
         Conditional loop will be executed in scalar mode
         Accelerator serial kernel generated
         CUDA shared memory used for TDVariables
         Generating implicit private(iPoint,StaticEnergy,iDim,Velocity2)
         Generating NVIDIA GPU code
        290, #pragma acc loop seq
        299, #pragma acc loop seq
    290, CUDA shared memory used for TDVariables
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable prevents parallelization
         Loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.__b_7CSolver.nPoint prevents parallelization
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable prevents parallelization
         Conditional loop will be executed in scalar mode
    299, Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes prevents parallelization
         Loop carried scalar dependence for iDim at line 300
         Generating implicit private(iDim)
         Complex loop carried dependence of __b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes->__b_13CFlowVariable.__b_9CVariable,__b_18CFVMFlowSolverBaseI14CEulerVariableL11ENUM_REGIME0EE.nodes prevents parallelization
         Loop carried scalar dependence for iDim at line 300

I’ve done several tests but none seem to work, so my question is: how can I correctly treat the nodes-> pointer to be used by the gpu through openacc?

The compiler is offloading the code, it’s just not able to implicitly parallelize the loop due to the potential dependencies.

Due to pointer aliasing, pointers may be pointing to the same object. If they we’re this would cause a dependency. Since it can’t tell at compile time, it must presume they are aliased at therefor not parallelize the code.

A couple ways to fix it:

  1. Use the “independent” clause to assert to the compiler that there are not dependencies
 #pragma acc kernels
  {
#pragma acc loop independent    
  for (iPoint = 0; iPoint < nPoint; iPoint++) {
  1. Use “parallel” instead of “kernels”. The main difference between the two constructs is with “kernels” the compiler must discover the parallelism and ensure it’s safe to parallelize. With “parallel”, you’re telling the compiler where to parallelize and it’s up to the programmer to ensure safety.
 #pragma acc parallel
  {
  #pragma acc loop   
  for (iPoint = 0; iPoint < nPoint; iPoint++) {

or combined:

#pragma acc parallel loop 
  for (iPoint = 0; iPoint < nPoint; iPoint++) {
  1. Add the flag “-Msafeptr”

This asserts to the compiler that there is no aliasing. This can be problematic if the code does actually have aliasing, so use with care. Better to use option #1 or #2

-Mat

Thanks for the reply Mat, I tried the second method and it works. But now another problem arises.

When in the computational region the kernel calls:

Density = nodes->GetDensity(iPoint);

where the function GetDensity(iPoint) is :

inline su2double GetDensity(unsigned long iPoint) const final { return Solution(iPoint,0); }

and Solution is

  using MatrixType = C2DContainer<unsigned long, su2double, StorageType::RowMajor,    64, DynamicSize, DynamicSize>;

  MatrixType Solution;       /*!< \brief Solution of the problem. */

C2DContainer is a class that handles std::vector in code I get the following error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Searching the internet I found that openAcc doesn’t particularly like vectors, but what could be a solution not wanting to use managed memory?
How can i copy the 2D vector in the gpu through openAcc?

It’s possible, but very tricky and you’ll be much better off using managed memory.

A vector is type with three pointer. Which means to manually manage the data, you must perform a deep copy.

For 1D, this means create the device vector, get the base pointer, create the array on the device, then attach the device base pointer to the base pointer in the device vector. To update or copy data, you can’t use the vector itself since this would only copy the vector’s pointers, not the data. Instead, you need to grab the base pointer and use it in the update directive or copy clauses.

For 2D, now you have pointer to an array of vectors with each having a pointer to an array. So now you have to repeat the 1D method for each by traversing the array of vectors, ensuring each are attached properly. To copy, again you’d need to traverse the array of vectors, grab each base pointer, and then copy.

Possible? Yes. Easy? No. Good code? Defiantly not.

If you really can’t use managed memory, you’ll be much better off switching to using C-style arrays. For a 1D vector, this can mean just using the base pointer of the vector rather than the vector itself.

Also keep in mind that vectors are not thread safe. It’s fine if all the code is does is access the vectors, but a push, pop, insert, etc. is problematic if done in parallel.

Ok thnaks so much , so use managed memory is the right choice.
Sometimes when I try to add more parallelism in the code I get this error, using the cuda debuger:

Thread 1 "SU2_CFD" received signal SIGSEGV, Segmentation fault.
__pgi_uacc_cuda_enter (rversion=0x7fffffffa948, objinfo=0x2954bf0 <.Y0012>, dindex=1) at ../../src/cuda_enter.c:78
78	../../src/cuda_enter.c: File o directory non esistente.

What kind of error is it?

It happens when I try to prallelize this piece of code

unsigned long CEulerSolver::SetPrimitive_Variables(CSolver **solver_container, const CConfig *config) {

  /*--- Number of non-physical points, local to the thread, needs
   *    further reduction if function is called in parallel ---*/
  unsigned long nonPhysicalPoints = 0;

  AD::StartNoSharedReading();

  SU2_OMP_FOR_STAT(omp_chunk_size)
#pragma acc parallel loop
  for (unsigned long iPoint = 0; iPoint < nPoint; iPoint ++) {

    /*--- Compressible flow, primitive variables nDim+9, (T, vx, vy, vz, P, rho, h, c, lamMu, eddyMu, ThCond, Cp) ---*/

    bool physical = nodes->SetPrimVar(iPoint, GetFluidModel());
    nodes->SetSecondaryVar(iPoint, GetFluidModel());

    /* Check for non-realizable states for reporting. */

    if (!physical) nonPhysicalPoints++;
  }
  END_SU2_OMP_FOR

  AD::EndNoSharedReading();

  return nonPhysicalPoints;
}

where SetPrimVar is:

#pragma acc routine
bool CEulerVariable::SetPrimVar(unsigned long iPoint, CFluidModel *FluidModel) {

  bool RightVol = true;

  SetVelocity(iPoint);   // Computes velocity and velocity^2
  su2double density      = GetDensity(iPoint);
  su2double staticEnergy = GetEnergy(iPoint)-0.5*Velocity2(iPoint);

  /*--- Check will be moved inside fluid model plus error description strings ---*/

  FluidModel->SetTDState_rhoe(density, staticEnergy);

  bool check_dens  = SetDensity(iPoint);
  bool check_press = SetPressure(iPoint, FluidModel->GetPressure());
  bool check_sos   = SetSoundSpeed(iPoint, FluidModel->GetSoundSpeed2());
  bool check_temp  = SetTemperature(iPoint, FluidModel->GetTemperature());

  /*--- Check that the solution has a physical meaning ---*/

  if (check_dens || check_press || check_sos || check_temp) {

    /*--- Copy the old solution ---*/

    for (unsigned long iVar = 0; iVar < nVar; iVar++)
      Solution(iPoint, iVar) = Solution_Old(iPoint, iVar);

    /*--- Recompute the primitive variables ---*/

    SetVelocity(iPoint);   // Computes velocity and velocity^2
    su2double density = GetDensity(iPoint);
    su2double staticEnergy = GetEnergy(iPoint)-0.5*Velocity2(iPoint);
    /* check will be moved inside fluid model plus error description strings*/
    FluidModel->SetTDState_rhoe(density, staticEnergy);

    SetDensity(iPoint);
    SetPressure(iPoint, FluidModel->GetPressure());
    SetSoundSpeed(iPoint, FluidModel->GetSoundSpeed2());
    SetTemperature(iPoint, FluidModel->GetTemperature());

    RightVol = false;

  }

  SetEnthalpy(iPoint); // Requires pressure computation.

  return RightVol;
}

Host side segmentation violation when entering the runtime code that launches the compute kernel.

Try adding “default(present)” so the compiler doesn’t add any implicit data movement and instead ensures the data is present on the device.

#pragma acc parallel loop default(present)

Not sure this will fix it, but let’s give it a try first.

It doesn’t work , when i run the program same error comes out

Ok, worth a try.

I’m looking at the source for cuda_enter.c and line 78 is just printing an error message, so it doesn’t quite make sense why it would segv there. I’m going to assume the line number if off.

Is it possible to get a reproducing example? Ideally if you can create a small stand-alone example, that would be great. But since this is a larger project, I understand if that’s not easy to do. The full source is fine as well, but I’ll need to ask for instructions on how you build and run.

If reproducer is not possible, I’ll think of some ideas on how to track down what’s happening.

I’ll try to create a reproducible example.
In the meantime, however, I realized that during the compilation, even if it is successful, I get this error:

container_details::AccessorImpl<unsigned long, double, (StorageType)0, (unsigned long)64, (unsigned long)0, (unsigned long)0>::operator ()(unsigned long, unsigned long):
     28, include "CEulerSolver.hpp"
          30, include "CFVMFlowSolverBase.hpp"
               31, include "CSolver.hpp"
                    43, include "CFluidModel.hpp"
                         33, include "CConfig.hpp"
                              47, include "container_decorators.hpp"
                                   31, include "C2DContainer.hpp"
                                       291, Generating implicit acc routine seq
                                            Generating acc routine seq
                                            Generating NVIDIA GPU code
__gnu_cxx::__promote_2<T1, T2, __gnu_cxx::__promote<T1, std::__is_integer<T1>::__value>::__type, __gnu_cxx::__promote<T2, std::__is_integer<T2>::__value>::__type>::__type std::pow<double, int>(T1, T2):
     28, include "CEulerSolver.hpp"
          30, include "CFVMFlowSolverBase.hpp"
               30, include "geometry_toolbox.hpp"
                    29, include "cmath"
                         15, include "cmath"
                             416, Generating implicit acc routine seq
                                  Generating acc routine seq
                                  Generating NVIDIA GPU code
ptxas fatal   : Unresolved extern function '_ZN14CEulerVariable10SetPrimVarEmP11CFluidModel'
NVC++/x86-64 Linux 22.9-0: compilation completed with warnings
[6/7] Installing files.

SetPrimVar is the function inside the openacc compute region that I think is causing the problem.
I controll the code and it’is defined correctly both in the source file and in the header file.

Are you compiling with “-gpu=nordc”? And if so, is SetPrimVar defined in a separate file?

One of the limitations of using nordc is that you can’t call device routines in separate files. They have to be in the same file so the device routine can be inline. Calling to a separate file requires a device link step which nordc disables.

The area of code where the segv occurs is where the runtime is finding the correct device binary or PTX for a particular target device. Given there is no binary or PTX, that is likely the core issue. Our bug here is that the compiler should be erroring on the ptxas fatal not just giving a warning.

Once you’re able to get a reproducer together, I can file a report.

Yes, i’m compiling with “-gpu=nordc” and SetPrimVar is defined as virtual function in another class.
The reason i am using “-gpu=nordc” is that without that flag i get this error:

Current file:     /home/marco/CFD/SU2_CPRP_CUDA/build/../SU2_CFD/src/solvers/CEulerSolver.cpp
        function: _ZN12CEulerSolver22SetPrimitive_VariablesEPP7CSolverPK7CConfig
        line:     1651
This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc86 to use NVIDIA Tesla GPU 0

and my current flags for compiling openAcc are:

if get_option('enable-gpu')

  add_global_arguments('-std=c++11','-acc','-gpu=managed','-Minfo=accel', language: 'cpp')
  add_global_arguments('-std=c99','-acc','-gpu=managed','-Minfo=accel', language: 'c')
  add_global_link_arguments('-std=c++11','-acc','-gpu=managed','-Minfo=accel','-Wl,--as-needed','-Wl,--no-undefined','-Wl,-O1','-Wl,--start-group', language: 'cpp')
  add_global_link_arguments('-std=c99','-acc','-gpu=managed','-Minfo=accel','-Wl,--as-needed','-Wl,--no-undefined','-Wl,-O1','-Wl,--start-group', language: 'c')

endif

Presumably you’re creating a shared object? In this case, correct. With C++, SOs need to be compiled and linked with -gpu=nordc. We were able to get RDC working with Fortran and C SOs, just not C++. I have an open RFE for this, but technical issues have prevented engineering from adding this support.

SetPrimVar is defined as virtual function in another class.

This is problematic for two reasons.

  1. to call a device routine in another source file, this requires a device link, which is only enabled with RDC.
  2. Function pointers, which is how virtual functions are implemented, is not yet supported on the device. This requires a dynamic linker for the device, which until very recently, did not exist. Engineering is working on adding this support but I’m not sure of its status.