Inlining with pragmas

Hey there,

Is it possible to inline functions that also have pragmas declared? It seems I can inline functions if the function itself is small and doesn’t contain any loops.

In the case where the function contains loops and I try to inline it into a function that is already wrapped with pragmas, I get yelled at with “Complex loop carried dependence of…” and “Loop carried reuse of…”.

If I add pragmas to the function I want to inline, I get “Accelerator restriction: function/procedure calls are not supported”.

The function calls exist in the same file, and I’m simply adding -Minline:fund to the command line when I compile.

Thanks for your time.

Have you tried using the OpenACC “Routine” pragma? It will allow you to call the function from the compute region so you don’t need to inline anything.

For the dependency messages, try adding “-Msafeptr”. It may be that the compiler can’t tell if your pointers are aliased or not. This flag asserts that all pointers are disjoint.

Finally, without the “routine” directive, no you can’t put pragma in the function to be inlined.

  • Mat

Thanks Mat,

I went ahead and just removed the pragmas from the inlined function for now. However, I’m getting a strange error that I’m not sure how to interpret. My input code (not stand-alone, but the only part that contains accelerator syntax)

void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
{
    // register int m;
    double rr[N][N][3],rp[3],temp,temp2,temp3;

    #pragma acc region copyin(N, mass[:N])
    {
        #pragma acc loop //for private(rp[:3])
        for (int j = 0; j < N; j++) {   
            #pragma acc loop
            for (int i = 0; i < N; i++) {
                if (i >= j) continue;
                rij(r,mass,eta,i,j,rp);
                temp = norm3(rp);

                // rp[0]*temp^3
                rr[i][j][0] = (1/temp)*rp[0];
                rr[i][j][1] = (1/temp)*rp[1];
                rr[i][j][2] = (1/temp)*rp[2];
            }
        }
    //} 


    A3(ro[0],v[0]);
    vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;

    //pragma acc kernels copyin(mass[:N])
    //{
        #pragma acc loop
            ro[i][0] = v[i][0];//A3(ro[i],v[i]);
            vo[i][0] = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
            ro[i][1] = v[i][1];//A3(ro[i],v[i]);
            vo[i][1] = (temp*mass[0])*rr[0][i][1];
            ro[i][2] = v[i][2];//A3(ro[i],v[i]);
            vo[i][2] = (temp*mass[0])*rr[0][i][2];

            #pragma acc loop 
            for (int j = 1; j < N; j++) {
                if (j > i-1) continue;
                vo[i][0] += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                vo[i][1] += (temp*mass[j])*rr[j][i][1];
                vo[i][2] += (temp*mass[j])*rr[j][i][2];
            }
            #pragma acc loop
            for (int j=i+1; j<=N-1; j++) {
                vo[i][0] += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                vo[i][1] += -(mass[j])*(rr[i][j][1]);
                vo[i][2] += -(mass[j])*(rr[i][j][2]);
            }
            #pragma acc loop
            for (int j=0; j <= i-1; j++) {
                #pragma acc loop
                for (int k = i+1; k <= N-1; k++) {
                    vo[i][0] += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                    vo[i][1] += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                    vo[i][2] += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                }
            }
        }
    }
}

And I’m getting this error

PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (source/n_body.cpp: 129)
rhs(double (*)[3], double (*)[3], double (*)[3], double (*)[3], double *, double *, int):
    129, Generating copyin(N)
         Generating copyin(mass[:N])
    131, Loop is parallelizable
    133, Loop is parallelizable
          84, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
         105, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
    147, Loop is parallelizable
         Accelerator scalar kernel generated
    153, Loop is parallelizable
    165, Loop is parallelizable
    172, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable
    178, Loop is parallelizable
    180, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable

Any ideas?

Thanks for your time.

Hi nmearl,

The “Unexpected flow graph” is a compiler error. Your example doesn’t compile as is but after a few changes, I was able to reproduce the error and I’ve filed a problem report (TPR#20361). However, the code itself is bad and this is causing the unexpected compiler behavior.

I took the liberty of rewriting the code to what I think you wanted. It appears to me that you are missing a second “i” loop since you’re accessing ro_[0] when there’s no “i” in scope. Second, reductions are only allowed on scalars so I replaced the intermediary vo accumulations with scalars. This will also help cut down on the number of memory access. I also added a data region and split the kernels into two separate regions. Keeping it as it was would mean you would have one outer gang region with several inner vector loops because of the scalar assignments to “vo” in the middle and the call to A3. Note that I did have to comment out all the calls.

Granted, I have no idea if these changes are correct or what you intended so please modify as needed.

Here’s my modified version:_

% cat test_fixed.c
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
 {
     // register int m;
     double rr[N][N][3],rp[3],temp,temp2,temp3;
     double vo0,vo1,vo2;

     #pragma acc data copyin(N, mass[:N])
{
     #pragma acc kernels
     {
         #pragma acc loop //for private(rp[:3])
         for (int j = 0; j < N; j++) {
             #pragma acc loop
             for (int i = 0; i < N; i++) {
                 if (i >= j) continue;
               ///  rij(r,mass,eta,i,j,rp);
               ///  temp = norm3(rp);

                 // rp[0]*temp^3
                 rr[i][j][0] = (1/temp)*rp[0];
                 rr[i][j][1] = (1/temp)*rp[1];
                 rr[i][j][2] = (1/temp)*rp[2];
             }
         }
     }


///     A3(ro[0],v[0]);
     vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;
     //#pragma acc kernels copyin(mass[:N])
     //{
      #pragma acc kernels loop copy(vo[0:N][0:3], ro[0:N][0:3])
      for (int i = 0; i < N; i++) {
             ro[i][0] = v[i][0];//A3(ro[i],v[i]);
             vo0 = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
             ro[i][1] = v[i][1];//A3(ro[i],v[i]);
             vo1 = (temp*mass[0])*rr[0][i][1];
             ro[i][2] = v[i][2];//A3(ro[i],v[i]);
             vo2 = (temp*mass[0])*rr[0][i][2];

             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j = 1; j < N; j++) {
                 if (j > i-1) continue;
                 vo0 += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                 vo1 += (temp*mass[j])*rr[j][i][1];
                 vo2 += (temp*mass[j])*rr[j][i][2];
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j=i+1; j<=N-1; j++) {
                 vo0 += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                 vo1 += -(mass[j])*(rr[i][j][1]);
                 vo2 += -(mass[j])*(rr[i][j][2]);
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j=0; j <= i-1; j++) {
             #pragma acc loop reduction(+:vo0,vo1,vo2)
                 for (int k = i+1; k <= N-1; k++) {
                     vo0 += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                     vo1 += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                     vo2 += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                 }
             }
             vo[i][0] +=  vo0;
             vo[i][1] +=  vo1;
             vo[i][2] +=  vo2;
        }
    }
 }


% pgcc -c test_fixed.c -acc -Minfo=accel -Msafeptr -V14.3
rhs:
      7, Generating copyin(N)
         Generating copyin(mass[:N])
      9, Generating present_or_copyin(rr[:N][:N][:])
         Generating present_or_copyin(rp[:])
         Generating NVIDIA code
     12, Loop is parallelizable
     14, Loop is parallelizable
         Accelerator kernel generated
         12, #pragma acc loop gang /* blockIdx.y */
         14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     32, Generating copy(vo[:N][:])
         Generating copy(ro[:N][:])
         Generating present_or_copyin(rr[:][:][:])
         Generating present_or_copyin(v[:N][:])
         Generating NVIDIA code
     33, Loop is parallelizable
         Accelerator kernel generated
         33, #pragma acc loop gang /* blockIdx.x */
         42, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         49, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         55, Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         57, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
     42, Loop is parallelizable
     49, Loop is parallelizable
     55, Loop is parallelizable
     57, Loop is parallelizable

Here’s the test case I submitted to reproduce the flow graph error:

% cat test.c
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
 {
     // register int m;
     double rr[N][N][3],rp[3],temp,temp2,temp3;
     #pragma acc kernels copyin(N, mass[:N])
     {
         #pragma acc loop //for private(rp[:3])
         for (int j = 0; j < N; j++) {
             #pragma acc loop
             for (int i = 0; i < N; i++) {
                 if (i >= j) continue;
               //  rij(r,mass,eta,i,j,rp);
               //  temp = norm3(rp);

                 // rp[0]*temp^3
                 rr[i][j][0] = (1/temp)*rp[0];
                 rr[i][j][1] = (1/temp)*rp[1];
                 rr[i][j][2] = (1/temp)*rp[2];
             }
         }
     //}


//     A3(ro[0],v[0]);
     vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;
     //#pragma acc kernels copyin(mass[:N])
     //{
     // #pragma acc loop
      int i = 0;
             ro[i][0] = v[i][0];//A3(ro[i],v[i]);
             vo[i][0] = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
             ro[i][1] = v[i][1];//A3(ro[i],v[i]);
             vo[i][1] = (temp*mass[0])*rr[0][i][1];
             ro[i][2] = v[i][2];//A3(ro[i],v[i]);
             vo[i][2] = (temp*mass[0])*rr[0][i][2];

             #pragma acc loop
             for (int j = 1; j < N; j++) {
                 if (j > i-1) continue;
                 vo[i][0] += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                 vo[i][1] += (temp*mass[j])*rr[j][i][1];
                 vo[i][2] += (temp*mass[j])*rr[j][i][2];
             }
             #pragma acc loop
             for (int j=i+1; j<=N-1; j++) {
                 vo[i][0] += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                 vo[i][1] += -(mass[j])*(rr[i][j][1]);
                 vo[i][2] += -(mass[j])*(rr[i][j][2]);
             }
             #pragma acc loop
             for (int j=0; j <= i-1; j++) {
                 #pragma acc loop
                 for (int k = i+1; k <= N-1; k++) {
                     vo[i][0] += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                     vo[i][1] += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                     vo[i][2] += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                 }
             }
        }
//     }
 }


% pgcc -c test.c -acc -Minfo=accel -Msafeptr -V14.3
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (test.c: 5)
rhs:
      5, Generating copyin(N)
         Generating copyin(mass[:N])
      8, Loop is parallelizable
     10, Loop is parallelizable
     20, Accelerator scalar kernel generated
     38, Complex loop carried dependence of '*(vo)' prevents parallelization
         Parallelization requires privatization of '*(vo)' as well as last value
     45, Complex loop carried dependence of '*(vo)' prevents parallelization
         Parallelization requires privatization of '*(vo)' as well as last value
     51, Parallelization would require privatization of array 'vo[:i1+1][:]'
     53, Complex loop carried dependence of '*(vo)' prevents parallelization
         Loop carried dependence due to exposed use of 'vo[:i1+1][:]' prevents parallelization
PGC/x86-64 Linux 14.3-0: compilation completed with severe errors

Thanks,
Mat

Thanks Mat! You rock!

I’m not sure how that i-loop got misplaced, but there is one in the actual file right where you put it. Sorry for that oversight.

Should I be concerned if when I compile it, I don’t get the same messages? I.e. I don’t get notified of the reductions in the loops

rhs(double (*)[3], double (*)[3], double (*)[3], double (*)[3], double *, double *, int):
    131, Generating copyin(N)
         Generating copyin(mass[:N])
    133, Generating present_or_copyin(rr[:N])
         Generating NVIDIA code
    135, Loop is parallelizable
         Accelerator kernel generated
        135, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    137, Loop carried reuse of 'rp' prevents parallelization
         Complex loop carried dependence of 'rp' prevents parallelization
         Complex loop carried dependence of '*(rr)' prevents parallelization
         Loop carried reuse of '*(rr)' prevents parallelization
          84, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
         105, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
    151, Generating copy(vo[:N][:])
         Generating copy(ro[:N][:])
         Generating present_or_copyin(rr[:N])
         Generating present_or_copyin(v[:N][:])
         Generating NVIDIA code
    155, Loop is parallelizable
         Accelerator kernel generated
        155, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    167, Loop is parallelizable
    173, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable
    179, Loop is parallelizable
    181, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable

Also, that inner part with 84 and 105 is the reference to another function. I went ahead and tried to implement the acc routine, but the compiler tells me that

"source/n_body.cpp", line 65: warning: invalid text in pragma
  #pragma acc routine
              ^

My compiler is

pgc++ 14.3-0 64-bit target on x86-64 Linux -tp piledriver 
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION.  All rights reserved.

Thanks again for your help,
Nick

Should I be concerned if when I compile it, I don’t get the same messages? I.e. I don’t get notified of the reductions in the loops

The loop at line 155 is getting a “gang vector” and the inner loops aren’t getting scheduled. Hence no reduction needed. The question is why the compiler would be ignoring the schedule you provided? It’s because of the dependency “Accelerator restriction: size of the GPU copy of ‘rr’ is unknown”.

The main difference between what you have and what I did, is I was using the C compiler but you’re using C++. C++ doesn’t have true VLAs so “rr[N][N][3]” gets turned into a pointer to a pointer to a fixed size array while in C this becomes a fixed size array. Fixed size arrays have a known size while pointers don’t. To fix, either use C or linearize rr (see below).

source/n_body.cpp", line 65: warning: invalid text in pragma
#pragma acc routine

Sorry, I didn’t realize you were using C++. C++ support for “routine” will be added in 14.3 along with several other C++ improvements (like accessing class member variables in compute regions and data clauses, and single dimension VLAs).

Here’s the linearized “rr” version of the example:

% cat test_fixed.cpp

#define IDX3D(i,j,k)  ((i*N*3)+(j*3)+k)
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
 {
     // register int m;
     double vo0,vo1,vo2;
#ifdef PGI_144
     double rr[N*N*3],rp[3],temp,temp2,temp3;
#else
     double *rr,rp[3],temp,temp2,temp3;
     rr = new double[N*N*3];
#endif
     #pragma acc data copyin(N, mass[:N]) create(rr[0:N*N*3])
{
     #pragma acc kernels
     {
         #pragma acc loop independent //for private(rp[:3])
         for (int j = 0; j < N; j++) {
             #pragma acc loop independent
             for (int i = 0; i < N; i++) {
                 if (i >= j) continue;
               ///  rij(r,mass,eta,i,j,rp);
               ///  temp = norm3(rp);

                 // rp[0]*temp^3
                 rr[IDX3D(i,j,0)] = (1/temp)*rp[0];
                 rr[IDX3D(i,j,0)] = (1/temp)*rp[1];
                 rr[IDX3D(i,j,0)] = (1/temp)*rp[2];
             }
         }
     }


///     A3(ro[0],v[0]);
     vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;
     //#pragma acc kernels copyin(mass[:N])
     //{
      #pragma acc kernels loop independent copy(vo[0:N][0:3], ro[0:N][0:3])
      for (int i = 0; i < N; i++) {
             ro[i][0] = v[i][0];//A3(ro[i],v[i]);
             vo0 = (temp*mass[0])*rr[IDX3D(0,i,0)];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
             ro[i][1] = v[i][1];//A3(ro[i],v[i]);
             vo1 = (temp*mass[0])*rr[IDX3D(0,i,1)];
             ro[i][2] = v[i][2];//A3(ro[i],v[i]);
             vo2 = (temp*mass[0])*rr[IDX3D(0,i,2)];

             #pragma acc loop reduction(+:vo0,vo1,vo2) independent
             for (int j = 1; j < N; j++) {
                 if (j > i-1) continue;
                 vo0 += (temp*mass[j])*rr[IDX3D(j,i,0)];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                 vo1 += (temp*mass[j])*rr[IDX3D(j,i,1)];
                 vo2 += (temp*mass[j])*rr[IDX3D(j,i,2)];
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2) independent
             for (int j=i+1; j<=N-1; j++) {
                 vo0 += -(mass[j])*(rr[IDX3D(j,i,0)]); //A3P(vo[i],-mass[j]*rr[i][j]);
                 vo1 += -(mass[j])*(rr[IDX3D(j,i,1)]);
                 vo2 += -(mass[j])*(rr[IDX3D(j,i,2)]);
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2) independent
             for (int j=0; j <= i-1; j++) {
             #pragma acc loop reduction(+:vo0,vo1,vo2) independent
                 for (int k = i+1; k <= N-1; k++) {
                     vo0 += (mass[k]*mass[j]*temp3)*(rr[IDX3D(j,i,0)]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                     vo1 += (mass[k]*mass[j]*temp3)*(rr[IDX3D(j,i,1)]);
                     vo2 += (mass[k]*mass[j]*temp3)*(rr[IDX3D(j,i,2)]);
                 }
             }
             vo[i][0] +=  vo0;
             vo[i][1] +=  vo1;
             vo[i][2] +=  vo2;
        }
    }
#ifndef PGI_144
   delete rr;
#endif

 }


% pgcpp -acc -Minfo -V14.3 test_fixed.cpp -Msafeptr -c -w
rhs(double (*)[3], double (*)[3], double (*)[3], double (*)[3], double *, double *, int):
     14, Generating copyin(N)
         Generating copyin(mass[:N])
         Generating create(rr[:(N*N)*3])
     16, Generating present_or_copyin(rp[:])
         Generating NVIDIA code
     18, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         18, #pragma acc loop gang /* blockIdx.y */
         20, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     35, Generating copy(vo[:N][:])
         Generating copy(ro[:N][:])
         Generating present_or_copyin(v[:N][:])
         Generating NVIDIA code
     39, Loop is parallelizable
         Accelerator kernel generated
         39, #pragma acc loop gang /* blockIdx.x */
         48, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         55, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         61, Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         63, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
     48, Loop is parallelizable
     55, Loop is parallelizable
     61, Loop is parallelizable
     63, Loop is parallelizable