Kernel with global scope variables

istvanreguly · April 11, 2014, 2:41pm

Hello,

I would like to use global scope constants in my OpenACC region, somehow as it is outlined below, however I do not seem to be able to get “calculate” to inline (I do use -Minline). As I understand it may be necessary to use #pragma acc declare create (stride) but I am not quite sure where to put it, I tried putting a data region around the conditional and the loop, which did report that these creates were generated, but the function still did not inline. As soon as I removed any reference to stride, it did inline, but of course I got wrong results.
Unfortunately the way the codebase is laid out, there is no way to change it so that it would pass stride to calculate.
Could anyone help me with this?

int stride;
void calculate(const double *in, double *out) {
  out[0] = in[0]+in[0+stride];
}
void wrapper(...) {
  double *in=...;
  double *out=...;
  if (first_time) {
    stride = 15; //not compile time known
  }
  #pragma acc kernel deviceptr(in,out)
  for (int i = 0; i < 100; i++) {
    for (int j = 0; j < 100; j++) {
      calculate(in+i+j*stride,out+i+j*stride);
    }
  }
}

Thank you,
Istvan

MatColgrove · April 11, 2014, 3:04pm

Hi Istvan,

I tried to recreate your issue with the following code. There is an problem in that the compiler can’t tell your pointers are not aliased so you need to add the “loop independent” clause, but no problem with inlining or global scoped variables. If I’m not recreating the problem, please post an example as well as the error you’re seeing, the compiler version and which OS you’re using.

Thanks,
Mat

% cat test.c
int stride;
static int first_time=1;
 void calculate(const double * in, double * out) {
   out[0] = in[0]+in[0+stride];
 }
 void wrapper(double * in, double * out) {
   int strd;
   if (first_time) {
     first_time=0;
     stride = 15; //not compile time known
   }
   #pragma acc kernels deviceptr(in,out)
   #pragma acc loop independent
   for (int i = 0; i < 100; i++) {
   #pragma acc loop independent
     for (int j = 0; j < 100; j++) {
       calculate(in+i+j*stride,out+i+j*stride);
     }
   }
 }
% pgcc -c test.c -Minline -Minfo -acc -V14.3
wrapper:
     12, Generating NVIDIA code
     14, Loop is parallelizable
     16, Loop is parallelizable
         Accelerator kernel generated
         14, #pragma acc loop gang /* blockIdx.y */
         16, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     17, calculate inlined, size=4, file test.c (3)

istvanreguly · May 2, 2014, 8:59am

Thank you, this seems to have done the trick for most loops. However, I am still having trouble with reductions:

#define MIN(a,b) (a<b)?(a):(b)
void ops_par_loop_calc_dt_kernel_min(int x_size, int y_size, double *p_a0, double *red) {
  double p_a1 = 1e21;
  #pragma acc parallel copyin(p_a0[0:x_size*y_size])
  #pragma acc loop reduction(min:p_a1)
  for ( int n_y=0; n_y<y_size; n_y++ ){
    #pragma acc loop reduction(min:p_a1)
    for ( int n_x=0; n_x<x_size; n_x++ ){
      p_a1 = MIN(p_a1,p_a0[n_x+n_y*x_size]);
    }
 }
 *red = p_a1;
}

when compiled with 14.2:

pgcpp -acc -ta=tesla:cc3x,keepgpu -O2 -Minline -Kieee -Minform=inform -Minfo=all calc_min.cpp -c -o calc_min.o

gives me the following report:

     16, Generating copyin(p_a0[0:x_size*y_size])
         Accelerator kernel generated
         19, #pragma acc loop gang /* blockIdx.x */
             Sum reduction generated for p_a1
         21, #pragma acc loop vector(256) /* threadIdx.x */
             Sum reduction generated for p_a1
     16, Generating NVIDIA code
     21, Loop is parallelizable

Why is a sum reduction being generated instead of a min reduction? I have looked into the generated calc_min.n001.gpu file, which, unless I am interpreting it wrong, does a threadblock-level sum reduction in shared memory. What am I doing wrong? I have tried using kernels instead of parallel and putting the reduction clause in all sorts of combinations but this still seems to come up.

MatColgrove · May 2, 2014, 5:12pm

It’s C++ compiler error that’s been fixed in 14.4.

% pgcpp -acc -c -Minfo=accel test.cpp -V14.2
ops_par_loop_calc_dt_kernel_min(int, int, double *, double *):
      3, Generating copyin(p_a0[0:y_size*x_size])
         Accelerator kernel generated
          6, #pragma acc loop gang /* blockIdx.x */
             Sum reduction generated for p_a1
          8, #pragma acc loop vector(256) /* threadIdx.x */
             Sum reduction generated for p_a1
      3, Generating NVIDIA code
      8, Loop is parallelizable

% pgcpp -acc -c -Minfo=accel test.cpp -V14.4
ops_par_loop_calc_dt_kernel_min(int, int, double *, double *):
      3, Generating copyin(p_a0[:y_size*x_size])
         Accelerator kernel generated
          6, #pragma acc loop gang /* blockIdx.x */
             Min reduction generated for p_a1
          8, #pragma acc loop vector(256) /* threadIdx.x */
             Min reduction generated for p_a1
      3, Generating Tesla code
      8, Loop is parallelizable

Mat