OpenACC, Procedures called in a compute region must have acc routine information

Hey everybody,

I have some minimal code snippet, which I cant’t get to run. I want to accelerate a kernel which applies the same algorithm to thousands of small-sized vectors. Usually, I’d like to save the data in a standard vector of standard vectors, but I couldn’t figure out how to move the data to device then. That’s why I came up with this unhandy struct-version… However, the actual problem is the error “Procedures called in a compute region must have acc routine information”, fully below:

dosomething(std::vector<double, std::allocator>, int):
11, Generating Tesla code
13, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x */
13, Loop is parallelizable
PGCC-S-0155-Procedures called in a compute region must have acc routine information: std::vector<double, std::allocator>::vector(const std::vector<double, std::allocator>&) (test.cpp: 39)
PGCC-S-0155-Accelerator region ignored; see -Minfo messages (test.cpp: 34)
main:
30, Generating copy(sizes[:],vecOfPointers[:numberOfInstances])
34, Generating copy(vecOfPointers->data[:])
Accelerator region ignored
39, Accelerator restriction: call to ‘std::vector<double, std::allocator>::vector(const std::vector<double, std::allocator>&)’ with no acc routine information
std::vector<double, std::allocator>::operator (unsigned long):
1, include “vector”
57, include “vector”
7, include “stl_vector.h”
780, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
PGCC/x86 Linux 18.10-1: compilation completed with severe errors

I compiled with PGC version 18.1:

pgc++ -std=c++11 -fPIC -ta=host,tesla:managed,cc35,cc60,time,cuda10.0 -acc -Minfo test.cpp -o test.xc

The code test.cpp is here:

#include <vector>

using myType = std::vector<double>;

struct myType_p{
  myType * data;
};

#pragma acc routine worker
void dosomething(myType vec, int N)
{
  #pragma acc loop worker vector
  for(int i = 0; i<N; ++i) vec[i]=1;
}

int main(int argc, char* argv[]){

int numberOfInstances = 4;     // usually about 10k to 100k 
int sizes[]={108,119,103,101}; // representative values in some range [a,b]

myType_p * vecOfPointers = new myType_p[numberOfInstances];
  
  for(int ii = 0; ii < numberOfInstances; ++ii)
  {
    vecOfPointers[ii].data = new myType[1];
    vecOfPointers[ii].data->resize(sizes[ii]);
  }
  
  #pragma acc data copy(sizes,vecOfPointers[0:numberOfInstances])
  {
    for (int ii = 0; ii<numberOfInstances; ++ii)
    {
      #pragma acc data copy(vecOfPointers[ii].data)
    }
    
    // the work in this loop shall first be distributed in the grid on gang level :
    #pragma acc parallel loop
    for (int ii = 0; ii<numberOfInstances; ++ii)
    { // this acc routine then is (planned) to be executed on worker and vector level
      dosomething(vecOfPointers[ii].data[0],sizes[ii]); 
    }
  }
  
  for(int ii = 0; ii < numberOfInstances; ++ii)
  {
    delete vecOfPointers[ii].data;
  }
  
  return 0;
}

whereas the routine is declared and used as described in https://devblogs.nvidia.com/7-powerful-new-features-openacc-2-0/
What am I doing wrong? I’d be glad if I could get some help here. Thanks in advance!

If there is a better way to copy dynamically allocated (nested) data to devices, please let me know. (I know about deep copy as in What’s new in OpenACC 2.6? | OpenACC. However, I didn’t find a solution to do this without an extra ‘copy’-for loop.)

Hi Aaron K.,

Since you’re passing “vec” by value, this causes the vector’s copy constructor to be invoked. If you change the code to pass “vec” by reference, the copy constructor wont be needed on the device and you can avoid this error.

Note using vectors on the device can be a bit tricky. Besides not being thread-safe, they are comprised of three pointers which are difficult to manually manage the data movement. Hence, I highly suggest you use CUDA Unified Memory (-ta=tesla:managed). You code as is (after fixing the pass by reference issue and giving “data” a size in the copy clause), will get an illegal address error since you’re not copying the data in correctly. Using “managed” will help simplify your code will allow the program to run successfully.

Hope this helps,
Mat

Hi Mat,
thank you for the reply!

I changed the following:

  1. Added member ‘int n’ to struct myType_p
  2. Now calling ‘vec’ by reference in ‘dosomething()’
  3. added size to ‘data’ in copy construct: ‘copy(vecOfPointers[ii].data[0:vecOfPointers[ii].n])’

It compiles and runs without errors:

$ pgc++ -std=c++11 -fPIC -ta=host,tesla:managed,cc35,cc60,time,cuda10.0 -acc -Minfo test.cpp -o test.xc
dosomething(std::vector<double, std::allocator> &, int):
12, Generating Tesla code
14, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
14, Loop is parallelizable
main:
31, Generating copy(sizes[:],vecOfPointers[:numberOfInstances])
35, Generating copy(vecOfPointers->data[:vecOfPointers->n])
Accelerator kernel generated
Generating Tesla code
38, #pragma acc loop gang /
blockIdx.x */
std::vector<double, std::allocator>::operator (unsigned long):
1, include “vector”
57, include “vector”
7, include “stl_vector.h”
780, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code

$ ./test.xc

Accelerator Kernel Timing data
/usr/local/home/kraemer/offload/rhsOffloading/test.cpp
main NVIDIA devicenum=0
time(us): 293
31: data region reached 2 times
31: data copyin transfers: 1
device time(us): total=17 max=17 min=17 avg=17
42: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
35: compute region reached 1 time
35: kernel launched 1 time
grid: [4] block: [32x8]
device time(us): total=233 max=233 min=233 avg=233
elapsed time(us): total=545 max=545 min=545 avg=545
35: data region reached 8 times

I have used CUDA UM already before. I assume, that formally I am still not handling the data movement correctly. However, since I use UM, the compiler probably ignores the data clauses and does the job on its own (correctly), does it? If I remove all (the two) ‘data copy’ directives, then it runs as well. The only difference is, that during compiling there is only a
Generating implicit copyin(vecOfPointers[:4],sizes[:])
without the second copy as it was previously (above) under ‘main:\n 35,’. Does this mean that the data IS copied (or managed by CUDA Unified Memory) or that it is NOT?

Just in case it might help, here is again the whole code (data constructs commented out):

#include <vector>

using myType = std::vector<double>;

struct myType_p{
  int n;
  myType * data;
};

#pragma acc routine worker
void dosomething(myType & vec, int N)
{
  #pragma acc loop worker vector
  for(int i = 0; i<N; ++i) vec[i]=1.0;
}

int main(int argc, char* argv[]){

  int numberOfInstances = 4;     // usually about 10k to 100k 
  int sizes[]={108,119,103,101}; // representative values in some range [a,b]

  myType_p * vecOfPointers = new myType_p[numberOfInstances];
  
  for(int ii = 0; ii < numberOfInstances; ++ii)
  {
    vecOfPointers[ii].data = new myType[1];
    vecOfPointers[ii].data->resize(sizes[ii]);
  }
  
//   #pragma acc data copy(sizes,vecOfPointers[0:numberOfInstances])
//   {
//    for (int ii = 0; ii<numberOfInstances; ++ii)
//    {
//      #pragma acc data copy(vecOfPointers[ii].data[0:vecOfPointers[ii].n])
//    }
  
    #pragma acc parallel loop
    for (int ii = 0; ii<numberOfInstances; ++ii)
    {
      dosomething(vecOfPointers[ii].data[0],sizes[ii]);
    }
//   }
  
  for(int ii = 0; ii < numberOfInstances; ++ii)
  {
    delete vecOfPointers[ii].data;
  }
  
  return 0;
}

Thanks again!
Aaron

Hi Aaron,

However, since I use UM, the compiler probably ignores the data clauses and does the job on its own (correctly), does it?

It’s actually the CUDA driver that handles the UM data movement, not the compiler runtime. Also, it’s not that the compiler runtime ignores data clauses, but rather tests if the data is managed (not all data is managed). If so then it passes the unified address to the generated kernel.

The only difference is, that during compiling there is only a
Generating implicit copyin(vecOfPointers[:4],sizes[:])

In the absence of an explicit data region, the compiler will need to implicitly add a data region. At runtime, the data is checked if the data is managed since just because you compile with “-ta=tesla:managed”, the compiler can’t assume that this particular variable is actually managed. Really all “-ta=tesla:manged” does is replace allocation (malloc/new) with cudaMallocManaged, the underlying data management system remains the same.

-Mat