Hey everybody,
I have some minimal code snippet, which I cant’t get to run. I want to accelerate a kernel which applies the same algorithm to thousands of small-sized vectors. Usually, I’d like to save the data in a standard vector of standard vectors, but I couldn’t figure out how to move the data to device then. That’s why I came up with this unhandy struct-version… However, the actual problem is the error “Procedures called in a compute region must have acc routine information”, fully below:
dosomething(std::vector<double, std::allocator>, int):
11, Generating Tesla code
13, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x */
13, Loop is parallelizable
PGCC-S-0155-Procedures called in a compute region must have acc routine information: std::vector<double, std::allocator>::vector(const std::vector<double, std::allocator>&) (test.cpp: 39)
PGCC-S-0155-Accelerator region ignored; see -Minfo messages (test.cpp: 34)
main:
30, Generating copy(sizes[:],vecOfPointers[:numberOfInstances])
34, Generating copy(vecOfPointers->data[:])
Accelerator region ignored
39, Accelerator restriction: call to ‘std::vector<double, std::allocator>::vector(const std::vector<double, std::allocator>&)’ with no acc routine information
std::vector<double, std::allocator>::operator (unsigned long):
1, include “vector”
57, include “vector”
7, include “stl_vector.h”
780, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
PGCC/x86 Linux 18.10-1: compilation completed with severe errors
I compiled with PGC version 18.1:
pgc++ -std=c++11 -fPIC -ta=host,tesla:managed,cc35,cc60,time,cuda10.0 -acc -Minfo test.cpp -o test.xc
The code test.cpp is here:
#include <vector>
using myType = std::vector<double>;
struct myType_p{
myType * data;
};
#pragma acc routine worker
void dosomething(myType vec, int N)
{
#pragma acc loop worker vector
for(int i = 0; i<N; ++i) vec[i]=1;
}
int main(int argc, char* argv[]){
int numberOfInstances = 4; // usually about 10k to 100k
int sizes[]={108,119,103,101}; // representative values in some range [a,b]
myType_p * vecOfPointers = new myType_p[numberOfInstances];
for(int ii = 0; ii < numberOfInstances; ++ii)
{
vecOfPointers[ii].data = new myType[1];
vecOfPointers[ii].data->resize(sizes[ii]);
}
#pragma acc data copy(sizes,vecOfPointers[0:numberOfInstances])
{
for (int ii = 0; ii<numberOfInstances; ++ii)
{
#pragma acc data copy(vecOfPointers[ii].data)
}
// the work in this loop shall first be distributed in the grid on gang level :
#pragma acc parallel loop
for (int ii = 0; ii<numberOfInstances; ++ii)
{ // this acc routine then is (planned) to be executed on worker and vector level
dosomething(vecOfPointers[ii].data[0],sizes[ii]);
}
}
for(int ii = 0; ii < numberOfInstances; ++ii)
{
delete vecOfPointers[ii].data;
}
return 0;
}
whereas the routine is declared and used as described in https://devblogs.nvidia.com/7-powerful-new-features-openacc-2-0/…
What am I doing wrong? I’d be glad if I could get some help here. Thanks in advance!
If there is a better way to copy dynamically allocated (nested) data to devices, please let me know. (I know about deep copy as in What’s new in OpenACC 2.6? | OpenACC. However, I didn’t find a solution to do this without an extra ‘copy’-for loop.)