How to use std::vectors within loop in OpenACC?

I have the following code

int main(int argc, char** argv )
{
	std::vector< std::vector< std::vector<double> > > vec { 
		{{1,2},{3,4}, {5,6},{7,8}}, 
        {{9,10}, {11,12}}, 
        {{13,14}, {15,16}, {17,18}} };

	#pragma acc parallel loop
	for (int k = 0; k <3; k++) {

		std::vector<std::vector<double>>& vec2d = vec[k];
		int L = vec2d.size();

		//std::vector<int>dVec{67,51,1,0,50};
		std::vector<double>dVec(L, 0.0);

		for (int i = 0; i < L; i++)
		{
			dVec[i] = vec2d[i][1] - vec2d[i][0];
		}

		for (int j=0; j<2; j++) {
			printf("k: %d j: %d vec0: %f, vec1: %f\n", k, j, vec2d[j][0], vec2d[j][1]);
		}
	}
	std::cout<<"finished\n";

    return 0;
}

and I compile with pgc++ -fast -ta=tesla:cuda9.2,managed -o runEx runEx.cpp -std=c++17 && ./runEx

if I comment out the #pragma acc parallel loop, then it works. But if I leave it there, then I get the error

PGCC-S-0155-Procedures called in a compute region must have acc routine information: operator delete (void *) (runEx.cpp: 425)
PGCC-S-0155-Accelerator region ignored; see -Minfo messages  (runEx.cpp: 6)
PGCC/x86-64 Linux 19.10-0: compilation completed with severe errors

Also, if I comment out the std::vector<int>dVec and the for loop containing it, then the code works even with the #pragma acc parallel loop

However, if I change the loop so it becomes just:

#pragma acc parallel loop
for (int k = 0; k <3; k++) {
	std::vector<int>dVec{67,51,1,0,50};
}

then I get the same error

why is this?

In order to run on the device, all called routines and methods must have a device callable version available. The compiler will try to implicitly create these routines for you provided that the definition of the routine is visible. In cases where the routines definition is not visible, then the user must use the OpenACC “routine” directive to create the device routines.

Here you have a std::vector who’s constructor and destructor are implicitly called. My assumption is that the “delete” opereator is coming from the vector’s destructor. Since the compiler can’t find a definition for this operator, it can’t implicitly create a device version, and I doubt you’ll be able to manually add a “routine” directive to the vector.

In general, support for using vectors (as well as other STD container types) is very limited on the device. Vectors are not thread safe so I generally recommend setting up the vector on the host and then only using it’s access operator on the device.

Secondly, while there is limited support, allocation from within device code is not recommended. Besides being serialized which adversely impact performance, the device side heap is quite small (default 32MB) so its easy to crash programs due to heap overflows. While the heap can be increased, it’s still not advisable to perform device side allocation.

While this may or may not work for your algorithm, I would make dVec a “double *”, allocate it before the “k” loop to the max size of vec2d, then put dVec in a “private” clause on the parallel loop. Alternatively, you can declare dVec as a fixed size double array within the body of the loop, though you would need to know the max size at compile time in this case.