Vector push_back in Thrust device code and strange behavior of nvc++/stdpar

I am trying to develop a code that assigns a GPU thread to each seed point as a starting point for a Depth-First Search (DFS) on each thread. There are two types of problem I am developing this code for: one is where there are (only) few thousand seed points (‘outlet nodes’), each incurring a long-range DFS (throughout the whole domain), while the other pertains to cases with ten to hundred thousands of seed points (internal nodes) but with relatively short-range searches (they finish as soon as they reach a cluster previously labeled as ‘connected’ to the outlet). My thought is that second type is more suitable for utilizing GPU threads (?) – I appreciate any insight into the logic here.

I started with an initial implementation on CPU threads using OpenMP, which required creating a temporary ‘stack’ C++ vector private to each thread, which mainly entails push_back and pop_back operations. I converted this to a new array-based implementation that would work on OpenACC, in which arrays of size N (with N as the entire size of the network) are predefined on the host (unlike before when stack vector size didn’t need to be defined) and then brought into parallel region by

#pragma acc data create(stack[0:stack_length])

#pragma acc kernels parallel loop gang(1024) vector(1) independent private (stack[0:stack_length])

The problem is that defining the stack ‘array’ size as N is an overkill since the length of the stack ‘vector’ would actually never become that large – especially in the 2nd type of the problem with short-range searches – and having many GPU threads each carrying an array as big is clearly problematic. Further, there seems to be no way of knowing a priori any safe, smaller bounds to this array size (?)
I have tried as much as I could to change the algorithm to one that doesn’t need such stack objects (e.g., using global/shared parent-child tree-like structure that keep track of the trajectory, but it fails due to race condition if run on multiple threads even after using atomic operations).

One current possibility I am investigating now is using Thrust library because, on the first glimpse, it promises to resemble C++ standard library and it evidently does support push_back on both thrust::host_vector and thrust::device_vector. However, I believe I learned it after a while – thanks to @MatColgrove and his insights – that this sort of push_back on device_vector is still a host/CPU code (containers) and only works on the host side in the eye of the compiler, and NOT necessarily in device code as I was hoping for.

Having said that, I am still puzzled by the strange, dual behavior of the code written below, where a regular STD vector, unlike either thrust::host_vector or thrust::device_vector can be declared – to my surprise – inside a device code (thrust::for_each, which should be of the same nature as the prohibited CUDA Kernel, right?) BUT the push_back only works if there are 6 of them at most (!?!) or if the size of the inner loop is no bigger than 3!!! Otherwise, there will be a compilation error that is outputted as comments! In no way can I make sense of this. On the one hand, something that’s not supposed to work works, but on the other hand it only works up to a certain point! I will appreciate if I can be helped here with this problem specifically and with my entire project if possible. Thanks.

.cpp code (please see the comments therein):

#include <stdio.h>
#include <thrust/iterator/counting_iterator.h>
#include <algorithm>
#include <execution>
#include <vector>
#include <unistd.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/for_each.h>
#include <thrust/device_vector.h>

struct Dfs_Data {
public:

 	Dfs_Data() {m_test = 12;}

 	~Dfs_Data(){}

	int test{15};
	int get_test() {return m_test;}

private:

 	int m_test;
};




int main(){

 Dfs_Data *data;
 data = new Dfs_Data();
 Dfs_Data& data_ref = *data;
 auto r = thrust::counting_iterator<int>(0);
 size_t limit = 0;

 cudaDeviceGetLimit(&limit, cudaLimitStackSize);
 printf("cudaLimitStackSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
 printf("cudaLimitPrintfFifoSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
 printf("cudaLimitMallocHeapSize: %u\n", (unsigned)limit);

 std::cout << "default settings of cuda context" << std::endl;
    
 limit = 128*1024*1024;

 cudaDeviceSetLimit(cudaLimitStackSize, limit);
 cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);
 cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);
    
 std::cout << "set limit to 128 MB for all settings" << std::endl;

 limit = 0;

 cudaDeviceGetLimit(&limit, cudaLimitStackSize);
 printf("New cudaLimitStackSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
 printf("New cudaLimitPrintfFifoSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
 printf("New cudaLimitMallocHeapSize: %u\n", (unsigned)limit);
    
    
 
 thrust::for_each(thrust::device, r, r+1,
  [=, &data_ref](int it){
            

	//thrust::device_vector<size_t> stack; // If uncommented, the compilatior error: NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure
        
	std::vector<int> stack;
                               
	//stack.reserve(2); // If uncommented and have the 6 push_backs then reserving more than 2 results in the compilation error; \
	if the following push_backs get commented then this can be set to a very large value with no problem!!\
	Note: using stack.resize(1e9) will only work if heap memory size is set to 10*1024*1024*1024 at least...
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        //stack.push_back(1); // if uncommented (and the next foor loop commented), I get compilattion error:\
        nvlink error   : Undefined reference to '_ZSt20__throw_length_errorPKc' in '/tmp/nvc++J8b3mPn4m_-Yd.o'\
        pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/nvdd                                                 
                               
	//for (int i=0; i<3; ++i) // if changed to i<4 and all the above push_backs are commented, then the same compilation error!!
	//{
	// stack.push_back(i);
	//}
                            
 });


// Same outcome if using nvc++ -stdpar and the following code 
 /*  
 std::for_each(std::execution::par_unseq, r, r+10,
                        [=, &data_ref](int it){
            
	std::vector<size_t> stack;
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	//stack.push_back(1); // if uncommented, I get compilattion error:\
	nvlink error   : Undefined reference to '_ZSt20__throw_length_errorPKc' in '/tmp/nvc++BR43mrxPB2Yyf.o'\
	pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/nvdd   
                 
        });
        */
 return 0;
}

It can be compiled with nvc++ -stdpar or nvc++ -cuda
I am using nvc++ (HPC-SDK) 21.9 with cuda 11.4.

ps1: can this be related to the limited cudaLimitStackSize which seems to not change beyond 1024 byte (hardware limit?). I assume defining the vector as such pertains to only heap memory which seems I can plummet it to ~18 gigabytes on an A100-80Gb GPU.
ps2: using a struct was an exercise here for my effort in passing a struct of global/shared vectors that I can then unravel and work with inside the thrust::for_each in my main program. I think this is the way to work with thrust if one has more several global vectors that each thread needs access to in order to execute their part…