Vector push_back in Thrust device code and strange behavior of nvc++/stdpar

aminamooie · January 18, 2022, 11:45pm

I am trying to develop a code that assigns a GPU thread to each seed point as a starting point for a Depth-First Search (DFS) on each thread. There are two types of problem I am developing this code for: one is where there are (only) few thousand seed points (‘outlet nodes’), each incurring a long-range DFS (throughout the whole domain), while the other pertains to cases with ten to hundred thousands of seed points (internal nodes) but with relatively short-range searches (they finish as soon as they reach a cluster previously labeled as ‘connected’ to the outlet). My thought is that second type is more suitable for utilizing GPU threads (?) – I appreciate any insight into the logic here.

I started with an initial implementation on CPU threads using OpenMP, which required creating a temporary ‘stack’ C++ vector private to each thread, which mainly entails push_back and pop_back operations. I converted this to a new array-based implementation that would work on OpenACC, in which arrays of size N (with N as the entire size of the network) are predefined on the host (unlike before when stack vector size didn’t need to be defined) and then brought into parallel region by

#pragma acc data create(stack[0:stack_length])

#pragma acc kernels parallel loop gang(1024) vector(1) independent private (stack[0:stack_length])

The problem is that defining the stack ‘array’ size as N is an overkill since the length of the stack ‘vector’ would actually never become that large – especially in the 2nd type of the problem with short-range searches – and having many GPU threads each carrying an array as big is clearly problematic. Further, there seems to be no way of knowing a priori any safe, smaller bounds to this array size (?)
I have tried as much as I could to change the algorithm to one that doesn’t need such stack objects (e.g., using global/shared parent-child tree-like structure that keep track of the trajectory, but it fails due to race condition if run on multiple threads even after using atomic operations).

One current possibility I am investigating now is using Thrust library because, on the first glimpse, it promises to resemble C++ standard library and it evidently does support push_back on both thrust::host_vector and thrust::device_vector. However, I believe I learned it after a while – thanks to @MatColgrove and his insights – that this sort of push_back on device_vector is still a host/CPU code (containers) and only works on the host side in the eye of the compiler, and NOT necessarily in device code as I was hoping for.

Having said that, I am still puzzled by the strange, dual behavior of the code written below, where a regular STD vector, unlike either thrust::host_vector or thrust::device_vector can be declared – to my surprise – inside a device code (thrust::for_each, which should be of the same nature as the prohibited CUDA Kernel, right?) BUT the push_back only works if there are 6 of them at most (!?!) or if the size of the inner loop is no bigger than 3!!! Otherwise, there will be a compilation error that is outputted as comments! In no way can I make sense of this. On the one hand, something that’s not supposed to work works, but on the other hand it only works up to a certain point! I will appreciate if I can be helped here with this problem specifically and with my entire project if possible. Thanks.

.cpp code (please see the comments therein):

#include <stdio.h>
#include <thrust/iterator/counting_iterator.h>
#include <algorithm>
#include <execution>
#include <vector>
#include <unistd.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/for_each.h>
#include <thrust/device_vector.h>

struct Dfs_Data {
public:

 	Dfs_Data() {m_test = 12;}

 	~Dfs_Data(){}

	int test{15};
	int get_test() {return m_test;}

private:

 	int m_test;
};




int main(){

 Dfs_Data *data;
 data = new Dfs_Data();
 Dfs_Data& data_ref = *data;
 auto r = thrust::counting_iterator<int>(0);
 size_t limit = 0;

 cudaDeviceGetLimit(&limit, cudaLimitStackSize);
 printf("cudaLimitStackSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
 printf("cudaLimitPrintfFifoSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
 printf("cudaLimitMallocHeapSize: %u\n", (unsigned)limit);

 std::cout << "default settings of cuda context" << std::endl;
    
 limit = 128*1024*1024;

 cudaDeviceSetLimit(cudaLimitStackSize, limit);
 cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);
 cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);
    
 std::cout << "set limit to 128 MB for all settings" << std::endl;

 limit = 0;

 cudaDeviceGetLimit(&limit, cudaLimitStackSize);
 printf("New cudaLimitStackSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
 printf("New cudaLimitPrintfFifoSize: %u\n", (unsigned)limit);
 cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
 printf("New cudaLimitMallocHeapSize: %u\n", (unsigned)limit);
    
    
 
 thrust::for_each(thrust::device, r, r+1,
  [=, &data_ref](int it){
            

	//thrust::device_vector<size_t> stack; // If uncommented, the compilatior error: NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure
        
	std::vector<int> stack;
                               
	//stack.reserve(2); // If uncommented and have the 6 push_backs then reserving more than 2 results in the compilation error; \
	if the following push_backs get commented then this can be set to a very large value with no problem!!\
	Note: using stack.resize(1e9) will only work if heap memory size is set to 10*1024*1024*1024 at least...
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        stack.push_back(1);
        //stack.push_back(1); // if uncommented (and the next foor loop commented), I get compilattion error:\
        nvlink error   : Undefined reference to '_ZSt20__throw_length_errorPKc' in '/tmp/nvc++J8b3mPn4m_-Yd.o'\
        pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/nvdd                                                 
                               
	//for (int i=0; i<3; ++i) // if changed to i<4 and all the above push_backs are commented, then the same compilation error!!
	//{
	// stack.push_back(i);
	//}
                            
 });


// Same outcome if using nvc++ -stdpar and the following code 
 /*  
 std::for_each(std::execution::par_unseq, r, r+10,
                        [=, &data_ref](int it){
            
	std::vector<size_t> stack;
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	stack.push_back(1);
	//stack.push_back(1); // if uncommented, I get compilattion error:\
	nvlink error   : Undefined reference to '_ZSt20__throw_length_errorPKc' in '/tmp/nvc++BR43mrxPB2Yyf.o'\
	pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/nvdd   
                 
        });
        */
 return 0;
}

It can be compiled with nvc++ -stdpar or nvc++ -cuda
I am using nvc++ (HPC-SDK) 21.9 with cuda 11.4.

ps1: can this be related to the limited cudaLimitStackSize which seems to not change beyond 1024 byte (hardware limit?). I assume defining the vector as such pertains to only heap memory which seems I can plummet it to ~18 gigabytes on an A100-80Gb GPU.
ps2: using a struct was an exercise here for my effort in passing a struct of global/shared vectors that I can then unravel and work with inside the thrust::for_each in my main program. I think this is the way to work with thrust if one has more several global vectors that each thread needs access to in order to execute their part…

Topic		Replies	Views
Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type nvc, nvc++ and nvfortran	7	345	April 2, 2024
Nvc++ & external CUDA-thrust conflicts for -stdpar offload nvc, nvc++ and nvfortran	5	447	December 12, 2022
Thrust functor return array; code optimization CUDA Programming and Performance	3	1108	December 27, 2019
CUDA thrust compilation error CUDA Programming and Performance	84	292	March 21, 2025
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	412	October 12, 2021
Problem with OpenAcc and CPP STL nvc, nvc++ and nvfortran cuda	17	716	January 26, 2024
Cuda application crashes works fine for small data and crashes for big data CUDA Developer Tools	0	361	December 8, 2020
Cuda code performance CUDA Programming and Performance	14	3133	December 16, 2014
Problem with vector comparison CUDA Programming and Performance cuda	7	522	October 27, 2022
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	47	October 3, 2024

Vector push_back in Thrust device code and strange behavior of nvc++/stdpar

Related topics