Problem with OpenAcc and CPP STL

I am working on converting a large codebase written in c++ to run on the GPU with OpenAcc. My code uses std::vector so I am using unified memory to automatically handle the data transfers. However, I keep getting errors such as the following:

NVC++-S-1061-Procedures called in a compute region must have acc routine information - std::__throw_bad_alloc() (/PATH_TO_CPP_FILE/main.cpp: 102)
std::_Vector_base<double, std::allocator<double>>::_M_allocate(unsigned long):
         102, Accelerator restriction: call to 'std::__throw_bad_alloc()' with no acc routine information
NVC++/x86-64 Linux 23.11-0: compilation completed with severe errors
make[2]: *** [CMakeFiles/PROJECT.dir/main.cpp.o] Error 2
make[1]: *** [CMakeFiles/PROJECT.dir/all] Error 2
make: *** [all] Error 2

Another issue is, as you can see in the error message, the location is main.cpp: 102 but there is no cpp code on line 102 (the location is not correct). For security reasons, I cannot share the code but the main issues are:

  1. the location of the problem in the error messages do not seem correct
  2. OpenAcc doesn’t seem to be working with std::vector even when I am using uniform/managed memory

Update: I think I know what the problem is. I tried to make a sample code with just using std::vector in a openacc parallel loop and it seems the push_back() function is giving this error.

Is there any workaround this? Also, how do I fix the nvc++ error messages as they do not properly showing the line number with the issue. Example:

// @file      saxypy.cpp
// @author    Ignacio Vizzo     [ivizzo@uni-bonn.de]
//
// Copyright (c) 2020 Ignacio Vizzo, all rights reserved
#include <cassert>
#include <iostream>
#include <vector>
#include <accelmath.h>

#include "device_vector.h"
#include "timer.h"

device_vector<float> saxpy(const device_vector<float> &x,
						   const device_vector<float> &y,
						   float a)
{
	assert(x.size() == y.size());
	device_vector<float> z(x.size());
#pragma acc parallel loop
	for (int i = 0; i < y.size(); ++i)
	{
		z[i] = a * x[i] + y[i];
	}
	return z;
}

int main()
{
	Timer t;
	const int N = 1 << 20;
	const float a = 1.0F;
	const device_vector<float> x(N, 1.0F);
	const device_vector<float> y(N, 1.0F);

	auto z = saxpy(x, y, a);
	float sum = 0;

	std::vector<double> darr;

#pragma acc parallel loop reduction(+ : sum) copy(sum)
	for (int i = 0; i < z.size(); ++i)
	{
		sum += pow(z[i], .5);
		darr.push_back(4.5);
	}

	std::cout << "Final result = " << sum << std::endl;
	return 0;
}

Error message:

NVC++-S-1061-Procedures called in a compute region must have acc routine information - std::__throw_bad_alloc() (/home/016950414/test_openacc/main.cpp: 102)
std::_Vector_base<double, std::allocator<double>>::_M_allocate(unsigned long):
         102, Accelerator restriction: call to 'std::__throw_bad_alloc()' with no acc routine information
NVC++/x86-64 Linux 23.11-0: compilation completed with severe errors
make[2]: *** [CMakeFiles/a.out.dir/main.cpp.o] Error 2
make[1]: *** [CMakeFiles/a.out.dir/all] Error 2
make: *** [all] Error 2

As you can see, the message says the issue is on main.cpp: 102 but main.cpp only has 49 lines of code

I have also attached the other header files I am using (it is not allowing me to upload .h files so I added .txt to them - just remove the .txt extension to run it)
timer.h.txt (2.9 KB)
device_vector.h.txt (1.2 KB)

Using vector is problematic and you definitely don’t want to do pushs or pops on the device. Under the hood, these can cause reallocation which are not thread-safe and would cause the host and device data structure to diverge. Device side allocation is permissible, but not recommended as it only last for the single kernel and wouldn’t be reflected back to the host. Allocation should really only be done from the host.

So while you can use vectors in device code, you’re restricted to just the access operator, i,e, “”.

The actual error is due the hidden exception handler in the allocator called from push_back. Exception handling is not supported in device.

-Mat

I believe I found the error: I am trying to find the median of an array of data. This is what the code is:

double Model::getMedian(void) const
{
	vector<double> temp;
	for (int i = 0; i<m_Size; i++)
	{
		temp.push_back(m_data[i]);
	}
	std::sort(temp.begin(), temp.end());
	
	return(temp[(int)temp.size() / 2]);
}

Is there any way to do this inside an OpenAcc parallel loop directive?

My first thought would be to size “temp” initially to the same size as “m_data”, then use assignment “temp[i]=m_data[i]” to do the do the copy. I’d also use the C++17 parallel sort so that’s done on the GPU as well and the full “temp” doesn’t get copied back to the host.

I tried this but I am getting a whole list of errors when I run the code using std::execution::par

To use this, I just have to use cpp17 and include: #include <execution> right? I have done this and it still doesn’t work

I have done this and it still doesn’t work

What errors are you getting? Compile errors, runtime, verification?

If you haven’t already, to enable STDPAR, you also need to add the “-stdpar” flag which also implies “-gpu=managed” which you’ll need as well.

Code:

double Model::getMedian(void) const
{
	vector<double> temp(m_Size);
	for (int i = 0; i<m_Size; i++)
	{
		temp[i] = m_data[i];
	}
    std::sort(std::execution::par, temp.begin(), temp.end());
	return temp[m_Size/ 2];
}

I also put the routine pragma above the function declaration in the class header file:

    #pragma acc routine seq
	double getMedian(void) const;

Errors I am getting:

NVC++-S-1062-Support procedure called within a compute region - llvm.expect (/home/016950414/<PROJECT>/<FILENAME>.cpp: 240)
std::basic_string<char, std::char_traits<char>, std::allocator<char>>::~basic_string():
         240, Accelerator restriction: unsupported call to support routine 'llvm.expect'
NVC++/x86-64 Linux 23.11-0: compilation completed with severe errors
make[2]: *** [CMakeFiles/<PROJECT>.dir/<FILENAME>.cpp.o] Error 2
make[1]: *** [CMakeFiles/<PROJECT>.dir/all] Error 2
make: *** [all] Error 2

This error seems to be related to strings, but somehow it only shows up when I try to add this getMedian() function.

Compiling using these flags in CMake:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -w -fast -stdpar -acc -gpu=managed -Minfo=accel")

@MatColgrove Just checking again, is there anything I am doing wrong here?

I tried to recreate the error, but it seems to compile fine for me, so there’s likely something else going on. Can you please post a reproducing example?

Here’s my attempt to recreate the error:

% cat test2.cpp
#include <vector>
#include <algorithm>
#include <execution>

using namespace std;

class Median
{
public:
    double getMedian(void);
    vector<double> m_data;
    int m_Size;

};

double Median::getMedian(void)
{
        vector<double> temp(m_Size);
        for (int i = 0; i<m_Size; i++)
        {
                temp[i] = m_data[i];
        }
        std::sort(std::execution::par, temp.begin(), temp.end());
        return temp[m_Size/ 2];
}
% nvc++ -c -fast -stdpar -acc -gpu=managed --std=c++17 test2.cpp
%

Here is a smaller code which gives me the same error:

main.cpp

#include <vector>
#include <algorithm>
#include <execution>
#include <cstdio>
#include <cstdlib>

using namespace std;

class Test
{
public:
	Test(size_t size) : m_data(size), m_Size(size)
	{
		for (size_t i = 0; i < m_Size; ++i)
		{
			m_data[i] = rand() % m_Size;
		}
	}

	double getMedian()
	{
		vector<double> temp(m_Size);
		for (int i = 0; i < m_Size; i++)
		{
			temp[i] = m_data[i];
		}
		std::sort(std::execution::par, temp.begin(), temp.end());
		return temp[m_Size / 2];
	}

	void print()
	{
		for (int i = 0; i < m_Size; i++)
		{
			printf("%d, ", m_data[i]);
		}
		printf("\nMedian = %lf\n", getMedian());
	}

private:
	vector<double> m_data;
	size_t m_Size;
};

int main()
{
	Test t(100'000);

	#pragma acc parallel loop
	for (int i = 0; i < 1'000'000; ++i)
	{
		if (i % 1'000 == 0)
		{
			t.print();
		}
	}

	return EXIT_SUCCESS;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.10)

project (
    TestOpenAcc
    VERSION 1.0.0
    LANGUAGES CXX CUDA
)

# Set build type
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_CXX_STANDARD 17)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -w -fast -stdpar -acc -gpu=managed -Minfo=accel")

# Add the CPP files to the executable
add_executable(a.out 
                ${CMAKE_SOURCE_DIR}/main.cpp
)

output

[ 50%] Building CXX object CMakeFiles/a.out.dir/main.cpp.o
main:
     47, Generating NVIDIA GPU code
         33, #pragma acc loop seq
         50, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     47, Generating implicit copy(t) [if not already present]
Test::getMedian():
     21, Generating implicit acc routine seq
Test::print():
     32, Generating implicit acc routine seq
         Generating acc routine seq
         Generating NVIDIA GPU code
void std::_Destroy_aux<true>::__destroy<decltype(std::allocator<double>::pointer std::allocator_traits<std::allocator<double>>::_S_pointer_helper<std::allocator<double>>(std::allocator<double>*)((std::allocator<double>*)0))>(decltype(std::allocator<double>::pointer std::allocator_traits<std::allocator<double>>::_S_pointer_helper<std::allocator<double>>(std::allocator<double>*)((std::allocator<double>*)0)), decltype(std::allocator<double>::pointer std::allocator_traits<std::allocator<double>>::_S_pointer_helper<std::allocator<double>>(std::allocator<double>*)((std::allocator<double>*)0))):
      1, include "vector"
          62, include "stl_construct.h"
              112, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
void std::__uninitialized_default_n_1<true>::__uninit_default_n<decltype(std::allocator<double>::pointer std::allocator_traits<std::allocator<double>>::_S_pointer_helper<std::allocator<double>>(std::allocator<double>*)((std::allocator<double>*)0)), unsigned long>(decltype(std::allocator<double>::pointer std::allocator_traits<std::allocator<double>>::_S_pointer_helper<std::allocator<double>>(std::allocator<double>*)((std::allocator<double>*)0)), unsigned long):
      1, include "vector"
          63, include "stl_uninitialized.h"
              511, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
std::string::_M_data() const:
      2, include "algorithm"
          10, include "algorithm"
               62, include "stl_algo.h"
                    65, include "random"
                         41, include "string"
                              52, include "basic_string.h"
                                  293, Generating implicit acc routine seq
                                       Generating acc routine seq
                                       Generating NVIDIA GPU code
std::string::_M_rep() const:
      2, include "algorithm"
          10, include "algorithm"
               62, include "stl_algo.h"
                    65, include "random"
                         41, include "string"
                              52, include "basic_string.h"
                                  301, Generating implicit acc routine seq
                                       Generating acc routine seq
                                       Generating NVIDIA GPU code
NVC++-S-1062-Support procedure called within a compute region - llvm.expect (/home/016950414/test_openacc/main.cpp: 240)
std::basic_string<char, std::char_traits<char>, std::allocator<char>>::~basic_string():
         240, Accelerator restriction: unsupported call to support routine 'llvm.expect'
NVC++/x86-64 Linux 23.11-0: compilation completed with severe errors
make[2]: *** [CMakeFiles/a.out.dir/main.cpp.o] Error 2
make[1]: *** [CMakeFiles/a.out.dir/all] Error 2
make: *** [all] Error 2

Using the following HPC and CMake:

module load cmake-3.25.1-gcc-10.2.0-sgz3pso
module load nvhpc/23.11

It’s the “t.print()” in the OpenACC parallel region. I/O on the device is very limited and does not support formatted printing or strings. You’re limited to only being able to use “printf”.

Remove the “parallel loop” in the main routine and it will compile and run.

Oh, I see that “print” is a method in the class, not part of the vector, and does use printf.

Though it looks like as part of a vector, a string type is getting included. I think I might a have mentioned this earlier that you need to build the vector on the host and can only use the access operator on the device. Here “temp” is getting created on the device.

Also, you can’t call a STDPAR routine from device code, so you will still want to remove the “parallel loop” in main.

Though, you can add it to the “i” loop in getMedian to offload this section:

        double getMedian()
        {
                vector<double> temp(m_Size);
                #pragma acc parallel loop
                for (int i = 0; i < m_Size; i++)
                {
                        temp[i] = m_data[i];
                }
                std::sort(std::execution::par, temp.begin(), temp.end());
                return temp[m_Size / 2];
        }

@MatColgrove So, I can’t use printf with string formatting inside parallel regions?

Also, if the stdpar sort needs to happen outside the parallel region, wont there be data movement back and forth from the GPU and CPU (if I am doing this a lot of time such as in a timestep loop)?

Ideally, what I have is some class Model which has two main methods: update() and getMedian(). I need to run this Model class for a grid of points which are independent. But for each point, the update method is called sequentially as it relies on a timeseries data. So i can’t parallelize this in the time axis but I should be able to run the grid of models in parallel right? But for that, I would need to run the update() and getMedian() functions inside the parallel region for each grid point.

@MatColgrove Just checking in on my previous question. Is there a way to sort a list of numbers inside a parallel region? The list of numbers is relatively sort (probably below 100) but that is just for one grid cell. I have multiple cells in a grid that I want to run parallelly.

Sorry I missed you’re follow-up post. Yes you can use printf. I was my mistake to assume “t.print” was the issue, but as I followed up, it’s actually the construction of the vector that’s the problem.

To parallelize the outer loop, you’d need to hoist the declaration of “temp” outside of the parallel region and then privatize it. Though the compiler can’t privatize aggregate types with dynamic data members. It can only privatize the vector itself (i.e. the three pointers in the vector), not the data the pointers point to. Hence you might need to convert this to a regular C-style array. You can then put it in a “private” clause or manually privatize the array by adding another dimension.

For “std::sort”, I’ve not tried using it in a parallel region. Often the STL routines have extra code in them like exception handling, that prevent them from being used in device code. So maybe std:sort (the base version, not the stdpar version) would work, but I doubt it. Though if you convert temp to a C array, it wouldn’t work either. Assuming you’re sorting the private array per thread, then you may need to write a simple sort.

By regular C-style array, do you mean the one that is allocated on the stack or with the new keyword on the heap? If it is with the new keyword, isn’t that dynamic data too? Is this still able to be privatized in the loop?

For sorting, what about sorting done in thrust? I also saw another post on the forum where you mentioned that -stdpar=gpu should be used. I didn’t use this, I just used -stdpar. Does this also not enable sorting on the device? Because if so, then my last option would be to write a simple iterative sort such as insertion or selection sort.

I’m meaning a pointer to malloc’d memory

double * A;
A = malloc(size*sizeof(double));

For sorting, what about sorting done in thrust?

Our implementation of the STDPAR version of std::sort is really the Thrust sort so the same limitations apply, i.e. it can only be called from the host.