Use std::vector with OpenACC

Hi. I’m trying to compute on GPU, using OpenACC, the sum between two vectors of std::vector. As compiler I’m using GCC+NVPTX with OpenACC support but when I compile the code with these flags:

g++ -fopenacc -offload=nvptx-none -fopt-info-optimized-omp -g -std=c++17

I got: “array_1 does not have pointer or array type” and “array_2 does not have pointer or array type”. Is there any way to use std::vector<std::vector> with OpenACC?

This a minimal reproducible example:

int main(int argc, char **argv) {
std::vector<std::vector> array1,array2;
float result[1000]={0.0};

for(int i=0; i<1000; i++){
std::vector accumulator1, accumulator2;
for (int j=0; j<1000; j++){
accumulator1.push_back(99.99);
accumulator2.push_back(66.66);
}
array1.push_back(accumulator1);
array2.push_back(accumulator2);
}

#pragma acc data copyin(array1[:1000][:1000],array2[:1000][:1000])
#pragma acc data copy(result[:1000])
#pragma acc parallel loop
for(int i=0; i<1000; i++){
for (int j=0; j<1000; j++){
result[i] += array1[i][j] + array2[i][j];
}
}
return 0;
}

Compiling with GCC+NVPTX is mandatory for me, but also trying to compile it with nvc++ returns:

main:
18, Generating copyin(array1,array2) [if not already present]
Generating copy(result[:]) [if not already present]
Generating NVIDIA GPU code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
24, #pragma acc loop seq
24, Complex loop carried dependence of prevents parallelization
Loop carried dependence of result prevents parallelization
Loop carried backward dependence of result prevents vectorization
std::vector<std::vector<float, std::allocator>, std::allocator<std::vector<float, std::allocator>>>::operator (unsigned long):
3, include “vector”
64, include “stl_vector.h”
771, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code
std::vector<float, std::allocator>::operator (unsigned long):
3, include “vector”
64, include “stl_vector.h”
771, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code

But still launching the application gives back:

Failing in Thread:0
call to cuInit returned error 999: Unknown
Any advice? Thanks

Hi carbonara,

Yes, a std::vector can be used within OpenACC offload region. However it can be challenging if not using using CUDA Unified Memory (UM).

With nvc++, just add the flag “-gpu=managed” so the underlying allocators use UM (see below). However, I don’t know if g++ has a similar feature.

The challenge with vectors are that they are a class with three pointers for the data members. Since data directives can only perform shallow copies, you instead need to perform a manual deep copy.

A deep copy typically starts by copying the class, then copy each dynamic data member, and finally “attach” (i.e. set the device pointer of the data member in the class). For more detailed information on manual deep copy, please see this article: Deep Copy Support in OpenACC | PGI

However since the vector’s data members are private, they can’t be attached so can’t be used in a deep copy.

What folks end up having to do with vectors when not using UM is to use a pointer to the underlying data, i.e." float * aptr = array1.data()", and then use this pointer within the offload region. Not elegant, but works.

However you’ve made things even more challenging by using a vector within a vector meaning you now need to create an array of pointers to each of the inner vector’s data.

Examples:

  1. Using nvc++ with vectors and UM enabled:
% cat vect1.cpp
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
        std::vector<std::vector<float>> array1,array2;
        float result[1000]={0.0};

        for(int i=0; i<1000; i++){
                std::vector<float> accumulator1, accumulator2;
                for (int j=0; j<1000; j++){
                        accumulator1.push_back(99.99);
                        accumulator2.push_back(66.66);
                }
                array1.push_back(accumulator1);
                array2.push_back(accumulator2);
        }

#pragma acc parallel loop
        for(int i=0; i<1000; i++){
                for (int j=0; j<1000; j++){
                        result[i] += array1[i][j] + array2[i][j];
                }
        }

        for(int i=0; i<10; i++){
                std::cout << result[i] << std::endl;
        }

        return 0;
}
% nvc++ -acc -gpu=managed vect1.cpp; a.out
166650
166650
166650
166650
166650
166650
166650
166650
166650
166650
  1. Using the underlying vector data on the device rather than the vector itself:
% cat vect2.cpp
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
        std::vector<std::vector<float>> array1,array2;
        float result[1000]={0.0};

        for(int i=0; i<1000; i++){
                std::vector<float> accumulator1, accumulator2;
                for (int j=0; j<1000; j++){
                        accumulator1.push_back(99.99);
                        accumulator2.push_back(66.66);
                }
                array1.push_back(accumulator1);
                array2.push_back(accumulator2);
        }

        auto data1 =  array1.data();
        auto data2 =  array2.data();
        float ** Arr1 = new float*[1000];
        float ** Arr2 = new float*[1000];

        for(int i=0; i<1000; i++){
             Arr1[i] = data1[i].data();
             Arr2[i] = data2[i].data();
        }

        #pragma acc parallel loop copy(result[:1000], Arr1[:1000][:1000],Arr2[:1000][:1000])
        for(int i=0; i<1000; i++){
                for (int j=0; j<1000; j++){
                        result[i] += Arr1[i][j] + Arr2[i][j];
                }
        }

        for(int i=0; i<10; i++){
                std::cout << result[i] << std::endl;
        }
        delete [] Arr1;
        delete [] Arr2;
        return 0;
}
% nvc++ -acc vect2.cpp; a.out
166650
166650
166650
166650
166650
166650
166650
166650
166650
166650

Hope this helps,
Mat

Thanks for your reply! I have tried compiling this example with nvc++ and the -gpu=managed flag and it works. Unfortunately, however, this code is part of a much bigger project and trying to compile everything with nvc++ returns the error "catastrophic error: cannot open source file “shared_mutex” even including the compilation flags for C++17. Compiling it with GCC+NVPTX (before adding this part), on the other hand, works perfectly. The second example you propose, however, if compiled with GCC, returns “error: array section is not contiguous in ‘map’ clause” and therefore cannot be used. Any clue why? Do you know if there is a compile flag like “-gpu =managed” that can be used with GCC+NVPTX to overcome this problem? Thanks again

Are you including the header in your source?

If so, we use the system installed g++ header files in order to support interoperability. By default, the NVHPC installer calls a script called “makelocalrc” to create a local system compiler configuration file, “localrc”, found in the bin directory of the compiler. This contains a setting pointing to where the g++ headers are located.

If you’re using a g++ other than what’s installed with OS, what could be happening is that the system g++ is old and doesn’t support C++14. Let me know if this is the case and I’ll walk you through how to update the localrc to use the same headers as the g++ you’re using rather than the system g++.

if compiled with GCC, returns “error: array section is not contiguous in ‘map’ clause”

I believe at one point they didn’t support multi-dimensional array in data directives. Though this may have been fixed in later releases. Per their wiki (https://gcc.gnu.org/wiki/OpenACC) they aremostly OpenACC 2.6 compliant as of the 12.0 release, but you may need to build using the OG12 branch. Though I don’t have specific or know details or if they’ve corrected this issue.

You may consider joining the OpenACC Slack Channel (see https://www.openacc.org/community) and asking there.

If they haven’t fixed this, then the work around would be to linearize the array into a single block.

Do you know if there is a compile flag like “-gpu =managed” that can be used with GCC+NVPTX to overcome this problem?

Not that I’m aware of. It’s a very useful feature so if they haven’t, hopefully they will in the future. You can use calls to “cudaMallocManaged” directly to perform your allocation, but might be a bit cumbersome with vectors.

Keep in mind that nvc++ is interoperable with g++ so you can mix nvc++ compiled OpenACC code with a larger project compiled with g++. The caveats being if using “-gpu=managed”, the files that perform allocation of data used on the device need to be compiled with nvc++ so we can make the implicit replacement to use cudaMallocManaged calls.

Also, we use Relocatable Device Code (RDC) by default which requires a device link step. Hence it’s preferred that the binary is linked with nvc++. If this is not possible, you’ll need to disable RDC via the “-gpu=nordc” flag. The limitation being without RDC you can’t use OpenACC features such as calling device functions from separate source file or use global device data (via the “declare” directive) found in a separate source. Both features require device linking.

You can also encapsulate the nvc++ OpenACC code into a shared object making it easier to add to your larger project. While we support RDC in C and Fortran shared objects, we don’t yet have this support for C++ so you still need to use the “-gpu=nordc” flag.

Thank you very much for your answer and for your availability. I’m joining the Slack channel which I didn’t know existed.
And yes, I’m using that header in my source and also a different g++ than what’s installed in the OS. Could you explain how to update the localrc to use the same headers as the newest version? So I can try overcome the “catastrophic error: cannot open source file “shared_mutex”” of nvc++.
Thanks in advance.

Could you explain how to update the localrc to use the same headers as the newest version?

There are two method.

In recent versions, we added a flag called “–gcc-toolchain=/path/to/gcc/bin” which you can add to your compilation line.

Though if you don’t want to add this flag, the second method is to regenerate the localrc file.

Steps to creating a localrc file:

  1. run “makelocalrc -d . -x -gcc=/full/path/to/bin/gcc -gpp=/full/path/to/bin/g++ -g77=/full/path/to/bin/gfortran”
  2. This creates the “localrc” in the current directory.
  3. Either
    a) copy this “localrc” file into the nvhpc compiler bin directory overwriting the original, or
    b) rename it to something else, like “locarlc.gnu12” and then copy it to the bin directory (actually it can be placed in any directory). Then set the environment variable “NVLOCALRC=/full/path/to/localrc.”. You can also create different localrc files for multiple configurations.

The second method of using different localrc files and NVLOCALRC is preferred especially if you ever need to use the system gcc or if there are multiple users using different settings.

Note that during installation you had the choice of a single system or network installation. For a network installation, the creation of the localrc is delayed until the first invocation of the compilers on a particular system. The localrc file is then named “localrc.”. In this context, adjust the localrc file names of the ones you create accordingly.

Note that in some installations, admins don’t allow user to have write access to the compiler bin directory. In this case put your localrc file in a place you have access, like your home directory, and set NVLOCALRC accordingly.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.