OpenACC regions with C++ structs

jeffr1992 · January 7, 2021, 8:12am

I’m struggling to get OpenACC working on an object oriented C++ code base that requires manual data management (it crashes when trying to use CUDA unified memory with -ta=tesla:managed), and have been unable to find any extensive examples online, so wanted to break my problem down and ask a series of smaller questions which may help me solve the larger issue:

If I have a struct, for example as follows:

struct ExampleObject {
    double x;
    double y;

    ExampleObject() {}
    ExampleObject(double x, double y): x(x), y(y) {}
    ~ExampleObject() {}
};

and the struct is either,
a) Passed in as an argument of a #pragma acc routine seq function
b) Instantiated inside a #pragma acc routine seq function
c) Used as a return type for a #pragma acc routine seq function
d) Instantiated inside a #pragma acc data region
e) Instantiated inside a #pragma acc parallel loop region
should their constructors and destructors have #pragma acc routine seq above them, or should I instead have #pragma acc enter data copyin(this) inside the constructors, and #pragma acc exit data delete(this) inside the destructor?

As an additional question, are there parts of the C++ STL that cannot be used in OpenACC #pragma acc data, #pragma acc parallel loop or #pragma acc routine seq regions? In particular, I use a lot of basic math functions such as std::abs(), std::sqrt(), std::atan2(), std::cos(), std::sin() etc. The reason I ask is because if I try the following:

#pragma acc routine seq
double wrap_to_pi(const double angle) {
    return std::fmod(angle + M_PI, 2.0*M_PI) - M_PI;
}

and compile with:
pgc++ -g -acc -ta=tesla -Minfo=accel
I recieve a compilation error with the message:
NVC++-S-1000-Call in OpenACC region to procedure 'fmod' which has no acc routine information

MatColgrove · January 7, 2021, 6:31pm

No. First, data regions can only be performed from the host so you wouldn’t to do this for a, b, c, or e.

Second, the data members are static (fixed size) so an object of this struct can be added directly to a copy clause. You only need to add the enter/exit data region when you need to perform a deep copy, i.e. when the aggregate type contains dynamic data members.

For the “fmod” issue, add the include file “accelmath.h”. This will get the proper device prototype for these math routines.

% cat test.cpp
#include <cmath>
#ifdef _OPENACC
#include <accelmath.h>
#endif

#pragma acc routine seq
double wrap_to_pi(const double angle) {
    return std::fmod(angle + M_PI, 2.0*M_PI) - M_PI;
}
% nvc++ -acc -Minfo=accel -c test.cpp
wrap_to_pi(double):
      7, Generating acc routine seq
         Generating Tesla code

Hope this helps,
Mat

jeffr1992 · January 8, 2021, 3:20am

Thanks Mat, using accelmath.h solved the std::fmod() issue.

Regarding the first question, I’m still a bit confused. Are you saying that if I have a function as follows:

#pragma acc routine seq
ExampleObject acc_func(const ExampleObject &arg1) {
    // Do some calculations to compute x and y
    ...
    return ExampleObject(x,y);
}

and this function is called inside a #pragma acc parallel loop region, for example:

#pragma acc parallel loop copy(example_object_output) copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
    example_object_output = acc_func(example_object_input)
}

which covers cases (a), (b) and (c), then I don’t need to use any #pragma acc ... directives in the ExampleObject struct (i.e. it looks like a normal C++ struct)? Similarly, for case (e), if I have the following:

#pragma acc parallel loop copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
    ExampleObject example_object_output = example_object_processor(example_object_input)
    // Do some calculations with example_object_output here
    ...
}

then I also don’t need to include any #pragma acc ... directives in the ExampleObject struct?

However, regarding case (d), if I have, for example:

#pragma acc data {
    double x = 1.0;
    double y = 2.0;
    ExampleObject example_object(x,y);
    // Do some calculations with example_object here
    ...
}

then I do need #pragma acc routine seq above the constructors and destructors, but don’t need #pragma acc enter data copyin(this)/#pragma acc exit data delete(this) inside the constructors/destructor? As an extension to this for case (d), if I did have a dynamic member variable in the ExampleObject struct, would I then need both #pragma acc routine seq above the constructors and destructors, and #pragma acc enter data copyin(this)/#pragma acc exit data delete(this) inside the constructors/destructor?

Apologies for such laborious questions, but I’m trying to wrap my head around a few confusing errors I’m currently getting in my code.

MatColgrove · January 8, 2021, 5:52pm

The question was regarding if you needed to include the enter/exit data regions in the constructor/destructor of the struct. For all cases, the answer is no. For the device side questions, data regions can only be used within host code so can’t be used on the device. For the host side question, it’s unnecessary since the struct is fixed size therefor a struct object can be put directly with a data clause.

#pragma acc parallel loop copy(example_object_output) copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
    example_object_output = acc_func(example_object_input)
}

No issue here with the copy/copyin of the example object. The compiler knows the size of the objects so can copy them without issue. The problem occurs when the object contains dynamic data members, whose size is not known until runtime, in which case you need to do a deep copy.

The problem with this code is a race condition. By copying “example_object_output”, you’ve made it shared so the actual values returned will depend upon whichever loop iteration was do last. This should be an array of structs or privatized so each vector has it’s own copy.

I should note that we have had bugs in the past when performing struct to struct copies on the device. Hopefully you wont encounter any of these, but if you do, let me know an I can report the issue.

#pragma acc parallel loop copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
    ExampleObject example_object_output = example_object_processor(example_object_input)
    // Do some calculations with example_object_output here
    ...
}

This one may not work as expected. I do have an open bug with something similar where the compiler is hoisting the declaration of the object out of the loop, thus making it shared. Though given you’re also initializing it via a call, the compiler may not do the hoisting, so may be fine. Give it try and if you do see an issue, let me know. The work around would be something like:

 ExampleObject example_object_output
 #pragma acc parallel loop copyin(example_object_input) private(example_object_output)
 for(size_t i = 0; i < num_particles_; ++i) {
    example_object_output = example_object_processor(example_object_input)
    // Do some calculations with example_object_output here
    ...
}

Not clear on this next one since the syntax is wrong.

#pragma acc data {
    double x = 1.0;
    double y = 2.0;
    ExampleObject example_object(x,y);
    // Do some calculations with example_object here
    ...
}

Though, something like this would be fine:

double x = 1.0;
double y = 2.0;
ExampleObject example_object(x,y);
// Do some calculations with example_object here
#pragma acc data copyin(example_object)
{
      #pragma acc parallel loop
      for ....
          // use example_object in a compute construct

As an extension to this for case (d) , if I did have a dynamic member variable in the ExampleObject struct, would I then need both #pragma acc routine seq above the constructors and destructors, and #pragma acc enter data copyin(this) / #pragma acc exit data delete(this) inside the constructors/destructor?

You can use the data regions within the constructor/destructor, it’s just not necessary. What changes is where the data region is and who controls it, either the struct/class itself or the host program.

When simply adding a dynamic data member to the struct, you’d want to do a manual deep copy. While not adopted by the OpenACC standard yet, there is a proposed “shape” and “policy” directive which can be used as well but I’d rather not confuse things further at this point.

Let’s walk through some examples. First the manual deep copy method where the data management is controled by the host program:

% cat deepcpy1.cpp

#include <iostream>
#include <cstdlib>

struct ExampleObject {
    ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr)  {}
    ExampleObject(int, double, double);
    ~ExampleObject();
    double x;
    double y;
    int size;
    double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
    x=_x;
    y=_y;
    size=_size;
    data = new double[size];
}
ExampleObject::~ExampleObject() {
    delete [] data;
    data = nullptr;
    size = 0;
}

int main () {

    int size = 1024;
    double x = 5.0;
    double y = 10.0;
    ExampleObject exo(size,x,y);

#pragma acc enter data copyin(exo,exo.data[0:exo.size])
#pragma acc parallel loop present(exo,exo.data)
    for (int i=0; i < exo.size; ++i) {
        double factor = (double) i / (double) exo.size;
        exo.data[i] = (exo.x + exo.y) * factor;
    }
#pragma acc exit data copyout(exo.data[0:exo.size]) delete(exo)

    std::cout << "Some results " << std::endl;
    for (int i=10; i < 20; ++i) {
        std::cout << i << ": " << exo.data[i] << std::endl;
    }
}
% nvc++ deepcpy1.cpp -acc -Minfo=accel ; a.out
main:
 31, Generating enter data copyin(exo,exo.data[:exo.size])
     Generating present(exo,exo.data[:])
     Generating Tesla code
     35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
 41, Generating exit data copyout(exo.data[:exo.size])
     Generating exit data delete(exo)
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832

Next, let’s move the data region inside the constructor/destructor. The main difference now is that the scope and lifetime of the device copy of the object is the same as that of the host copy. In the first example, the lifetime was only between the enter and exit data regions.

% cat deepcpy2.cpp

#include <iostream>
#include <cstdlib>

struct ExampleObject {
    ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr)  {}
    ExampleObject(int, double, double);
    ~ExampleObject();
    double x;
    double y;
    int size;
    double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
    x=_x;
    y=_y;
    size=_size;
    data = new double[size];
#pragma acc enter data copyin(this, data[0:size])
}
ExampleObject::~ExampleObject() {
#pragma acc exit data delete(data,this)
    delete [] data;
    data = nullptr;
    size = 0;
}

int main () {

    int size = 1024;
    double x = 5.0;
    double y = 10.0;
    ExampleObject exo(size,x,y);

#pragma acc parallel loop present(exo,exo.data)
    for (int i=0; i < exo.size; ++i) {
        double factor = (double) i / (double) exo.size;
        exo.data[i] = (exo.x + exo.y) * factor;
    }
#pragma acc update self(exo.data[0:exo.size])

    std::cout << "Some results " << std::endl;
    for (int i=10; i < 20; ++i) {
        std::cout << i << ": " << exo.data[i] << std::endl;
    }
}

% nvc++ deepcpy2.cpp -acc -Minfo=accel ; a.out
main:
     33, Generating present(exo,exo.data[:])
         Generating Tesla code
         36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     42, Generating update self(exo.data[:exo.size])
ExampleObject::ExampleObject(int, double, double):
     20, Generating enter data copyin(this[:1],data[:size])
ExampleObject::~ExampleObject():
     23, Generating exit data delete(this[:1],data[:1])
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832

Finally, let’s full encapsulate all the data management and processing within the struct itself:

% cat deepcpy3.cpp

#include <iostream>
#include <cstdlib>

struct ExampleObject {
    ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr)  {}
    ExampleObject(int, double, double);
    ~ExampleObject();
    void process_data();
    void print_data();
#ifdef _OPENACC
    void acc_update_device();
    void acc_update_self();
#endif

private:
    double x;
    double y;
    int size;
    double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
    x=_x;
    y=_y;
    size=_size;
    data = new double[size];
#pragma acc enter data copyin(this, data[0:size])
}
ExampleObject::~ExampleObject() {
#pragma acc exit data delete(data,this)
    delete [] data;
    data = nullptr;
    size = 0;
}

void ExampleObject::process_data() {
#pragma acc parallel loop present(this,data)
    for (int i=0; i < size; ++i) {
        double factor = (double) i / (double) size;
        data[i] = (x + y) * factor;
    }
}

void ExampleObject::print_data() {
    std::cout << "Some results " << std::endl;
    for (int i=10; i < 20; ++i) {
        std::cout << i << ": " << data[i] << std::endl;
    }
}

#ifdef _OPENACC
void ExampleObject::acc_update_self() {
#pragma acc update self(data[:size])
}
void ExampleObject::acc_update_device() {
#pragma acc update device(data[:size])
}
#endif

int main () {

    int size = 1024;
    double x = 5.0;
    double y = 10.0;
    ExampleObject exo(size,x,y);
    exo.process_data();
#ifdef _OPENACC
    exo.acc_update_self();
#endif
    exo.print_data();
}

% nvc++ deepcpy3.cpp -acc -Minfo=accel ; a.out
ExampleObject::ExampleObject(int, double, double):
     28, Generating enter data copyin(this[:1],data[:size])
ExampleObject::~ExampleObject():
     31, Generating exit data delete(this[:1],data[:1])
ExampleObject::process_data():
     36, Generating present(this[:1],data[:1])
         Generating Tesla code
         38, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
ExampleObject::acc_update_device():
     57, Generating update device(data[:size])
ExampleObject::acc_update_self():
     54, Generating update self(data[:size])
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832

Hopefully this clarifies things, but if not, please let me know!

-Mat

Topic		Replies	Views
Strange issues with OpenACC data and loop directives in C++ classes Legacy PGI Compilers	6	1026	January 8, 2021
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2404	December 4, 2023
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	670	May 31, 2023
OpenACC: Copying array, returned by a function, to device nvc, nvc++ and nvfortran	10	925	August 11, 2021
Calling object function in another object's function causes OpenACC code to crash nvc, nvc++ and nvfortran	6	588	October 12, 2021
"OpenACC" deepcopy support in current nvc releases? It works, but is it supported? nvc, nvc++ and nvfortran omniverse_extension	9	1112	February 23, 2022
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4495	December 9, 2020
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1480	December 4, 2020
zero output when OpenACC is used Legacy PGI Compilers	3	1772	March 19, 2018
OpenACC, Procedures called in a compute region must have acc routine information Legacy PGI Compilers	3	4002	June 27, 2019

OpenACC regions with C++ structs

Related topics