The question was regarding if you needed to include the enter/exit data regions in the constructor/destructor of the struct. For all cases, the answer is no. For the device side questions, data regions can only be used within host code so can’t be used on the device. For the host side question, it’s unnecessary since the struct is fixed size therefor a struct object can be put directly with a data clause.
#pragma acc parallel loop copy(example_object_output) copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
example_object_output = acc_func(example_object_input)
}
No issue here with the copy/copyin of the example object. The compiler knows the size of the objects so can copy them without issue. The problem occurs when the object contains dynamic data members, whose size is not known until runtime, in which case you need to do a deep copy.
The problem with this code is a race condition. By copying “example_object_output”, you’ve made it shared so the actual values returned will depend upon whichever loop iteration was do last. This should be an array of structs or privatized so each vector has it’s own copy.
I should note that we have had bugs in the past when performing struct to struct copies on the device. Hopefully you wont encounter any of these, but if you do, let me know an I can report the issue.
#pragma acc parallel loop copyin(example_object_input)
for(size_t i = 0; i < num_particles_; ++i) {
ExampleObject example_object_output = example_object_processor(example_object_input)
// Do some calculations with example_object_output here
...
}
This one may not work as expected. I do have an open bug with something similar where the compiler is hoisting the declaration of the object out of the loop, thus making it shared. Though given you’re also initializing it via a call, the compiler may not do the hoisting, so may be fine. Give it try and if you do see an issue, let me know. The work around would be something like:
ExampleObject example_object_output
#pragma acc parallel loop copyin(example_object_input) private(example_object_output)
for(size_t i = 0; i < num_particles_; ++i) {
example_object_output = example_object_processor(example_object_input)
// Do some calculations with example_object_output here
...
}
Not clear on this next one since the syntax is wrong.
#pragma acc data {
double x = 1.0;
double y = 2.0;
ExampleObject example_object(x,y);
// Do some calculations with example_object here
...
}
Though, something like this would be fine:
double x = 1.0;
double y = 2.0;
ExampleObject example_object(x,y);
// Do some calculations with example_object here
#pragma acc data copyin(example_object)
{
#pragma acc parallel loop
for ....
// use example_object in a compute construct
As an extension to this for case (d)
, if I did have a dynamic member variable in the ExampleObject
struct, would I then need both #pragma acc routine seq
above the constructors and destructors, and #pragma acc enter data copyin(this)
/ #pragma acc exit data delete(this)
inside the constructors/destructor?
You can use the data regions within the constructor/destructor, it’s just not necessary. What changes is where the data region is and who controls it, either the struct/class itself or the host program.
When simply adding a dynamic data member to the struct, you’d want to do a manual deep copy. While not adopted by the OpenACC standard yet, there is a proposed “shape” and “policy” directive which can be used as well but I’d rather not confuse things further at this point.
Let’s walk through some examples. First the manual deep copy method where the data management is controled by the host program:
% cat deepcpy1.cpp
#include <iostream>
#include <cstdlib>
struct ExampleObject {
ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr) {}
ExampleObject(int, double, double);
~ExampleObject();
double x;
double y;
int size;
double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
x=_x;
y=_y;
size=_size;
data = new double[size];
}
ExampleObject::~ExampleObject() {
delete [] data;
data = nullptr;
size = 0;
}
int main () {
int size = 1024;
double x = 5.0;
double y = 10.0;
ExampleObject exo(size,x,y);
#pragma acc enter data copyin(exo,exo.data[0:exo.size])
#pragma acc parallel loop present(exo,exo.data)
for (int i=0; i < exo.size; ++i) {
double factor = (double) i / (double) exo.size;
exo.data[i] = (exo.x + exo.y) * factor;
}
#pragma acc exit data copyout(exo.data[0:exo.size]) delete(exo)
std::cout << "Some results " << std::endl;
for (int i=10; i < 20; ++i) {
std::cout << i << ": " << exo.data[i] << std::endl;
}
}
% nvc++ deepcpy1.cpp -acc -Minfo=accel ; a.out
main:
31, Generating enter data copyin(exo,exo.data[:exo.size])
Generating present(exo,exo.data[:])
Generating Tesla code
35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
41, Generating exit data copyout(exo.data[:exo.size])
Generating exit data delete(exo)
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832
Next, let’s move the data region inside the constructor/destructor. The main difference now is that the scope and lifetime of the device copy of the object is the same as that of the host copy. In the first example, the lifetime was only between the enter and exit data regions.
% cat deepcpy2.cpp
#include <iostream>
#include <cstdlib>
struct ExampleObject {
ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr) {}
ExampleObject(int, double, double);
~ExampleObject();
double x;
double y;
int size;
double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
x=_x;
y=_y;
size=_size;
data = new double[size];
#pragma acc enter data copyin(this, data[0:size])
}
ExampleObject::~ExampleObject() {
#pragma acc exit data delete(data,this)
delete [] data;
data = nullptr;
size = 0;
}
int main () {
int size = 1024;
double x = 5.0;
double y = 10.0;
ExampleObject exo(size,x,y);
#pragma acc parallel loop present(exo,exo.data)
for (int i=0; i < exo.size; ++i) {
double factor = (double) i / (double) exo.size;
exo.data[i] = (exo.x + exo.y) * factor;
}
#pragma acc update self(exo.data[0:exo.size])
std::cout << "Some results " << std::endl;
for (int i=10; i < 20; ++i) {
std::cout << i << ": " << exo.data[i] << std::endl;
}
}
% nvc++ deepcpy2.cpp -acc -Minfo=accel ; a.out
main:
33, Generating present(exo,exo.data[:])
Generating Tesla code
36, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
42, Generating update self(exo.data[:exo.size])
ExampleObject::ExampleObject(int, double, double):
20, Generating enter data copyin(this[:1],data[:size])
ExampleObject::~ExampleObject():
23, Generating exit data delete(this[:1],data[:1])
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832
Finally, let’s full encapsulate all the data management and processing within the struct itself:
% cat deepcpy3.cpp
#include <iostream>
#include <cstdlib>
struct ExampleObject {
ExampleObject() : x(0.0), y(0.0), size(0),data(nullptr) {}
ExampleObject(int, double, double);
~ExampleObject();
void process_data();
void print_data();
#ifdef _OPENACC
void acc_update_device();
void acc_update_self();
#endif
private:
double x;
double y;
int size;
double * data;
};
ExampleObject::ExampleObject(int _size, double _x, double _y) {
x=_x;
y=_y;
size=_size;
data = new double[size];
#pragma acc enter data copyin(this, data[0:size])
}
ExampleObject::~ExampleObject() {
#pragma acc exit data delete(data,this)
delete [] data;
data = nullptr;
size = 0;
}
void ExampleObject::process_data() {
#pragma acc parallel loop present(this,data)
for (int i=0; i < size; ++i) {
double factor = (double) i / (double) size;
data[i] = (x + y) * factor;
}
}
void ExampleObject::print_data() {
std::cout << "Some results " << std::endl;
for (int i=10; i < 20; ++i) {
std::cout << i << ": " << data[i] << std::endl;
}
}
#ifdef _OPENACC
void ExampleObject::acc_update_self() {
#pragma acc update self(data[:size])
}
void ExampleObject::acc_update_device() {
#pragma acc update device(data[:size])
}
#endif
int main () {
int size = 1024;
double x = 5.0;
double y = 10.0;
ExampleObject exo(size,x,y);
exo.process_data();
#ifdef _OPENACC
exo.acc_update_self();
#endif
exo.print_data();
}
% nvc++ deepcpy3.cpp -acc -Minfo=accel ; a.out
ExampleObject::ExampleObject(int, double, double):
28, Generating enter data copyin(this[:1],data[:size])
ExampleObject::~ExampleObject():
31, Generating exit data delete(this[:1],data[:1])
ExampleObject::process_data():
36, Generating present(this[:1],data[:1])
Generating Tesla code
38, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
ExampleObject::acc_update_device():
57, Generating update device(data[:size])
ExampleObject::acc_update_self():
54, Generating update self(data[:size])
Some results
10: 0.146484
11: 0.161133
12: 0.175781
13: 0.19043
14: 0.205078
15: 0.219727
16: 0.234375
17: 0.249023
18: 0.263672
19: 0.27832
Hopefully this clarifies things, but if not, please let me know!
-Mat