A nice tip for copying C++ structs in openacc

Hi there, as I wrote elsewhere,

the device-mapper of openacc had problems in my case (which is a C++ struct with several constructors and three arrays, one for the extents and the other for the strides and member variables for the lengths of these arrays, and whether the data is rowmayor and so on).

I think I can map all these in situ with #acc pragmas, but one then, everytime one uploads a loop, one has to write dozens of mapping commands.

I was, unfortunately, not able to map these with the acc functions and put them into pointers. Only the macros worked. At some point, the nvc++ compiler even generated an exception when i tried it with pointers and to associate the device member variables to the malloced pointers at the device…

I asked chatgp which came up with nice macros, similar to this one:

define STRINGIFY(x) #x

define CREATE_IN_STRUCT(dA)
enter data copyin(dA)
copyin(dA.pdata[0:dA.pdatalength])
copyin(dA.pextents[0:dA.prank])
copyin(dA.pstrides[0:dA.prank])

define CREATE_OUT_STRUCT(dL)
enter data copyin(dL)
create(dL.pdata[0:dL.pdatalength])
copyin(dL.pextents[0:dL.prank])
copyin(dL.pstrides[0:dL.prank])

define EXIT_STRUCT(dL)
exit data delete(dL.pdata[0:dL.pdatalength])
delete(dL.pextents[0:dL.prank])
delete(dL.pstrides[0:dL.prank])
delete(dL)

define UPDATE_HOST(dL) update self(dL.pdata[0:dL.pdatalength])

define UPDATE_DEVICE(dA) update device(dL.pdata[0:dL.pdatalength])

they can be used like this:

mycomplex_data_structure s;
#pragma acc CREATE_IN_STRUCT(s)

And the preprocessor will then replace these with the mapping commands.

But I still do not know why it yields me device sigsevs when i try to use the acc functions like acc_memcpy_to_device. That would be still a cleaner way.

(It maybe that it has something to do that the struct needs its member variables to point to the arrays and simply copying up is not enough, but the member variables of the struct have to be linked to the uploaded array when it is up?).

I hope my routines work at least on the device.

The function acc_ondevice(acc_device_nvidia) seems to say so. And if i empty the array for the results beforehand, and if i do not copy it back from the device to the host, i get zeros, and if i copy it back, i see the results.

So it seems that the copy in and copy back operation with the macros work. They save some writing time one has less code repetition.

You can improve the macro’s further by adding another level using the “_Pragma” operator. This will allow you to more easily switch between OpenACC and OpenMP, especially for the data model as semantically they are basically the same.

But I still do not know why it yields me device sigsevs when i try to use the acc functions like acc_memcpy_to_device. That would be still a cleaner way.

Dynamic data members are pointers with a host address. If you simply do a straight copy, the device will access the host address and hence the illegal address error. The step you’re missing is the “attach”, which fills in the device address to the member’s device data in the device copy of the struct. The “attach” is implicitly done when doing the deep-copy via the data regions and the parent is on the device.

For a full explanation of deep-copy, please see the article “Deep Copy in OpenACC: Support for Dynamically Nested Data Structures”.

Hi thanks for that tipp with the pragma operator.

Hm, i personally think that allocs would be better than these pragmas for mapping, since from functions, i get a pointer which i can use submit and point to.

Also, malloc functions can return a nullptr in order to make clear that the operation has failed, i.e.one can check the gpu is is out of memory. So i have now created the following functions:

template
inline void update_device(datastruct& dL) {
acc_update_device((void*)dL.pdata,sizeof(T)dL.pdatalength);
}
template
inline void update_host(datastruct& dL) {
acc_update_host((void
)dL.pdata,sizeof(T)*dL.pdatalength);
}

template
void inline create_in_struct(datastruct& dA) {
acc_copyin((void*)dA.pdata, sizeof(T)dA.pdatalength);
acc_copyin((void
)dA.pextents,sizeof(size_t)dA.prank);
acc_copyin((void
) dA.pstrides,sizeof(size_t)dA.prank);
acc_copyin(&dA, sizeof(dA));
acc_attach((void
*)&dA.pdata);
acc_attach((void**)&dA.pextents);
acc_attach((void**)&dA.pstrides);
}

template
void inline create_out_struct(datastruct& dA) {
acc_copyin((void*)dA.pdata, sizeof(T)dA.pdatalength);
acc_copyin((void
)dA.pextents,sizeof(size_t)dA.prank);
acc_copyin((void
) dA.pstrides,sizeof(size_t)dA.prank);
acc_copyin(&dA, sizeof(dA));
acc_attach((void
*)&dA.pdata);
acc_attach((void**)&dA.pextents);
acc_attach((void**)&dA.pstrides);
}
template
inline void exit_struct(datastruct &dA)
{
acc_detach((void**)&dA.pdata);
acc_detach((void**)&dA.pextents);
acc_detach((void**)&dA.pstrides);
acc_delete((void*)dA.pstrides,sizeof(size_t)dA.prank);
acc_delete((void
)dA.pextents,sizeof(size_t)dA.prank);
acc_delete((void
)dA.pdata,sizeof(T)dA.pdatalength);
acc_delete((void
)&dA, sizeof(dA));
}

acc_copyin, like acc_malloc, also returns a devicepointer.

So if one wants to know if that failed, one may just to check if it returned a nullptr.

For openMP, i think we would really need an attach method. I got problems with my code because i found none. OpenMP has some things that make it interesting.

For example, openmp_alloc can be called from target regions. So the function on target can create its own temporary data caches.

But what is more, when called from target, openmp_alloc wants a memory allocator
https://www.openmp.org/spec-html/5.0/openmpsu53.html

this memory allocator can, for example specify, if one should place something in large cap mem or in low latency mem.

So if one has a matrix, the function on device can put the strides and extents for its temporary matrix into low latency mem, and the data into large_cap mem. (as long as the hardware supports such a designation)

Also, from an openmp loop one can always call another function with an openmp loop.

But for matrices and arrays, I guess we need an attach clause and thus open_acc is the better option by now…

Interesting are the commands #pragma omp simd or #pragma parallel for simd.

The latter distributes a parallel loop into vectors. With the inner loop as a simd operation, on cpu one is almost as fast as numpy in c++ (apart from microseconds, and often openmp wins indeed) even when doing a matrix multiplication with rather slow stl vectors (which are not really optimized for numerical calculations). Something which has a sometimes faster [ i] operator (according to my tests) is c++ stl valarray. But sadly this has no data() access to its pointers… so the best is to use raw c arrays with pointers, or fortran instead of c, which is sometimes faster than c…

As for how it stands, Openmp support for target offload in clang and gcc is somewhat under development.

I see from the compiler output that nvc++ does a good job in optimizing my library. with openacc

Even with no optimizations, it does what i would do by hand with cuda manually… (although In my case, with the higher options switched on, nvc++ seems to over optimize some things that it should not touch, i guess…)