how to parallelize C++ code containing templates with OpenAcc on GPU?

I need to port some C++ code on GPU using PGI compiler pgc++. In simpler programs i used the following construction (Particle - a structure with 16 fields):

 GL=1000000;
 Particle particles[GL] __attribute__((aligned(64)));
 ...
 int main(int argc, char **argv)
 {
...
 #pragma acc data create(particles,...) 
    {
      while(NumberOfAliveParticles>0)
      {
        fun1();
        fun2();
        ...
    }
  }
 }

In fun1(), fun2() i made OpenAcc compute regions using OpenAcc #pragma directives. So, the “particles” array was on GPU (after #pragma acc data create(particles…)) and i worked with it on the GPU.
In the current code, which i need to port on GPU, “Particle” is a template structure and “particles” is a class member std::vector<Particle> (T - template parameter) of class A.
Could You be so kind to answer my questions:

  1. Can OpenAcc be used with C++ templates (can i use a C++ template class object inside an OpenAcc parallel compute region or it does not matter - template or not template it is)?
  2. The “particles” array should be on GPU. May i use std::vector on GPU? If not, may i use std::array container for “particles” array on GPU? What container would You recommend to use or may be it is better to make “particles” array global?
  3. How to write #pragma acc data create() for such “particles” vector (or array, or some container, which OpenAcc allows to use) which is a member of class A?
    Thank You very much for Your answer.

Hi @and,

  1. Can OpenAcc be used with C++ templates (can i use a C++ template class object inside an OpenAcc parallel compute region or it does not matter - template or not template it is)?

Yes, templates should be fine as they are processed by the compiler before reaching the OpenACC code generation. Also, the compiler will implicitly add the “routine” directive to calls made in the template so device callable versions will be generated.

Now not all code can be offloaded to GPUs so if you have things like I/O, you may see errors. Though, this is a general issue, not specific to use with templates.

  1. The “particles” array should be on GPU. May i use std::vector on GPU? If not, may i use std::array container for “particles” array on GPU? What container would You recommend to use or may be it is better to make “particles” array global?

While you can use vectors, they can be tricky.

First, vectors are not thread safe so in general you need to be cautious about putting them in parallel code. Be sure to not resize, push or pop and only use the access operator.

Second, vectors are just an opaque class with three pointers. Given OpenACC data and update directives perform shallow copies, putting a vector in a data clause will cause only the pointers to be copied, not the data they point to. You need to perform a deep copy of the data, or compile with the flag “-ta=tesla:managed” so the pointers will be allocated in a CUDA Unified memory address space accessible by both the host and device.

  1. How to write #pragma acc data create() for such “particles” vector (or array, or some container, which OpenAcc allows to use) which is a member of class A?

Try using CUDA Unified Memory first since it will make things easier.

Otherwise, you need to perform a deep copy and depending on what ‘particles’ is (does it have it’s own dynamic data members?), it may be fairly straight forward or complex. If you want to try a deep copy, please post a small example which shows the data structure and a simple operation to perform on it. Basically a small version of your full application.

-Mat

Thank You very much for Your answer. I need some time to prepare a small version of the application.