Either add a “present” clause or another “copy” clause on the parallel loop within “rotation”.
#pragma acc parallel loop present(T)
#pragma acc parallel loop copy(T[:N]))
Otherwise, the compiler must add an implicit copy of “T” but since the bounds are unknown, it can’t.
At runtime, “present” will look-up “T” in the present table tp do the host to device address mapping. If “T” is not present, the code will error.
For “copy”, “present_or” semantics will be applied. If “T” is present, then the device address will be used and no copy performed. If “T” is not present, then the data will be copied. This is useful if “rotation” may or may not be called from within a higher level data region. The caveat being that you’d need to compute or pass in “N”'s value.
Thanks for the suggestions and ideas.
I guess, the #pragama acc parallel loop present(T) would be a good option, since the vector(array) is in the GPU. As far as I understood, copy would do host to the device, since copy is already used at the beginning of the code, we do not need to use the copy again.
As you know, I can also use the -ta=tesla:managed using unified or explicit copying would vary the overall performance. Would you just suggest to use -ta=tesla:managed or not.
Correct. I prefer using “present” in these cases, but “copy” does have it’s place especially if “rotation” is not always call from within an outer data region.
Would you just suggest to use -ta=tesla:managed or not.
Using Unified Memory (i.e. managed) does greatly simplify programming since data management is often the most difficult part of offloading to a GPU. Especially for large complex data structures. So long as the program isn’t touching data back and forth between the host and device, the performance is often the same as manually managing the data.
The caveats to using UM are that it’s currently only available for dynamically allocated data, is an extension so may not be portable to other compilers and platforms, can’t be used with “async”, and is currently not performant when using CUDA Aware MPI.
In practice, I tend to manually manage data. In part it’s because what I used to doing, but also I’m primarily porting MPI+OpenACC codes, so CUDA Aware MPI performance is a requirement. Though once the performance issues are fixed, as well as managing static data, I’ll start using UM more. I also find if I start with UM, it’s difficult to go back and add data directives later.
So which should you use? If productivity is your primary goal and the caveats aren’t an issue, then use it. If the goal is to highly optimize the data movement, then data directives are the better path.