Ok, I should have given more detail on this aspect.
When using unstructured data regions (enter/exit data), the scoping doesn’t allow the compiler to “know” that the variable will be present during the compilation. Instead, it does the implicit copy. However, OpenACC semantics states that at runtime, the array is checked if it is present or not. If it is, then no copy is performed. Here most of these arrays will be present so the copy is not used.
What I usually do is add a “present(var1,var2…,varN)” clause or “default(present)” where all array are implicitly put in a present clause. At runtime, if the variable is not present, an error is given.
There’s also “default(none)” where the compiler doesn’t uses implicit copies but instead checks that the arrays are in a data clause within scope. The benefit being that this is done during compilation but it does mean adding all the arrays to a data clause (like “present”) either on each loop or within a structured data region encompassing multiple parallel loops within the same subroutine:
!$acc data present(var1,var2, ... varN) create(Fflux_iph)
!$acc parallel loop default(none) << inherits the data scoping from the outer data region since it's in the same scope
!$acc parallel loop
!$acc end data
Personally, I’ll typically use “default(present)” but I know others like “default(none)”
When adding “default(present)” the compiler feedback shows:
481, Generating default present(nk(1:nblocks),qp_iphr(0:nimax,1:njmax,1:nkmax,1:nblocks,1:5),qp_iphl(0:nimax,1:njmax,1:nkmax,1:nblocks,1:5),iy_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),ix_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),iz_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),jac_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),ni(1:nblocks),nj(1:nblocks))
552, Generating Tesla code
553, !$acc loop gang, vector(128) collapse(5) ! blockidx%x threadidx%x
554, ! blockidx%x threadidx%x collapsed
555, ! blockidx%x threadidx%x collapsed
556, ! blockidx%x threadidx%x collapsed
557, ! blockidx%x threadidx%x collapsed
Then at runtime, I see this error:
FATAL ERROR: data in PRESENT clause was not found on device 1: name=fflux_iph host:0x93b7220
So yes, there is extra data copies happening with the automatic arrays “Fflux_iph” and “Gflux_jph”. To fix, add a data region so the arrays are only created on the device and used in each compute region within the subroutine:
!$acc data create(Gflux_jph)
!$acc parallel loop gang vector collapse(4) private(F_L, F_R) default(present)
DO nbl = 1,nblocks
DO k = 1,NKmax
DO j = 0,NJmax
DO i = 1,NImax
!$acc end data
On my V100, the reported time goes from 7.2 to 6.9 seconds.
Looking at the profile (i,e,s etting "NV_ACC_TIME=1), the time spent computing on the device is only about 0.13 seconds, with most of the remaining time in the device updates at lines 46 and 48.
So once you’re able to offload the loops in “MP5_FACE_INTERPOLATION_WITH_PRIMS_CHARS”, you’ll be able remove these updates and see very good performance.
In looking at “MP5_FACE_INTERPOLATION_WITH_PRIMS_CHARS”, I see that you have some “matmul” calls. We’re just now (in 21.7) starting to add support for device side calling of matmul, but we’re not catching your use case. I’ll send the example to engineering to see if we can get it added in a future release. In the meantime, you may need to manually write out the matrix multiply instead of using the intrinsic.
I also see the following error which is preventing parallelization:
292, Accelerator restriction: induction variable live-out from loop: n_prim
309, Accelerator restriction: induction variable live-out from loop: n_prim
The problem being that “n_prim” is a module variable which gives the variable a global storage. To fix, either change these loops to use a local variable for the loop index, or add it to your private clauses.
Finally, you’re missing “Gma” in the private clause.
After commenting out your matmul calls (I know this cheating, but hey, I have to leave you some work), removing the update, and fixing the private issues, the time goes to 0.28 seconds. Obvious this will go back up once the matmul are back in, but it just show how much those updates are dominating the time.