OpenACC: Best way to manage data transfer between host and device


What is the best way to manage the data in a code? Will the use of flag ‘-ta-tesla:managed’ be beneficial?
Or is it better/good-practice to explicitly put the data directives around each parallel loop sections which will be executed multiple times during the whole run?

To explain my problem in more detail:

I have nested loops in my code like I have shown in my previous post : OpenACC: Best way to parallelize nested DO loops with data dependency between loops?. The code I have shown in that post is just a toy code I experimented with and it did give good performance with the suggestions you have provided. I have similar kind of loops in my original code. I implemented the same strategy you have advised which is trying to increase the amount of parallelism for GPU by increasing number of blocks. So all I did was added the following directives to my loops.

DO nbl = 1,nblocks
           queue = nbl
!$acc parallel loop gang collapse(3) async(queue)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
!$acc loop vector
DO i = 1, NI(nbl)

               <some operations>


I have expected similar speedup as I have seen in the toy-code. But unfortunately I dint see speedup, infact the whole code is running a little slower. The more I added these directives, the slower the code is getting. I have multiple sections like these in my code.

I am using nblocks = 1, NI(1) = 400, NJ(1) =400, NK(1) = 1.

My large code is not similar to the toy-code that I posted in previously, but the loop structures are similar. What might be the probable reason for slowing down?


Using CUDA Unified Memory (i.e. ‘managed’) is very simple to use and if the program isn’t frequently copying data back and forth, it’s often the same performance as if you used data directives. The caveats being that UM only works with allocated data and does not work with CUDA Aware MPI.

I personally almost always use data directive, but mostly because I’m used to them and work a lot with hybrid MPI+OpenACC codes. In this case, I take a top-down approach where I add enter data create directives just after I allocate my arrays (exit data delete just before the deallocates) and then use “update” directives to manage the data movement. As I incrementally add the compute regions via a bottom-up approach, I put updates before and after them. Then as I add more compute regions, I move the updates outwards until all the compute and initialization is offloaded and use as few updates as possible.

What might be the probable reason for slowing down?

It could be any number of things, but only the compiler and a profiler can tell you for sure. Somethings to look for:

Data movement:
Is the compiler needing to add implicit data copies to your loops (as seen in the compiler feedback, i.e. -Minfo=accel)?
Is there excessive data movement between the host and device? To tell either use the compiler’s runtime profile (i.e. set the env var NV_ACC_TIME=1") or the Nsight-Systems profiler. (When using UM, you’ll need to use Nsight-Systems).

Kernel Performance:
Is there enough compute to feed a GPU?
Are the loop schedules optimal and ‘vector’ applied to the stride-1 dimension? (-Minfo=accel will tell you the schedule being used).
If using atomics or reductions, can they be removed by revising the algorithm?

Using the Nsight-Compute profiler:
What’s the CUDA Occupancy? Low occupancy is often the result of using too many registers (which store local variables) when using large kernels.
Is there a lot of register spilling?
Is the kernel able to effectively utilize the L1 and L2 caches?
Are the warps stalled? If so, why? Waiting for memory (i.e. long scoreboard), contention on the FP unit?

There more things to look at as well, but that’s enough to start with.

1 Like

Yes, can you please explain what implicit data movement means? I could not find its meaning in the manuals…Thanks!

Should I add that as the compiler flag?

This is defined in various Sections, but mostly in Section 2.6.2 of the OpenACC Spec

Basically, in order to compute something on a device with discreate memory, the data needs to be copied over. Hence when the user does not explicitly use data directives and clauses, the compiler must implicitly handle the data movement for you. The performance problem being that this has to occur every time the compute region is launched so can cause a lot of extraneous copies.

No, “NV_ACC_TIME” is an environment variable. You can set it and unset it without needing to recompile. Not sure what shell you’re using, but in bash its “export NV_ACC_TIME=1”, or in cshrc “setenv NV_ACC_TIME 1”. Either unset the variable or set it to 0 to disable.

1 Like

The problem seems to be in the data movement in my case. I have removed the ‘-ta=tasla:managed’ compiler flag and created explicit data directives all over. I tried to run almost everything in the time-loop on the GPU eliminating the need to transfer data back to the host through out the main time-loop process (mostly). Except, to carryout one operation on the host, I have used ‘update self’ directive and used ‘update device’ to update back the device memory of that array. AFTER ALL THESE CHANGES I AM STILL SEEING IMPLICIT DATA COPY of arrays in the ‘-Minfo=accel’ information as shown below. As a result the program is running slower on the gpu (CPU:8.1 secs, GPU: 10.2 secs). Please kindly point out the mistake that I am probably doing.

 71, !$acc loop gang, vector(128) collapse(5) ! blockidx%x threadidx%x
         72,   ! blockidx%x threadidx%x collapsed
         73,   ! blockidx%x threadidx%x collapsed
         74,   ! blockidx%x threadidx%x collapsed
         75,   ! blockidx%x threadidx%x collapsed
     70, Generating implicit copyin(nj(1:nblocks),ni(1:nblocks),jac(1:nimax,1:njmax,1:nkmax,1:nblocks),residual_rhs(1:nimax,1:njmax,1:nkmax,1:nblocks,1:nconserv)) [if not already present]
         Generating implicit copy(qc(1:nimax,1:njmax,1:nkmax,1:nblocks,1:nconserv)) [if not already present]
         Generating implicit copyin(nk(1:nblocks),rk_factor1(step),rk_factor3(step),rk_factor2(step),qc_initial(1:nimax,1:njmax,1:nkmax,1:nblocks,1:nconserv)) [if not already present]


Ok, I should have given more detail on this aspect.

When using unstructured data regions (enter/exit data), the scoping doesn’t allow the compiler to “know” that the variable will be present during the compilation. Instead, it does the implicit copy. However, OpenACC semantics states that at runtime, the array is checked if it is present or not. If it is, then no copy is performed. Here most of these arrays will be present so the copy is not used.

What I usually do is add a “present(var1,var2…,varN)” clause or “default(present)” where all array are implicitly put in a present clause. At runtime, if the variable is not present, an error is given.

There’s also “default(none)” where the compiler doesn’t uses implicit copies but instead checks that the arrays are in a data clause within scope. The benefit being that this is done during compilation but it does mean adding all the arrays to a data clause (like “present”) either on each loop or within a structured data region encompassing multiple parallel loops within the same subroutine:

!$acc data present(var1,var2, ... varN)  create(Fflux_iph)
!$acc parallel loop default(none)  << inherits the data scoping from the outer data region since it's in the same scope
!$acc parallel loop  
!$acc end data

Personally, I’ll typically use “default(present)” but I know others like “default(none)”

When adding “default(present)” the compiler feedback shows:

    481, Generating default present(nk(1:nblocks),qp_iphr(0:nimax,1:njmax,1:nkmax,1:nblocks,1:5),qp_iphl(0:nimax,1:njmax,1:nkmax,1:nblocks,1:5),iy_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),ix_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),iz_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),jac_iph(0:nimax,1:njmax,1:nkmax,1:nblocks),ni(1:nblocks),nj(1:nblocks))
    552, Generating Tesla code
        553, !$acc loop gang, vector(128) collapse(5) ! blockidx%x threadidx%x
        554,   ! blockidx%x threadidx%x collapsed
        555,   ! blockidx%x threadidx%x collapsed
        556,   ! blockidx%x threadidx%x collapsed
        557,   ! blockidx%x threadidx%x collapsed

Then at runtime, I see this error:

FATAL ERROR: data in PRESENT clause was not found on device 1: name=fflux_iph host:0x93b7220

So yes, there is extra data copies happening with the automatic arrays “Fflux_iph” and “Gflux_jph”. To fix, add a data region so the arrays are only created on the device and used in each compute region within the subroutine:

       !$acc data create(Gflux_jph) 
        !$acc parallel loop gang vector collapse(4) private(F_L, F_R) default(present)
        DO nbl = 1,nblocks
        DO k = 1,NKmax
        DO j = 0,NJmax
        DO i = 1,NImax
     !$acc end data

On my V100, the reported time goes from 7.2 to 6.9 seconds.

Looking at the profile (i,e,s etting "NV_ACC_TIME=1), the time spent computing on the device is only about 0.13 seconds, with most of the remaining time in the device updates at lines 46 and 48.

So once you’re able to offload the loops in “MP5_FACE_INTERPOLATION_WITH_PRIMS_CHARS”, you’ll be able remove these updates and see very good performance.

In looking at “MP5_FACE_INTERPOLATION_WITH_PRIMS_CHARS”, I see that you have some “matmul” calls. We’re just now (in 21.7) starting to add support for device side calling of matmul, but we’re not catching your use case. I’ll send the example to engineering to see if we can get it added in a future release. In the meantime, you may need to manually write out the matrix multiply instead of using the intrinsic.

I also see the following error which is preventing parallelization:

292, Accelerator restriction: induction variable live-out from loop: n_prim
309, Accelerator restriction: induction variable live-out from loop: n_prim

The problem being that “n_prim” is a module variable which gives the variable a global storage. To fix, either change these loops to use a local variable for the loop index, or add it to your private clauses.

Finally, you’re missing “Gma” in the private clause.

After commenting out your matmul calls (I know this cheating, but hey, I have to leave you some work), removing the update, and fixing the private issues, the time goes to 0.28 seconds. Obvious this will go back up once the matmul are back in, but it just show how much those updates are dominating the time.

1 Like

Thanks for pointing this out. I have corrected it. Now those loops are working fine.

Alright, thanks great!

You are right, once offloading this subroutine, I started seeing the performance.

Haha, yah. I wrote manual matrix multiplication code.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.