How to force synchronous send using OpenACC data construct

My code keeps sending data asynchronously when transferring data from Host to device. I used the nvidia profiler and data is sent in 1MB chunks.

My data transfer clause is shown below before entering time loop
#pragma acc data copyin(U2[0:p.domain_size],U3[0:p.domain_size],source[0:nt], coef[0:five]), copy(U1[0:p.domain_size])
for (i=0; i<nt; i+=2) { // time loop

My code use arrays U1,U2,U3 for computation and output result is in U1 after time loop.

The copyin data is done async while copyout works sync. How do I make my code send data sync in one chunk so data movement can be efficient.

Hi King!

In 12.x, the OpenACC runtime copies the user data to a 1MB pinned buffer, then transfers that asynchronously. In 13.1-13.3, the OpenACC runtime pins the user memory, so should send the data in a single contiguous chunk. However, we have recently found cases where the pinning of the memory is more costly than copying it chunks. Hence, we may need to go back to using the pinned buffer instead, for at least some cases.

  • Mat