My code keeps sending data asynchronously when transferring data from Host to device. I used the nvidia profiler and data is sent in 1MB chunks.
My data transfer clause is shown below before entering time loop
#pragma acc data copyin(U2[0:p.domain_size],U3[0:p.domain_size],source[0:nt], coef[0:five]), copy(U1[0:p.domain_size])
for (i=0; i<nt; i+=2) { // time loop
…
My code use arrays U1,U2,U3 for computation and output result is in U1 after time loop.
The copyin data is done async while copyout works sync. How do I make my code send data sync in one chunk so data movement can be efficient.