Number data copyin and copyout unexpected


I’m porting an existing CFD code OpenACC and am trying to optimize the data transfers a bit.

I noticed that with PGI_ACC_TIME=1 is enabled, I get the following breakdown:

27: data region reached 11 times
27: data copyin reached 440 times
device time(us): total=1,216,222 max=2,815 min=1,992 avg=2,764
84: data copyout reached 77 times
device time(us): total=186,841 max=2,701 min=1,467 avg=2,426

My data region statement is:

*$acc data pcopyin(x, u) pcopyout(xmu)

where x, u, and xmu are multdimensional (5-d) arrays in Fortran. I expected the copyin information to occur 22 times (perhaps 33 to allow for the alloc) and the copyout to occur 11 since I’m calling the routine 11 times for the benchmark.

Any hints as to why the # of copy’s aren’t matching this expectation?

I also tried to replace the data region with explicit acc_copyin(), acc_create(), acc_delete() and acc_copyout() statements (along with present clause to the kernel) but that’s not working properly even though the # of transfers is 3 – that’s for another post though.

Thanks for any guidance here.

Hi cps,

In order to maximize the performance of large data transfers, the run time will break the data transfer into multiple transfers. While one pinned memory buffer gets sent to the device, other buffers get filled from virtual memory. The copy from virtual to pinned memory can take time so this overlap helps speed things up.

We’ve debated internally what is the correct number to print here. It was decided that it should match what you would see if you had performed a CUDA profile using nvprof or nvvp. However we acknowledge that this can be confusing. I’ll bring it up again.

Note, you can change the buffer size and thus reduce the number of transfers by setting the environment variable “PGI_ACC_BUFFERSIZE”.

  • Mat