efficiency of copying a strided array

When copying a section of an array from device to host like

     real :: A(100,100,100)
     copyout(A(1:4,1:4,1:4))

how does the OpenACC compiler handle the strided array section? If the section of memory to be copied is not stored contiguously , does it create a temporary buffer to store the array section contiguously and then copyout the data to the CPU in one go, then fill in the CPU side of the array section? FORTRAN compilers can create temporary arrays for section for optimization.

What worries me is that, when i turn of PGI_ACC_NOTIFY, I see small chunks of the array being copied. Is it really very inefficient like that, or the OpenACC compiler can do optimizations behind the scene ?

Daniel

What worries me is that, when i turn of PGI_ACC_NOTIFY, I see small chunks of the array being copied. Is it really very inefficient like that, or the OpenACC compiler can do optimizations behind the scene ?

DMA transfers can only be done on contiguous blocks of memory so when transferring non-contiguous sub-arrays, the compiler must split the transfers into small blocks.

While the compiler can’t optimize this, you can by copying only contiguous blocks, i.e. “A(1:4,:,:)”. Yes, you’ll be copying more data, but it’s often faster than transferring several small blocks. The compiler can not do this for you since it doesn’t know if it’s safe to copy the entire second and third dimensions.

-Mat