Array copy optimize

Hello. I’m trying to optimize array copy for my program. I suggest, that sequential memory copy have to be faster, that one-by-one copy in loop. However i get opposite experimental results.

definitions:

KOL_MOMENT_MAX=2000
NN1=514,NN2=280
real, allocatable :: E0_NAKOP(:,:,:),C0_NAKOP(:,:,:),E0(:,:),C0(:,:)
allocate(E0_NAKOP(KOL_MOMENT_MAX,NN1,NN2))
allocate(C0_NAKOP(KOL_MOMENT_MAX,NN1,NN2))
allocate(E0(NN1,NN2))
allocate(C0(NN1,NN2))

first variant

do i=1,NN2
do j=1,NN1
E0_NAKOP(KOL_NAKOP,j,i)=E0(j,i)
C0_NAKOP(KOL_NAKOP,j,i)=C0(j,i)
enddo
enddo

second variant

E0_NAKOP(KOL_NAKOP,:,:)=E0(:,:)
C0_NAKOP(KOL_NAKOP,:,:)=C0(:,:)

third variant

E0_NAKOP(KOL_NAKOP,:,:)=E0
C0_NAKOP(KOL_NAKOP,:,:)=C0

Work time for function, that doing that copy is changing like this:

  1. 334,7
  2. 418,3
  3. 538,1

As I understand it, compiler just creates some odd code, and variant 2 and 3 actually not a sequential copy, but some variant of the same loop, but even with more overheads.
How can I force compiler just to use something like C memcpy, that is the most optimal way to copy sequential data arrays?

I suggest my problem is that sequential accessible subscripts is left, not right ones. Am I right?

I have tried doing

      E0_NAKOP(:,:,KOL_NAKOP)=E0(:,:)
      C0_NAKOP(:,:,KOL_NAKOP)=C0(:,:)

but this also takes more time than first variant.
I looked at generated assembly, and I see, that there is much overhead code for each array even when quick sequent copy available. That explains, why first variant is the most quick. But I still want to know, is it possible to make it work as quick as C memcpy does? Because it is told to be quicker for sequent data than simple one-by-one loop access.

Hi Senya,

What optimization flags are you using?

We do perform idiom recognition and will replace array assignment with mcopy or mset where the shape of the arrays are the same and the data is contiguous.

In your first examples where KOL_NAKOP is in the first dimension, the data is not contiguous. However, the second and third variant should be faster if you use “-fast”.

In the second example, where KOL_NAKOP is in the third dimension, the data is contiguous so memcopy will be used. (again with -fast).

I wrote up examples for each. From the “-Minfo”, I see memcopy being used in the second example:


! first example E0_NAKOP(KOL_NAKOP,:,:)=E0(:,:) 
% pgf90 senya.f90 -Minfo -fast -V13.10
foo:
     16, Memory set idiom, array assignment replaced by call to pgf90_mset4
     17, Loop not fused: function call before adjacent loop
     18, Generated vector sse code for the loop
     22, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4
     24, Loop interchange produces reordered loop nest: 25,26,24
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     25, Loop not fused: function call before adjacent loop
         5 loops fused
     26, Loop not fused: dependence chain to sibling loop
     33, Loop distributed: 2 new loops
         Loop interchange produces reordered loop nest: 34,34,33
         Loop interchange produces reordered loop nest: 35,35,33
         2 loops fused
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     34, Loop not fused: dependence chain to sibling loop
         2 loops fused
     38, Loop distributed: 2 new loops
         Loop interchange produces reordered loop nest: 39,39,38
         Loop interchange produces reordered loop nest: 40,40,38
         2 loops fused
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     39, 2 loops fused

! second example E0_NAKOP(:,:,KOL_NAKOP)=E0(:,:)
% pgf90 senya2.f90 -Minfo -fast -V13.10
foo:
     16, Memory set idiom, array assignment replaced by call to pgf90_mset4
     17, Loop not fused: function call before adjacent loop
     18, Generated vector sse code for the loop
     22, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4
     24, Loop interchange produces reordered loop nest: 25,24,26
     26, Generated an alternate version of the loop
         Generated vector sse code for the loop
         Generated 2 prefetch instructions for the loop
     34, Memory copy idiom, loop replaced by call to __c_mcopy4
     35, Memory copy idiom, loop replaced by call to __c_mcopy4
     39, Memory copy idiom, loop replaced by call to __c_mcopy4
     40, Memory copy idiom, loop replaced by call to __c_mcopy4
  • Mat

Ok, thank you. That was what I need.
Just to share my thoughts.
The only thing I dislike, is that -fast implies loop vectorization, that breaks ability to debug. So if you want debugging, you have to use inoptimal loops instead of idioms replacing. It would be good to have ability either to enable idiom replacing separately or to explicitly point compiler to use it in some cases.
However, i think that idioms replacing works good enough to be put as default compiler behavior (even with optimization turned off) with option to turn it off. That how I expected compiler to act.

is that -fast implies loop vectorization, that breaks ability to debug

FYI, you can disable Vectorization using “-Mnovect”. Also, when debugging with optimization enabled, use “-gopt” instead of “-g”. “-g” will inhibit some optimizations to make the code more readable.

  • Mat