unordered array access is faster than ordered access???

elephant · September 12, 2011, 10:48am

Hi!
I am a little bit confused about a certain issue:
I am porting an unstructured grid application. In order to have coalesced memory access, I generated a new vector (Q_GPU_kc). It is ordered with respect to the cells (kc).
The original vecotr (Q_GPU) is orderd with respect to the nodes.
PCELL_GPU is an array that contains the information about which nodes belong to a cell.

Q_GPU_kc that is ordered with respect to the cells was created as folows:

!$acc region
      DO i=1,8
         DO kc=1,kcend
	         Q_GPU_kc(kc,i,1) = Q_GPU(pcell_GPU(kc,i),1)
	         Q_GPU_kc(kc,i,2) = Q_GPU(pcell_GPU(kc,i),2)
	         Q_GPU_kc(kc,i,3) = Q_GPU(pcell_GPU(kc,i),3)
	         Q_GPU_kc(kc,i,4) = Q_GPU(pcell_GPU(kc,i),4)
	         Q_GPU_kc(kc,i,5) = Q_GPU(pcell_GPU(kc,i),5)
	         Q_GPU_kc(kc,i,6) = Q_GPU(pcell_GPU(kc,i),6)  
	      END DO
      END DO
!$acc end region

What I thought, is that if uing this ordered vector inside a loop that is indexing over kc would improve my performance!!! However, the opposit happend…

Is it because a kernel has problems to deal with 3dim arrays? The memory access of the loop calling Q_GPU_kc should be very efficient, no?
I must also say that the grid, which I am using is not fully unstructured. Hoever, I should still see better performance using the (Q_GPU_kc) vector to my oppinion.
Do you have any explanation for this?

The code with the ordered vector is foolowing one:

!$acc region 
      DO kc =1, kcend 
         DELTA_Q_T(KC,1) =                               & 
                Q_GPU_kc(KC,1,2)*S_X_GPU(kc,1)           & 
               +Q_GPU_kc(KC,2,2)*S_X_GPU(kc,2)           & 
               +Q_GPU_kc(KC,3,2)*S_X_GPU(kc,3)           & 
               +Q_GPU_kc(KC,4,2)*S_X_GPU(kc,4)           & 
               +Q_GPU_kc(KC,5,2)*S_X_GPU(kc,5)           & 
               +Q_GPU_kc(KC,6,2)*S_X_GPU(kc,6)           & 
               +Q_GPU_kc(KC,7,2)*S_X_GPU(kc,7)           & 
               +Q_GPU_kc(KC,8,2)*S_X_GPU(kc,8)           & 
               +Q_GPU_kc(KC,1,3)*S_Y_GPU(kc,1)           & 
               +Q_GPU_kc(KC,2,3)*S_Y_GPU(kc,2)           & 
               +Q_GPU_kc(KC,3,3)*S_Y_GPU(kc,3)           & 
               +Q_GPU_kc(KC,4,3)*S_Y_GPU(kc,4)           & 
               +Q_GPU_kc(KC,5,3)*S_Y_GPU(kc,5)           & 
               +Q_GPU_kc(KC,6,3)*S_Y_GPU(kc,6)           & 
               +Q_GPU_kc(KC,7,3)*S_Y_GPU(kc,7)           & 
               +Q_GPU_kc(KC,8,3)*S_Y_GPU(kc,8)           & 
               +Q_GPU_kc(KC,1,4)*S_Z_GPU(kc,1)           & 
               +Q_GPU_kc(KC,2,4)*S_Z_GPU(kc,2)           & 
               +Q_GPU_kc(KC,3,4)*S_Z_GPU(kc,3)           & 
               +Q_GPU_kc(KC,4,4)*S_Z_GPU(kc,4)           & 
               +Q_GPU_kc(KC,5,4)*S_Z_GPU(kc,5)           & 
               +Q_GPU_kc(KC,6,4)*S_Z_GPU(kc,6)           & 
               +Q_GPU_kc(KC,7,4)*S_Z_GPU(kc,7)           & 
               +Q_GPU_kc(KC,8,4)*S_Z_GPU(kc,8) 

      END DO 
!$acc end region


     47, Generating compute capability 2.0 binary
     49, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 63 registers; 4 shared, 208 constant, 0 local memory bytes; 33% occupancy


    47: region entered 20 times
        time(us): total=77350 init=5 region=77345
                  kernels=75924 data=0
        w/o init: total=77345 max=3945 min=3847 avg=3867
        49: kernel launched 20 times
            grid: [3063]  block: [256]
            time(us): total=75924 max=3807 min=3778 avg=3796

The code with the “unordered”, original vector is the following one:

!$acc region 
      DO kc =1, kcend   
         DELTA_Q_T(KC,1) =                                       & 
                Q_GPU(PCELL_GPU(KC,1),2)*S_X_GPU(kc,1)           & 
               +Q_GPU(PCELL_GPU(KC,2),2)*S_X_GPU(kc,2)           & 
               +Q_GPU(PCELL_GPU(KC,3),2)*S_X_GPU(kc,3)           & 
               +Q_GPU(PCELL_GPU(KC,4),2)*S_X_GPU(kc,4)           & 
               +Q_GPU(PCELL_GPU(KC,5),2)*S_X_GPU(kc,5)           & 
               +Q_GPU(PCELL_GPU(KC,6),2)*S_X_GPU(kc,6)           & 
               +Q_GPU(PCELL_GPU(KC,7),2)*S_X_GPU(kc,7)           & 
               +Q_GPU(PCELL_GPU(KC,8),2)*S_X_GPU(kc,8)           & 
               +Q_GPU(PCELL_GPU(KC,1),3)*S_Y_GPU(kc,1)           & 
               +Q_GPU(PCELL_GPU(KC,2),3)*S_Y_GPU(kc,2)           & 
               +Q_GPU(PCELL_GPU(KC,3),3)*S_Y_GPU(kc,3)           & 
               +Q_GPU(PCELL_GPU(KC,4),3)*S_Y_GPU(kc,4)           & 
               +Q_GPU(PCELL_GPU(KC,5),3)*S_Y_GPU(kc,5)           & 
               +Q_GPU(PCELL_GPU(KC,6),3)*S_Y_GPU(kc,6)           & 
               +Q_GPU(PCELL_GPU(KC,7),3)*S_Y_GPU(kc,7)           & 
               +Q_GPU(PCELL_GPU(KC,8),3)*S_Y_GPU(kc,8)           & 
               +Q_GPU(PCELL_GPU(KC,1),4)*S_Z_GPU(kc,1)           & 
               +Q_GPU(PCELL_GPU(KC,2),4)*S_Z_GPU(kc,2)           & 
               +Q_GPU(PCELL_GPU(KC,3),4)*S_Z_GPU(kc,3)           & 
               +Q_GPU(PCELL_GPU(KC,4),4)*S_Z_GPU(kc,4)           & 
               +Q_GPU(PCELL_GPU(KC,5),4)*S_Z_GPU(kc,5)           & 
               +Q_GPU(PCELL_GPU(KC,6),4)*S_Z_GPU(kc,6)           & 
               +Q_GPU(PCELL_GPU(KC,7),4)*S_Z_GPU(kc,7)           & 
               +Q_GPU(PCELL_GPU(KC,8),4)*S_Z_GPU(kc,8) 
      END DO 
!$acc end region


     49, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 54 registers; 4 shared, 232 constant, 0 local memory bytes; 33% occupancy


    47: region entered 20 times
        time(us): total=59540 init=3 region=59537
                  kernels=58114 data=0
        w/o init: total=59537 max=3034 min=2950 avg=2976
        49: kernel launched 20 times
            grid: [3063]  block: [256]
            time(us): total=58114 max=2930 min=2876 avg=2905

Thanky ou very much!!!

MatColgrove · September 14, 2011, 6:03pm

Hi Mark,

I sent a response to the email you sent PGI Customer Support but haven’t seen a response from you yet. In case you missed the mail, I’ve posted it below:

Can you send me both versions of the code? I can assume what the problem is, but if I have the code I can profile it and look at the generate GPU code to get you a better answer.

As far as my guesses, the initialization code has a very small outer loop (8) so the problem here is most likely due to scheduling then anything else. I’d try inverting the i and kc loops and make the i loop sequential. Also, I’d cache index fetched address from pcell_GPU. Often the compiler will do this optimization, but it’s not guaranteed so I like to make sure.

For example:

!$acc region
!$acc do parallel, vector(256), kernel
         DO kc=1,kcend
          DO i=1,8
            idx = pcell_GPU(kc,i)
            Q_GPU_kc(kc,i,1) = Q_GPU(idx,1)
            Q_GPU_kc(kc,i,2) = Q_GPU(idx,2)
            Q_GPU_kc(kc,i,3) = Q_GPU(idx,3)
            Q_GPU_kc(kc,i,4) = Q_GPU(idx,4)
            Q_GPU_kc(kc,i,5) = Q_GPU(idx,5)
            Q_GPU_kc(kc,i,6) = Q_GPU(idx,6)
         END DO
      END DO
!$acc end region

In C, there aren’t true multi-dimensional arrays so all multi-dimensional Fortran arrays need to be linearized. I’d like to see the generated kernel code (-ta=nvidia,keepgpu) to look at the indexing. I think the data should be accessed contiguously, but is probably needing to make more calculations to determine the index. Notice that the first example uses 63 registers versus 54 in the second. Most likely this is due to the increased number of index calculations.

I think I’d try re-rolling the second example and add in a “i” loop. Like above, make the i loop sequential and cache the value of pcell_cpu. Hopefully this will reduce the number of registers required and increase your occupancy.

Again, these are just guesses, but worth trying.

Mat