Hi!
I am a little bit confused about a certain issue:
I am porting an unstructured grid application. In order to have coalesced memory access, I generated a new vector (Q_GPU_kc). It is ordered with respect to the cells (kc).
The original vecotr (Q_GPU) is orderd with respect to the nodes.
PCELL_GPU is an array that contains the information about which nodes belong to a cell.
Q_GPU_kc that is ordered with respect to the cells was created as folows:
!$acc region
DO i=1,8
DO kc=1,kcend
Q_GPU_kc(kc,i,1) = Q_GPU(pcell_GPU(kc,i),1)
Q_GPU_kc(kc,i,2) = Q_GPU(pcell_GPU(kc,i),2)
Q_GPU_kc(kc,i,3) = Q_GPU(pcell_GPU(kc,i),3)
Q_GPU_kc(kc,i,4) = Q_GPU(pcell_GPU(kc,i),4)
Q_GPU_kc(kc,i,5) = Q_GPU(pcell_GPU(kc,i),5)
Q_GPU_kc(kc,i,6) = Q_GPU(pcell_GPU(kc,i),6)
END DO
END DO
!$acc end region
What I thought, is that if uing this ordered vector inside a loop that is indexing over kc would improve my performance!!! However, the opposit happend…
Is it because a kernel has problems to deal with 3dim arrays? The memory access of the loop calling Q_GPU_kc should be very efficient, no?
I must also say that the grid, which I am using is not fully unstructured. Hoever, I should still see better performance using the (Q_GPU_kc) vector to my oppinion.
Do you have any explanation for this?
The code with the ordered vector is foolowing one:
!$acc region
DO kc =1, kcend
DELTA_Q_T(KC,1) = &
Q_GPU_kc(KC,1,2)*S_X_GPU(kc,1) &
+Q_GPU_kc(KC,2,2)*S_X_GPU(kc,2) &
+Q_GPU_kc(KC,3,2)*S_X_GPU(kc,3) &
+Q_GPU_kc(KC,4,2)*S_X_GPU(kc,4) &
+Q_GPU_kc(KC,5,2)*S_X_GPU(kc,5) &
+Q_GPU_kc(KC,6,2)*S_X_GPU(kc,6) &
+Q_GPU_kc(KC,7,2)*S_X_GPU(kc,7) &
+Q_GPU_kc(KC,8,2)*S_X_GPU(kc,8) &
+Q_GPU_kc(KC,1,3)*S_Y_GPU(kc,1) &
+Q_GPU_kc(KC,2,3)*S_Y_GPU(kc,2) &
+Q_GPU_kc(KC,3,3)*S_Y_GPU(kc,3) &
+Q_GPU_kc(KC,4,3)*S_Y_GPU(kc,4) &
+Q_GPU_kc(KC,5,3)*S_Y_GPU(kc,5) &
+Q_GPU_kc(KC,6,3)*S_Y_GPU(kc,6) &
+Q_GPU_kc(KC,7,3)*S_Y_GPU(kc,7) &
+Q_GPU_kc(KC,8,3)*S_Y_GPU(kc,8) &
+Q_GPU_kc(KC,1,4)*S_Z_GPU(kc,1) &
+Q_GPU_kc(KC,2,4)*S_Z_GPU(kc,2) &
+Q_GPU_kc(KC,3,4)*S_Z_GPU(kc,3) &
+Q_GPU_kc(KC,4,4)*S_Z_GPU(kc,4) &
+Q_GPU_kc(KC,5,4)*S_Z_GPU(kc,5) &
+Q_GPU_kc(KC,6,4)*S_Z_GPU(kc,6) &
+Q_GPU_kc(KC,7,4)*S_Z_GPU(kc,7) &
+Q_GPU_kc(KC,8,4)*S_Z_GPU(kc,8)
END DO
!$acc end region
47, Generating compute capability 2.0 binary
49, Loop is parallelizable
Accelerator kernel generated
49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 2.0 : 63 registers; 4 shared, 208 constant, 0 local memory bytes; 33% occupancy
47: region entered 20 times
time(us): total=77350 init=5 region=77345
kernels=75924 data=0
w/o init: total=77345 max=3945 min=3847 avg=3867
49: kernel launched 20 times
grid: [3063] block: [256]
time(us): total=75924 max=3807 min=3778 avg=3796
The code with the “unordered”, original vector is the following one:
!$acc region
DO kc =1, kcend
DELTA_Q_T(KC,1) = &
Q_GPU(PCELL_GPU(KC,1),2)*S_X_GPU(kc,1) &
+Q_GPU(PCELL_GPU(KC,2),2)*S_X_GPU(kc,2) &
+Q_GPU(PCELL_GPU(KC,3),2)*S_X_GPU(kc,3) &
+Q_GPU(PCELL_GPU(KC,4),2)*S_X_GPU(kc,4) &
+Q_GPU(PCELL_GPU(KC,5),2)*S_X_GPU(kc,5) &
+Q_GPU(PCELL_GPU(KC,6),2)*S_X_GPU(kc,6) &
+Q_GPU(PCELL_GPU(KC,7),2)*S_X_GPU(kc,7) &
+Q_GPU(PCELL_GPU(KC,8),2)*S_X_GPU(kc,8) &
+Q_GPU(PCELL_GPU(KC,1),3)*S_Y_GPU(kc,1) &
+Q_GPU(PCELL_GPU(KC,2),3)*S_Y_GPU(kc,2) &
+Q_GPU(PCELL_GPU(KC,3),3)*S_Y_GPU(kc,3) &
+Q_GPU(PCELL_GPU(KC,4),3)*S_Y_GPU(kc,4) &
+Q_GPU(PCELL_GPU(KC,5),3)*S_Y_GPU(kc,5) &
+Q_GPU(PCELL_GPU(KC,6),3)*S_Y_GPU(kc,6) &
+Q_GPU(PCELL_GPU(KC,7),3)*S_Y_GPU(kc,7) &
+Q_GPU(PCELL_GPU(KC,8),3)*S_Y_GPU(kc,8) &
+Q_GPU(PCELL_GPU(KC,1),4)*S_Z_GPU(kc,1) &
+Q_GPU(PCELL_GPU(KC,2),4)*S_Z_GPU(kc,2) &
+Q_GPU(PCELL_GPU(KC,3),4)*S_Z_GPU(kc,3) &
+Q_GPU(PCELL_GPU(KC,4),4)*S_Z_GPU(kc,4) &
+Q_GPU(PCELL_GPU(KC,5),4)*S_Z_GPU(kc,5) &
+Q_GPU(PCELL_GPU(KC,6),4)*S_Z_GPU(kc,6) &
+Q_GPU(PCELL_GPU(KC,7),4)*S_Z_GPU(kc,7) &
+Q_GPU(PCELL_GPU(KC,8),4)*S_Z_GPU(kc,8)
END DO
!$acc end region
49, Loop is parallelizable
Accelerator kernel generated
49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 2.0 : 54 registers; 4 shared, 232 constant, 0 local memory bytes; 33% occupancy
47: region entered 20 times
time(us): total=59540 init=3 region=59537
kernels=58114 data=0
w/o init: total=59537 max=3034 min=2950 avg=2976
49: kernel launched 20 times
grid: [3063] block: [256]
time(us): total=58114 max=2930 min=2876 avg=2905
Thanky ou very much!!!