 # Loop tuning

Hi

I am poreting a large code to the GPU with the PGI Acc Model. Currently running 14x…
Now I want to do some fine tuning. I have 5 Loops that are showing not that good performance yet. The Loops look like:

``````!\$acc region

do NN=1,Number_of_nodes
do CI=1,Number_of_cells_per_node(NN)

face1 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),1)    &
+face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),1)    &
+face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),1)
face2 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),2)    &
+face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),2)    &
+face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),2)
face3 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),3)    &
+face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),3)    &
+face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),3)

ARRAY(NN,1) = ARRAY(NN,1) +(face1*F(RNODE1(NN,CI),1)     &
+ face2*G(RNODE1(NN,CI),1)     &
+ face3*H(RNODE1(NN,CI),1)     &
+ 0.125/DXX(RNODE1(NN,CI))     &
* DELTA(RNODE1(NN,CI),1))
end do
end do

!\$acc end region

Number_of_nodes is a big number (2 Million)
Number_of_cells_per_node(NN) has values varying between 1 to 10 (mostly 8)
RNODE2(:,:) has values varying between 1:8
RNODE1(:,:) has values from 1 to 2M
``````

And the Minfo:

``````    140, Generating compute capability 2.0 binary
141, Loop is parallelizable
Accelerator kernel generated
141, !\$acc do parallel, vector(256) ! blockidx%x threadidx%x
Using register for 'rnode_cnt_gpu'
CC 2.0 : 63 registers; 4 shared, 488 constant, 0 local memory bytes; 33% occupancy
142, Complex loop carried dependence of 'dq_gpu' prevents parallelization
Loop carried dependence of 'dq_gpu' prevents parallelization
Loop carried backward dependence of 'dq_gpu' prevents vectorization
Inner sequential loop scheduled on accelerator
``````

And the ,time output is:

``````    140: region entered 20 times
time(us): total=177978 init=4 region=177974
kernels=176084 data=0
w/o init: total=177974 max=8967 min=8854 avg=8898
141: kernel launched 20 times
grid:   block: 
time(us): total=176084 max=8880 min=8766 avg=8804
``````

Is there anything I can do to gain performance??? I guess this inner loop is the killing part, right?
I would be happy for any kind of tipp to tune this loop.
Thank you so much!

Hi elephant,

What I’d try doing is assigning the RNODE2(NN,CI) and RNODE1(NN,CI) values to temp variables and replacing each instance with the loop with the temp variable. For example:

``````!\$acc region

do NN=1,Number_of_nodes
do CI=1,Number_of_cells_per_node(NN)
rn1 = RNODE1(NN,CI)
rn2 = RNODE2(NN,CI)
face1 = face1(rn2)*I_GPU(rn1,1)    &
+face2(rn2)*J_GPU(rn1,1)    &
+face3(rn2)*K_GPU(rn1,1)
... continues
``````

The code is using a lot of registers. My best guess is many of these registers are being used to hold the address calculation for each of the RNODE address. Granted, the compiler may already be recognizing the redundant look-ups and has already replaced them with temp variables. In which case, manually replacing them wont matter. Worth a try though.

Next, I’d use a temp variable to accumulate, otherwise, your storing “ARRAY(NN,1)” to global memory after each iteration of the loop.

``````!\$acc region

do NN=1,Number_of_nodes
tempsum = ARRAY(NN,1)
do CI=1,Number_of_cells_per_node(NN)
...
tempsum = tempsum + (face1*F(RNODE1(NN,CI),1)     &
....
enddo
ARRAY(NN,1) = tempsum
enddo
!\$acc end region
``````

The next thing to try is setting the maxregcount to 16 (-Mcuda=maxregcount:16). This should boost the occupancy from 33% to 100%. Though, increasing the occupancy doesn’t always mean better performance since less registers can mean more global memory fetches. Though, you don’t have a lot of data reuse, so may be ok.

Hope this helps,
Mat