Loop tuning

Hi

I am poreting a large code to the GPU with the PGI Acc Model. Currently running 14x…
Now I want to do some fine tuning. I have 5 Loops that are showing not that good performance yet. The Loops look like:

!$acc region

         do NN=1,Number_of_nodes
            do CI=1,Number_of_cells_per_node(NN)
	    
	          face1 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),1)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),1)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),1)
	          face2 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),2)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),2)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),2)
	          face3 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),3)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),3)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),3)
	    	   
		       
             ARRAY(NN,1) = ARRAY(NN,1) +(face1*F(RNODE1(NN,CI),1)     &
                                       + face2*G(RNODE1(NN,CI),1)     &
                                       + face3*H(RNODE1(NN,CI),1)     &
                                       + 0.125/DXX(RNODE1(NN,CI))     &
                                       * DELTA(RNODE1(NN,CI),1)) 	             
            end do       
         end do 
         
!$acc end region



Number_of_nodes is a big number (2 Million)
Number_of_cells_per_node(NN) has values varying between 1 to 10 (mostly 8)
RNODE2(:,:) has values varying between 1:8
RNODE1(:,:) has values from 1 to 2M

And the Minfo:

    140, Generating compute capability 2.0 binary
    141, Loop is parallelizable
         Accelerator kernel generated
        141, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             Using register for 'rnode_cnt_gpu'
             CC 2.0 : 63 registers; 4 shared, 488 constant, 0 local memory bytes; 33% occupancy
    142, Complex loop carried dependence of 'dq_gpu' prevents parallelization
         Loop carried dependence of 'dq_gpu' prevents parallelization
         Loop carried backward dependence of 'dq_gpu' prevents vectorization
         Inner sequential loop scheduled on accelerator

And the ,time output is:

    140: region entered 20 times
        time(us): total=177978 init=4 region=177974
                  kernels=176084 data=0
        w/o init: total=177974 max=8967 min=8854 avg=8898
        141: kernel launched 20 times
            grid: [3182]  block: [256]
            time(us): total=176084 max=8880 min=8766 avg=8804

Is there anything I can do to gain performance??? I guess this inner loop is the killing part, right?
I would be happy for any kind of tipp to tune this loop.
Thank you so much!

Hi elephant,

What I’d try doing is assigning the RNODE2(NN,CI) and RNODE1(NN,CI) values to temp variables and replacing each instance with the loop with the temp variable. For example:

!$acc region

         do NN=1,Number_of_nodes
            do CI=1,Number_of_cells_per_node(NN)
             rn1 = RNODE1(NN,CI) 
             rn2 = RNODE2(NN,CI)
             face1 = face1(rn2)*I_GPU(rn1,1)    &
                    +face2(rn2)*J_GPU(rn1,1)    &
                    +face3(rn2)*K_GPU(rn1,1)
... continues

The code is using a lot of registers. My best guess is many of these registers are being used to hold the address calculation for each of the RNODE address. Granted, the compiler may already be recognizing the redundant look-ups and has already replaced them with temp variables. In which case, manually replacing them wont matter. Worth a try though.

Next, I’d use a temp variable to accumulate, otherwise, your storing “ARRAY(NN,1)” to global memory after each iteration of the loop.

!$acc region

         do NN=1,Number_of_nodes
            tempsum = ARRAY(NN,1)
            do CI=1,Number_of_cells_per_node(NN) 
...
                tempsum = tempsum + (face1*F(RNODE1(NN,CI),1)     & 
....
            enddo
           ARRAY(NN,1) = tempsum
     enddo
!$acc end region

The next thing to try is setting the maxregcount to 16 (-Mcuda=maxregcount:16). This should boost the occupancy from 33% to 100%. Though, increasing the occupancy doesn’t always mean better performance since less registers can mean more global memory fetches. Though, you don’t have a lot of data reuse, so may be ok.

Hope this helps,
Mat