compiler output -Minfo: one loop slower than the other

elephant · June 30, 2011, 2:56pm

Hi
I am porting a large CFD code to the GPU.
My strategy is to take a subroutine (sub1), rewrite it as a “stand alone” program, accelerate it using the pgi accelerator model, and then rewrite it as a subroutine again and call it in a “dummy main program” in a loop that simulates the iterations.

Now, sub1 had a speedup of 117x in the “stand alone version”. I declared the arrays static. used the !$acc data region clause to ensure that no data transfer occurs during calculation.

For the main program version, where sub1 is a subroutine again, I have the exact same code for sub1. Allthough the arrays are declared dynamic witht the !$acc mirror clause. Before entering the loop where sub1 is called, I allocate the arrays and transfer the data to the GPU using !$acc update device.

My problem is, that I only have now 45x for sub1…

I checked the compiler output informaton (-Minfo): there is no datatransfer during calculation either, so this can not be the reason for the slowdown…

My sub1 consits out of 6 loops. When I measured the time used for each loop, I saw that the first 2 loos where almost identical, whereas for loop 3 there was a huge difference! (And also a slowdown for the other loops)

The -Minfo of this loop3 shows (CC 2.0)…

… for the standalone version(117x):
register: 47
shared: 4
constant: 112
local: 0
occupancy: 33%
… and for the version where sub1 is called from a main program (45x):
register: 41
shared: 4
constant: 304
local: 0
occupancy: 50 %

It seems wired to me that the faster loop has less occupancy…

Can you explain with the provided information why this could happen???
To measure the time of loop3 I captured the time before and after the loop, so I think that the difference in allocation of the arrays cant cause the slowdown, since when the loop is entered, in both cases the arrays are allocated on the device and also the data is on the device…

Or could it be due to this difference? since at compilation the compiler does not know the size of the arrays and therefore it can not opimally generate a strategy where to save which array and so on (register,shared,constant,…)

Thank you very much!!!

elephant · June 30, 2011, 3:05pm

By the way,: loop3 looks like this:

!$acc region

	 DO kc=1,kcend
	 
	      inttemp1 = pcell_T(KC,1)
	      inttemp2 = pcell_T(KC,2)
	      inttemp3 = pcell_T(KC,3)
	      inttemp4 = pcell_T(KC,4)
	      inttemp5 = pcell_T(KC,5)
	      inttemp6 = pcell_T(KC,6)
	      inttemp7 = pcell_T(KC,7)
	      inttemp8 = pcell_T(KC,8)

	      ArrTemp4(kc,1,1) =   Q_T(inttemp1,1)        &
                            +Q_T(inttemp2,1)	    &
                            +Q_T(inttemp3,1)	    &
                            +Q_T(inttemp4,1)	    &
                            +Q_T(inttemp5,1)	    &
                            +Q_T(inttemp6,1)	    &
                            +Q_T(inttemp7,1)	    &
                            +Q_T(inttemp8,1)
              
              ArrTemp4(kc,2,1) =   Q_T(inttemp1,2)  	    &
                            +Q_T(inttemp2,2)	    &
                            +Q_T(inttemp3,2)	    &
                            +Q_T(inttemp4,2)	    &
                            +Q_T(inttemp5,2)	    &
                            +Q_T(inttemp6,2)	    &
                            +Q_T(inttemp7,2)	    &
                            +Q_T(inttemp8,2)
              
              ArrTemp4(kc,3,1) =   Q_T(inttemp1,3)  	    &
                	    +Q_T(inttemp2,3)	    &
                	    +Q_T(inttemp3,3)	    &
                	    +Q_T(inttemp4,3)	    &
                	    +Q_T(inttemp5,3)	    &
                	    +Q_T(inttemp6,3)	    &
                	    +Q_T(inttemp7,3)	    &
                	    +Q_T(inttemp8,3)
              
              ArrTemp4(kc,4,1) =   Q_T(inttemp1,4)  	    &
                	    +Q_T(inttemp2,4)	    &
                	    +Q_T(inttemp3,4)	    &
                	    +Q_T(inttemp4,4)	    &
                	    +Q_T(inttemp5,4)	    &
                	    +Q_T(inttemp6,4)	    &
                	    +Q_T(inttemp7,4)	    &
                	    +Q_T(inttemp8,4)
              
              ArrTemp4(kc,5,1) =   Q_T(inttemp1,5)  	    &
                	    +Q_T(inttemp2,5)	    &
                	    +Q_T(inttemp3,5)	    &
                	    +Q_T(inttemp4,5)	    &
                	    +Q_T(inttemp5,5)	    &
                	    +Q_T(inttemp6,5)	    &
                	    +Q_T(inttemp7,5)	    &
                	    +Q_T(inttemp8,5)
			  
	      ArrTemp4(kc,6,1) =  CMU_GPU(inttemp1)	    &
	                    +CMU_GPU(inttemp2)  	    &
			    +CMU_GPU(inttemp3)  	    &
			    +CMU_GPU(inttemp4)  	    &
			    +CMU_GPU(inttemp5)  	    &
			    +CMU_GPU(inttemp6)  	    &
			    +CMU_GPU(inttemp7)  	    &
			    +CMU_GPU(inttemp8)
			   
              ArrTemp4(kc,7,1) = CMUT_GPU(inttemp1)	    &
	                    +CMUT_GPU(inttemp2)	    &
			    +CMUT_GPU(inttemp3)	    &
			    +CMUT_GPU(inttemp4)	    &
			    +CMUT_GPU(inttemp5)	    &
			    +CMUT_GPU(inttemp6)	    &
			    +CMUT_GPU(inttemp7)	    &
			    +CMUT_GPU(inttemp8)
			    
	      ArrTemp4(kc,8,1) =   QT_T(inttemp1,1)	    &
	                    +QT_T(inttemp2,1) 	    &
			    +QT_T(inttemp3,1) 	    &
			    +QT_T(inttemp4,1) 	    &
			    +QT_T(inttemp5,1) 	    &
			    +QT_T(inttemp6,1) 	    &
			    +QT_T(inttemp7,1) 	    &
			    +QT_T(inttemp8,1)
			  
	      ArrTemp4(kc,9,1) =    Y_GPU(inttemp1)	    &
	                    +Y_GPU(inttemp2)            &
			    +Y_GPU(inttemp3)            &
			    +Y_GPU(inttemp4)            &
			    +Y_GPU(inttemp5)            &
			    +Y_GPU(inttemp6)            &
			    +Y_GPU(inttemp7)            &
			    +Y_GPU(inttemp8)
			  
	      ArrTemp4(kc,10,1) =    Z_GPU(inttemp1)            &
	                    +Z_GPU(inttemp2)            &
			    +Z_GPU(inttemp3)            &
			    +Z_GPU(inttemp4)            &
			    +Z_GPU(inttemp5)            &
			    +Z_GPU(inttemp6)            &
			    +Z_GPU(inttemp7)            &
			    +Z_GPU(inttemp8)
	     
	      ArrTemp4(kc,11,1) =  DQT_T(inttemp1,1)	    &
	                    +DQT_T(inttemp2,1)	    &
			    +DQT_T(inttemp3,1)	    &
			    +DQT_T(inttemp4,1)	    &
			    +DQT_T(inttemp5,1)	    &
			    +DQT_T(inttemp6,1)	    &
			    +DQT_T(inttemp7,1)	    &
			    +DQT_T(inttemp8,1)

         END DO

!$acc end region

MatColgrove · July 6, 2011, 5:56pm

Hi elephant,

While I can’t tell without doing an in-depth investigation, my best guess is that it’s the array descriptors that accounts for the difference. With static arrays, the compiler is able to optimize the address calculations but this is more difficult with dynamic arrays. Improving this is something our engineers are investigating, however.

I’d be interested in knowing the the output from the basic profiling information (i.e. -ta=nvidia,time). In particular, the actual schedule used. I believe the increase in occupancy is due to the decrease in register usage and therefor an increase in the number of threads per block. It’s possible, that the increased number of threads causes other resources constraints and it’s better to reduce the number via the “!$acc do vector(nnn)” directive.

Mat