How openacc improves register utilization in openaccfortran

When I was using the openacc optimizer to optimize my fortran code, pgprof reminded me that the number of registers used per thread is too large and gives optimization suggestions.:Use the -maxrregcount flag or the __launch_bounds__qualifier to decrease the number of register used by each thread.on devices With compute capability 5.2 turnning global cache off can invcrease the occupancy limited by register usage. The first choice I have tried, the program will stop running, then how to operate :"__launch_bounds__qualifier" in openacc? What can I do to: “turnning global cache off”?
The experimental GPU I am using is K40, and the final program is running on the P40.

code:

	!$acc kernels present(fsj2,fsj3,fsj4,fsj5,u1,v1,w1,t1,six,siy,siz,sjx,sjy,sjz,skx,sky,skz,vol,fmuet) 
	!$acc loop collapse(3) 
	  do  k=2,k3

	  do  i=2,i3


	  do  j=2,j2
          
		jm = j-1
		jp = j+1
		im = i-1
		ip = i+1
		km = k-1	
            	kp = k+1

			uxa(i,j,k) = ((0.5d0*(six(ip,j,k)+six(ip,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(ip,j,k)+u1(ip,jm,k))*0.25d0) - (0.5d0*(six(i,j,k)+&
six(i,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(im,j,k)+u1(im,jm,k))*0.25d0) + (0.5d0*(sjx(i,j,k)+sjx(i,jp,k)))*u1(i,j,k) - (0.5d0*(sjx(i,j,k)+&
sjx(i,jm,k)))*u1(i,jm,k) + (0.5d0*(skx(i,j,kp)+skx(i,jm,kp)))*((u1(i,j,k)+u1(i,jm,k)+u1(i,j,kp)+u1(i,jm,kp))*0.25d0) - (0.5d0*(skx(i,j,k)+&
skx(i,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(i,j,km)+u1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))


			uya(i,j,k) = ((0.5d0*(siy(ip,j,k)+siy(ip,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(ip,j,k)+u1(ip,jm,k))*0.25d0) - (0.5d0*(siy(i,j,k)+siy&
(i,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(im,j,k)+u1(im,jm,k))*0.25d0) + (0.5d0*(sjy(i,j,k)+sjy(i,jp,k)))*u1(i,j,k) - (0.5d0*(sjy(i,j,k)+sjy(i,jm,k)))*u1&
(i,jm,k) + (0.5d0*(sky(i,j,kp)+sky(i,jm,kp)))*((u1(i,j,k)+u1(i,jm,k)+u1(i,j,kp)+u1(i,jm,kp))*0.25d0) - (0.5d0*(sky(i,j,k)+sky(i,jm,k)))*((u1(i,j,k)&
+u1(i,jm,k)+u1(i,j,km)+u1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))
			uza(i,j,k) = ((0.5d0*(siz(ip,j,k)+siz(ip,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(ip,j,k)+u1(ip,jm,k))*0.25d0) - (0.5d0*(siz(i,j,k)+siz&
(i,jm,k)))*((u1(i,j,k)+u1(i,jm,k)+u1(im,j,k)+u1(im,jm,k))*0.25d0) + (0.5d0*(sjz(i,j,k)+sjz(i,jp,k)))*u1(i,j,k) - (0.5d0*(sjz(i,j,k)+sjz(i,jm,k)))*u1&
(i,jm,k) + (0.5d0*(skz(i,j,kp)+skz(i,jm,kp)))*((u1(i,j,k)+u1(i,jm,k)+u1(i,j,kp)+u1(i,jm,kp))*0.25d0) - (0.5d0*(skz(i,j,k)+skz(i,jm,k)))*((u1(i,j,k)&
+u1(i,jm,k)+u1(i,j,km)+u1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))
			



			vxa(i,j,k) = ((0.5d0*(six(ip,j,k)+six(ip,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(ip,j,k)+v1(ip,jm,k))*0.25d0) - (0.5d0*(six(i,j,k)+six&
(i,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(im,j,k)+v1(im,jm,k))*0.25d0) + (0.5d0*(sjx(i,j,k)+sjx(i,jp,k)))*v1(i,j,k) - (0.5d0*(sjx(i,j,k)+sjx(i,jm,k)))*v1&
(i,jm,k) + (0.5d0*(skx(i,j,kp)+skx(i,jm,kp)))*((v1(i,j,k)+v1(i,jm,k)+v1(i,j,kp)+v1(i,jm,kp))*0.25d0) - (0.5d0*(skx(i,j,k)+skx(i,jm,k)))*((v1(i,j,k)&
+v1(i,jm,k)+v1(i,j,km)+v1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))
			vya(i,j,k) = ((0.5d0*(siy(ip,j,k)+siy(ip,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(ip,j,k)+v1(ip,jm,k))*0.25d0) - (0.5d0*(siy(i,j,k)+siy&
(i,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(im,j,k)+v1(im,jm,k))*0.25d0) + (0.5d0*(sjy(i,j,k)+sjy(i,jp,k)))*v1(i,j,k) - (0.5d0*(sjy(i,j,k)+sjy(i,jm,k)))*v1&
(i,jm,k) + (0.5d0*(sky(i,j,kp)+sky(i,jm,kp)))*((v1(i,j,k)+v1(i,jm,k)+v1(i,j,kp)+v1(i,jm,kp))*0.25d0) - (0.5d0*(sky(i,j,k)+sky(i,jm,k)))*((v1(i,j,k)&
+v1(i,jm,k)+v1(i,j,km)+v1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))
			vza(i,j,k) = ((0.5d0*(siz(ip,j,k)+siz(ip,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(ip,j,k)+v1(ip,jm,k))*0.25d0) - (0.5d0*(siz(i,j,k)+siz&
(i,jm,k)))*((v1(i,j,k)+v1(i,jm,k)+v1(im,j,k)+v1(im,jm,k))*0.25d0) + (0.5d0*(sjz(i,j,k)+sjz(i,jp,k)))*v1(i,j,k) - (0.5d0*(sjz(i,j,k)+sjz(i,jm,k)))*v1&
(i,jm,k) + (0.5d0*(skz(i,j,kp)+skz(i,jm,kp)))*((v1(i,j,k)+v1(i,jm,k)+v1(i,j,kp)+v1(i,jm,kp))*0.25d0) - (0.5d0*(skz(i,j,k)+skz(i,jm,k)))*((v1(i,j,k)&
+v1(i,jm,k)+v1(i,j,km)+v1(i,jm,km))*0.25d0))*(2.d0/(vol(i,j,k)+vol(i,jm,k)))
			
enddo      
enddo
enddo
!$acc end kernels

In order to reduce the use of registers, I used a way to disassemble a computational process into multiple computational processes, using the intermediate variable “temp”, which played a role, reducing the use of registers by about 20 percent, but Still very high, each thread occupies about 140 registers, which is unacceptable to me.
code:

!$acc kernels present(v1,six,siy,siz,sjx,sjy,sjz,skx,sky,skz,vol) 
!$acc loop  collapse(3)
	do   k=2,k3
		 
	    do   j=2,j3
		 

		do   i=2,i2
			




			temp = 0.50*(six(i,j,k)+six(i+1,j,k))*v1(i ,j ,k )
			temp = temp- 0.50*(six(i,j,k)+six(i-1,j,k))*v1(i-1,j ,k ) 
			temp = temp+ 0.125*(sjx(i,j+1,k)+sjx(i-1,j+1,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j+1,k )+v1(i-1,j+1,k ))
			temp = temp - 0.125*(sjx(i,j,k)+sjx(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j-1,k )+v1(i-1,j-1,k ))
			temp = temp + 0.125*(skx(i,j,k+1)+skx(i-1,j,k+1))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k+1)+v1(i-1,j ,k+1))
			temp = temp- 0.125*(skx(i,j,k)+skx(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k-1)+v1(i-1,j ,k-1))
			vxa(i,j,k) =temp*(2.d0/(vol(i,j,k)+vol(i-1,j,k)))


			
			temp = 0.5d0*(siy(i,j,k)+siy(i+1,j,k))*v1(i ,j ,k )
			temp = temp- 0.5d0*(siy(i,j,k)+siy(i-1,j,k))*v1(i-1,j ,k ) 
			temp = temp+ 0.125*(sjy(i,j+1,k)+sjy(i-1,j+1,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j+1,k )+v1(i-1,j+1,k ))
			temp = temp - 0.125*(sjy(i,j,k)+sjy(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j-1,k )+v1(i-1,j-1,k ))
			temp = temp + 0.125*(sky(i,j,k+1)+sky(i-1,j,k+1))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k+1)+v1(i-1,j ,k+1))
			temp = temp- 0.125*(sky(i,j,k)+sky(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k-1)+v1(i-1,j ,k-1))
			vya(i,j,k) = temp*(2.d0/(vol(i,j,k)+vol(i-1,j,k)))



			temp = 0.5d0*(siz(i,j,k)+siz(i+1,j,k))*v1(i ,j ,k )
			temp = temp- 0.50*(siz(i,j,k)+siz(i-1,j,k))*v1(i-1,j ,k ) 
			temp = temp+ 0.125*(sjz(i,j+1,k)+sjz(i-1,j+1,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j+1,k )+v1(i-1,j+1,k ))
			temp = temp - 0.125*(sjz(i,j,k)+sjz(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j-1,k )+v1(i-1,j-1,k ))

			temp = temp + 0.125*(skz(i,j,k+1)+skz(i-1,j,k+1))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k+1)+v1(i-1,j ,k+1))
			temp = temp- 0.125*(skz(i,j,k)+skz(i-1,j,k))*(v1(i ,j ,k )+v1(i-1,j ,k )+v1(i ,j ,k-1)+v1(i-1,j ,k-1))
			vza(i,j,k) =temp*(2.d0/(vol(i,j,k)+vol(i-1,j,k)))	
			
        enddo
	
   enddo 
enddo      
!$acc end kernels

Hi wanghr323,

Your next step would be to split the computation into separate kernels. i.e. compute the u variable in on loop, and the v variables in another. If the register usage is still too high, try splitting each array computation in it’s own loop (6 in all).

The problem here is that you have many intermediary computations each needing to be stored in a register. While you could set the flag “-ta=tesla:maxregcount=N” (where N is the number of registers), the intermediate values still need to be stored. Hence by limiting the registers, these values could then be spilled to global memory and hurt performance. The best way to reduce register usage is to limit the number of local variables used (implicit and explicit) and why I suggest using multiple loops.

The first choice I have tried, the program will stop running, then how to operate :"__launch_bounds__qualifier" in openacc?

Since OpenACC is not NVIDIA specific and meant to target generic accelerator, NVIDIA specific items such as launch_bounds are not included in the standard.

Note that the PGI compiler does implicitly include launch_bounds with the number of threads when generating the device code. However, we don’t set the second argument, blocks per SM, which is more what you’re looking for. I did a study a number of years ago to see if the compiler could find the optimal blocks per SM. But since it required run time information, I found that the compiler simply didn’t have enough information during the static compilation to do this optimally. Instead it was determined that it’s more an algorithmic issue needing involvement of the programmer.

What can I do to: “turnning global cache off”?

Try setting “-ta=tesla:loadcache:L1”. The default is to use L2 for cache.

Hope this helps,
Mat