Ways to reduce registers per thread in CUDA Fortran?

Dyoun · August 10, 2017, 4:20pm

Hi,

I’m porting a program written in CUDA C to CUDA Fortran and trying to reduce the number of registers without the register limit feature. The original C code has 45 registers per thread, however the Fortran code has over 100 registers which limits the theoretical occupancy to 12.5%.

For the C code, I used launch_bounds and managed local variable scopes as efficient as possible to decrease register counts. I use texture memory and shared memory heavily in both codes. Are there other ways to optimize register usages in CUDA Fortran?

Thanks in advance.
Youn

brentl · August 10, 2017, 6:22pm

Unfortunately there is not as much control of register usage in CUDA Fortran as in CUDA C, as there is basically another level of translation/compilation.

How do you compile (what options)? The CUDA version you use and the options you use can make a difference, in my experience. For instance, I’ve seen cc30 use fewer registers than cc35, even though both code runs fine.

We do have an open RFE for launch bounds in CUDA Fortran, and I’ll add you to the list.

Dyoun · August 14, 2017, 3:17pm

Thank you for your answer.

I’m using the latest version of Visual Fortran and the CUDA Fortran Compute Capability setting is set to automatic. Could you tell me how to set compute capability manually using command line?

Also, I wrote a simple vector addition code to compare the number of registers in CUDA C and CUDA Fortran. The CUDA C code uses 12 registers and CUDA Fortran uses 16. That’s fine but I found that the code uses 22 registers if I declare device variables under the module ‘vectorDevice’ (so I don’t need to pass those as arguments to the kernel). Below is the code that declares the device variables in the host code.

module vectorDevice
	contains
		attributes(global) subroutine vectorAddKernel(C, A, B, size)
			implicit none

			integer, value :: size
			integer :: i, n

			integer, dimension(:) :: A, B, C

			i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

			if (i .le. size) then
				do n = 1, 20
					C(i) = C(i) + A(i) + B(i)
				end do
			end if

		end subroutine vectorAddKernel
end module vectorDevice

program vectorFortran
	use cudafor
	use vectorDevice
	implicit none

	integer :: inputSize, gridSize, blockSize, n, m, cValue
	logical :: valid

	integer, dimension(:), allocatable :: h_A, h_B, h_C
	integer, device, dimension(:), allocatable :: d_A, d_B, d_C

	inputSize = 100000

	allocate(h_A(inputSize), h_B(inputSize), h_C(inputSize))
	allocate(d_A(inputSize), d_B(inputSize), d_C(inputSize))

	h_A = 1
	h_B = 2
	d_A = h_A
	d_B = h_B

	blockSize = 1024
	gridSize = ceiling(real(inputSize) / blockSize)

	call vectorAddKernel<<<gridSize, blockSize>>>(d_C, d_A, d_B, inputSize)

	h_C = d_C

	valid = .true.

	do n = 1, inputSize
		cValue = 0

		do m = 1, 20
			cValue = cValue + h_A(n) + h_B(n)
		end do

		if (h_C(n) .ne. cValue) then
			valid = .false.
			print *, "Invalid !!!", h_C(n), ":", cValue
			exit
		end if
	end do

	if (valid .eq. .true.) then
		print *, "Valid !!!"
	end if

	deallocate(h_A, h_B, h_C)
	deallocate(d_A, d_B, d_C)

end program vectorFortran

Thank you for your help.

MatColgrove · August 14, 2017, 3:48pm

Could you tell me how to set compute capability manually using command line?

Sure. It’s “-Mcuda=ccXX”, where “XX” is the compute capability version. You can see a full list of available CC versions via the “help” command line option “pgfortran -help -Mcuda”.

Note that you’re program has an error in it where it’s not initializing the C array. Not that this effects register usage, but will give you incorrect results.

-Mat

Dyoun · August 14, 2017, 4:10pm

Thank you!

Daniel.Kokron · July 15, 2019, 9:29pm

Regarding the “RFE for launch bounds in CUDA Fortran”, please add me to the list of requestors.

Topic		Replies	Views
CUDA Fortran- threads Legacy PGI Compilers	5	4110	April 14, 2011
How to manage Register Memory in Fortran? Legacy PGI Compilers	2	4211	December 14, 2009
ICE when set maxregcount low in CUDA Legacy PGI Compilers	4	4454	September 7, 2010
Registers in Fermi (cc2.0) for cuda fortran Legacy PGI Compilers	0	6033	April 14, 2011
CUDA Fortran vs. CUDA C on Fermi Legacy PGI Compilers	4	6958	April 15, 2011
How is the number of required registers per thread counded? CUDA Programming and Performance	2	1525	November 20, 2009
reducing the number of used registers CUDA Programming and Performance	8	6374	September 22, 2009
Register allocation in fermi C2050 CUDA Programming and Performance	1	7662	December 29, 2010
reduce he number of register per thread in the kernel CUDA Programming and Performance	3	694	June 23, 2014
number of registers CUDA Programming and Performance	2	1904	November 7, 2007

Ways to reduce registers per thread in CUDA Fortran?

Related topics