Ways to reduce registers per thread in CUDA Fortran?

Dyoun · August 10, 2017, 4:20pm

Hi,

I’m porting a program written in CUDA C to CUDA Fortran and trying to reduce the number of registers without the register limit feature. The original C code has 45 registers per thread, however the Fortran code has over 100 registers which limits the theoretical occupancy to 12.5%.

For the C code, I used launch_bounds and managed local variable scopes as efficient as possible to decrease register counts. I use texture memory and shared memory heavily in both codes. Are there other ways to optimize register usages in CUDA Fortran?

Thanks in advance.
Youn

brentl · August 10, 2017, 6:22pm

Unfortunately there is not as much control of register usage in CUDA Fortran as in CUDA C, as there is basically another level of translation/compilation.

How do you compile (what options)? The CUDA version you use and the options you use can make a difference, in my experience. For instance, I’ve seen cc30 use fewer registers than cc35, even though both code runs fine.

We do have an open RFE for launch bounds in CUDA Fortran, and I’ll add you to the list.

Dyoun · August 14, 2017, 3:17pm

Thank you for your answer.

I’m using the latest version of Visual Fortran and the CUDA Fortran Compute Capability setting is set to automatic. Could you tell me how to set compute capability manually using command line?

Also, I wrote a simple vector addition code to compare the number of registers in CUDA C and CUDA Fortran. The CUDA C code uses 12 registers and CUDA Fortran uses 16. That’s fine but I found that the code uses 22 registers if I declare device variables under the module ‘vectorDevice’ (so I don’t need to pass those as arguments to the kernel). Below is the code that declares the device variables in the host code.

module vectorDevice
	contains
		attributes(global) subroutine vectorAddKernel(C, A, B, size)
			implicit none

			integer, value :: size
			integer :: i, n

			integer, dimension(:) :: A, B, C

			i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

			if (i .le. size) then
				do n = 1, 20
					C(i) = C(i) + A(i) + B(i)
				end do
			end if

		end subroutine vectorAddKernel
end module vectorDevice

program vectorFortran
	use cudafor
	use vectorDevice
	implicit none

	integer :: inputSize, gridSize, blockSize, n, m, cValue
	logical :: valid

	integer, dimension(:), allocatable :: h_A, h_B, h_C
	integer, device, dimension(:), allocatable :: d_A, d_B, d_C

	inputSize = 100000

	allocate(h_A(inputSize), h_B(inputSize), h_C(inputSize))
	allocate(d_A(inputSize), d_B(inputSize), d_C(inputSize))

	h_A = 1
	h_B = 2
	d_A = h_A
	d_B = h_B

	blockSize = 1024
	gridSize = ceiling(real(inputSize) / blockSize)

	call vectorAddKernel<<<gridSize, blockSize>>>(d_C, d_A, d_B, inputSize)

	h_C = d_C

	valid = .true.

	do n = 1, inputSize
		cValue = 0

		do m = 1, 20
			cValue = cValue + h_A(n) + h_B(n)
		end do

		if (h_C(n) .ne. cValue) then
			valid = .false.
			print *, "Invalid !!!", h_C(n), ":", cValue
			exit
		end if
	end do

	if (valid .eq. .true.) then
		print *, "Valid !!!"
	end if

	deallocate(h_A, h_B, h_C)
	deallocate(d_A, d_B, d_C)

end program vectorFortran

Thank you for your help.

MatColgrove · August 14, 2017, 3:48pm

Could you tell me how to set compute capability manually using command line?

Sure. It’s “-Mcuda=ccXX”, where “XX” is the compute capability version. You can see a full list of available CC versions via the “help” command line option “pgfortran -help -Mcuda”.

Note that you’re program has an error in it where it’s not initializing the C array. Not that this effects register usage, but will give you incorrect results.

-Mat

Dyoun · August 14, 2017, 4:10pm

Thank you!

Daniel.Kokron · July 15, 2019, 9:29pm

Regarding the “RFE for launch bounds in CUDA Fortran”, please add me to the list of requestors.

Topic		Replies	Views
CUDA Fortran- threads Legacy PGI Compilers (archived)	5	4146	April 14, 2011
ICE when set maxregcount low in CUDA Legacy PGI Compilers (archived)	4	4510	September 7, 2010
Registers in Fermi (cc2.0) for cuda fortran Legacy PGI Compilers (archived)	0	6057	April 14, 2011
How to manage Register Memory in Fortran? Legacy PGI Compilers (archived)	2	4253	December 14, 2009
CUDA Fortran vs. CUDA C on Fermi Legacy PGI Compilers (archived)	4	7041	April 15, 2011
reduce he number of register per thread in the kernel CUDA Programming and Performance	3	751	June 23, 2014
How is the number of required registers per thread counded? CUDA Programming and Performance	2	1570	November 20, 2009
Change the number of registers per kernel with OpenACC? Legacy PGI Compilers (archived)	3	3778	June 13, 2020
reducing the number of used registers CUDA Programming and Performance	8	6450	September 22, 2009
Anyone help me with the link to the register usage break points? CUDA Programming and Performance	4	432	July 14, 2022

Ways to reduce registers per thread in CUDA Fortran?

Related topics