Ways to reduce registers per thread in CUDA Fortran?


I’m porting a program written in CUDA C to CUDA Fortran and trying to reduce the number of registers without the register limit feature. The original C code has 45 registers per thread, however the Fortran code has over 100 registers which limits the theoretical occupancy to 12.5%.

For the C code, I used launch_bounds and managed local variable scopes as efficient as possible to decrease register counts. I use texture memory and shared memory heavily in both codes. Are there other ways to optimize register usages in CUDA Fortran?

Thanks in advance.

Unfortunately there is not as much control of register usage in CUDA Fortran as in CUDA C, as there is basically another level of translation/compilation.

How do you compile (what options)? The CUDA version you use and the options you use can make a difference, in my experience. For instance, I’ve seen cc30 use fewer registers than cc35, even though both code runs fine.

We do have an open RFE for launch bounds in CUDA Fortran, and I’ll add you to the list.

Thank you for your answer.

I’m using the latest version of Visual Fortran and the CUDA Fortran Compute Capability setting is set to automatic. Could you tell me how to set compute capability manually using command line?

Also, I wrote a simple vector addition code to compare the number of registers in CUDA C and CUDA Fortran. The CUDA C code uses 12 registers and CUDA Fortran uses 16. That’s fine but I found that the code uses 22 registers if I declare device variables under the module ‘vectorDevice’ (so I don’t need to pass those as arguments to the kernel). Below is the code that declares the device variables in the host code.

module vectorDevice
		attributes(global) subroutine vectorAddKernel(C, A, B, size)
			implicit none

			integer, value :: size
			integer :: i, n

			integer, dimension(:) :: A, B, C

			i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

			if (i .le. size) then
				do n = 1, 20
					C(i) = C(i) + A(i) + B(i)
				end do
			end if

		end subroutine vectorAddKernel
end module vectorDevice

program vectorFortran
	use cudafor
	use vectorDevice
	implicit none

	integer :: inputSize, gridSize, blockSize, n, m, cValue
	logical :: valid

	integer, dimension(:), allocatable :: h_A, h_B, h_C
	integer, device, dimension(:), allocatable :: d_A, d_B, d_C

	inputSize = 100000

	allocate(h_A(inputSize), h_B(inputSize), h_C(inputSize))
	allocate(d_A(inputSize), d_B(inputSize), d_C(inputSize))

	h_A = 1
	h_B = 2
	d_A = h_A
	d_B = h_B

	blockSize = 1024
	gridSize = ceiling(real(inputSize) / blockSize)

	call vectorAddKernel<<<gridSize, blockSize>>>(d_C, d_A, d_B, inputSize)

	h_C = d_C

	valid = .true.

	do n = 1, inputSize
		cValue = 0

		do m = 1, 20
			cValue = cValue + h_A(n) + h_B(n)
		end do

		if (h_C(n) .ne. cValue) then
			valid = .false.
			print *, "Invalid !!!", h_C(n), ":", cValue
		end if
	end do

	if (valid .eq. .true.) then
		print *, "Valid !!!"
	end if

	deallocate(h_A, h_B, h_C)
	deallocate(d_A, d_B, d_C)

end program vectorFortran

Thank you for your help.

Could you tell me how to set compute capability manually using command line?

Sure. It’s “-Mcuda=ccXX”, where “XX” is the compute capability version. You can see a full list of available CC versions via the “help” command line option “pgfortran -help -Mcuda”.

Note that you’re program has an error in it where it’s not initializing the C array. Not that this effects register usage, but will give you incorrect results.


Thank you!

Regarding the “RFE for launch bounds in CUDA Fortran”, please add me to the list of requestors.