CUDA strange register usage

Hello.

I’m optimizing register usage so that occupancy can be maximized.

But my kernel is using too many registers per thread.

I can’t understand why.

Please refer to the following code.

ATTRIBUTES(GLOBAL) SUBROUTINE cuRayTraceKernelOpt(cuRotRay, cuConst, cuDevice, phis, xst, src, jout,				&
											      CorePhiAngIn, CorePhiAngOut, lJout, lScat1, iz, iSweep, gb, ge)
USE PARAM
USE CUDATypeDef,  ONLY : cuRotRay_Type,		cuConst_Type,		cuDevice_Type
USE CUDAMOD,	  ONLY : cuTrackRayOpt
USE CUDAFOR
IMPLICIT NONE

TYPE(cuConst_Type) :: cuConst
TYPE(cuDevice_Type) :: cuDevice
TYPE(cuRotRay_Type) :: cuRotRay(cuConst%nRotRay)
!--- Read by Texture Cache
REAL(4), DEVICE, INTENT(IN) :: xst(cuConst%ng, cuConst%nCoreFsr)
REAL(4), DEVICE, INTENT(IN) :: src(cuConst%ng, cuConst%nCoreFsr)
REAL, DEVICE, INTENT(IN) :: CorePhiAngIn(cuConst%nPolarAngle, cuConst%ng, cuConst%nPhiAngSv)
!-------------------------
REAL, DEVICE :: phis(cuConst%nPolarAngle, cuConst%ng, cuConst%nCoreFsr)
REAL, DEVICE :: jout(3, cuConst%nPolarAngle, cuConst%ng, 4, cuConst%nxy)
REAL, DEVICE :: CorePhiAngOut(cuConst%nPolarAngle, cuConst%ng, cuConst%nPhiAngSv)
LOGICAL, VALUE :: lJout, lScat1
INTEGER, VALUE :: iz, iSweep, gb, ge

INTEGER :: i, iRotRay

DO i = 1, cuDevice%cuWorkset(blockIdx%x, iSweep)%nRotRay
  iRotRay = cuDevice%cuWorkset(blockIdx%x, iSweep)%RotRayList(i)
  CALL cuTrackRayOpt(cuRotRay(iRotRay), cuConst, phis, xst, src, jout,													&
			      CorePhiAngIn, CorePhiAngOut, lJout, iz, gb, ge)
ENDDO

END SUBROUTINE

ATTRIBUTES(DEVICE) SUBROUTINE cuTrackRayOpt(cuRotRay, cuConst, phis, xst, src, jout,								&
										    CorePhiAngIn, CorePhiAngOut, ljout, iz, gb, ge)
USE PARAM
USE CUDATypeDef,  ONLY : cuRotRay_Type,		cuConst_Type
USE CUDAFOR
IMPLICIT NONE

TYPE(cuRotRay_Type) :: cuRotRay
TYPE(cuConst_Type) :: cuConst
!--- Read by Texture Cache
REAL(4), DEVICE, INTENT(IN) :: xst(cuConst%ng, cuConst%nCoreFsr)
REAL(4), DEVICE, INTENT(IN) :: src(cuConst%ng, cuConst%nCoreFsr)
REAL, DEVICE, INTENT(IN) :: CorePhiAngIn(cuConst%nPolarAngle, cuConst%ng, cuConst%nPhiAngSv)
!-------------------------
REAL, DEVICE :: phis(cuConst%nPolarAngle, cuConst%ng, cuConst%nCoreFsr)
REAL, DEVICE :: jout(3, cuConst%nPolarAngle, cuConst%ng, 4, cuConst%nxy)
REAL, DEVICE :: CorePhiAngOut(cuConst%nPolarAngle, cuConst%ng, cuConst%nPhiAngSv)
LOGICAL :: lJout
INTEGER :: iz, gb, ge

INTEGER :: iazi, ipin, isurf, ireg
INTEGER :: j, k, ir
INTEGER :: ipol, ig, irot
INTEGER :: ExpAppIdx
REAL(4) :: temp, tau, phid, PhiAngOut, ExpApp, wt, wt2(4)

ipol = threadIdx%x
ig = threadIdx%y + (gb - 1)
irot = threadIdx%z

PhiAngOut = CorePhiAngIn(ipol, ig, cuRotRay%PhiAngInSvIdx(irot))

IF (irot .EQ. 1) THEN

DO j = 1, cuRotRay%nCoreRay
  iazi = cuRotRay%iAzi(j)
  wt = cuConst%wtang(ipol, iazi)
  wt2 = cuConst%wtang2(ipol, iazi, :)
  DO k = cuRotRay%CoreRayIdxSt(j), cuRotRay%CoreRayIdxSt(j + 1) - 1
    ipin = cuRotRay%PinIdx(k)
	IF (lJout) THEN
	  isurf = cuRotRay%SurfIdx(1, k)
	  !$ACC ATOMIC
      jout(1, ipol, ig, isurf, ipin) = jout(1, ipol, ig, isurf, ipin) + wt * PhiAngOut
	  !$ACC ATOMIC
      jout(3, ipol, ig, isurf, ipin) = jout(3, ipol, ig, isurf, ipin) + wt2(isurf) * PhiAngOut
	ENDIF
    DO ir = cuRotRay%PinRayIdxSt(k), cuRotRay%PinRayIdxSt(k + 1) - 1
      ireg = cuRotRay%FsrIdx(ir)
      tau = - cuRotRay%LenSeg(ir) * xst(ig, ireg)
      ExpAppIdx = max(INT(tau), -40000); ExpAppIdx = min(0, ExpAppIdx)
	  ExpApp = cuConst%EXPA(ipol, ExpAppIdx) * tau + cuConst%EXPB(ipol, ExpAppIdx)
	  phid = (PhiAngOut - src(ig, ireg)) * ExpApp
      PhiAngOut = PhiAngOut - phid
	  !$ACC ATOMIC
	  phis(ipol, ig, ireg) = phis(ipol, ig, ireg) + wt * phid
	ENDDO
	IF (lJout) THEN
	  isurf = cuRotRay%SurfIdx(2, k)
	  !$ACC ATOMIC
      jout(2, ipol, ig, isurf, ipin) = jout(2, ipol, ig, isurf, ipin) + wt * PhiAngOut
	  !$ACC ATOMIC
      jout(3, ipol, ig, isurf, ipin) = jout(3, ipol, ig, isurf, ipin) + wt2(isurf) * PhiAngOut
	ENDIF
  ENDDO
ENDDO

ELSE

DO j = cuRotRay%nCoreRay, 1, -1
  iazi = cuRotRay%iAzi(j)
  wt = cuConst%wtang(ipol, iazi)
  wt2 = cuConst%wtang2(ipol, iazi, :)
  DO k = cuRotRay%CoreRayIdxSt(j + 1) - 1, cuRotRay%CoreRayIdxSt(j), -1
    ipin = cuRotRay%PinIdx(k)
	IF (lJout) THEN
	  isurf = cuRotRay%SurfIdx(2, k)
	  !$ACC ATOMIC
      jout(1, ipol, ig, isurf, ipin) = jout(1, ipol, ig, isurf, ipin) + wt * PhiAngOut
	  !$ACC ATOMIC
      jout(3, ipol, ig, isurf, ipin) = jout(3, ipol, ig, isurf, ipin) + wt2(isurf) * PhiAngOut
	ENDIF
    DO ir = cuRotRay%PinRayIdxSt(k + 1) - 1, cuRotRay%PinRayIdxSt(k), -1
      ireg = cuRotRay%FsrIdx(ir)
      tau = - cuRotRay%LenSeg(ir) * xst(ig, ireg)
      ExpAppIdx = max(INT(tau), -40000); ExpAppIdx = min(0, ExpAppIdx)
	  ExpApp = cuConst%EXPA(ipol, ExpAppIdx) * tau + cuConst%EXPB(ipol, ExpAppIdx)
	  phid = (PhiAngOut - src(ig, ireg)) * ExpApp
      PhiAngOut = PhiAngOut - phid
	  !$ACC ATOMIC
	  phis(ipol, ig, ireg) = phis(ipol, ig, ireg) + wt * phid
	ENDDO
	IF (lJout) THEN
	  isurf = cuRotRay%SurfIdx(1, k)
	  !$ACC ATOMIC
      jout(2, ipol, ig, isurf, ipin) = jout(2, ipol, ig, isurf, ipin) + wt * PhiAngOut
	  !$ACC ATOMIC
      jout(3, ipol, ig, isurf, ipin) = jout(3, ipol, ig, isurf, ipin) + wt2(isurf) * PhiAngOut
	ENDIF
  ENDDO
ENDDO

ENDIF

CorePhiAngOut(ipol, ig, cuRotRay%PhiAngOutSvIdx(irot)) = PhiAngOut

END SUBROUTINE

With this kernel I’m using 48 registers per thread even though I have only 23 4-byte local scalars. (In my code, default real size is 8-byte)

What is consuming registers other than local scalars?

In debug mode, each thread consumes 36 registers. I suppose some additional scalars are being generated by compiler optimization.

How do I track it?

More interesting thing is that if I remove those atomic directives, each thread uses 71 registers.

I can’t understand why removing atomic directives results in far more register consumption. Can you explain this feature?

And please give me some tips to reduce register usage. I want my threads to use less than 32 each.

Using shared memory can be a solution, but since shared memory is vulnerable to bank conflict and much slower than register, I don’t really prefer using it. Also there’s no guarantee that same bank access will always be multicasted or broadcasted.

Maybe inlining the device subroutine will help?

The extra registers are probably holding the array addresses so they aren’t computed each time the arrays are accessed.

Try using the flag “-Mcuda=maxregcount:32” to limit the number of registers per thread. You may gets some spilling, but so long as the spills stay in L1/L2 cache, it’s not a problem.

Note that achieving 100% theoretical occupancy does not guarantee good performance.

  • Mat