More questions about CUDA

Hello.

Sorry for keep bothering.

  1. For the third argument of kernal call, i.e, the size of shared memory, what should I specify? Do I have to specify total shared memory size or shared memory size required for each block?

  2. How can I define thread-local automatic array?

  3. If I have a fixed-sized local array in device subroutine which is called within a kernel, is the local array allocated every time when the subroutine is called? If so, is it better to define fixed-size local array in kernel region and pass it as argument?

  1. For the third argument of kernal call, i.e, the size of shared memory, what should I specify? Do I have to specify total shared memory size or shared memory size required for each block?

This is the size in bytes of the shared memory per thread block.

  1. How can I define thread-local automatic array?

Declare an automatic array in the kernel but without the “shared” attribute. However, I strongly recommend not using local automatic arrays. This requires every thread to allocate data on the device which is very poor for performance. Use fixed sized, global, or shared arrays if possible.

  1. If I have a fixed-sized local array in device subroutine which is called within a kernel, is the local array allocated every time when the subroutine is called? If so, is it better to define fixed-size local array in kernel region and pass it as argument?

No, a fixed sized local array is not allocated.

Hope this helps,
Mat

But regarding question 2, I have a strange problem.

Please refer to the following code.

#define		MAX_RAY_SEG		227
#define		MAX_ASY_RAY		3

ATTRIBUTES(DEVICE) SUBROUTINE cuTrackRay(RayInfo, CoreInfo, cuConst, cuMOC, cuTrack, DcmpAsyRay, ljout, iz, gb, ge)
USE PARAM
USE TYPEDEF,	  ONLY : RayInfo_Type,		CoreInfo_Type,		CellRayInfo_Type,	DcmpAsyRayInfo_Type
USE CUDATypeDef,  ONLY : cuConst_Type,		cuMOC_Type,			cuTrack_Type
USE CUDAFOR
IMPLICIT NONE

TYPE(RayInfo_Type) :: RayInfo
TYPE(CoreInfo_Type) :: CoreInfo
TYPE(cuConst_Type) :: cuConst
TYPE(cuMOC_Type) :: cuMOC
TYPE(cuTrack_Type) :: cuTrack
TYPE(DcmpAsyRayInfo_Type) :: DcmpAsyRay
LOGICAL, VALUE :: ljout
INTEGER, VALUE :: iz, gb, ge

TYPE(CellRayInfo_Type), POINTER :: CellRay

INTEGER :: mp(2)
INTEGER :: iazi, irotray, iasyray, iceray, iray, irayseg, idir
INTEGER :: nAsyRay, nPinRay, nRaySeg, FsrIdxSt
INTEGER :: ipin, icel, ibcel, iasy, ireg, isurf1, isurf2
INTEGER :: iRaySegIdx
INTEGER :: j, l, ir
INTEGER :: jbeg, jend, jinc
INTEGER :: ipol, ig, irot, dir_tid, block_tid
INTEGER :: ExpAppIdx
REAL :: tau, phiobd, wt, wt2(4)

REAL :: PhiAngOut(MAX_RAY_SEG + 2, MAX_ASY_RAY)
REAL :: ExpApp(MAX_RAY_SEG, MAX_ASY_RAY)
REAL :: DelPhi(MAX_RAY_SEG, MAX_ASY_RAY)

INTEGER, SHARED :: CellRayIdxSt(cuConst%nMaxCellRay, cuConst%nMaxAsyRay, 2)
INTEGER, SHARED :: SurfIdx(cuConst%nMaxCellRay, cuConst%nMaxAsyRay, 2)
INTEGER, SHARED :: PinIdx(cuConst%nMaxCellRay, cuConst%nMaxAsyRay)
INTEGER, SHARED :: nTotRaySeg(cuConst%nMaxAsyRay)
INTEGER, SHARED :: nTotCellRay(cuConst%nMaxAsyRay)

mp = (/2, 1/)

ipol = threadIdx%x
ig = threadIdx%y + (gb - 1)
irot = threadIdx%z
dir_tid = threadIdx%x + (threadIdx%y - 1) * blockDim%x
block_tid = threadIdx%x + (threadIdx%y - 1) * blockDim%x + (threadIdx%z - 1) * blockDim%x * blockDim%y

nAsyRay = DcmpAsyRay%nAsyRay
iRotRay = DcmpAsyRay%iRotRay
iAsy = DcmpAsyRay%iAsy; iRay = DcmpAsyRay%iRay

DO j = 1, nAsyRay
  iRaySegIdx = 0
  iAsyRay = DcmpAsyRay%AsyRayList(j)
  nPinRay = RayInfo%AsyRay(iAsyRay)%nCellRay
  DO l = 1, nPinRay
    ipin = RayInfo%AsyRay(iAsyRay)%PinIdx(l)
    iceray = RayInfo%AsyRay(iAsyRay)%PinRayIdx(l)
    ipin = CoreInfo%Asy(iAsy)%GlobalPinIdx(ipin)
    icel = CoreInfo%Pin(ipin)%Cell(iz)
    FsrIdxSt = CoreInfo%Pin(ipin)%FsrIdxSt
    ibcel = CoreInfo%CellInfo(icel)%BaseCellStr
    CellRay => CoreInfo%CellInfo(ibcel)%CellRay(iceray)
	nRaySeg = CellRay%nSeg
	IF (block_tid .EQ. 1) THEN
	  PinIdx(l, j) = ipin
	  CellRayIdxSt(l, j, 1) = iRaySegIdx + nRaySeg
      CellRayIdxSt(l, j, 2) = iRaySegIdx + 1
	  SurfIdx(l, j, 1) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(2, l)
      SurfIdx(l, j, 2) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(1, l)
	  nTotCellRay(j) = nPinRay
	ENDIF
    DO iRaySeg = 1, nRaySeg
      ireg = FsrIdxSt + CellRay%LocalFsrIdx(iRaySeg) - 1
      iRaySegIdx = iRaySegIdx + 1
	  IF (dir_tid .EQ. 1) cuTrack%FsrIdx(iRaySegIdx, j) = ireg
      tau = - CellRay%LenSeg(iRaySeg) * cuMOC%xst(ig, ireg)
      ExpAppIdx = max(INT(tau), -40000); ExpAppIdx = min(0, ExpAppIdx)
	  ExpApp(iRaySegIdx, j) = cuConst%EXPA(ipol, ExpAppIdx) * tau + cuConst%EXPB(ipol, ExpAppIdx)
    ENDDO
  ENDDO
  IF (block_tid .EQ. 1) nTotRaySeg(j) = iRaySegIdx
ENDDO

CALL syncThreads()

jbeg = 1; jend = nAsyRay; jinc = 1
IF (irot .EQ. 2) THEN
  jbeg = nAsyRay; jend = 1; jinc = -1
ENDIF

IF (DcmpAsyRay%lRotRayBeg(irot)) THEN
  phiobd = cuMOC%CorePhiAngIn(ipol, ig, RayInfo%PhiAngInSvIdx(iRotRay, irot))
ELSE
  phiobd = cuMOC%AsyPhiAngIn(ipol, ig, DcmpAsyRay%prevRay(irot), DcmpAsyRay%prevAsy(irot), irot)
ENDIF
DO j = jbeg, jend, jinc
  idir = DcmpAsyRay%DirList(j); iazi = DcmpAsyRay%AziList(j)
  wt = cuConst%wtang(ipol, iazi)
  wt2 = cuConst%wtang2(ipol, iazi, :)
  nRaySeg = nTotRaySeg(j)
  IF (irot .EQ. 2) idir = mp(idir)
  IF (idir .EQ. 1) THEN
    PhiAngOut(1, j) = phiobd
    DO ir = 1, nRaySeg
      ireg = cuTrack%FsrIdx(ir, j)
      DelPhi(ir, j) = (PhiAngOut(ir, j) - cuMOC%src(ig, ireg)) * ExpApp(ir, j)
      PhiAngOut(ir + 1, j) = PhiAngOut(ir, j) - DelPhi(ir, j)
    ENDDO
    phiobd = PhiAngOut(nRaySeg + 1, j)
  ELSE
    PhiAngOut(nRaySeg + 2, j) = phiobd
    DO ir = nRaySeg, 1, -1
      ireg = cuTrack%FsrIdx(ir, j)
      DelPhi(ir, j) = (PhiAngOut(ir + 2, j) - cuMOC%src(ig, ireg)) * ExpApp(ir, j)
      PhiAngOut(ir + 1, j) = PhiAngOut(ir + 2, j) - DelPhi(ir, j)
    ENDDO
    phiobd = PhiAngOut(2, j)
  ENDIF
ENDDO
IF (DcmpAsyRay%lRotRayEnd(irot)) THEN
  cuMOC%CorePhiAngOut(ipol, ig, RayInfo%PhiAngOutSvIdx(iRotRay, irot)) = phiobd
ELSE
  cuMOC%AsyPhiAngOut(ipol, ig, iRay, iAsy, irot) = phiobd
ENDIF

DO j = jbeg, jend, jinc
  iazi = DcmpAsyRay%AziList(j)
  wt = cuConst%wtang(ipol, iazi)
  DO ir = 1, nTotRaySeg(j)
    ireg = cuTrack%FsrIdx(ir, j)  
    !$ACC ATOMIC
	cuMOC%phis(ig, ireg) = cuMOC%phis(ig, ireg) + wt * DelPhi(ir, j)
  ENDDO
  IF (lJout) THEN
    idir = DcmpAsyRay%DirList(j)
    wt2 = cuConst%wtang2(ipol, iazi, :)
    IF (irot .EQ. 2) idir = mp(idir)
	IF (idir .EQ. 1) THEN
	  DO ir = 1, nTotCellRay(j)
        icel = PinIdx(ir, j); isurf1 = SurfIdx(ir, j, 1); isurf2 = SurfIdx(ir, j, 2)
		!$ACC ATOMIC
        cuMOC%Jout(2, ig, isurf1, icel) = cuMOC%Jout(2, ig, isurf1, icel) + wt * PhiAngOut(CellRayIdxSt(ir, j, 1) + 1, j)
                !$ACC ATOMIC
        cuMOC%Jout(3, ig, isurf1, icel) = cuMOC%Jout(3, ig, isurf1, icel) + wt2(isurf1) * PhiAngOut(CellRayIdxSt(ir, j, 1) + 1, j)
                !$ACC ATOMIC
        cuMOC%Jout(1, ig, isurf2, icel) = cuMOC%Jout(1, ig, isurf2, icel) + wt * PhiAngOut(CellRayIdxSt(ir, j, 2), j)
                !$ACC ATOMIC
        cuMOC%Jout(3, ig, isurf2, icel) = cuMOC%Jout(3, ig, isurf2, icel) + wt2(isurf2) * PhiAngOut(CellRayIdxSt(ir, j, 2), j)
      ENDDO
	ELSE
	  DO ir = 1, nTotCellRay(j)
        icel = PinIdx(ir, j); isurf1 = SurfIdx(ir, j, 1); isurf2 = SurfIdx(ir, j, 2)
	        !$ACC ATOMIC
        cuMOC%Jout(2, ig, isurf2, icel) = cuMOC%Jout(2, ig, isurf2, icel) + wt * PhiAngOut(CellRayIdxSt(ir, j, 2) + 1, j)
                !$ACC ATOMIC
        cuMOC%Jout(3, ig, isurf2, icel) = cuMOC%Jout(3, ig, isurf2, icel) + wt2(isurf2) * PhiAngOut(CellRayIdxSt(ir, j, 2) + 1, j)
                !$ACC ATOMIC
        cuMOC%Jout(1, ig, isurf1, icel) = cuMOC%Jout(1, ig, isurf1, icel) + wt * PhiAngOut(CellRayIdxSt(ir, j, 1) + 2, j)
                !$ACC ATOMIC
        cuMOC%Jout(3, ig, isurf1, icel) = cuMOC%Jout(3, ig, isurf1, icel) + wt2(isurf1) * PhiAngOut(CellRayIdxSt(ir, j, 1) + 2, j)
	  ENDDO
	ENDIF
  ENDIF
ENDDO

END SUBROUTINE

This code runs fine.

But if I change

REAL :: PhiAngOut(MAX_RAY_SEG + 2, MAX_ASY_RAY)
REAL :: ExpApp(MAX_RAY_SEG, MAX_ASY_RAY)
REAL :: DelPhi(MAX_RAY_SEG, MAX_ASY_RAY)

to automatic arrays,

REAL :: PhiAngOut(cuConst%nMaxRaySeg + 2, cuConst%nMaxAsyRay)
REAL :: ExpApp(cuConst%nMaxRaySeg, cuConst%nMaxAsyRay)
REAL :: DelPhi(cuConst%nMaxRaySeg, cuConst%nMaxAsyRay)

segmentation fault occurs. (Error code 700)

cuConst%nMaxRaySeg, cuConst%nMaxAsyRay and MAX_RAY_SEG, MAX_ASY_RAY are exactly same.

This is why I thought that there might be some other ways to define thread-local automatic arrays.

What could be the problem?

Hi CNJ,

My best guess is that you’re getting a heap overflow. The device side heap defaults to 8MB which is relatively small.

What are the values of “cuConst%nMaxRaySeg” and “cuConst%nMaxAsyRay” and how many threads are being used?

If indeed you are using more than 8MB, then you can increase this value by calling “cudaDeviceSetLimit” at the beginning of your program.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation

  • Mat

Oh, I didn’t know that.

You are right.

In total, those arrays consume ~30MB.

Then, when those automatic arrays are allocated?

Are they allocated on kernel launch only? Or every time when that device subroutine is called?

And I have additional questions.

phis(gb : ge, :) = phis(gb : ge, :) + cuMOC(tid)%phis(gb : ge, :)

In this line, phis is on host while cuMOC(tid)%phis is on device.

Is this a good way to bring data back to host?

Or shall I make a temporary storage on host, copy device data to it, and add up to host data?

And why !$ACC ATOMIC and atomicAdd() give different results?

Actually using !$ACC ATOMIC gives correct results, while changing the directives in that code to atomicAdd() gives incorrect ones.

Are they allocated on kernel launch only? Or every time when that device subroutine is called?

There’s no way to allocate these at kernel launch since the size would vary from call to call of the device routine. Instead, there’s an implicit allocate of the automatic arrays every time the device routine is called and by every thread. The arrays are then implicitly deallocated when the routine returns. Note that this is how automatics work on the host as well.


In this line, phis is on host while cuMOC(tid)%phis is on device.

This would be illegal since you can’t mix a host and device array within the same expression. Instead, you’ll need to copy back the device array to a host array which can then be used in the expression.


And why !$ACC ATOMIC and atomicAdd() give different results?

Actually using !$ACC ATOMIC gives correct results, while changing the directives in that code to atomicAdd() gives incorrect ones.

What does your atomicAdd call look like? The routine returns the value of the first argument before it’s been modified. Hence, it’s an error to use something like:

cuMOC%Jout(2, ig, isurf1, icel) = atomicAdd(cuMOC%Jout(2, ig, isurf1, icel),wt * PhiAngOut(CellRayIdxSt(ir, j, 1) + 1, j))

Since you’d be overwriting the variable with it’s original value.

Instead, you’d want something like:

origval = atomicAdd(cuMOC%Jout(2, ig, isurf1, icel),wt * PhiAngOut(CellRayIdxSt(ir, j, 1) + 1, j))
  • Mat

Oh, I didn’t know that mixing host and device array is illegal, because it just worked.

I suppose my words confused you that cuMOC is a device variable.

cuMOC is a derived type with pointer attribute, defined in a module.

It contains device allocatable arrays such as phis.

cuMOC is allocated on the host side with the size of GPU number.

And cuMOC(i)%phis is allocated on i-th device.

I think this is a typical technique used in multi-GPU programming.

So, in this case, which way of data merging is more efficient?

So, in this case, which way of data merging is more efficient?

Thanks for the clarification. In this case, the compiler will create a host side temporary array, copy the device array to the host, then perform the computation. If the sub-array is contiguous, then the performance should be no different than if you manually copied the array. If the array is non-contiguous, then the compiler would need to move the data in chunks. Hence it may be better to create a host temp array yourself, and then copy the entire array.

  • Mat