Strange data behaviors

Hi.

I’ve been working on my code for a while and now I’m now at the finalizing step.

However, I have some unresolved problems.

I will post three code sections. You don’t need to go through all of them, only some parts are related to today’s problems.

The problems are mainly related to the variable GPUWorkerDat%FsrIdx.

First, if I remove line (A), I get illegal address error.

Second, if I remove line (B) or (C), I get incorrect results.

(A) ~ (C) are merely printing lines, but why are they causing such troubles?

Third, if I use !$ACC ENTER DATA CREATE instead of COPYIN in code section 2, I get strange result at line (A). This is how the console window looks like:


17 7
24 7
9 7
8 7
7 7
12 8
20 8
22 8
40 8
9 8

That is, ireg is not properly written to GPUWorkerDat%FsrIdx. The values of GPUWorkerDat%FsrIdx become some random integers from 1 ~ 8.

If I change the loop from vector loop to sequential loop, this problem vanishes. However, at line (B) or (C), the values of GPUWorkerDat%FsrIdx come back to those unintended random integers.

Why the compiler change values by its own will?

Hi CNJ,

This is a very complex and challenging code. I like it!

I can only guess as to what’s wrong, but my feeling for A, B, and C, is that you have synchronization issue. Both the “nRaySeq” and “ifsr” are private to a worker but shared between vectors. Given that on NVIDIA targets, vectors are grouped into warp, it’s possible that the threads on one warp is updating these variables while another warp is reading them. Granted, all the vectors should be updating the variable with the same value, but I’d like to explore this idea.

In general I recommend to not put scalars in a “private” clause. In OpenACC, scalars are private by default. Only if you have case where there’s a live-out, the scalar is global, or if you’re passing the scalar by address to a routine, do you need to add them to a “private” clause. By putting a scalar in a “private” clause you are creating a global array where each vector or worker has it’s own copy. This can hurt performance. Automatic privatization will allow the variable to be declared locally in the kernel and increase it’s chance of being stored in a local register, thus helping performance.

I would like you to try removing all of the scalars in your “private” clauses and to see if it helps.

For the section 2 issue, “create” vs “copyin”, does “Dmalloc” initialize the arrays, like to zero? If not, then I agree, using create or copyin shouldn’t matter. But if it does, then you have an uninitialized memory read (UMR) someplace.

Hope this helps,
Mat

Okay. I will try removing private and let you know the result.

However, as you can see, there are many scalars that don’t need to be allocated for each vector unit. If they use up the register, isn’t the number of threads I can launch limited?

And Dmalloc doesn’t initialize values.

What I suspect is the optimization process that you called dead line elimination. Compiler is making wrong decisions, that is, if I remove (A), compiler may think that GPUWorkerDat = ireg line is not needed and thus removing this line. Thereby ifsr becomes uninitialized, and cause illegal address error at reading Src(ig, ifsr). How do you think?

FYI, when I tested this code with OpenACC directives disabled, the result was fine.


#include <defines.h>
SUBROUTINE RayTrace_GPU(RayInfo, CoreInfo, phis, PhiAngIn, xst, src, jout, iz, mygb, myge, ljout)
USE PARAM
USE TYPEDEF, ONLY : RayInfo_Type, Coreinfo_type
USE MOC_MOD, ONLY : nMaxRaySeg,     nMaxCellRay,    nMaxAsyRay,     nMaxCoreRay,    &
                    EXPAPolar,      EXPBPolar,      wtangP0
USE PE_MOD,  ONLY : PE, GPUControl
IMPLICIT NONE
TYPE(RayInfo_Type) :: RayInfo
TYPE(CoreInfo_Type) :: CoreInfo
!$ACC DECLARE PRESENT(RayInfo, CoreInfo, GPUControl, EXPAPolar, EXPBPolar, wtangP0)
REAL(8), POINTER :: phis(:, :), PhiAngIn(:, :, :), xst(:, :), src(:, :), jout(:, :, :, :)
INTEGER :: iz, mygb, myge
LOGICAL :: ljout

INTEGER :: iGang, iWorker, iRay
INTEGER :: i, j, k, l, m, jbeg, jend, jinc, irw, irw1, irv, ig

REAL(8) :: wttemp, wt(RayInfo%nPolarAngle), tau
REAL(8) :: phiobd(Rayinfo%nPolarAngle, mygb : myge), phia(mygb : myge), phid, phiocel1, phiocel2
INTEGER :: nPolarAngle, nAziAngle, nPhiAngSv
INTEGER :: iazi, ipol, PhiAnginSvIdx, PhiAngOutSvIdx
INTEGER :: nRotRay, nCoreRay, nAsyRay, nPinRay, nRaySeg
INTEGER :: irotray, icoreray, iasyray, iceray, irayseg
INTEGER :: ipin, icelg, icelw, icelv, iasy, ireg, isurf1, isurf2, irot, idir, ifsr
INTEGER :: irsegidx, icellrayidx, FsrIdxSt

INTEGER :: mp(2) = (/ 2, 1 /)

! Tracking Data Storages
INTEGER :: nTotRaySeg(nMaxCoreRay), nTotCellRay(nMaxCoreRay)
INTEGER :: CellRayIdxSt(nMaxCellRay, nMaxCoreRay, 2)
INTEGER :: PinIdx(nMaxCellRay, nMaxCoreRay)
INTEGER :: SurfIdx(nMaxCellRay, nMaxCoreRay, 2)
INTEGER :: ExpAppIdx(mygb : myge, nMaxRaySeg, nMaxCoreRay)
INTEGER :: FsrIdx(nMaxRaySeg, nMaxCoreRay)
REAL(8) :: ExpApp(RayInfo%nPolarAngle, mygb : myge, nMaxRaySeg, nMaxCoreRay)
REAL(8) :: OptLenList(mygb : myge, nMaxRaySeg, nMaxCoreRay)
REAL(8) :: PhiAngOut(RayInfo%nPolarAngle, mygb : myge, nMaxRaySeg + 2)
    
nAziAngle = RayInfo%nAziAngle
nPolarAngle = RayInfo%nPolarAngle
nPhiAngSv = RayInfo%nPhiAngSv

!$ACC ENTER DATA COPYIN(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :))
!$ACC ENTER DATA COPYIN(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC DATA PRESENT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC KERNELS
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF
!$ACC END DATA

!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :),                                    &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC PARALLEL &
!$ACC NUM_GANGS(GPUControl(1)%nGang) NUM_WORKERS(GPUControl(1)%nWorker) VECTOR_LENGTH(GPUControl(1)%nVector)
!$ACC LOOP INDEPENDENT GANG PRIVATE(iGang)
DO iGang = 1, GPUControl(1)%nGang
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(wt, phiobd, phia, FsrIdxSt, nTotRaySeg, nTotCellRay,                                 &
  !$ACC                                 CellRayIdxSt, PinIdx, SurfIdx, ExpAppIdx, FsrIdx, ExpApp, OptLenList, PhiAngOut)
  DO iWorker = 1, GPUControl(1)%nWorker     
    !$ACC CACHE(wt, phia)
    !$ACC LOOP SEQ
    DO iRay = 1, GPUControl(1)%OpenACCRayList(0, iWorker, iGang)
      iRotRay = GPUControl(1)%OpenACCRayList(iRay, iWorker, iGang)

      !!!!!!!!!!!!!!!! Inlined Tracking Subroutine !!!!!!!!!!!!!!!!
      
      nCoreRay = RayInfo%RotRay(iRotRay)%nRay
      PRINT *, '1st', iRotRay, nCoreRay
      !$ACC LOOP SEQ
      DO j = 1, nCoreRay         
        irsegidx = 0;   icellrayidx = 0
        iCoreRay = RayInfo%RotRay(iRotRay)%RayIdx(j)
        nAsyRay = RayInfo%CoreRay(iCoreRay)%nRay
        PRINT *, '2nd', iCoreRay, nAsyRay
        !$ACC LOOP SEQ
        DO k = 1, nAsyRay 
          iasyray = RayInfo%CoreRay(iCoreRay)%AsyRayIdx(k)
          iasy = RayInfo%CoreRay(iCoreRay)%AsyIdx(k)
          IF(iasy .EQ. 0)  CYCLE   !Skip Dummy Assembly
          nPinRay = RayInfo%AsyRay(iAsyRay)%nCellRay
          PRINT *, '3rd', iasyray, iasy, nPinRay
          !$ACC LOOP SEQ
          DO l = 1, nPinRay
            ipin = RayInfo%AsyRay(iAsyRay)%PinIdx(l)
            iceray = RayInfo%AsyRay(iAsyRay)%PinRayIdx(l)
            ipin = CoreInfo%Asy(iAsy)%GlobalPinIdx(ipin)
            icelw = CoreInfo%Pin(ipin)%Cell(iz)
            FsrIdxSt = CoreInfo%Pin(ipin)%FsrIdxSt
            icellrayidx = icellrayidx + 1
            PRINT *, '4th', ipin, iceray, icelw, FsrIdxSt, icellrayidx
            PinIdx(icellrayidx, j) = ipin
            CellRayIdxSt(icellrayidx, j, 2) = irsegidx + 1
            nRaySeg = CoreInfo%CellInfo(icelw)%CellRay(iceray)%nSeg
            !$ACC LOOP INDEPENDENT VECTOR
            DO iRaySeg = 1, nRaySeg
              irv = irsegidx + iRaySeg
              ireg = FsrIdxSt + CoreInfo%CellInfo(icelw)%CellRay(iceray)%LocalFsrIdx(iRaySeg) - 1
              FsrIdx(irv, j) = ireg
              !$ACC LOOP SEQ
              DO ig = mygb, myge
                tau = - CoreInfo%CellInfo(icelw)%CellRay(iceray)%LenSeg(iRaySeg) * xst(ig, ireg)
                OptLenList(ig, irv, j) = tau
                ExpAppIdx(ig, irv, j) = min(0, max(INT(tau), -40000))
              ENDDO
            ENDDO
            irsegidx = irsegidx + nRaySeg
            CellRayIdxSt(icellrayidx, j, 1) = irsegidx
            SurfIdx(icellRayIdx, j, 1) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(2, l)
            SurfIdx(icellRayIdx, j, 2) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(1, l)
          ENDDO
        ENDDO
        nTotRaySeg(j) = irsegidx
        nTotCellRay(j) = icellRayIdx
        !$ACC LOOP INDEPENDENT COLLAPSE(3) VECTOR
        DO iRaySeg = 1, nTotRaySeg(j)
          DO ig = mygb, myge
            DO ipol = 1, nPolarAngle
              ExpApp(ipol, ig, iRaySeg, j) &
              = EXPAPolar(ipol, ExpAppIdx(ig, iRaySeg, j)) * OptLenList(ig, iRaySeg, j) + EXPBPolar(ipol, ExpAppIdx(ig, iRaySeg, j))
            ENDDO
          ENDDO
        ENDDO
      ENDDO

      !$ACC LOOP SEQ
      DO irot = 1, 2
        PhiAnginSvIdx = RayInfo%PhiAngInSvIdx(iRotRay, irot)
        PhiAngOutSvIdx = RayInfo%PhiangOutSvIdx(iRotRay, irot)
        phiobd(:, :) = PhiAngIn(:, :, PhiAnginSvIdx)
        jinc = 1; jbeg = 1; jend = nCoreRay
        IF(irot .EQ. 2) THEN
          jinc = -1; jbeg = nCoreRay; jend = 1
        ENDIF
        !$ACC LOOP SEQ
        DO j = jbeg, jend, jinc
          idir = RayInfo%RotRay(iRotRay)%DIR(j);
          iazi = RayInfo%CoreRay(RayInfo%RotRay(iRotRay)%RayIdx(j))%iang
          wt(:) = wtangP0(:, iazi)      
          IF(irot .EQ. 2) idir = mp(idir)
          nRaySeg = nTotRaySeg(j)   
          IF(idir .EQ. 1) THEN
            PhiAngOut(:, :, 1) = phiobd(:, :)
            !$ACC LOOP SEQ
            DO irw = 1, nRaySeg
              ifsr = FsrIdx(irw, j)
              phia(:) = 0._8
              !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
              DO ig = mygb, myge
                DO ipol = 1, nPolarAngle
                  phid = (PhiAngOut(ipol, ig, irw) - src(ig, ifsr)) * ExpApp(ipol, ig, irw, j)
                  PhiAngOut(ipol, ig, irw + 1) = PhiAngOut(ipol, ig, irw) - phid
                  !$ACC ATOMIC UPDATE
                  ! phis(ig, ifsr) = phis(ig, ifsr) + wt(ipol) * phid
                  phia(ig) = phia(ig) + wt(ipol) * phid
                  !$ACC END ATOMIC
                ENDDO
              ENDDO
              !$ACC LOOP INDEPENDENT VECTOR
              DO ig = mygb, myge
                !$ACC ATOMIC UPDATE
                phis(ig, ifsr) = phis(ig, ifsr) + phia(ig)
                !$ACC END ATOMIC
              ENDDO
            ENDDO
            phiobd(:, :) = PhiAngOut(:, :, nRaySeg + 1)
            IF(ljout) THEN
              !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
              DO irv = 1, nTotCellRay(j)
                DO ig = mygb, myge
                  icelv = PinIdx(irv, j); isurf1 = SurfIdx(irv, j, 1); isurf2 = SurfIdx(irv, j, 2)
                  phiocel1 = 0._8; phiocel2 = 0._8
                  !$ACC LOOP SEQ
                  DO ipol = 1, nPolarAngle
                    phiocel1 = phiocel1 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 1) + 1)
                    phiocel2 = phiocel2 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 2))
                  ENDDO
                  !$ACC ATOMIC UPDATE
                  Jout(ig, 2, isurf1, icelv) = Jout(ig, 2, isurf1, icelv) + phiocel1
                  Jout(ig, 1, isurf2, icelv) = Jout(ig, 1, isurf2, icelv) + phiocel2
                  !$ACC END ATOMIC
                ENDDO
              ENDDO
            ENDIF
          ELSE
            PhiAngOut(:, :, nRaySeg + 2) = phiobd(:, :)
            irw = nRaySeg + 1
            !$ACC LOOP SEQ
            DO irw1 = 1, nRaySeg
              irw = irw - 1
              ifsr = FsrIdx(irw, j)
              phia(:) = 0._8
              !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
              DO ig = mygb, myge
                DO ipol = 1, nPolarAngle
                  phid = (PhiAngOut(ipol, ig, irw + 2) - src(ig, ifsr)) * ExpApp(ipol, ig, irw, j)
                  PhiAngOut(ipol, ig, irw + 1) = PhiAngOut(ipol, ig, irw + 2) - phid
                  !$ACC ATOMIC UPDATE
                  ! phis(ig, ifsr) = phis(ig, ifsr) + wt(ipol) * phid
                  phia(ig) = phia(ig) + wt(ipol) * phid
                  !$ACC END ATOMIC
                ENDDO
              ENDDO
              !$ACC LOOP INDEPENDENT VECTOR
              DO ig = mygb, myge
                !$ACC ATOMIC UPDATE
                phis(ig, ifsr) = phis(ig, ifsr) + phia(ig)
                !$ACC END ATOMIC
              ENDDO
            ENDDO
            phiobd(:, :) = PhiAngOut(:, :, 2)
            IF(lJout) THEN
              !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
              DO irv = 1, nTotCellRay(j)
                DO ig = mygb, myge
                  icelv = PinIdx(irv, j); isurf1 = SurfIdx(irv, j, 1); isurf2 = SurfIdx(irv, j, 2)
                  phiocel1 = 0._8; phiocel2 = 0._8
                  !$ACC LOOP SEQ
                  DO ipol = 1, nPolarAngle
                    phiocel1 = phiocel1 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 1) + 2)
                    phiocel2 = phiocel2 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 2) + 1)
                  ENDDO
                  !$ACC ATOMIC UPDATE
                  Jout(ig, 1, isurf1, icelv) = Jout(ig, 1, isurf1, icelv) + phiocel1
                  Jout(ig, 2, isurf2, icelv) = Jout(ig, 2, isurf2, icelv) + phiocel2
                  !$ACC END ATOMIC
                ENDDO
              ENDDO
            ENDIF
          ENDIF
        ENDDO
        PhiAngIn(:, :, PhiAngOutSvIdx) = phiobd(:, :)
      ENDDO

      !!!!!!!!!!!!!!!! Inlined Tracking Subroutine !!!!!!!!!!!!!!!!

    ENDDO  
  ENDDO
ENDDO

!$ACC LOOP INDEPENDENT GANG PRIVATE(j, FsrIdxSt, icelg)
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icelg = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(i, ireg)
  DO i = 1, CoreInfo%CellInfo(icelg)%nFsr
    ireg = FsrIdxSt + i - 1
    !$ACC LOOP INDEPENDENT VECTOR
    DO ig = mygb, myge
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icelg)%vol(i)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL
!$ACC END DATA

!$ACC EXIT DATA DELETE(xst(mygb : myge, :), src(mygb : myge, :))
!$ACC EXIT DATA COPYOUT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :), PhiAngIn(:, mygb : myge, :))

PAUSE

END SUBROUTINE

I eliminated the derived type variable. I thought this would be safer. Also, I tidied up the code a little bit.

The problem is not fixed yet, though.

Removing private declarations on scalars didn’t make any change.

While I was examining some features, I found an odd thing.

The value of icellrayidx is not properly incremented in “Release mode” only.

icellrayidx has some values like 272043665.

Why?

And do you have any other suggestions about strange behavior of FsrIdx?

I’m getting pessimistic about when I will be able to finish this code. I hope there is a solution.

icellrayidx has some values like 272043665. Why?

This looks like a compiler bug to me where it’s not computing the counter correctly when inside nested “seq” loops. It appears ok with a single level “seq” loop. I’ve added a problem report, TPR#22358, and sent it engineering for investigation.

As a work around, can you compute icellrayidx instead if incrementing it?

  icellrayidx = ( ( k-1 ) *nPinRay ) +l



What I suspect is the optimization process that you called dead line elimination. Compiler is making wrong decisions, that is, if I remove (A), compiler may think that GPUWorkerDat = ireg line is not needed and thus removing this line. Thereby ifsr becomes uninitialized, and cause illegal address error at reading Src(ig, ifsr). How do you think?

Doubtful. Dead code elimination wouldn’t apply here since that only gets rid of code that’s not actually used. On the host side, the compiler may promote a variable to a register but the code itself wouldn’t be removed.

The above problem you identified with the counter variables inside of multi-level nested “seq” loops may be the core issue. Looks like you have a few other spots such as “irsegidx”, that may be effected. Can you try changing these to be computed as well?

  • Mat

Other variables are fine.

Strange behavior of Fsridx doesn’t seem to be related with it.

Hi CNJ,

Are you able to share the code with us? I think we’ll need to debug the issue here in order to get to the core problem. Unfortunately I’ll be away for the next two weeks so wont be able to dig into it myself. Though, if you can send a note to PGI Customer Service (trs@pgroup.com) hopefully we can get another engineer to find the issue.

  • Mat

Sorry. I can’t share all the entire project.

It contains more than 200 source files and redistribution is limited by government law.

Anyhow, I could resolve all the problems and now my code is working.

From now on, I have to work on optimization.

During debugging, I could find a very serious bug.

!$ACC LOOP INDEPENDENT GANG
DO j = 1, CoreInfo%nxy
FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icelg = CoreInfo%Pin(j)%Cell(iz)
!$ACC LOOP INDEPENDENT COLLAPSE(2) WORKER VECTOR
DO i = 1, CoreInfo%CellInfo(icelg)%nFsr
DO ig = mygb, myge
ireg = FsrIdxSt + i - 1
phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icelg)%vol(i)) + src(ig, ireg)
ENDDO
ENDDO
ENDDO

I found that this loop interrupts preceding loop. That is, this loop is executed before preceding loop finishes. This should never happen since two loops are perfectly isolated.

So I had to separate parallel region to prevent interrupting.

#include <defines.h>
SUBROUTINE RayTrace_GPU(RayInfo, CoreInfo, phis, PhiAngIn, xst, src, jout, iz, mygb, myge, ljout)
USE PARAM
USE TYPEDEF, ONLY : RayInfo_Type, Coreinfo_type
USE MOC_MOD, ONLY : nMaxRaySeg,     nMaxCellRay,    nMaxAsyRay,     nMaxCoreRay,    &
                    EXPAPolar,      EXPBPolar,      wtangP0
USE PE_MOD,  ONLY : PE, GPUControl
IMPLICIT NONE
TYPE(RayInfo_Type) :: RayInfo
TYPE(CoreInfo_Type) :: CoreInfo
!$ACC DECLARE PRESENT(RayInfo, CoreInfo, GPUControl, EXPAPolar, EXPBPolar, wtangP0)
REAL(8), POINTER :: phis(:, :), PhiAngIn(:, :, :), xst(:, :), src(:, :), jout(:, :, :, :)
INTEGER :: iz, mygb, myge
LOGICAL :: ljout

INTEGER :: iRay
INTEGER :: i, j, k, l, m, jbeg, jend, jinc, irw, irw1, irv, ig

REAL(8) :: wttemp, wt(RayInfo%nPolarAngle), tau
REAL(8) :: phiobd(Rayinfo%nPolarAngle, mygb : myge), phia(mygb : myge, 2), phid, phiocel1, phiocel2
INTEGER :: nPolarAngle, nAziAngle, nPhiAngSv
INTEGER :: iazi, ipol, PhiAnginSvIdx, PhiAngOutSvIdx
INTEGER :: nCoreRay, nAsyRay, nPinRay, nRaySeg
INTEGER :: irotray, icoreray, iasyray, iceray, irayseg
INTEGER :: ipin, icelg, icelw, icelv, iasy, ireg, isurf1, isurf2, irot, idir, ifsr
INTEGER :: irsegidx, icellrayidx, FsrIdxSt

INTEGER :: mp(2) = (/ 2, 1 /)

! Tracking Data Storages
INTEGER :: nTotRaySeg(nMaxCoreRay), nTotCellRay(nMaxCoreRay)
INTEGER :: CellRayIdxSt(nMaxCellRay, nMaxCoreRay, 2)
INTEGER :: PinIdx(nMaxCellRay, nMaxCoreRay)
INTEGER :: SurfIdx(nMaxCellRay, nMaxCoreRay, 2)
INTEGER :: ExpAppIdx(mygb : myge, nMaxRaySeg, nMaxCoreRay)
INTEGER :: FsrIdx(nMaxRaySeg, nMaxCoreRay)
REAL(8) :: ExpApp(RayInfo%nPolarAngle, mygb : myge, nMaxRaySeg, nMaxCoreRay)
REAL(8) :: OptLenList(mygb : myge, nMaxRaySeg, nMaxCoreRay)
REAL(8) :: PhiAngOut(RayInfo%nPolarAngle, mygb : myge, nMaxRaySeg + 2)
    
nAziAngle = RayInfo%nAziAngle
nPolarAngle = RayInfo%nPolarAngle
nPhiAngSv = RayInfo%nPhiAngSv

!$ACC ENTER DATA COPYIN(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :))
!$ACC ENTER DATA CREATE(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC DATA PRESENT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC KERNELS
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF
!$ACC END DATA
!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :),                        &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC PARALLEL NUM_GANGS(GPUControl(1)%nGang) NUM_WORKERS(2) VECTOR_LENGTH(GPUControl(1)%nVector)
!$ACC LOOP INDEPENDENT GANG PRIVATE(phia, irsegidx, icellrayidx, nTotRaySeg, nTotCellRay, CellRayIdxSt,          &
!$ACC                               PinIdx, SurfIdx, ExpAppIdx, FsrIdx, ExpApp, OptLenList)
DO iRay = 1, GPUControl(1)%nRay
  !$ACC CACHE(phia)
  iRotRay = GPUControl(1)%RayList(iRay)

  !!!!!!!!!!!!!!!! Inlined Tracking Subroutine !!!!!!!!!!!!!!!!
      
  nCoreRay = RayInfo%RotRay(iRotRay)%nRay
  !$ACC LOOP SEQ
  DO j = 1, nCoreRay         
    irsegidx = 0; icellrayidx = 0
    iCoreRay = RayInfo%RotRay(iRotRay)%RayIdx(j)
    nAsyRay = RayInfo%CoreRay(iCoreRay)%nRay
    !$ACC LOOP SEQ
    DO k = 1, nAsyRay 
      iasyray = RayInfo%CoreRay(iCoreRay)%AsyRayIdx(k)
      iasy = RayInfo%CoreRay(iCoreRay)%AsyIdx(k)
      IF(iasy .EQ. 0) CYCLE
      nPinRay = RayInfo%AsyRay(iAsyRay)%nCellRay
      !$ACC LOOP SEQ
      DO l = 1, nPinRay
        ipin = RayInfo%AsyRay(iAsyRay)%PinIdx(l)
        iceray = RayInfo%AsyRay(iAsyRay)%PinRayIdx(l)
        ipin = CoreInfo%Asy(iAsy)%GlobalPinIdx(ipin)
        icelw = CoreInfo%Pin(ipin)%Cell(iz)
        FsrIdxSt = CoreInfo%Pin(ipin)%FsrIdxSt
        irw = icellrayidx + l
        PinIdx(irw, j) = ipin
        CellRayIdxSt(irw, j, 2) = irsegidx + 1
        nRaySeg = CoreInfo%CellInfo(icelw)%CellRay(iceray)%nSeg
        !$ACC LOOP INDEPENDENT WORKER VECTOR
        DO iRaySeg = 1, nRaySeg
          irv = irsegidx + iRaySeg
          ireg = FsrIdxSt + CoreInfo%CellInfo(icelw)%CellRay(iceray)%LocalFsrIdx(iRaySeg) - 1
          FsrIdx(irv, j) = ireg
          !$ACC LOOP SEQ
          DO ig = mygb, myge
            tau = - CoreInfo%CellInfo(icelw)%CellRay(iceray)%LenSeg(iRaySeg) * xst(ig, ireg)
            OptLenList(ig, irv, j) = tau
            ExpAppIdx(ig, irv, j) = min(0, max(INT(tau), -40000))
          ENDDO
        ENDDO
        irsegidx = irsegidx + nRaySeg
        CellRayIdxSt(irw, j, 1) = irsegidx
        SurfIdx(irw, j, 1) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(2, l)
        SurfIdx(irw, j, 2) = RayInfo%AsyRay(iAsyRay)%PinRaySurf(1, l)
      ENDDO
      icellrayidx = icellrayidx + nPinRay
    ENDDO
    nTotRaySeg(j) = irsegidx
    nTotCellRay(j) = icellRayIdx
    !$ACC LOOP INDEPENDENT WORKER
    DO iRaySeg = 1, nTotRaySeg(j)
      !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
      DO ig = mygb, myge
        DO ipol = 1, nPolarAngle
          ExpApp(ipol, ig, iRaySeg, j) &
          = EXPAPolar(ipol, ExpAppIdx(ig, iRaySeg, j)) * OptLenList(ig, iRaySeg, j) + EXPBPolar(ipol, ExpAppIdx(ig, iRaySeg, j))
        ENDDO
      ENDDO
    ENDDO
  ENDDO

  !$ACC LOOP INDEPENDENT WORKER PRIVATE(wt, phiobd, PhiAngOut) 
  DO irot = 1, 2
    !$ACC CACHE(wt)
    PhiAnginSvIdx = RayInfo%PhiAngInSvIdx(iRotRay, irot)
    PhiAngOutSvIdx = RayInfo%PhiangOutSvIdx(iRotRay, irot)
    phiobd(:, :) = PhiAngIn(:, :, PhiAnginSvIdx)
    jinc = 1; jbeg = 1; jend = nCoreRay
    IF(irot .EQ. 2) THEN
      jinc = -1; jbeg = nCoreRay; jend = 1
    ENDIF
    !$ACC LOOP SEQ
    DO j = jbeg, jend, jinc
      idir = RayInfo%RotRay(iRotRay)%DIR(j);
      iazi = RayInfo%CoreRay(RayInfo%RotRay(iRotRay)%RayIdx(j))%iang
      wt(1 : nPolarAngle) = wtangP0(1 : nPolarAngle, iazi)
      IF(irot .EQ. 2) idir = mp(idir)
      nRaySeg = nTotRaySeg(j)   
      IF(idir .EQ. 1) THEN
        PhiAngOut(:, :, 1) = phiobd(:, :)
        !$ACC LOOP SEQ
        DO irw = 1, nRaySeg
          ifsr = FsrIdx(irw, j)
          phia(:, irot) = 0._8
          !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
          DO ig = mygb, myge
            DO ipol = 1, nPolarAngle
              phid = (PhiAngOut(ipol, ig, irw) - src(ig, ifsr)) * ExpApp(ipol, ig, irw, j)
              PhiAngOut(ipol, ig, irw + 1) = PhiAngOut(ipol, ig, irw) - phid
              !$ACC ATOMIC UPDATE
              phia(ig, irot) = phia(ig, irot) + wt(ipol) * phid
              !$ACC END ATOMIC
            ENDDO
          ENDDO
          !$ACC LOOP INDEPENDENT VECTOR
          DO ig = mygb, myge
            !$ACC ATOMIC UPDATE
            phis(ig, ifsr) = phis(ig, ifsr) + phia(ig, irot)
            !$ACC END ATOMIC
          ENDDO
        ENDDO
        phiobd(:, :) = PhiAngOut(:, :, nRaySeg + 1)
        IF(ljout) THEN
          !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
          DO irv = 1, nTotCellRay(j)
            DO ig = mygb, myge
              icelv = PinIdx(irv, j); isurf1 = SurfIdx(irv, j, 1); isurf2 = SurfIdx(irv, j, 2)
              phiocel1 = 0._8; phiocel2 = 0._8
              !$ACC LOOP SEQ
              DO ipol = 1, nPolarAngle
                phiocel1 = phiocel1 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 1) + 1)
                phiocel2 = phiocel2 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 2))
              ENDDO
              !$ACC ATOMIC UPDATE
              Jout(ig, 2, isurf1, icelv) = Jout(ig, 2, isurf1, icelv) + phiocel1
              !$ACC END ATOMIC
              !$ACC ATOMIC UPDATE
              Jout(ig, 1, isurf2, icelv) = Jout(ig, 1, isurf2, icelv) + phiocel2
              !$ACC END ATOMIC
            ENDDO
          ENDDO
        ENDIF
      ELSE
        PhiAngOut(:, :, nRaySeg + 2) = phiobd(:, :)
        irw = nRaySeg + 1
        !$ACC LOOP SEQ
        DO irw1 = 1, nRaySeg
          irw = irw - 1
          ifsr = FsrIdx(irw, j)
          phia(:, irot) = 0._8
          !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
          DO ig = mygb, myge
            DO ipol = 1, nPolarAngle
              phid = (PhiAngOut(ipol, ig, irw + 2) - src(ig, ifsr)) * ExpApp(ipol, ig, irw, j)
              PhiAngOut(ipol, ig, irw + 1) = PhiAngOut(ipol, ig, irw + 2) - phid
              !$ACC ATOMIC UPDATE
              phia(ig, irot) = phia(ig, irot) + wt(ipol) * phid
              !$ACC END ATOMIC
            ENDDO
          ENDDO
          !$ACC LOOP INDEPENDENT VECTOR
          DO ig = mygb, myge
            !$ACC ATOMIC UPDATE
            phis(ig, ifsr) = phis(ig, ifsr) + phia(ig, irot)
            !$ACC END ATOMIC
          ENDDO
        ENDDO
        phiobd(:, :) = PhiAngOut(:, :, 2)
        IF(lJout) THEN
          !$ACC LOOP INDEPENDENT COLLAPSE(2) VECTOR
          DO irv = 1, nTotCellRay(j)
            DO ig = mygb, myge
              icelv = PinIdx(irv, j); isurf1 = SurfIdx(irv, j, 1); isurf2 = SurfIdx(irv, j, 2)
              phiocel1 = 0._8; phiocel2 = 0._8
              !$ACC LOOP SEQ
              DO ipol = 1, nPolarAngle
                phiocel1 = phiocel1 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 1) + 2)
                phiocel2 = phiocel2 + wt(ipol) * PhiAngOut(ipol, ig, CellRayIdxSt(irv, j, 2) + 1)
              ENDDO
              !$ACC ATOMIC UPDATE
              Jout(ig, 1, isurf1, icelv) = Jout(ig, 1, isurf1, icelv) + phiocel1
              !$ACC END ATOMIC
              !$ACC ATOMIC UPDATE
              Jout(ig, 2, isurf2, icelv) = Jout(ig, 2, isurf2, icelv) + phiocel2
              !$ACC END ATOMIC
            ENDDO
          ENDDO
        ENDIF
      ENDIF
    ENDDO
    PhiAngIn(:, :, PhiAngOutSvIdx) = phiobd(:, :)
  ENDDO

  !!!!!!!!!!!!!!!! Inlined Tracking Subroutine !!!!!!!!!!!!!!!!

ENDDO  
!$ACC END PARALLEL
!$ACC END DATA

!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), phis(mygb : myge, :))
!$ACC PARALLEL
!$ACC LOOP INDEPENDENT GANG
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icelg = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT COLLAPSE(2) WORKER VECTOR
  DO i = 1, CoreInfo%CellInfo(icelg)%nFsr
    DO ig = mygb, myge
      ireg = FsrIdxSt + i - 1
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icelg)%vol(i)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL
!$ACC END DATA

!$ACC EXIT DATA DELETE(xst(mygb : myge, :), src(mygb : myge, :))
!$ACC EXIT DATA COPYOUT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :), PhiAngIn(:, mygb : myge, :))

END SUBROUTINE