Why !$ACC PARALLEL give access violation error?

SUBROUTINE RayTrace_OACC(RayInfo, CoreInfo, phis, PhiAngIn, xst, src, jout, iz, mygb, myge, ljout)
USE PARAM
USE TYPEDEF, ONLY : RayInfo_Type, Coreinfo_type, Pin_Type, Cell_Type
USE MOC_MOD, ONLY : nMaxRaySeg,     nMaxCellRay,    nMaxAsyRay,     nMaxCoreRay,    &
                    EXPAPolar,      EXPBPolar,      GPUWorkerDat
USE PE_MOD,  ONLY : PE
IMPLICIT NONE

TYPE(RayInfo_Type) :: RayInfo
TYPE(CoreInfo_Type) :: CoreInfo
REAL(8), POINTER :: phis(:, :), PhiAngIn(:, :, :), xst(:, :), src(:, :), jout(:, :, :, :)
INTEGER :: iz, mygb, myge
LOGICAL :: ljout

!Pointing Variable

TYPE(Cell_Type), POINTER :: Cell(:)
TYPE(Pin_Type), POINTER :: Pin(:)

INTEGER :: iGang, iWorker, iRay
INTEGER :: i, j, k, l, m, jbeg, jend, jinc, ir, ir1, ig

REAL(8) :: wttemp, wtang(10, 100), wt(RayInfo%nPolarAngle), tau
REAL(8) :: phiobd(Rayinfo%nPolarAngle, mygb : myge), phia(mygb : myge), phid, phiocel
INTEGER :: nPolarAngle, nAziAngle, nPhiAngSv
INTEGER :: iazi, ipol, PhiAnginSvIdx, PhiAngOutSvIdx
INTEGER :: nRotRay, nCoreRay, nAsyRay, nPinRay, nRaySeg
INTEGER :: irotray, icoreray, iasyray, iceray, irayseg
INTEGER :: nFsr, nxy
INTEGER :: ipin, icel, iasy, ireg, isurf, irot, itype, idir
INTEGER :: irsegidx, icellrayidx, FsrIdxSt

INTEGER :: mp(2)
INTEGER :: nTotRaySeg(nMaxCoreRay), nTotCellRay(nMaxCoreRay)

nAziAngle = RayInfo%nAziAngle
nPolarAngle = RayInfo%nPolarAngle
nPhiAngSv = RayInfo%nPhiAngSv
      
DO iazi = 1, nAziAngle
  wttemp = RayInfo%AziAngle(iazi)%weight * RayInfo%AziAngle(iazi)%del
  DO ipol = 1, nPolarAngle
    wtang(ipol, iazi) = wttemp * RayInfo%PolarAngle(ipol)%weight * RayInfo%PolarAngle(ipol)%sinv
  ENDDO
ENDDO

!$ACC ENTER DATA COPYIN(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :))
!$ACC ENTER DATA CREATE(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC DATA PRESENT(PE, CoreInfo, RayInfo, GPUWorkerDat, EXPAPolar, EXPBPolar)
!$ACC HOST_DATA USE_DEVICE(phis, Jout)
phis(mygb : myge, :) = 0._8
IF (ljout) jout(mygb : myge, :, :, :) = 0._8
!$ACC END HOST_DATA

!$ACC PARALLEL NUM_GANGS(PE%nGang) NUM_WORKERS(PE%nWorker) VECTOR_LENGTH(32)

! Blah Blah

Access violation error occurs at !$ACC PARALLEL line.

I tried changing PE%nGang and PE%nWorker to numbers, but error still pops out.

The parallel region is quite big; worker level loop is especially long - about 200 lines. So I first suspected that this might be the cause.

However, error doesn’t vanish even after I remove all loops in the parallel region.

More interesting phenomenon is that if I increase system stack size, following error occurs during memory copy process (before even reach the parallel region):

call to cuCtxCreate returned error 304: other

Why!$ACC PARALLEL produce access violation errors? And why the behavior on GPU changes with changing system stack size?

!$ACC HOST_DATA USE_DEVICE(phis, Jout) 
phis(mygb : myge, :) = 0._8 
IF (ljout) jout(mygb : myge, :, :, :) = 0._8 
!$ACC END HOST_DATA

This is probably where your error is coming from. The host data directive says to use the device pointer on the host and is mainly there for compatibility with native languages such as CUDA. So here you’re saying to use the device pointer for “phis” and “jout” but trying to access them on the host. Your program will get a segmentation violation since the device address isn’t valid on the host.

What are you trying to do here? If you just want to set these values on the device, then they need to be added to a compute region. For implicit loops such that occur with array syntax, you’ll want to use the “kernels” directive so the compiler is allowed to discover the loop parallelism. The “parallel” directive says that the user tells the compiler where the parallelism is so can only be used on explicit loops.

!$acc kernels
phis(mygb : myge, :) = 0._8 
!$acc end kernels
IF (ljout) then
  !$acc kernels
    jout(mygb : myge, :, :, :) = 0._8 
   !$acc end kernels
endif
  • Mat

I tried it, but still same error at the !$ACC PARALLEL region.

I tested one code:

SUBROUTINE RayTrace_GPU(RayInfo, CoreInfo, phis, PhiAngIn, xst, src, jout, iz, mygb, myge, ljout)
USE PARAM
USE TYPEDEF, ONLY : RayInfo_Type, Coreinfo_type
USE MOC_MOD, ONLY : nMaxRaySeg,     nMaxCellRay,    nMaxAsyRay,     nMaxCoreRay,    &
                    EXPAPolar,      EXPBPolar,      wtangP0,        GPUWorkerDat
USE PE_MOD,  ONLY : PE, GPUControl
IMPLICIT NONE
TYPE(RayInfo_Type) :: RayInfo
TYPE(CoreInfo_Type) :: CoreInfo
!$ACC DECLARE PRESENT(RayInfo, CoreInfo, GPUWorkerDat, GPUControl, EXPAPolar, EXPBPolar, wtangP0)
REAL(8), POINTER :: phis(:, :), PhiAngIn(:, :, :), xst(:, :), src(:, :), jout(:, :, :, :)
INTEGER :: iz, mygb, myge
LOGICAL :: ljout

!$ACC ENTER DATA COPYIN(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :))
!$ACC ENTER DATA CREATE(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :), &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC KERNELS
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF

!$ACC PARALLEL &
!$ACC NUM_GANGS(GPUControl(1)%nGang) NUM_WORKERS(GPUControl(1)%nWorker) VECTOR_LENGTH(GPUControl(1)%nVector)
!$ACC LOOP INDEPENDENT GANG
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz);
  !$ACC LOOP INDEPENDENT COLLAPSE(2) WORKER VECTOR
  DO i = 1, CoreInfo%CellInfo(icel)%nFsr
    DO ig = mygb, myge
      ireg = FsrIdxSt + i - 1
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) + src(ig, ireg) ! No Runtime Error Without This Line
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL
!$ACC END DATA

!$ACC EXIT DATA DELETE(xst(mygb : myge, :), src(mygb : myge, :))
!$ACC EXIT DATA COPYOUT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :), PhiAngIn(:, mygb : myge, :))

What am I doing wrong with phis, xst or src? I tried specifying the range of subarray explicitly, but doesn’t work.

I suppose xst, src, phis, Jout, PhiAngIn all are not copied in correctly.

This is a simplified code, so if I can figure out what is wrong with this parallel region, I might be able to get some ideas.

Dead-code elimination may remove unused code, so just because you remove the one line, it doesn’t necessarily mean that that’s where the segv comes from. It could be, but may not be.

In this case, the “phis”, “xst”, and “src” all look ok but if “ireq” is too big, you could be accessing off the end of the array. Have you confirmed that “iref” stays within the bounds of the arrays?

FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz);

I see where you have put “CoreInfo” in a “present” clause but can you post how you’re building this structure on device? The Pin array may not be correct and thus causing the illegal memory access.

Can you post the compiler feedback messages (-Minfo=accel) for this section? This will tells more about what the compiler is doing.

  • Mat

The derived type variables are copied in correctly.

I tried the following:

!$ACC LOOP INDEPENDENT GANG PRIVATE(j, FsrIdxSt, icel)
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(i, ireg)
  DO i = 1, CoreInfo%CellInfo(icel)%nFsr
    ireg = FsrIdxSt + i - 1
    !$ACC LOOP INDEPENDENT VECTOR PRIVATE(ig)
    DO ig = mygb, myge
      IF(ig .EQ. 7) PRINT *, j, FsrIdxSt, icel, i, ireg, ig
!      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO

All loop iterators are printed out correctly.

This also proves that ireg doesn’t go out of array boundary.

Following is the compiler message you requested.


src\RayTracing_GPU.f90

c:\program files\pgi\win32\15.10\bin\pgfortran.exe -Hx,123,8 -Hx,123,0x40000 -Hx,0,0x40000000 -Mx,0,0x40000000 -Hx,0,0x20000000 -Mpreprocess -g -Bstatic -Mbackslash -mp -acc -Mmpi=msmpi -Mcuda=debug,nollvm -Mfree -Mchkstk -I"C:\Users\chirayu\Desktop\nTracerV100_PVF\/src" -I"C:\Users\chirayu\Desktop\nTracerV100_PVF\/src/SP3SENM" -I"C:\Users\chirayu\Desktop\nTracerV100_PVF\/MATRA/Debug" -I"c:\program files\pgi\win32\15.10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 12\include" -I"C:\Program Files\Windows Kits\8.1\Include\shared" -I"C:\Program Files\Windows Kits\8.1\Include\um" -ta=tesla -Minform=warn -module "Debug" -Minfo=accel,mp -o "Debug\RayTracing_GPU.obj" -c "C:\Users\chirayu\Desktop\nTracerV100_PVF\src\RayTracing_GPU.f90"

Command exit code: 0

Command output: [raytrace_gpu:

11, Generating present(rayinfo,coreinfo,gpuworkerdat(:,:),gpucontrol(:),expapolar(:,:),expbpolar(:,:),wtangp0(:,:))
38, Generating enter data copyin(phiangin(:,mygb:myge,:),src(mygb:myge,:),xst(mygb:myge,:))
39, Generating enter data create(jout(mygb:myge,:,:,:),phis(mygb:myge,:))
41, Generating present(xst(mygb:myge,:),src(mygb:myge,:),phiangin(:,mygb:myge,:),phis(mygb:myge,:),jout(mygb:myge,:,:,:))
44, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
44, !$acc loop gang ! blockidx%y
!$acc loop gang, vector(128) ! blockidx%x threadidx%x
48, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
48, !$acc loop gang ! blockidx%y
!$acc loop gang, vector(128) ! blockidx%x threadidx%x
51, Accelerator kernel generated
Generating Tesla code
252, !$acc loop gang(gpucontrol%ngang) ! blockidx%x
255, !$acc loop worker(gpucontrol%nworker) ! threadidx%y
258, !$acc loop vector(gpucontrol%nvector) ! threadidx%x
51, Generating copy(phis)
255, Loop is parallelizable
258, Loop is parallelizable
267, Generating exit data delete(src(mygb:myge,:),xst(mygb:myge,:))
268, Generating exit data copyout(phiangin(:,mygb:myge,:),jout(mygb:myge,:,:,:),phis(mygb:myge,:))

51, Generating copy(phis)

This is quite strange. Why does the compiler make another copy of phis at the start of parallel region? Shouldn’t this be suppressed by the present clause?

I think this is making a crash in GPU memory.



Also, I found a really odd thing.

If I remove the kernel region, the parallel region works fine.

And if I place kernel region behind the parallel region, kernel region makes another copy of phis and starts to produce the access violation error.

However, if I separate the data region into two, both accelerator regions work fine.

I seems that the !$ACC DATA PRESENT is working only on the first accelerator region for a variable in the present clause which is used in multiple accelerator regions.

This is completely weird. What is going on here?

To summarize,

!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :),                       &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC KERNELS
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF

!$ACC PARALLEL &  ! Another Copy of phis and access violation
!$ACC LOOP INDEPENDENT GANG PRIVATE(j, FsrIdxSt, icel)
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(i, ireg)
  DO i = 1, CoreInfo%CellInfo(icel)%nFsr
    ireg = FsrIdxSt + i - 1
    !$ACC LOOP INDEPENDENT VECTOR PRIVATE(ig)
    DO ig = mygb, myge
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icel)%vol(i)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL
!$ACC END DATA



!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :),                       &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))

!$ACC PARALLEL &
!$ACC LOOP INDEPENDENT GANG PRIVATE(j, FsrIdxSt, icel)
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(i, ireg)
  DO i = 1, CoreInfo%CellInfo(icel)%nFsr
    ireg = FsrIdxSt + i - 1
    !$ACC LOOP INDEPENDENT VECTOR PRIVATE(ig)
    DO ig = mygb, myge
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icel)%vol(i)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL

!$ACC KERNELS ! Another copy of phis and access violation
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS ! jout is used for the first time, so no problem
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF

!$ACC END DATA



! Work fine
!$ACC DATA PRESENT(phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC KERNELS
  phis(mygb : myge, :) = 0._8
!$ACC END KERNELS
IF (ljout) THEN
  !$ACC KERNELS
    jout(mygb : myge, :, :, :) = 0._8
  !$ACC END KERNELS
ENDIF
!$ACC END DATA

!$ACC DATA PRESENT(xst(mygb : myge, :), src(mygb : myge, :), PhiAngIn(:, mygb : myge, :),                       &
!$ACC              phis(mygb : myge, :), Jout(mygb : myge, :, :, :))
!$ACC PARALLEL &
!$ACC LOOP INDEPENDENT GANG PRIVATE(j, FsrIdxSt, icel)
DO j = 1, CoreInfo%nxy
  FsrIdxSt = CoreInfo%Pin(j)%FsrIdxSt; icel = CoreInfo%Pin(j)%Cell(iz)
  !$ACC LOOP INDEPENDENT WORKER PRIVATE(i, ireg)
  DO i = 1, CoreInfo%CellInfo(icel)%nFsr
    ireg = FsrIdxSt + i - 1
    !$ACC LOOP INDEPENDENT VECTOR PRIVATE(ig)
    DO ig = mygb, myge
      phis(ig, ireg) = phis(ig, ireg) / (xst(ig, ireg) * CoreInfo%CellInfo(icel)%vol(i)) + src(ig, ireg)
    ENDDO
  ENDDO
ENDDO
!$ACC END PARALLEL
!$ACC END DATA

Is this a compiler bug?

Hi CNJ,

Just so I’m understanding, you have posted three code blocks. In the first two are the failing cases, where the difference is that you simply reversed the order of the compute regions. In both cases, the access violation occurs in the compute region that comes seconds. In the third case, the execution succeeds but you need to identify “phis” as present twice.

Is there any code between these two loops? In particular, I’m wondering if there are calls which include “phis” as an argument? In this case given “phis” is a pointer, the compiler would have to assume that what “phis” points to has changed so must copy in the array. This would explain the extra “copyin(phis)” feedback message.

Is this a compiler bug?

Possible, but without a reproducing example I can’t tell.

The extra “copyin(phis)” feedback message is strange and is most likely causing the problem. “copyin” does check for the presence on the device but since you’ve only copied over a sub-array, not the whole array, it’s most likely copying it in. Usually the compiler will only try to copy in the minimal amount of data, but perhaps given you have a computed index, it may not be able to determine the range so copies in the whole thing.

If you can write-up a reproducing example, I can then try and determine why the compiler is emitting the extra “copyin(phis)”. Though, it seems that the work-around is to add a “present(phis)” on your compute regions so this copy isn’t made. Adding the second data region works as well, but is probably overkill.

  • Mat

Yes, I have a long loop between two, but they are all in the accelerator region and I just disabled it with the preprocessor to simplify the problem for the time being.

I don’t think disabling code block with preprocessor will affect something.