NVFORTRAN SEGMANTATION FAULT (CORE DUMPED) in OPENACC DATA REGION

Hello,

NVfortran compiles the code, a dummy example is given below. Yet when it is run, it gives the segmentation fault (core dumped) in the OpenACC data region.

The main code alike:

program main

  USE MOD
  
  ! counters
  INTEGER T, I, J, K
  INTEGER J_P = 2
  
  ! counters +
  INTEGER OS, OE, IG
  
  ! variables
  INTEGER*8 M
  INTEGER N
  INTEGER TP,ATP
  INTEGER ACP = 0
  
  DOUBLE PRECISION AN(MAT_DDT(1)%NUMCSR)
  DOUBLE PRECISION CN(NON_DDT%NUMCSR,HFKS)
  
  ! ALLOCATABLE ARRAYS
  DOUBLE PRECISION, ALLOCATABLE :: FAN(:,:)
  DOUBLE PRECISION, ALLOCATABLE :: FCN(:,:,:)
  
  ! -----------------------------------------------------------
  ! Allocatables are allocated
  ! -----------------------------------------------------------

  DO T = 0,180

  !$acc data copyin(FAN(:,:),AN(:), FCN(:,:,:),CN(:,:)) copy(AN(:),CN(:,:))
  
  CALL COPY2DEVICE_MAT(J_P)
  CALL COPY2DEVICE_NON(J_P)
  
  !$acc parallel loop present(MAT(1),NON) collapse(2)  
  DO I = OS,OE
  !************************
    ! ASSEMBLE
    DO J = 1,IG
	DO K = 1,IG
	  
	  M = (I-1)*IG*IG + (J-1)*IG + (K-1) + 1
	  N = (MIN(J,K)-1)*IG + (MAX(J,K)-1) + 1
	  
	  
	  TP = MAT%TAL(J_P)%CAL(M)
	  ATP = TP ; IF (ACP .EQ. 1) ATP = NON%TAL(J_P)%CAL(M)
	  
	  IF (NOCSR .GT. 0) THEN ! (ATP>0 AS WELL)
	    AN(TP) = AN(TP) + FAN(I-OS+1,N)				!*** SYMMETRY IS ASSUMED ***
	    CN(ATP,:) = CN(ATP,:) + FCN(I-OS+1,N,:) 		!*** SYMMETRY IS ASSUMED ***
	  END IF
	END DO
    END DO
  !************************
  END DO !  I
  !$acc end parallel loop
  
  CALL COPY2HOST_MAT(J_P)
  CALL COPY2HOST_NON(J_P)
  
  !$acc end data
  
  ! -----------------------------------------------------------
  ! Allocatables are allocated
  ! -----------------------------------------------------------

  END DO !  T

end program main

The MOD module:
MODULE MOD
IMPLICIT NONE

TYPE T02
  INTEGER, ALLOCATABLE, DIMENSION(:) :: CAL
END TYPE T02

TYPE T01
  TYPE (T02), ALLOCATABLE, DIMENSION(:) :: TAL
END TYPE T01

TYPE (T01), ALLOCATABLE, DIMENSION(:) :: MAT
TYPE (T01) :: NON


CONTAINS

!FOR KEY

SUBROUTINE COPY2DEVICE_MAT(J_P)

	INTEGER :: i
	INTEGER :: J_P
	
	!$acc enter data copyin(MAT(1))
	!$acc enter data copyin(MAT(1)%TAL(:))
	DO i = 1,J_P
		!$acc enter data copyin(MAT(1)%TAL(i)%CAL(:))
	END DO

END SUBROUTINE

SUBROUTINE COPY2HOST_MAT(J_P)

	INTEGER :: i
	INTEGER :: J_P
	
	DO i = 1,J_P
		!$acc exit data copyout(MAT(1)%TAL(i)%CAL(:))
	END DO
	!$acc exit data copyout(MAT(1)%TAL(:))
	!$acc exit data copyout(MAT(1))

END 

SUBROUTINE COPY2DEVICE_NON_PAT_KEY(J_P)

	INTEGER :: i
	INTEGER :: J_P
	
	!$acc enter data copyin(NON)
	!$acc enter data copyin(NON%TAL(:))
	DO i = 1,J_P
		!$acc enter data copyin(NON%TAL(i)%CAL(:))
	END DO

END SUBROUTINE


	
SUBROUTINE COPY2HOST_NON(J_P)

	INTEGER :: i
	INTEGER :: J_P
	
	DO i = 1,J_P
		!$acc exit data copyout(NON%TAL(i)%CAL(:))
	END DO
	!$acc exit data copyout(NON%TAL(:))
	!$acc exit data copyout(NON)

END SUBROUTINE

END MODULE MOD

When I compiled and run the code without the parallel regions as shown below, it still gave the segmentation fault.

program main

  USE MOD
  
  ! counters
  INTEGER T, I, J, K
  INTEGER J_P = 2
  
  ! counters +
  INTEGER OS, OE, IG
  
  ! variables
  INTEGER*8 M
  INTEGER N
  INTEGER TP,ATP
  INTEGER ACP = 0
  
  DOUBLE PRECISION AN(MAT_DDT(1)%NUMCSR)
  DOUBLE PRECISION CN(NON_DDT%NUMCSR,HFKS)
  
  ! ALLOCATABLE ARRAYS
  DOUBLE PRECISION, ALLOCATABLE :: FAN(:,:)
  DOUBLE PRECISION, ALLOCATABLE :: FCN(:,:,:)
  
  ! -----------------------------------------------------------
  ! Allocatables are allocated
  ! -----------------------------------------------------------

  DO T = 0,180

  !$acc data copyin(FAN(:,:),AN(:), FCN(:,:,:),CN(:,:)) copy(AN(:),CN(:,:))
  
  CALL COPY2DEVICE_MAT(J_P)
  CALL COPY2DEVICE_NON(J_P)
  

  DO I = OS,OE
  !************************
    ! ASSEMBLE
    DO J = 1,IG
	DO K = 1,IG
	  
	  M = (I-1)*IG*IG + (J-1)*IG + (K-1) + 1
	  N = (MIN(J,K)-1)*IG + (MAX(J,K)-1) + 1
	  
	  
	  TP = MAT%TAL(J_P)%CAL(M)
	  ATP = TP ; IF (ACP .EQ. 1) ATP = NON%TAL(J_P)%CAL(M)
	  
	  IF (NOCSR .GT. 0) THEN ! (ATP>0 AS WELL)
	    AN(TP) = AN(TP) + FAN(I-OS+1,N)				!*** SYMMETRY IS ASSUMED ***
	    CN(ATP,:) = CN(ATP,:) + FCN(I-OS+1,N,:) 		!*** SYMMETRY IS ASSUMED ***
	  END IF
	END DO
    END DO
  !************************
  END DO !  I
  
  CALL COPY2HOST_MAT(J_P)
  CALL COPY2HOST_NON(J_P)
  
  !$acc end data
  
  ! -----------------------------------------------------------
  ! Allocatables are allocated
  ! -----------------------------------------------------------

  END DO !  T

end program main

Hi yunus.altintop.2,

I get syntax errors when trying to compile these files. Not sure if they are cut and paste errors, forum posting issue, or you tried to pair down the example too much. For example, there’s no “MAT_DDT” or “NON_DDT” declared anywhere.

  DOUBLE PRECISION AN(MAT_DDT(1)%NUMCSR)
  DOUBLE PRECISION CN(NON_DDT%NUMCSR,HFKS)

A seg fault would be coming from the host, so look to the data directives. Here you’re using “MAT” in data clause, but it hasn’t been allocated. Hence, using “MAT(1)%TAL(:)” would cause a seg fault. Granted I can’t be sure this is indeed the actual problem since the example is incomplete, but it’s a problem.

-Mat

How can I debug my program? Which tools and flags should I use to figure it out?

You can set the environment variables “NV_ACC_NOTIFY=3” or “NV_ACC_DEBUG=1”. These will have the runtime print out diagnostic messages with “DEBUG” being very verbose. These wont tell you why the segv is occurring, but should point you to the data directive where it’s occurring.

Of course this is assuming that it is a problem with how you’re using the data directives, Since it’s on the host, it could be a general issue with your code. In that case, you can compile with debugging enabled (-g) and then use a debugger such as gdb or cuda-gdb.

When I do so, it gives the error:

Thread 1 “qage” received signal SIGSEGV, Segmentation fault.
0x0000000000406420 in mod_csr::copy2device_non (j_p=< optimized out>)
at Modules/mod.f90:56
56 !$acc enter data copyin(NON%TAL(i)%CAL(:))

There is a miss-writing. For convention, it is not SUBROUTINE COPY2DEVICE_NON_PAT_KEY(J_P).
It is SUBROUTINE COPY2DEVICE_NON(J_P) .

It is in the second code region in the first command.

Sorry