Unexpected NAN given from subroutine called in a parallel region of fortran file?

Dear developers,

I am parallelizing one of my Fortran software using Openacc. Most of the sections are working quite well, except for one section where I called several subroutine within a parallel region, I found NaN return from the call. To illustrate this idea, I will post a light-weighted version of my software for reference



MODULE TVD_MODULE        
USE DEFINITION
IMPLICIT NONE

contains

SUBROUTINE TVDVL_reconx
USE RIEMANN_MODULE
USE DEFINITION  
IMPLICIT NONE

! Integer !
INTEGER :: j, k, l, i, p

!$ACC ROUTINE (EOSEPSILON_NM) SEQ
!$ACC ROUTINE (EOSSOUNDSPEED) SEQ

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

! We first interpolate density  !
!$OMP PARALLEL
!$OMP DO COLLAPSE(4) SCHEDULE (STATIC)
!$ACC PARALLEL DEFAULT (PRESENT)
!$ACC LOOP GANG WORKER VECTOR COLLAPSE(4)
DO l = nz_min_2 - 1, nz_part_2 + 1
	DO k = ny_min_2 - 1, ny_part_2 + 1
		DO j = nx_min_2 - 1, nx_part_2 + 1
			DO i = imin2, imax2
				CALL TVD_VL (x2cen(j-1), x2cen(j), x2cen(j+1), xF2(j-1), xF2(j), dx2(j), & 
										 prim2(i,j-1,k,l), prim2(i,j,k,l), prim2(i,j+1,k,l), & 
										 primR2(i,j-1,k,l), primL2(i,j,k,l))
			END DO
		END DO
	END DO
END DO
!$OMP END DO

! Get energy and sound speed !
  !$OMP DO COLLAPSE(3) SCHEDULE (STATIC)
	!$ACC LOOP GANG WORKER VECTOR COLLAPSE(3)
	DO l = nz_min_2 - 1, nz_part_2 + 1
		DO k = ny_min_2 - 1, ny_part_2 + 1
			DO j = nx_min_2 - 1, nx_part_2 
				CALL EOSEPSILON_NM (primR2(irho2,j,k,l), primR2(itau2,j,k,l), eps2R(j,k,l))
				CALL EOSEPSILON_NM (primL2(irho2,j,k,l), primL2(itau2,j,k,l), eps2L(j,k,l))
				CALL EOSSOUNDSPEED (primR2(itau2,j,k,l), primR2(irho2,j,k,l), eps2R(j,k,l), cs2R(j,k,l))
				CALL EOSSOUNDSPEED (primL2(itau2,j,k,l), primL2(irho2,j,k,l), eps2L(j,k,l), cs2L(j,k,l))
			END DO
		END DO
  END DO
  !$OMP END DO
END IF
!$ACC END PARALLEL
!$OMP END PARALLEL

!$ACC SERIAL 
WRITE (*,*) primL2(itau2,25,49,0), primL2(irho2,25,49,0), eps2L(25,49,0), cs2L(25,49,0), 'left 0'
WRITE (*,*) primL2(itau2,25,49,1), primL2(irho2,25,49,1), eps2L(25,49,1), cs2L(25,49,1), 'left 1' 
WRITE (*,*) primL2(itau2,25,49,2), primL2(irho2,25,49,2), eps2L(25,49,2), cs2L(25,49,2), 'left 2'
WRITE (*,*) primR2(itau2,25,49,0), primR2(irho2,25,49,0), eps2R(25,49,0), cs2R(25,49,0), 'right 0'
WRITE (*,*) primR2(itau2,25,49,1), primR2(irho2,25,49,1), eps2R(25,49,1), cs2R(25,49,1), 'right 1'
WRITE (*,*) primR2(itau2,25,49,2), primR2(irho2,25,49,2), eps2R(25,49,2), cs2R(25,49,2), 'right 2'
WRITE (*,*) 
!$ACC END SERIAL
STOP

END SUBROUTINE

END MODULE

So all arrays are properly transferred to the device. The ACC SERIAL region is for me to check if there are unreasonable output from the loop. I find that eps2L(25,49,1), cs2L(25,49,1), eps2R(25,49,1), cs2R(25,49,1) are all NaNs. I pin-point the source of NaN is from the subroutine EOSSOUNDSPEED and EOSEPSILON_NM, let me show them here



SUBROUTINE EOSSOUNDSPEED (p_in, rho_in, eps_in, cs_out)

!$ACC ROUTINE SEQ

USE DEFINITION

IMPLICIT NONE

! Input density !

REAL*8, INTENT (IN) :: p_in, rho_in, eps_in

! Output value !

REAL*8, INTENT (OUT) :: cs_out

! Local variables !

REAL*8 :: dpdden, dpdeps

dpdden = eps_in * (ggas2 - 1.0D0)

dpdeps = rho_in * (ggas2 - 1.0D0)

cs_out = sqrt(dpdden+dpdeps*p_in/rho_in**(2.0D0))


END SUBROUTINE

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

SUBROUTINE EOSEPSILON_NM (rho_in, p_in, eps_out)

!$ACC ROUTINE SEQ

USE DEFINITION

IMPLICIT NONE

! Input density !

REAL*8, INTENT (IN) :: rho_in, p_in

! Output value !

REAL*8, INTENT (OUT) :: eps_out

eps_out = p_in/rho_in/(ggas2 - 1.0D0)

END SUBROUTINE

ggas2 is a variable from the definition module and has been properly declared using !$ACC DECLARE before the start of the program. Also, note that EOSSOUNDSPEED and EOSEPSILON_NM are not in the tvd_module but are in another separate file. I try a number of combinations to look for where the issue is.

  1. most trivial solution, mute the CALL EOSSOUNDSPEED and CALL EOSEPSILON_NM statement - successful
  2. move the EOSSOUNDSPEED and EOSSOUNDSPEED subroutine to the tvd_module - fail
  3. mute anything related to ggas2, i.e. eps_out = p_in/rho_in!/(ggas2 - 1.0D0) - fail
  4. print out the value of eps_out and cs_out in both subroutines - successful (huh???)
  5. replace all division operators in EOSSOUNDSPEED and EOSSOUNDSPEED with the multiplication operator - sucessful

I produce two successful cases, but the pattern is very non-trivial. Any clue about what is happening? I am using HPC_SDK 22.3, and built a hdf5 fortran (h5fc) compiler using the pgi compiler. The program is compiled with the corresponding h5fc compiler.

Without a reproducing example, it’s hard to tell what’s wrong, but are there any dependency between the first and second loops?

When you have two gang loops in the same parallel region, there’s no barrier between them. Hence some gangs may move to computing the second loop before other gangs have finished the first.

Hence one thing to try is using “PARALLEL LOOP” on each loop and remove the outer PARALLEL / END PARALLEL region. This will create two CUDA kernels rather than one, with the second not being launch until the first is complete.

If that doesn’t help, can you provide a reproducing example?

-Mat