Runtime error with nvfortran 20.7


My group is working on porting a CFD code to GPU with OpenACC. As a first step, we wanted to compile the code on CPU but issues came up.

On a Linux x86 system, it shows segmentation fault with nvfortran -g -Ktrap=fp -r8 -O0. In some tests, it shows the line number where it encounters an array index out-of-bound error, where the displayed array range makes no sense to me:

0: Subscript out of range for array mhdflux_v (FaceFlux.f90: 2983)
    subscript=2, lower bound=1640695287988, upper bound=1640695287990, dimension=1

In some other tests, it does not show the line number, just messages like

[ny01:09070] *** Process received signal ***
[ny01:09070] Signal: Segmentation fault (11)
[ny01:09070] Signal code:  (128)
[ny01:09070] Failing at address: (nil)

The code that is causing this issue is related to the usage of derived types in Fortran. We have a large derived type declared in one module and used in another module, with a mixture of scalars and vectors that looks like

type, public :: Param
     integer :: iLeft,  jLeft, kLeft
     integer :: iRight, jRight, kRight
     integer :: iBlockFace
     integer :: iDimFace
     integer :: iFluidMin = 1, iFluidMax = nFluid
     integer :: iVarMin   = 1, iVarMax   = nVar
     integer :: iEnergyMin = nVar+1, iEnergyMax = nVar + nFluid

     integer :: iFace, jFace, kFace   

     real :: CmaxDt
     real :: Area2, AreaX, AreaY, AreaZ, Area = 0.0
     real :: DeltaBnL, DeltaBnR
     real :: DiffBb ! (1/4)(BnL-BnR)^2
     real :: StateLeft_V(nVar)
     real :: StateRight_V(nVar)
     real :: FluxLeft_V(nVar+nFluid), FluxRight_V(nVar+nFluid)

     real :: Normal_D(3), NormalX, NormalY, NormalZ
     real :: Tangent1_D(3), Tangent2_D(3)
     real :: B0n, B0t1, B0t2
     real :: UnL, Ut1L, Ut2L, B1nL, B1t1L, B1t2L
     real :: UnR, Ut1R, Ut2R, B1nR, B1t1R, B1t2R

     real :: MhdFlux_V(     RhoUx_:RhoUz_)
     real :: MhdFluxLeft_V( RhoUx_:RhoUz_)
     real :: MhdFluxRight_V(RhoUx_:RhoUz_)

     real :: Enormal
     real :: Unormal_I(nFluid+1) = 0.0
     real :: UnLeft_I(nFluid+1)
     real :: UnRight_I(nFluid+1)
     real :: EtaJx, EtaJy, EtaJz, Eta     
     real :: InvDxyz, HallCoeff
     real :: HallJx, HallJy, HallJz
     logical :: UseHallGradPe = .false.
     real :: BiermannCoeff, GradXPeNe, GradYPeNe, GradZPeNe
     real :: DiffCoef, EradFlux=0.0, RadDiffCoef
     real :: HeatFlux, IonHeatFlux, HeatCondCoefNormal     
     real :: bCrossArea_D(3) = 0.0
     real :: B0x=0.0, B0y=0.0, B0z=0.0
     real :: ViscoCoeff
     logical :: IsBoundary
     real :: InvClightFace, InvClight2Face
     logical :: DoTestCell = .false.
     logical :: IsNewBlockVisco = .true.
     logical :: IsNewBlockGradPe = .true.
     logical :: IsNewBlockCurrent = .true.
     logical :: IsNewBlockHeatCond = .true.
     logical :: IsNewBlockIonHeatCond = .true.
     logical :: IsNewBlockRadDiffusion = .true.
     logical :: IsNewBlockAlfven = .true.
  end type Param

An object of this derived type is passed between several subroutines to set the parameters and intermediate values.
One of the arrays with declared range 2:4 in this derived type caused the issue. I have tried several different approaches to resolve this issue:

  • turn off OpenMP
  • use local array (copy) instead of pointer to the vectors
  • direct call with p%MhdFlux_V, etc., without using the associate block
  • change vector range from 2:4 to 1:3
  • move the vectors into a separate type declaration

However, none of these works. An older version of this module without using derived types can be compiled and run without issue, which indicates that there’s something going on with the usage of derived type.

With -O2 or above, the code does not generate runtime error, but the result is wrong. We have confirmed that the same code has no issue with gfortran, nagfor and ifort. We have also run valgrind with gcc, and it showed no memory issue.

Would you be able to provide a reproducing example that we can use to investigate?

If not, can you post the section of code where the out-of-bound error occurs?
If you run the code through a debugger, does the seg fault occur in the same spot?

In your type, the MhdFlux_V array which is the same one that the out-of-bounds error occur, is declared as:

 real :: MhdFlux_V(     RhoUx_:RhoUz_)
 real :: MhdFluxLeft_V( RhoUx_:RhoUz_)
 real :: MhdFluxRight_V(RhoUx_:RhoUz_)

Though, I don’t see where “RhoUx_” or “RhoUz_” are declared. Where do these variables get declared and what are their values?

The indexes, RhoUx_,RhoUz_ are constant parameters declared in another module. I tried to reproduce the issue with some simple program, but no success yet. Do you mind if I share the entire code with a makefile and instructions to run?


No, the full source is fine. If we can better understand why it’s erroring, it may be easier to write a reproducer, assuming that it’s a compiler issue.

Since our code is not fully open-source yet, is there a way I can share files that’s not publicly available online?

It turned out that it is due to the incorrect recognition of associate syntax in a contained subroutine which accesses the derived type components for nvfortran 20.7.

Hi hyzhou,

We were able to reduce the original issue down to the following small reproducer. There appeared to be a problem with internal procedures that are called within an associate when the internal procedure has an identical associate. If the associate expression (o%x in this case) was changed to something else (say, o%y), the code worked fine. However the original version should be correct and engineering has fixed the problemed in our 22.3 release.

For example:

% cat test.F90
program p
type t
integer :: x(10)
integer :: y(10)
end type
type(t) :: o
associate (z=>o%x) ! this fails if child also has associate (z=>o%x)
! associate (z=>o%y) ! this works if child has associate (z=>o%x)
call child
end associate
subroutine child()
associate (z=>o%x) ! this fails if parent has associate (z=>o%x)
! associate (z=>o%y) ! this works if child has associate (z=>o%x)
print *, lbound(z),ubound(z)
end associate
end subroutine

Fails in 22.2:
% nvfortran test.F90 -V22.2 -fast; a.out
Segmentation fault

Works correctly in 22.3:
% nvfortran test.F90 -V22.3 -fast ; a.out
            1           10