NVFORTRAN-F-0000-Internal compiler error. child tinfo should have been created at outlining function for host

We are building a large fortran code base and have the code GPU accelerated with OpenMP target. It works with recent gcc / gfortran. I am in the process of “adapting” the code so that it compiles with the Nvidia compiler, but struggle with lots of internal compile errors. Her eis a reduced example of a code that refuses to compile. The setup is the following:

% cat /etc/redhat-release 
Rocky Linux release 9.4 (Blue Onyx)

% nvfortran --version
nvfortran 24.7-0 64-bit target on x86-64 Linux -tp znver2 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

% nvfortran -mp=gpu -c test.f90
NVFORTRAN-F-0000-Internal compiler error. child tinfo should have been created at outlining function for host     324  (test.f90: 19)
NVFORTRAN/x86-64 Linux 24.7-0: compilation aborted

And the code is:

subroutine ctoprim(mem,prim,lb,ub,ii)
  !$omp declare target
  real, dimension(:,:,:,:,:)  :: mem, prim
  integer:: i3, ii, lb, ub
  real, parameter :: smallr=1e-6
  
  !$omp parallel do shared(mem,prim,ii,lb,ub)
  do i3 = lb,ub
    prim(1,1,1,i3,ii) = max(mem(1,1,1,i3,ii),smallr)             ! density -> density
  end do
  
end subroutine

subroutine slopes(prim,ii,lb,ub)
  !$omp declare target
  integer :: i3, ii, lb, ub
  real, dimension(:,:,:,:,:) :: prim
  !
  !$omp parallel do
  do i3=lb,ub
    prim(1,1,1,i3,ii) = prim(1,1,1,i3,ii) - prim(1,1,1,i3-1,ii)
  enddo
end subroutine

Thanks for the report.

You’re encountering a known limitation where “parallel” can’t be used within a device subroutine. Though, the compiler should be issuing an error, not an ICE, so I’ve added a problem report, TPR #36478, to see if we can address this.

To work around, you can hoist the “parallel” out of the subroutine and then added it around the call, or use “loop bind(parallel)” instead.

For example, hoisting “parallel”:

subroutine ctoprim(mem,prim,lb,ub,ii)
  !$omp declare target
  real, dimension(:,:,:,:,:)  :: mem, prim
  integer:: i3, ii, lb, ub
  real, parameter :: smallr=1e-6

  !$omp do shared(mem,prim,ii,lb,ub)
  do i3 = lb,ub
    prim(1,1,1,i3,ii) = max(mem(1,1,1,i3,ii),smallr)             ! density -> density
  end do

end subroutine

subroutine slopes(prim,ii,lb,ub)
  !$omp declare target
  integer :: i3, ii, lb, ub
  real, dimension(:,:,:,:,:) :: prim
  !
  !$omp do
  do i3=lb,ub
    prim(1,1,1,i3,ii) = prim(1,1,1,i3,ii) - prim(1,1,1,i3-1,ii)
  enddo
end subroutine

...

!$omp parallel
   call slopes(prim,ii,lb,ub)
!$omp end parallel

or use “loop”

subroutine ctoprim(mem,prim,lb,ub,ii)
  !$omp declare target
  real, dimension(:,:,:,:,:)  :: mem, prim
  integer:: i3, ii, lb, ub
  real, parameter :: smallr=1e-6

  !$omp loop bind(parallel)
  do i3 = lb,ub
    prim(1,1,1,i3,ii) = max(mem(1,1,1,i3,ii),smallr)             ! density -> density
  end do

end subroutine

subroutine slopes(prim,ii,lb,ub)
  !$omp declare target
  integer :: i3, ii, lb, ub
  real, dimension(:,:,:,:,:) :: prim
  !
  !$omp loop bind(parallel)
  do i3=lb,ub
    prim(1,1,1,i3,ii) = prim(1,1,1,i3,ii) - prim(1,1,1,i3-1,ii)
  enddo
end subroutine

-Mat

Great explanation, and thanks a lot for the quick turn-around.

Just to make sure I have understood the rules correctly: If I make the parallel region outside the routine, so something like:

subroutine outer()

!$omp target teams distribute
do ii=1,npatches
  !$omp parallel shared(mem,prim,lb,ub,ii)
  call ctoprim(mem,prim,lb,ub,ii)
  call slopes(prim,ii,lb,ub)
  !$omp eddo
enddo
!$omp end target teams

then there is an implied barrier after each “!$omp do” loop in the subroutine calls, and that barrier only applies to a single thread team (e.g. in the case of a Nvidia GPU the threads executing on a single streaming multiprocessor). Is that correct?

Or do I need to explicitly have a barrier placed, like:

subroutine outer()

!$omp target teams distribute
do ii=1,npatches
  !$omp parallel shared(mem,prim,lb,ub,ii)
  call ctoprim(mem,prim,lb,ub,ii)
  !$omp barrier
  call slopes(prim,ii,lb,ub)
  !$omp eddo
enddo
!$omp end target teams

Also, what spawns a kernel. Is it the target directive, the distribute teams directive or the parallel region directive?

Correct. There is an implicit barrier added after the “do” loop to sync the threads within a team.

A “barrier” would be applied globally across all teams and threads but should be avoided since it can have a severe negative performance impact.

Also, what spawns a kernel. Is it the target directive, the distribute teams directive or the parallel region directive?

“target teams” defines the compute region and spawns the kernel.

“distribute” defines how to apply the workshare across teams.

“parallel” defines the region to apply thread parallelism with “do” or “for” defining the workshare for those threads.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.