Improving performance when calling subroutines inside of openmp teams regions with nvfortran

utheriwagura · January 22, 2026, 10:15pm

I have a subroutine that calculates a 3 dimensional quantity by iterating over the k index of an array and then performing calculations over 2D arrays. Within this loop over k, the subroutine calls another subroutine that also performs work over 2D arrays. I would like to preserve this structure while offloading this calculation to a GPU using do concurrents and an openmp target teams loop directive.

Although the combination of a target teams loop over k with nested do concurrents seems to improve my runtime by itself, when I add in the subroutine call within the k loop, I see about a 6x slowdown relative to inlining the subroutine logic within the k loop. Why do subroutine calls within teams loop regions lead to such a decrease in performance?

I have put together a Minimally Reproducible Example that mimics the structure of my subroutine. When I compile this MRE with O0 and run it with a subroutine call inside of the target teams loop region, it runs in about 9 seconds:

Total time for 100  iterations: 9.348666906356812 seconds

However, inlining the logic of the subroutine results in a roughly 2x speed up in the MRE’s runtime.

Total time for 100  iterations: 4.667701005935669 seconds

compiling with either O2 or O4 and inlining the same subroutine results in a 5x speedup in the MRE’s runtime.

These numbers were produced with nvfortran 25.5 with the following settings:

nvfortran -g -O0 -mp=gpu -stdpar=gpu -gpu=mem:separate -Mnofma -Minfo=all -r8 MRE.F90 -o subroutine_MRE.exe

and were run on Nvidia A100s

MRE

!> Minimal Reproducible Example (MRE) for testing calls to subroutines that
!! do work over layers that are called within k-loops inside of target teams
!! regions

module constants_mod
  implicit none

  ! Simulation parameters
  integer, parameter :: nx = 288, ny = 288, nz = 100
  integer, parameter :: iterations = 100

  contains

  !> Subroutine that contains a k-loop with do concurrents over i,j followed by
  !! a call to compute_grad_mre and more do concurrents
  subroutine advection_calc_mre(u, v, h, adv_u, adv_v, metric_x, metric_y)
    implicit none
    real, intent(in) :: u(nx, ny, nz), v(nx, ny, nz), h(nx, ny, nz)
    real, intent(out) :: adv_u(nx, ny, nz), adv_v(nx, ny, nz)
    real, intent(in) :: metric_x(nx, ny), metric_y(nx, ny)

    ! Local variables
    real :: ke(nx, ny), kex(nx, ny), key(nx, ny)
    real :: curl(nx, ny), coriolis(nx, ny)
    integer :: i, j, k
    real :: temp1, temp2

    real :: area_factor

    adv_u = 0.0
    adv_v = 0.0

    !$omp target enter data map(alloc: adv_u, adv_v)
    !$omp target enter data map(alloc: curl, coriolis, temp1, temp2)
    !$omp target enter data map(alloc: key, kex, ke)

    !$omp target enter data map(alloc: area_factor)


    ! Main k-loop structure
    !$omp target teams loop private(curl, coriolis, ke, kex, key)
    do k = 1, nz

      ! Calculate needed quantities 
      do concurrent (j=2:ny-1, i=2:nx-1)
        curl(i, j) = (v(i+1, j, k) - v(i-1, j, k)) * metric_x(i, j) - &
                         (u(i, j+1, k) - u(i, j-1, k)) * metric_y(i, j)
        coriolis(i, j) = curl(i, j) * h(i, j, k)
      enddo

      ! Calculate ke while accessing neighboring indices: i-1, i, j-1, j
      area_factor = 0.25 
      do concurrent (j=2:ny-1, i=2:nx-1)
        ke(i, j) = area_factor * ((u(i-1, j, k) * u(i-1, j, k) + u(i, j, k) * u(i, j, k)) + &
                                  (v(i, j-1, k) * v(i, j-1, k) + v(i, j, k) * v(i, j, k))) * &
                  (1.0 / max(h(i-1, j, k), 0.001)) + &
                  (1.0 / max(h(i, j, k), 0.001))
      enddo

      ! Calculate ke gradient in x-direction
      do concurrent (j=2:ny-1, i=2:nx-2)
        kex(i, j) = (ke(i+1, j) - ke(i, j)) * 1.0
      enddo

      ! Calculate ke gradient in y-direction
      do concurrent (j=2:ny-2, i=2:nx-1)
        key(i, j) = (ke(i, j+1) - ke(i, j)) * 1.0
      enddo
      
      ! ! Call to subroutine similar to compute_grad
      ! call compute_grad_mre(u, v, h, ke, kex, key, k)

      ! Do work after function call
      do concurrent (j=2:ny-1, i=2:nx-1)
        temp1 = coriolis(i, j) * v(i, j, k)
        temp2 = kex(i, j) + metric_x(i, j) * h(i, j, k)
        adv_u(i, j, k) = temp1 - temp2
    
        temp1 = coriolis(i, j) * u(i, j, k)
        temp2 = key(i, j) + metric_y(i, j) * h(i, j, k)
        adv_v(i, j, k) = -temp1 - temp2
      enddo

    enddo

    !$omp target exit data map(delete: key, kex, ke)
    !$omp target exit data map(delete: curl, coriolis, temp1, temp2)
    !$omp target exit data map(delete: adv_u, adv_v)
    !$omp target exit data map(delete: area_factor)

  end subroutine advection_calc_mre

  !> Subroutine containing do concurrents over i,j that access multiple indices
  subroutine compute_grad_mre(u, v, h, ke, kex, key, k)
    !$omp declare target
    implicit none

    integer, intent(in) :: k
    real, intent(in) :: u(nx, ny, nz), v(nx, ny, nz), h(nx, ny, nz)
    real, intent(out) :: ke(nx, ny), kex(nx, ny), key(nx, ny)

    ! Local variables
    integer :: i, j
    real :: area_factor

    ! Calculate ke while accessing neighboring indices: i-1, i, j-1, j
    area_factor = 0.25 
    do concurrent (j=2:ny-1, i=2:nx-1)
      ke(i, j) = area_factor * ((u(i-1, j, k) * u(i-1, j, k) + u(i, j, k) * u(i, j, k)) + &
                                (v(i, j-1, k) * v(i, j-1, k) + v(i, j, k) * v(i, j, k))) * &
                 (1.0 / max(h(i-1, j, k), 0.001)) + &
                 (1.0 / max(h(i, j, k), 0.001))
    enddo

    ! Calculate ke gradient in x-direction
    do concurrent (j=2:ny-1, i=2:nx-2)
      kex(i, j) = (ke(i+1, j) - ke(i, j)) * 1.0
    enddo

    ! Calculate ke gradient in y-direction
    do concurrent (j=2:ny-2, i=2:nx-1)
      key(i, j) = (ke(i, j+1) - ke(i, j)) * 1.0
    enddo

  end subroutine compute_grad_mre

end module constants_mod

program MRE_advection
  use constants_mod

  implicit none

  ! Array declarations - main arrays
  real, allocatable :: u(:,:,:), v(:,:,:), h(:,:,:)
  real, allocatable :: result_u(:,:,:), result_v(:,:,:)
  real, allocatable :: metric_x(:,:), metric_y(:,:)

  integer :: iter
  real :: start_time, end_time

  ! Allocate arrays
  allocate(u(nx, ny, nz))
  allocate(v(nx, ny, nz))
  allocate(h(nx, ny, nz))
  allocate(result_u(nx, ny, nz))
  allocate(result_v(nx, ny, nz))
  allocate(metric_x(nx, ny))
  allocate(metric_y(nx, ny))

  ! Initialize arrays with random values
  call random_seed()
  call random_number(u)
  call random_number(v)
  call random_number(h)
  call random_number(metric_x)
  call random_number(metric_y)

  u = u * 10.0
  v = v * 10.0
  h = h * 100.0 + 1.0  ! Ensure h > 1
  metric_x = metric_x + 0.5  ! Ensure > 0
  metric_y = metric_y + 0.5  ! Ensure > 0

  result_u = 0.0
  result_v = 0.0

  ! Map variables to device
  !$omp target enter data map(to: u,v,h)
  !$omp target enter data map(to: result_u, result_v)
  !$omp target enter data map(to: metric_x, metric_y)

  ! Timing wrapper
  call cpu_time(start_time)

  ! Main loop - iterations for performance testing
  do iter = 1, iterations
    call advection_calc_mre(u, v, h, result_u, result_v, metric_x, metric_y)
  enddo

  call cpu_time(end_time)

  !$omp target exit data map(from: u,v,h)
  !$omp target exit data map(from: result_u, result_v)
  !$omp target exit data map(from: metric_x, metric_y)

  print *, "Total time for ", iterations, " iterations: ", (end_time - start_time), " seconds"
  print *, "Average time per iteration: ", (end_time - start_time) / real(iterations), " seconds"

  deallocate(u, v, h, result_u, result_v, metric_x, metric_y)

end program MRE_advection

MatColgrove · January 23, 2026, 11:31pm

Hi utheriwagura and welcome!

Why do subroutine calls within teams loop regions lead to such a decrease in performance?

This is not unexpected. Subroutine calls increase overhead such as adding additional registers (typically around 150 registers which in turn reduces the theoretical occupancy), but also reduces the scope on how much compiler optimization can be applied.

Note that in this case, the compiler is able to inline the subroutine for you via the flag “-Minline”.

Here’s my times on an H100, without inlining, your manual inline (I added a macro -DINLINED to enable), and the compiler inlining:

% nvfortran -V25.5 -O2 -mp=gpu -stdpar=gpu -gpu=mem:separate -Mnofma -r8 MRE.F90 -o mre_no_inline.out
% nvfortran -V25.5 -O2 -DINLINED -mp=gpu -stdpar=gpu -gpu=mem:separate -Mnofma -r8 MRE.F90 -o mre_manual_inline.out
% nvfortran -V25.5 -O2 -Minline -Minfo=inline  -mp=gpu -stdpar=gpu -gpu=mem:separate -Mnofma -r8 MRE.F90 -o mre_compiler_inline.out
advection_calc_mre:
     73, compute_grad_mre inlined, size=32, file MRE.F90 (96)
% ./mre_no_inline.out
 Total time for           100  iterations:     3.974095106124878       seconds
 Average time per iteration:    3.9740951061248780E-002  seconds
% ./mre_manual_inline.out
 Total time for           100  iterations:    0.5511779785156250       seconds
 Average time per iteration:    5.5117797851562504E-003  seconds
% ./mre_compiler_inline.out
 Total time for           100  iterations:    0.5584948062896729       seconds
 Average time per iteration:    5.5849480628967288E-003  seconds

-Mat

utheriwagura · January 29, 2026, 6:13pm

Thanks for getting back to me Mat!

I was able to get the faster runtimes in my MRE with the -Minline flag, which is nice. However, I have not been able to get the subroutine call in my main code base to inline, even when I pass in the -Minline flag, turn the optimization up to -04, and simplify the logic of the subroutine. The strange part is that -Minfo=inline does not give me any reason as to why it is not inlining, as it does for some other subroutines in my code base. I haven’t been able to replicate this behavior with my MRE yet, but do you have any ideas as to what may be causing this?

Regarding register usage: this does seem to be an issue with the subroutine in our main code base, at least based on some of the output from nsight compute. The report suggests that in the lead up to the subroutine we use significantly more registers, and the average number of threads we run per warp goes down to 1. Is it fair to assume the decrease in threads is due to register pressure, or could there be other causes of this?

MatColgrove · January 29, 2026, 7:04pm

Since this can cause a lot of output, the negative info is under a separate flag, i.e. “-Mneginfo=inline”.

Why it’s not inlining could be for use of reshaped arrays, the routine is too big, or there’s too many inline levels. In these cases, you might need to try adjusts some of the suboptions such as “-Minline=reshape”, “-Minline=levels:10”, or one of the “size” options:

% nvfortran -help -Minline
-Minline[=lib:<inlib>|<maxsize>|<func>|except:<func>|name:<func>|pragma|maxsize:<n>|totalsize:<n>|smallsize:<n>|reshape|exclude:<sdir>]
                    Enable function inlining
    lib:<inlib>     Use extracted functions from inlib
    <maxsize>       Set maximum function size to inline
    <func>          Inline function func
    except:<func>   Do not inline function func
    name:<func>     Inline function func
    pragma          Inline procedures marked !NVF$ INLINE
    maxsize:<n>     Inline only functions smaller than n
    totalsize:<n>   Limit inlining to total size of n
    smallsize:<n>   Always inline functions smaller than n
    reshape         Allow inlining in Fortran even when array shapes do not match
    exclude:<sdir>  Not extract file in sdir
    -Minline        Inline all functions that were extracted

Also keep in mind that “-Minline” will only inline routines in the same file or files when they are all added to the same compile line. If you want cross-file inlining with separate compilation, you need to do a two pass compile. First, compile with “-Mextract=lib:myinllib” (where “myinllib” can be any name you like) to create the inline library. Then compile again with “-Minline=lib:myinllib” to use the inline library.

Is it fair to assume the decrease in threads is due to register pressure, or could there be other causes of this?

Highly likely. A CUDA Block has a maximum of 64K registers which are divided amongst the CUDA threads. Hence if each thread needs more resisters, this limits the number of threads per block. A thread can use a max of 255 registers.

You can use the flag “-gpu=maxregcount:” to set the max number of registers a thread is allowed, thus allowing for more threads. However, if the thread needs more memory then can fit in the allowed registers, the extra variables “spill” to local memory which can decrease performance as memory gets swapped from local memory (which is stored in the GPU’s global memory) and the registers. I’ll typically only use this if the registers per thread is close to a borderline, like 65, then I’ll make the maxregcount=64.

In Nsight-Compute, what you want to look for is the “theoretical occupancy”. High occupancy does not guarantee better performance nor does low occupancy mean you can’t get optimal performance. Typically for HPC apps which are often larger kernels with many local variables, 50% or even 25% occupancy is ok. It’s really a balance between having enough registers per thread to accommodate the needs of the kernel vs fully utilizing all the resources of the GPU. In some cases I’ve seen folks split larger kernels into multiple smaller kernels in order to help with occupancy. However you need an algorithm that allows this and the overhead cost of launching kernels gets increased which might offset any gains.

I should note that there’s also “achieved occupancy”. Theoretical is the max occupancy that could be achieved, with “Achieved” being what occupancy was actually achieved. If there is a difference, the warps are stalling for some reason, like “long scoreboard” where the threads are waiting on memory, or contention on some compute unit like FP64. ncu has a table which will show the warp stalls.

Finally, I’ve been assuming that the kernel has enough work to fully utilize a GPU. Low achieved occupancy could also be that there’s not enough work.

utheriwagura · January 29, 2026, 10:07pm

I tried -Minline=reshape, -Minline=levels:10, -Minline=func:<func-name>, -Minline=pragma, and -Minline=smallsize:500 . None of these options seem to have been successful, though I can’t say for certain because

the -Mneginfo=info flag isn’t printing any extra information about why the function isn’t inlining either.

The subroutine I am trying to inline is in the same file it is called in.

utheriwagura · February 4, 2026, 11:39pm

Hi Mat. I looked into the inlining issues once more, and I think I understand why my code wasn’t inlining previously. I didn’t mention this in the previous post or MRE, but our code also passes several derived types to the subroutine we are trying to inline. It looks like when we include these derived types, the subroutine not only fails to inline, but it doesn’t give any info as to why it isn’t inlining. Removing them allows the subroutine to inline.

I expanded the previous MRE from above to include two derived types that are passed to the subroutine but not used, and this seems to prevent any inlining from occurring. Is this expected behavior?

Compilation flags:

-O2 -Minline -Mneginfo=inline -mp=gpu -stdpar=gpu -gpu=mem:separate -Mnofma -Minfo=all -r8

To use the derived types, add in the macro -DUSE_DERIVED_TYPES to the above flags as well

Example:

!> Minimal Reproducible Example (MRE) for testing calls to subroutines that
!! do work over layers that are called within k-loops inside of target teams
!!  regions

module derived_types_mod
  implicit none

  #ifdef USE_DERIVED_TYPES
  type :: my_type1
    real, allocatable :: arr(:)
    integer :: n
    real :: scalar
  end type my_type1

  type :: my_type2
    real, allocatable :: matrix(:,:)
    integer :: m, n
    real :: value
  end type my_type2
  #endif

end module derived_types_mod

module constants_mod
  #ifdef USE_DERIVED_TYPES
  use derived_types_mod
  #endif
  implicit none

  implicit none

  ! Simulation parameters
  integer, parameter :: nx = 288, ny = 288, nz = 100
  integer, parameter :: iterations = 100

  contains

  !> Subroutine that contains a k-loop with do concurrents over i,j followed by
  !! a call to compute_grad_mre and more do concurrents
  #ifdef USE_DERIVED_TYPES
  subroutine advection_calc_mre(u, v, h, adv_u, adv_v, metric_x, metric_y, dt1, dt2)
  #else
  subroutine advection_calc_mre(u, v, h, adv_u, adv_v, metric_x, metric_y)
  #endif
    implicit none
    real, intent(in) :: u(nx, ny, nz), v(nx, ny, nz), h(nx, ny, nz)
    real, intent(out) :: adv_u(nx, ny, nz), adv_v(nx, ny, nz)
    real, intent(in) :: metric_x(nx, ny), metric_y(nx, ny)
    #ifdef USE_DERIVED_TYPES
    type(my_type1), intent(in) :: dt1
    type(my_type2), intent(in) :: dt2
    #endif

    ! Local variables
    real :: ke(nx, ny), kex(nx, ny), key(nx, ny)
    real :: curl(nx, ny), coriolis(nx, ny)
    integer :: i, j, k
    real :: temp1, temp2

    real :: area_factor

    adv_u = 0.0
    adv_v = 0.0

    !$omp target enter data map(alloc: adv_u, adv_v)
    !$omp target enter data map(alloc: curl, coriolis, temp1, temp2)
    !$omp target enter data map(alloc: key, kex, ke)

    !$omp target enter data map(alloc: area_factor)


    ! Main k-loop structure
    !$omp target teams loop private(curl, coriolis, ke, kex, key)
    do k = 1, nz

      ! Calculate needed quantities 
      do concurrent (j=2:ny-1, i=2:nx-1)
        curl(i, j) = (v(i+1, j, k) - v(i-1, j, k)) * metric_x(i, j) - &
                         (u(i, j+1, k) - u(i, j-1, k)) * metric_y(i, j)
        coriolis(i, j) = curl(i, j) * h(i, j, k)
      enddo

      ! ! Call to subroutine similar to compute_grad
      #ifdef USE_DERIVED_TYPES
      call compute_grad_mre(u, v, h, ke, kex, key, k, dt1, dt2)
      #else
      call compute_grad_mre(u, v, h, ke, kex, key, k)
      #endif

      ! Do work after function call
      do concurrent (j=2:ny-1, i=2:nx-1)
        temp1 = coriolis(i, j) * v(i, j, k)
        temp2 = kex(i, j) + metric_x(i, j) * h(i, j, k)
        adv_u(i, j, k) = temp1 - temp2
    
        temp1 = coriolis(i, j) * u(i, j, k)
        temp2 = key(i, j) + metric_y(i, j) * h(i, j, k)
        adv_v(i, j, k) = -temp1 - temp2
      enddo

    enddo

    !$omp target exit data map(delete: key, kex, ke)
    !$omp target exit data map(delete: curl, coriolis, temp1, temp2)
    !$omp target exit data map(delete: adv_u, adv_v)
    !$omp target exit data map(delete: area_factor)

  end subroutine advection_calc_mre

  !> Subroutine containing do concurrents over i,j that access multiple indices
  #ifdef USE_DERIVED_TYPES
  subroutine compute_grad_mre(u, v, h, ke, kex, key, k, dt1, dt2)
  #else
  subroutine compute_grad_mre(u, v, h, ke, kex, key, k)
  #endif
    !$omp declare target
    implicit none

    integer, intent(in) :: k
    real, intent(in) :: u(nx, ny, nz), v(nx, ny, nz), h(nx, ny, nz)
    real, intent(out) :: ke(nx, ny), kex(nx, ny), key(nx, ny)
     
    #ifdef USE_DERIVED_TYPES
    type(my_type1), intent(in) :: dt1
    type(my_type2), intent(in) :: dt2
    #endif

    ! Local variables
    integer :: i, j
    real :: area_factor

    ! Calculate ke while accessing neighboring indices: i-1, i, j-1, j
    area_factor = 0.25 
    do concurrent (j=2:ny-1, i=2:nx-1)
      ke(i, j) = area_factor * ((u(i-1, j, k) * u(i-1, j, k) + u(i, j, k) * u(i, j, k)) + &
                                (v(i, j-1, k) * v(i, j-1, k) + v(i, j, k) * v(i, j, k))) * &
                 (1.0 / max(h(i-1, j, k), 0.001)) + &
                 (1.0 / max(h(i, j, k), 0.001))
    enddo

    ! Calculate ke gradient in x-direction
    do concurrent (j=2:ny-1, i=2:nx-2)
      kex(i, j) = (ke(i+1, j) - ke(i, j)) * 1.0
    enddo

    ! Calculate ke gradient in y-direction
    do concurrent (j=2:ny-2, i=2:nx-1)
      key(i, j) = (ke(i, j+1) - ke(i, j)) * 1.0
    enddo

  end subroutine compute_grad_mre

end module constants_mod

program MRE_advection
  use constants_mod
  #ifdef USE_DERIVED_TYPES
  use derived_types_mod
  #endif

  implicit none

  ! Array declarations - main arrays
  real, allocatable :: u(:,:,:), v(:,:,:), h(:,:,:)
  real, allocatable :: result_u(:,:,:), result_v(:,:,:)
  real, allocatable :: metric_x(:,:), metric_y(:,:)

  integer :: iter
  real :: start_time, end_time
  
  #ifdef USE_DERIVED_TYPES
  type(my_type1) :: dt1
  type(my_type2) :: dt2
  #endif

  ! Allocate arrays
  allocate(u(nx, ny, nz))
  allocate(v(nx, ny, nz))
  allocate(h(nx, ny, nz))
  allocate(result_u(nx, ny, nz))
  allocate(result_v(nx, ny, nz))
  allocate(metric_x(nx, ny))
  allocate(metric_y(nx, ny))

  ! Initialize arrays with random values
  call random_seed()
  call random_number(u)
  call random_number(v)
  call random_number(h)
  call random_number(metric_x)
  call random_number(metric_y)

  ! Allocate and initialize derived type fields
  #ifdef USE_DERIVED_TYPES
  dt1%n = 10
  dt1%scalar = 1.23
  allocate(dt1%arr(dt1%n))
  dt1%arr = 42.0

  dt2%m = 5
  dt2%n = 8
  dt2%value = 3.14
  allocate(dt2%matrix(dt2%m, dt2%n))
  dt2%matrix = 7.0
  #endif

  u = u * 10.0
  v = v * 10.0
  h = h * 100.0 + 1.0  ! Ensure h > 1
  metric_x = metric_x + 0.5  ! Ensure > 0
  metric_y = metric_y + 0.5  ! Ensure > 0

  result_u = 0.0
  result_v = 0.0

  ! Map variables to device
  !$omp target enter data map(to: u,v,h)
  !$omp target enter data map(to: result_u, result_v)
  !$omp target enter data map(to: metric_x, metric_y)

  ! Timing wrapper
  call cpu_time(start_time)

  ! Main loop - iterations for performance testing
  do iter = 1, iterations
    #ifdef USE_DERIVED_TYPES
    call advection_calc_mre(u, v, h, result_u, result_v, metric_x, metric_y, dt1, dt2)
    #else
    call advection_calc_mre(u, v, h, result_u, result_v, metric_x, metric_y)
    #endif
  enddo

  call cpu_time(end_time)

  !$omp target exit data map(from: u,v,h)
  !$omp target exit data map(from: result_u, result_v)
  !$omp target exit data map(from: metric_x, metric_y)

  print *, "Total time for ", iterations, " iterations: ", (end_time - start_time), " seconds"
  print *, "Average time per iteration: ", (end_time - start_time) / real(iterations), " seconds"

  deallocate(u, v, h, result_u, result_v, metric_x, metric_y)

  #ifdef USE_DERIVED_TYPES
  deallocate(dt1%arr)
  deallocate(dt2%matrix)
  #endif

end program MRE_advection

MatColgrove · February 6, 2026, 7:44pm

I asked one of our compiler engineers who works on the inliner. She says that the missing neginfo is a bug. She also tried removing the restrictions on the inliner when a derived type is an argument, but encountered a number of issues having to do with the descriptor for the type.

She asked that I add a report, TPR #38226, and she’ll see what she can do.

edoy · February 9, 2026, 9:42pm

Thanks for all the useful inlining info @MatColgrove. After getting inlining to work (by removing derived types and derived type members from the procedure calls), I run into ACC_HEAP_ALLOC errors. Is it the case that procedure local arrays get allocated on the heap even after inlining?

My code looks something like this:

!$omp target teams loop private(arr1)
do j = ...
    
    call proc1(arr1)
    call proc2(arr1, arr2(:,j), ni)
enddo

subroutine proc1(arrin)
    real, intent(inout):: arrin

    !$omp loop bind(parallel)
    do i = ...
        arrin(i) = ...
    enddo

end subroutine proc1

subroutine proc2(arr1, arr2, ni)
    real, intent(in):: arr1(ni), arr2(ni)
    real :: arr3(ni)

    call proc1(arr3)

    !$omp loop bind(parallel)
    do i = ...
        arr2(i) = ... ! do something with arr3
    enddo

end subroutine proc2

I can confirm that proc1 and proc2 get inlined with -Minline=reshape,levels:2 -Minfo=inline,mp but I run into the ACC_HEAP_ALLOC errors as mentioned. I’m not sure if the error comes from the private arrays in the teams loop, or the procedure local array in proc2.

MatColgrove · February 9, 2026, 9:57pm

The outer loop’s private “arr1” wouldn’t cause this. It might cause memory errors in that these get allocated before the kernel launch, so you could run out of memory. But this is host side.

The likely issue has to do with the automatic array, arr3. This does get allocated on the device so could encounter device heap errors.

These errors are easily worked around but setting the NV_ACC_CUDA_HEAPSIZE environment to a size large enough for the automatic allocation. There’s no limit to the size other than the available memory on the device.

Though performance-wise, device side allocation can be slow so if possible, avoided. Are you able to switch arr3 to be fixed size rather than automatic?

edoy · February 9, 2026, 10:06pm

Thanks Mat. Making the array static could be an option. Would replacing arr3 with an argument also work?

Thanks,
Ed

MatColgrove · February 9, 2026, 10:40pm

Yep, basically hoist arr3 into the caller with it’s allocation before the offload region. Then put it in a “private” clause and pass the private copy of arr3 into the subroutine.

Topic		Replies	Views
OpenACC: Inlining of device subroutine leads to nvlink error Legacy PGI Compilers	5	13107	December 17, 2015
NV 23.11 not in-lining with -Minline (works with 23.5) nvc, nvc++ and nvfortran	10	584	February 1, 2024
Acc loop with a FORTRAN subroutine call (acc routine worker) is not parallelized! Legacy PGI Compilers	18	2082	April 6, 2022
Nvfortran error nvc, nvc++ and nvfortran	39	3728	January 17, 2024
NVFORTRAN-F-0000-Internal compiler error. child tinfo should have been created at outlining function for host nvc, nvc++ and nvfortran	4	95	September 12, 2024
function inline problem Legacy PGI Compilers	2	3551	April 15, 2014
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	2077	November 30, 2020
function/procedure calls not supported Legacy PGI Compilers	5	7541	March 2, 2012
Bug in nvfortran 22.1-0 64-bit target on x86-64 Linux -tp zen3 nvc, nvc++ and nvfortran	2	474	August 12, 2022
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	215	December 28, 2024

Improving performance when calling subroutines inside of openmp teams regions with nvfortran

Related topics