Bug: NVHPC 25.X present table errors with fortran do concurrent and kind-of nested type-bound procedures

Hi,

I’ve run into a very specific problem when using do concurrent with a sort-of nested type-bound procedures. The error in the below MRE arises when:

  • NVHPC 25.1 or 25.3 are used
  • -O1 or higher is enabled
  • both -mp=gpu and stdpar=gpu are added to the compile flags
module testm

    implicit none

    type:: base
    contains
        procedure:: an_elemental_function
        procedure:: a_2d_subroutine
    end type base

contains

    real elemental function an_elemental_function(this, input)
        class(base), intent(in):: this
        real, intent(in):: input
        an_elemental_function = 2.*input
    end function an_elemental_function

    subroutine a_2d_subroutine(this, input)
        class(base), intent(in):: this
        real, intent(inout):: input(:, :)
        integer:: i, j, s(2)
        s = shape(input)
        do concurrent(i=1:s(1), j=1:s(2))
            input(i,j) = an_elemental_function(this, input(i, j))
        enddo
    end subroutine a_2d_subroutine

end module testm

program test

    use testm

    implicit none
    type(base):: t
    real:: a(4, 4)

    a(:, :) = 2.

    call t%a_2d_subroutine(a)

    write(*, *) sum(a) == 64.0

end program test

Note the do concurrent in a_2d_subroutine, which calls an_elemental_function - both of which are methods of base. I get the error:

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.9, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x4091e0 device:0x75feefafa200 size:80 presentcount:1+0 line:24 name:descriptor
host:0x4095c0 device:0x75feefafa000 size:64 presentcount:1+0 line:24 name:input(:,:)
host:0x409600 device:(nil) size:0 presentcount:1+0 line:24 name:this
allocated block device:0x75feefafa000 size:512 thread:1
allocated block device:0x75feefafa200 size:512 thread:1

Present table errors:
.O0001(:) lives at 0x4091e0 size 1180 partially present in
host:0x4091e0 device:0x75feefafa200 size:80 presentcount:1+0 line:24 name:descriptor file:/home/edwardy/test-simple.f90
host:0x4095c0 device:0x75feefafa000 size:64 presentcount:1+0 line:24 name:input(:,:) file:/home/edwardy/test-simple.f90
host:0x409600 device:(nil) size:0 presentcount:1+0 line:24 name:this file:/home/edwardy/test-simple.f90
FATAL ERROR: variable in data clause is partially present on the device: name=.O0001(:)
 file:/home/edwardy/test-simple.f90 a_2d_subroutine line:24

Note that if I replace input(i,j) = an_elemental_function(this, input(i, j)) with a class method call input(i,j) = this%an_elemental_function(input(i, j)), then it works fine.

Full compilation flag: nvfortran test.f90 -mp=gpu -stdpar=gpu -gpu=mem:separate -O1.

The example works as normal for nvhpc 24.9, regardless of flag combination.

Since the code is accessing host stack variables on the device, it requires full Unified Memory, i.e. “-gpu=mem:unified”.

Does your system and device support HMM, which is needed for full Unified Memory?

For example on a Grace-Hopper system:

% nvfortran -stdpar=gpu -gpu=mem:separate test.F90 ; a.out
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 9.0, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x417220 device:0x400419efa200 size:80 presentcount:1+0 line:24 name:descriptor
host:0x4175c0 device:0x400419efa000 size:64 presentcount:1+0 line:24 name:input(:,:)
host:0x417600 device:(nil) size:0 presentcount:1+0 line:24 name:this
allocated block device:0x400419efa000 size:512 thread:1
allocated block device:0x400419efa200 size:512 thread:1

Present table errors:
.O0001(:) lives at 0x417220 size 1180 partially present in
host:0x417220 device:0x400419efa200 size:80 presentcount:1+0 line:24 name:descriptor file:/home/mcolgrove/tmp/test.F90
host:0x4175c0 device:0x400419efa000 size:64 presentcount:1+0 line:24 name:input(:,:) file:/home/mcolgrove/tmp/test.F90
host:0x417600 device:(nil) size:0 presentcount:1+0 line:24 name:this file:/home/mcolgrove/tmp/test.F90
FATAL ERROR: variable in data clause is partially present on the device: name=.O0001(:)
 file:/home/mcolgrove/tmp/test.F90 a_2d_subroutine line:24

% nvfortran -stdpar=gpu -gpu=mem:unified test.F90 ; a.out
  T

Hi Matt,

No we don’t have HMM enabled systems yet.

Ok, though if you’re able to enable HMM you’ll be able to use a wider variety of code on the GPU. It does require a newer version of Linux and CUDA drivers as well as newer GPU architectures. Full details are in the article I linked above.

Performance wise, full UM over PCIe isn’t the best but should be functional. It’s much better using NVLink on the Grace-Hopper systems if you have access.

Thanks Mat,

I should clarify that we’re not intending to use unified memory for the time being since we don’t have any HMM systems (and probably won’t for awhile). We’re happy to manually manage memory with OpenMP directives for the time being.

I’m more concerned because the error mentioned seems to be new. We hope that we don’t get blocked from using newer nvfortran versions.

Sincere apologies! The was a flood of UF posts the last few days, so I was moving too fast and missed that this is a regression.

I filed a problem report, TPR #37406, and will have engineering investigate.

Note that I’m now thinking that it might be a device inlining issue as adding “-Minline”, which is done by the front-end compiler, works around the problem.

Note for performance, how you have it now, the compiler needs to implicitly copy the data each time it encounters the DC loop. UM should help here, or you might consider adding OpenACC or OpenMP data regions to hoist the data movement earlier in the program.

-Mat