Deviceptr vs present OpenACC directives

Dear all,

during the development of a Fortran library for easy handling of memory offloading on GPU devices, we faced an issue concerning the OpenACC directive deviceptr. Please, consider the following minimal test:

program test_deviceptr
use iso_c_binding
use openacc

implicit none

integer                   :: sizes(3)=[1,2,3]
real, pointer             :: a(:,:,:)
real, allocatable, target :: b(:,:,:)
type(c_ptr)               :: cptr
integer(c_size_t)         :: bytes
integer                   :: i, j, k

interface
   function acc_malloc_f(total_byte_dim) bind(c, name="acc_malloc")
   use iso_c_binding, only : c_ptr, c_size_t
   implicit none
   type(c_ptr)                          :: acc_malloc_f
   integer(c_size_t), value, intent(in) :: total_byte_dim
   endfunction acc_malloc_f

   subroutine acc_memcpy_from_device_f(host_ptr, dev_ptr, total_byte_dim) bind(c, name="acc_memcpy_from_device")
   use iso_c_binding, only : c_ptr, c_size_t
   implicit none
   type(c_ptr),       value :: host_ptr
   type(c_ptr),       value :: dev_ptr
   integer(c_size_t), value :: total_byte_dim
   endsubroutine acc_memcpy_from_device_f
endinterface

bytes = int(storage_size(a)/8, c_size_t) * int(product(sizes), c_size_t)
cptr = acc_malloc_f(bytes)
if (c_associated(cptr)) call c_f_pointer(cptr, a, shape=sizes)
!$acc parallel loop collapse(3) deviceptr(a)
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         a(i,j,k) = (i + j + k) * 0.5
      enddo
   enddo
enddo
allocate(b(sizes(1),sizes(2),sizes(3)))
call acc_memcpy_from_device_f(c_loc(b), c_loc(a), bytes)
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         print*, b(i,j,k)
      enddo
   enddo
enddo
endprogram test_deviceptr

The test allocates an array on the device, fills it through a parallel-loop, copies it back on the host, and prints the result. If I compile this test with nvfortran (24.1-0) it compiles and runs correctly (as expected):

   1.50000000    
   2.00000000    
   2.00000000    
   2.50000000    
   2.50000000    
   3.00000000

However, if I use GNU gfortran (13.1.0) I obtain:

compilers_proofs/oac/test_deviceptr.f90:34:42:

   34 | !$acc parallel loop collapse(3) deviceptr(a)
      |                                          1
Error: POINTER object ‘a’ in MAP clause at (1)

As a consequence, I have read more carefully the OpenACC latest specs and found the following statements:

deviceptr

The deviceptr clause may appear on structured data and compute constructs and declare directives.
The deviceptr clause is used to declare that the pointers in var-list are device pointers, so the
data need not be allocated or moved between the host and device for this pointer.

In C and C++, the vars in var-list must be pointer variables.

In Fortran, the vars in var-list must be dummy arguments (arrays or scalars), and may not have the
Fortran pointer, allocatable, or value attributes.
For data in shared memory, host pointers are the same as device pointers, so this clause has no effect.

To my understanding, in Fortran, deviceptr should not accept pointer variables (as I do in the test), thus it seems that GNU gfortran is right in raising the error. Furthermore, I tried to use the present directive that looks to accept pointer variables:

present

The present clause may appear on structured data and compute constructs and declare directives. The present clause specifies that vars in var-list are in shared memory or are already present in the current device memory due to data regions or data lifetimes that contain the construct on which the present clause appears.

For each var in var-list, if var is in shared memory, no action is taken; if var is not in shared memory,
the present clause behaves as follows:
• At entry to the region:
– An attach action is performed if var is a pointer reference, and a present increment
action with the structured reference counter is performed if var is not a null pointer.
• At exit from the region:
– If the structured reference counter for var is zero, no action is taken.
– Otherwise, a detach action is performed if var is a pointer reference, and a present decrement
action with the structured reference counter is performed if var is not a null pointer. If
both structured and dynamic reference counters are zero, a delete action is performed.

Substituting the directive deviceptr with present in the test makes GNU gfortran to compile and run correctly the test, but with nvfortran, altough the test is correctly complied, the runnning results in the following error:

hostptr=0x79999d2fa000,stride=1,size=6,eltsize=4,name=a(:,:,:),flags=0x200=present,async=-1,threadid=1
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.5, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
...empty...
allocated block device:0x79999d2fa000 size:512 thread:1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=a(:,:,:) host:0x79999d2fa000
 file:/home/stefano/fortran/FUNDAL/compilers_proofs/oac/test_present.f90 test_present line:34

Currently, we are using deviceptr directive in our library because it works fine with nvfortran, but we are worried this is not the right way due to OpenACC specs and GNU gfortran behavior.

Can you explain if we are out of specs using deviceptr in the above way?

Kind regards,
Stefano

Hi Stefano,

The behavior of “deviceptr” is an extension in order to support interoperability with CUDA Fortran.

However using “present” here isn’t really correct. “present” checks that the variable is being managed by the OpenACC runtime via data directives or is in shared memory, which isn’t the case here. I don’t know why it works with gfortran, but technically it shouldn’t.

I don’t have a full understanding of your project, but this approach seems problematic to me. You’re effectively doing CUDA Fortran like data management by splitting between host and device variables rather than OpenACC managed mirrored copies of the variables.

Another approach to consider is using unstructured data regions contained in a module like the simple example below. For a full library I’d extend this to use generic interfaces to handle different array types and dimensions.

module acc_data_manage

  contains

  subroutine acc_create_3d(arr)
        real, dimension(:,:,:) :: arr
        !$acc enter data create(arr)
  end subroutine acc_create_3d

  subroutine acc_delete_3d(arr)
        real, dimension(:,:,:) :: arr
        !$acc exit data delete(arr)
  end subroutine acc_delete_3d

  subroutine acc_update_device_3d(arr)
        real, dimension(:,:,:) :: arr
        !$acc update device(arr)
  end subroutine acc_update_device_3d

  subroutine acc_update_self_3d(arr)
        real, dimension(:,:,:) :: arr
        !$acc update self(arr)
  end subroutine acc_update_self_3d

end module acc_data_manage


program test_deviceptr
use acc_data_manage
implicit none

integer          :: sizes(3)=[1,2,3]
real, allocatable :: a(:,:,:)
integer                   :: i, j, k
allocate(a(sizes(1),sizes(2),sizes(3)))
call acc_create_3d(a)

!$acc parallel loop collapse(3) present(a)
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         a(i,j,k) = (i + j + k) * 0.5
      enddo
   enddo
enddo
call acc_update_self_3d(a)

do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         print*, a(i,j,k)
      enddo
   enddo
enddo
call acc_delete_3d(a)
deallocate(a)
endprogram test_deviceptr

This works with both nvfortran and gfortran and should make your library easier to use as well as implement.

-Mat

Dear Mat, thank you very much for your help, it is appreciated.

The behavior of “deviceptr” is an extension in order to support interoperability with CUDA Fortran.

Indeed, we are integrating the library into (also) a solver having already a CUDAFortran backend, this was a “viable” approach for us.

However using “present” here isn’t really correct. “present” checks that the variable is being managed by the OpenACC runtime via data directives or is in shared memory, which isn’t the case here. I don’t know why it works with gfortran, but technically it shouldn’t.

This is a crucial point for us, thus present is not useful for our approach. I will try to ask to GNU gfortran developers which is their interpretation of the OpenACC specs and why gfortran raises an error with deviceptr.

I don’t have a full understanding of your project, but this approach seems problematic to me. You’re effectively doing CUDA Fortran like data management by splitting between host and device variables rather than OpenACC managed mirrored copies of the variables.

We aim to have full control of device memory by means of runtime routines instead of sitting on top of OpenACC management with mirrored copies, just as we do in CUDAFortran as well as the OpenMP offload (the library aims to unify all the 3 backends).

Another approach to consider is using unstructured data regions contained in a module like the simple example below. For a full library I’d extend this to use generic interfaces to handle different array types and dimensions…

This is interesting, thank you very much for the suggestion! I will study this approach to check if it is possible to replicate with OpenMP offloading.

This works with both nvfortran and gfortran and should make your library easier to use as well as implement.

I agree, having the possibility to compile with a large number of compilers is a good development practive for us.

Thank you again, Mat.
Stefano

It should be easy to replicate in OpenMP since it adopted a very similar approach using “target data” constructs. Both directives can be added to the routines with the selection either controlled by compiler flags, or by adding macro guards.

Hi Mat,

It should be easy to replicate in OpenMP since it adopted a very similar approach using “target data” constructs. Both directives can be added to the routines with the selection either controlled by compiler flags, or by adding macro guards.

Yes, today we have discussed your suggestion and we agree that target data is equivalent. We will add this approach alongside the other.

If you are interested the library is FOSS

Stefano

Dear @MatColgrove and all,

aside from the issue with GNU gfortran and deviceptr we are facing a problem with OpenACC MPI-CUDA Aware test.

Our current approach (mimic a more complex scenario) is something like this:

...
print '(A)', mpih%myrankstr//'test MPI by means of device memory'
!$acc data deviceptr(a01,a11)
!$acc host_data use_device(a01,a11)
!$omp target data use_device_ptr(a01,a11)
if (mpih%myrank == 0_I4P) call MPI_IRECV(a01, 6, MPI_REAL8, 1, 100, MPI_COMM_WORLD, req_recv(1), mpih%ierr)
if (mpih%myrank == 1_I4P) call MPI_SEND( a11, 6, MPI_REAL8, 0, 100, MPI_COMM_WORLD,              mpih%ierr)
call MPI_WAITALL(mpih%procs_number, req_recv, MPI_STATUSES_IGNORE, mpih%ierr)
!$omp end target data
!$acc end host_data
!$acc end data
...

This kind of snippet is correctly compiled with NVidia nvfortran, but the data is not correctly passed by the OpenACC MPI-Cuda aware (using CUDA fortran it works as exepected). Note that we have also tried OpenMP offloading (see the !$omp target data use_device_ptr(a01,a11) directive) on Intel GPUs and also works as expected. We suppose that the decoration with !$acc data deviceptr(a01,a11) and !$acc host_data use_device(a01,a11) are wrong. Can you give us your opinion about this?

Stefano

I suspect the “host_data” construct isn’t needed here given a01 and a11 are already device pointers.

Though do you have this example posted in your project? I looked but didn’t see it, though could have missed it.

Since I’ve not seen this use case before, I’d to investigate so I can give you a more definitive answer.

Dear Mat,

I am sorry to have forgotten the link of test. The test is the following

However it uses the library that is currently not so obviously buildable (there is some documentation anyway).

I should have tested without host data, but I am not completely sure, I will try again ASAP.

Thank you very much for your great help.

Stefano

Hi Mat,

without host_data the test compiles but running it I got the following error:

[enlil:58255:0:58255] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7cef732fa600)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
==== backtrace (tid:  58255) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000001a0840 __nss_database_lookup()  ???:0
 2 0x0000000000394f72 local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 3 0x0000000000394e8d local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 4 0x0000000000394e8d local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 5 0x000000000039325f pgf90_copy_f77_argl_i8()  ???:0
 6 0x0000000000404fcc MAIN_()  /home/stefano/fortran/FUNDAL/src/tests/mpi/fundal_mpi_test.F90:168
 7 0x00000000004025b1 main()  ???:0
 8 0x0000000000029d90 __libc_init_first()  ???:0
 9 0x0000000000029e40 __libc_start_main()  ???:0
10 0x00000000004024a5 _start()  ???:0
=================================
[enlil:58255] *** Process received signal ***
[enlil:58255] Signal: Segmentation fault (11)
[enlil:58255] Signal code:  (-6)
[enlil:58255] Failing at address: 0x3e80000e38f
[enlil:58255] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7cefb0a42520]
[enlil:58255] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a0840)[0x7cefb0ba0840]
[enlil:58255] [ 2] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394f72)[0x7cefb2d94f72]
[enlil:58255] [ 3] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394e8d)[0x7cefb2d94e8d]
[enlil:58255] [ 4] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394e8d)[0x7cefb2d94e8d]
[enlil:58255] [ 5] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(pgf90_copy_f77_argl_i8+0x21f)[0x7cefb2d9325f]
[enlil:58255] [ 6] exe/fundal_mpi_test[0x404fcc]
[enlil:58255] [ 7] exe/fundal_mpi_test[0x4025b1]
[enlil:58255] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7cefb0a29d90]
[enlil:58255] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7cefb0a29e40]
[enlil:58255] [10] exe/fundal_mpi_test[0x4024a5]
[enlil:58255] *** End of error message ***
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
==== backtrace (tid:  58256) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000001a0840 __nss_database_lookup()  ???:0
 2 0x0000000000394f72 local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 3 0x0000000000394e8d local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 4 0x0000000000394e8d local_copy_i8()  nvcoFNRQnfblGcF.ll:0
 5 0x000000000039325f pgf90_copy_f77_argl_i8()  ???:0
 6 0x0000000000404cee MAIN_()  /home/stefano/fortran/FUNDAL/src/tests/mpi/fundal_mpi_test.F90:169
 7 0x00000000004025b1 main()  ???:0
 8 0x0000000000029d90 __libc_init_first()  ???:0
 9 0x0000000000029e40 __libc_start_main()  ???:0
10 0x00000000004024a5 _start()  ???:0
=================================
[enlil:58256] *** Process received signal ***
[enlil:58256] Signal: Segmentation fault (11)
[enlil:58256] Signal code:  (-6)
[enlil:58256] Failing at address: 0x3e80000e390
[enlil:58256] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x742414242520]
[enlil:58256] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a0840)[0x7424143a0840]
[enlil:58256] [ 2] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394f72)[0x742416594f72]
[enlil:58256] [ 3] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394e8d)[0x742416594e8d]
[enlil:58256] [ 4] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(+0x394e8d)[0x742416594e8d]
[enlil:58256] [ 5] /opt/nvidia/hpc_sdk-v12.3/Linux_x86_64/24.1/compilers/lib/libnvf.so(pgf90_copy_f77_argl_i8+0x21f)[0x74241659325f]
[enlil:58256] [ 6] exe/fundal_mpi_test[0x404cee]
[enlil:58256] [ 7] exe/fundal_mpi_test[0x4025b1]
[enlil:58256] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x742414229d90]
[enlil:58256] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x742414229e40]
[enlil:58256] [10] exe/fundal_mpi_test[0x4024a5]
[enlil:58256] *** End of error message ***

The combination of the !$acc data deviceptr(a01,a11) and !$acc host_data use_device(a01,a11) is the only one that compiles and runs smoothly, but the MPI copy from a11 to a a01 does not happen, thus the result is wrong and the test is not passed.

Stefano

Yes, I’m seeing the same behavior. I’m not convinced that using “data deviceptr” and then a “host_data” region is the correct approach. I’m tried passing the address of the device pointer directly to MPI_SEND/IRECV, via “c_loc(a01)”. It appears to me that the device address is getting passed, but still the values aren’t copied between ranks.

I’ve spent a few hours on this trying various things, but still no luck, and need to work on other things. I’ll try to get back to it as some point, but again as we see from your first post, using this hybrid CUDA Fortran style approach will bring complications.

Dear Mat,

thank you for your help and your time, it is appreciated.

Stefano