How to use c_f_pointer in OpenACC

How to make the following code work with OpenACC?

    subroutine test_alloc_arr(mem_gpu)
    integer(kind=c_int8_t), dimension(:), pointer :: mem_gpu
    integer(kind=c_int8_t), dimension(:), pointer :: gpu_arr
    integer :: i,j
    type(c_devptr) :: mem_gpu_address

    !$acc declare device_resident(mem_gpu, gpu_arr, mem_gpu_address)
    allocate(mem_gpu(1024))

    !$acc data present(mem_gpu, gpu_arr, mem_gpu_address)
    !$acc parallel default(none) num_gangs(1) num_workers(1) vector_length(128)

    do i = 1, 128
        mem_gpu(i) = i
    end do
    ! Also tried with c_ptr and c_loc, but then compiler complains about c_f_pointer  
    ! not having acc routine clause
    
    mem_gpu_address = c_devloc(mem_gpu(11)) ! This works
    call c_f_pointer(mem_gpu_address, gpu_arr, shape=[10])  ! But this line does not compile!

    !$acc end parallel
    !$acc update host(mem_gpu, gpu_arr)
    !$acc end data

    print *,mem_gpu(11:20)
    print *,gpu_arr

end subroutine test_alloc_arr
This results in: NVFORTRAN-S-0074-Illegal number or type of arguments to c_f_pointer - keyword argument fptr

Compiled with NVFORTRAN from HPC SDK 20.11

Hi FrostyMike,

I’ve not seen anyone attempt to call c_f_pointer from device code before, but suspect the issue is that c_f_pointer would be expecting a host pointer but in this context, gpu_arr would have a device attribute.

Can you describe what you’re trying to accomplish with this code? It appears to me that you’re trying to associate pointer with a device sub-array, which there’s simpler ways to do.

Also, by using “device_resident” are you purposefully wanting these arrays to be device only?

-Mat

Hi Mat,

I am trying to create a sort of memory pool. I.e. first to allocate large region of memory (as 1d integer array) and then place a large number of other objects inside , so that I don’t need to allocate/deallocate them all the time, which is currently very costly in terms of performance. I’ve done similar things with CUDA and C/C++ in the past, but my current toolset is strictly limited to Fortran 2003 and OpenACC only. Unfortunately Fortran is not friendly towards such concepts, so it is proven more difficult.

Also, by using “device_resident” are you purposefully wanting these arrays to be device only?

That is the crux of the matter. I got it to work with combination of c_loc and c_f_pointer, but only when the arrays have a “mirror image” in host memory. I however only use GPUs for computation, so I don’t need the host memory version. If I allocate 4gb memory pool on GPU, it costs me additional 4gb of RAM, at no benefit, so I am looking for ways to get rid of it.

I am trying to create a sort of memory pool.

Got it. Yes, it can be a bit more challenging to do this Fortran. But if you don’t mind using a bit of CUDA Fortran, it should be doable.

I wrote a very rudimentary example below on how I’d approach the problem. Basically use a CUDA Fortran “device” array to hold the memory pool, and then use “acc_map_data” to associate a host array to a portion of the pool. Technically “acc_map_data” is a C/C++ only function in the OpenACC standard, but we extended it to work in Fortran and device pointers. Much easier than having to jump through hoops getting the C pointers.

% cat pool.f90

module devpool

  use cudafor
  use openacc
  use iso_c_binding

  integer(kind=c_int8_t), dimension(:), allocatable, device :: pool
  integer :: cnt, maxcnt

contains

  subroutine initpool(sze)
     integer :: sze
     maxcnt = sze
     cnt = 1
     allocate(pool(maxcnt))
  end subroutine initpool

  subroutine register_real(arr,sze)
     real :: arr(*)
     integer :: sze
     if ( (cnt+sze).gt.maxcnt) then
        print *, "Error: pool not large enough to meet request."
        stop
     end if
     call acc_map_data(arr,pool(cnt),sze)
     cnt=cnt+sze
   end subroutine

end module devpool

program foo
   use devpool
   real, dimension(:), allocatable :: a1, a2
   integer :: sze, i

   sze = 1024
   call initpool(sze*16)
   allocate(a1(sze))
   allocate(a2(sze))
   call register_real(a1,sze*4)
   call register_real(a2,sze*4)

!$acc parallel loop present(a1)
   do i=1,sze
      a1(i) = 1. / real(i)
   end do

!$acc parallel loop present(a1,a2)
   do i=1,sze
      a2(i) = a1(i) * real(i)
   end do

!$acc update self(a2)
    print *, a2(1:10)

end program foo

% nvfortran -acc -cuda pool.f90; a.out
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000

I like your example, it is much cleaner then my own. However, unfortunately it suffers from 2 problems:

  1. To give array pointer shape, you are forced to allocate it on cpu first.

    allocate(a1(sze))
    allocate(a2(sze))
  2. allocate() does not use the created pool, so CPU allocation not only adds memory overhead, but causes an expensive call to malloc().
    This is not a big problem though, because it can be fixed by using c_f_pointer().

Unfortunately I didn’t find any other way to give the pointer shape except either allocate() (which is a no go) or c_f_pointer (only works with CPU memory, so again it has to be allocated).

Do you think there is a chance that Nvidia/PGI will add support for c_f_pointer() in OpenACC/Cuda device code anytime soon?

Do you think there is a chance that Nvidia/PGI will add support for c_f_pointer() in OpenACC/Cuda device code anytime soon?

I can put in an RFE, but no promise on when or if it’s something we’d add.

Still seems to me that there’s simpler ways to accomplish this. Would pointer assignment of device pointers work for you?

% cat pool2.f90

module devpool

  use cudafor
  use openacc
  use iso_c_binding

  real, dimension(:), allocatable, device, target :: pool
  integer :: cnt, maxcnt

contains

  subroutine initpool(sze)
     integer :: sze
     maxcnt = sze
     cnt = 1
     allocate(pool(maxcnt))
  end subroutine initpool

  subroutine register_real_dev(arr,sze)
     real, device, pointer :: arr(:)
     integer :: sze, bytes
     if ( (cnt+sze).gt.maxcnt) then
        print *, "Error: pool not large enough to meet request."
        stop
     end if
     arr => pool(cnt:sze)
     cnt=cnt+sze
   end subroutine

end module devpool

program foo
   use devpool
   real, dimension(:), device, pointer :: a1
   real, dimension(:), allocatable :: a2
   integer :: sze, i

   sze = 1024
   call initpool(sze*16)
   allocate(a2(sze))
   call register_real_dev(a1,sze)

!$acc parallel loop deviceptr(a1)
   do i=1,sze
      a1(i) = 1. / real(i)
   end do

!$acc parallel loop copyout(a2) deviceptr(a1)
   do i=1,sze
      a2(i) = a1(i) * real(i)
   end do

    print *, a2(1:10)

end program foo

% nvfortran -acc -cuda pool2.f90; a.out
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000