How to make the following code work with OpenACC?
subroutine test_alloc_arr(mem_gpu)
integer(kind=c_int8_t), dimension(:), pointer :: mem_gpu
integer(kind=c_int8_t), dimension(:), pointer :: gpu_arr
integer :: i,j
type(c_devptr) :: mem_gpu_address
!$acc declare device_resident(mem_gpu, gpu_arr, mem_gpu_address)
allocate(mem_gpu(1024))
!$acc data present(mem_gpu, gpu_arr, mem_gpu_address)
!$acc parallel default(none) num_gangs(1) num_workers(1) vector_length(128)
do i = 1, 128
mem_gpu(i) = i
end do
! Also tried with c_ptr and c_loc, but then compiler complains about c_f_pointer
! not having acc routine clause
mem_gpu_address = c_devloc(mem_gpu(11)) ! This works
call c_f_pointer(mem_gpu_address, gpu_arr, shape=[10]) ! But this line does not compile!
!$acc end parallel
!$acc update host(mem_gpu, gpu_arr)
!$acc end data
print *,mem_gpu(11:20)
print *,gpu_arr
end subroutine test_alloc_arr
This results in:
NVFORTRAN-S-0074-Illegal number or type of arguments to c_f_pointer - keyword argument fptr
Compiled with NVFORTRAN from HPC SDK 20.11
Hi FrostyMike,
I’ve not seen anyone attempt to call c_f_pointer from device code before, but suspect the issue is that c_f_pointer would be expecting a host pointer but in this context, gpu_arr would have a device attribute.
Can you describe what you’re trying to accomplish with this code? It appears to me that you’re trying to associate pointer with a device sub-array, which there’s simpler ways to do.
Also, by using “device_resident” are you purposefully wanting these arrays to be device only?
-Mat
Hi Mat,
I am trying to create a sort of memory pool. I.e. first to allocate large region of memory (as 1d integer array) and then place a large number of other objects inside , so that I don’t need to allocate/deallocate them all the time, which is currently very costly in terms of performance. I’ve done similar things with CUDA and C/C++ in the past, but my current toolset is strictly limited to Fortran 2003 and OpenACC only. Unfortunately Fortran is not friendly towards such concepts, so it is proven more difficult.
Also, by using “device_resident” are you purposefully wanting these arrays to be device only?
That is the crux of the matter. I got it to work with combination of c_loc and c_f_pointer, but only when the arrays have a “mirror image” in host memory. I however only use GPUs for computation, so I don’t need the host memory version. If I allocate 4gb memory pool on GPU, it costs me additional 4gb of RAM, at no benefit, so I am looking for ways to get rid of it.
I am trying to create a sort of memory pool.
Got it. Yes, it can be a bit more challenging to do this Fortran. But if you don’t mind using a bit of CUDA Fortran, it should be doable.
I wrote a very rudimentary example below on how I’d approach the problem. Basically use a CUDA Fortran “device” array to hold the memory pool, and then use “acc_map_data” to associate a host array to a portion of the pool. Technically “acc_map_data” is a C/C++ only function in the OpenACC standard, but we extended it to work in Fortran and device pointers. Much easier than having to jump through hoops getting the C pointers.
% cat pool.f90
module devpool
use cudafor
use openacc
use iso_c_binding
integer(kind=c_int8_t), dimension(:), allocatable, device :: pool
integer :: cnt, maxcnt
contains
subroutine initpool(sze)
integer :: sze
maxcnt = sze
cnt = 1
allocate(pool(maxcnt))
end subroutine initpool
subroutine register_real(arr,sze)
real :: arr(*)
integer :: sze
if ( (cnt+sze).gt.maxcnt) then
print *, "Error: pool not large enough to meet request."
stop
end if
call acc_map_data(arr,pool(cnt),sze)
cnt=cnt+sze
end subroutine
end module devpool
program foo
use devpool
real, dimension(:), allocatable :: a1, a2
integer :: sze, i
sze = 1024
call initpool(sze*16)
allocate(a1(sze))
allocate(a2(sze))
call register_real(a1,sze*4)
call register_real(a2,sze*4)
!$acc parallel loop present(a1)
do i=1,sze
a1(i) = 1. / real(i)
end do
!$acc parallel loop present(a1,a2)
do i=1,sze
a2(i) = a1(i) * real(i)
end do
!$acc update self(a2)
print *, a2(1:10)
end program foo
% nvfortran -acc -cuda pool.f90; a.out
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000
I like your example, it is much cleaner then my own. However, unfortunately it suffers from 2 problems:
- To give array pointer shape, you are forced to allocate it on cpu first.
allocate(a1(sze))
allocate(a2(sze))
- allocate() does not use the created pool, so CPU allocation not only adds memory overhead, but causes an expensive call to malloc().
This is not a big problem though, because it can be fixed by using c_f_pointer().
Unfortunately I didn’t find any other way to give the pointer shape except either allocate() (which is a no go) or c_f_pointer (only works with CPU memory, so again it has to be allocated).
Do you think there is a chance that Nvidia/PGI will add support for c_f_pointer() in OpenACC/Cuda device code anytime soon?
Do you think there is a chance that Nvidia/PGI will add support for c_f_pointer() in OpenACC/Cuda device code anytime soon?
I can put in an RFE, but no promise on when or if it’s something we’d add.
Still seems to me that there’s simpler ways to accomplish this. Would pointer assignment of device pointers work for you?
% cat pool2.f90
module devpool
use cudafor
use openacc
use iso_c_binding
real, dimension(:), allocatable, device, target :: pool
integer :: cnt, maxcnt
contains
subroutine initpool(sze)
integer :: sze
maxcnt = sze
cnt = 1
allocate(pool(maxcnt))
end subroutine initpool
subroutine register_real_dev(arr,sze)
real, device, pointer :: arr(:)
integer :: sze, bytes
if ( (cnt+sze).gt.maxcnt) then
print *, "Error: pool not large enough to meet request."
stop
end if
arr => pool(cnt:sze)
cnt=cnt+sze
end subroutine
end module devpool
program foo
use devpool
real, dimension(:), device, pointer :: a1
real, dimension(:), allocatable :: a2
integer :: sze, i
sze = 1024
call initpool(sze*16)
allocate(a2(sze))
call register_real_dev(a1,sze)
!$acc parallel loop deviceptr(a1)
do i=1,sze
a1(i) = 1. / real(i)
end do
!$acc parallel loop copyout(a2) deviceptr(a1)
do i=1,sze
a2(i) = a1(i) * real(i)
end do
print *, a2(1:10)
end program foo
% nvfortran -acc -cuda pool2.f90; a.out
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000