Acc_malloc() in Fortran to avoid host allocation

Hi,

I am trying to use acc_malloc() in a Fortran code to avoid host allocations for scratch variables on the GPU. So far my attempts have failed. Here is my latest attempt:

real(ESMF_KIND_R8), allocatable:: copyArray(:,:,:)
type(c_devptr)  :: dev_copyArray
...
dev_copyArray = acc_malloc(size)
call acc_map_data(copyArray, dev_copyArray, size)
...
!$acc data present(copyArray)
!$acc kernels
...
!$acc end kernels
!$acc end data

This fails with

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

However, when I explicitly allocate the copyArray on the host side, the same code works, but of course then it isn’t any different than using the create data clause.

Is it possible to avoid host allocation for GPU scratch arrays?

-Gerhard

Hi Gerhard,

Can you post a more complete example? This should work as expected (see my example below), so I suspect the error is coming from something in the omitted code, such as using “dev_copyArray” in the kernels region.

Here an example:

% cat test.F90

program foo
use iso_c_binding
use cudafor
use openacc
real(8), allocatable:: copyArray(:,:,:)
type(c_devptr)  :: dev_copyArray
integer :: nx,ny,nz,size,i,j,k

nx=32
ny=32
nz=32
size = nx*ny*nz*8
allocate(copyArray(nx,ny,nz))
dev_copyArray = acc_malloc(size)
call acc_map_data(copyArray, dev_copyArray, size)
!$acc data present(copyArray)
!$acc kernels
do k=1,nz
   do j=1,ny
      do i=1,nx
         copyArray(i,j,k)=1.0
      enddo
   enddo
enddo
!$acc end kernels
!$acc update self(copyArray)
!$acc end data
print *, copyArray(1:2,1,1)

end program foo
% nvfortran -acc test.F90 -cuda -V22.3; a.out
    1.000000000000000         1.000000000000000

Also, since you’re using CUDA Fortran features anyway, you should be able to simplify things by making “dev_copyArray” a device array. Especially if you’re using “dev_copyArray” in the kernel region.

% cat test.cuf


program foo
use iso_c_binding
use cudafor
use openacc
real(8), allocatable:: copyArray(:,:,:)
real(8), allocatable, device :: dev_copyArray(:,:,:)
integer :: nx,ny,nz,i,j,k

nx=32
ny=32
nz=32
allocate(copyArray(nx,ny,nz))
allocate(dev_copyArray(nx,ny,nz))
!$acc kernels
do k=1,nz
   do j=1,ny
      do i=1,nx
         dev_copyArray(i,j,k)=1.0
      enddo
   enddo
enddo
!$acc end kernels
copyArray=dev_copyArray
print *, copyArray(1:2,1,1)

end program foo
% nvfortran -acc test.cuf -cuda -V22.3 ; a.out
    1.000000000000000         1.000000000000000

-Mat

Hi Mat,

Thank you for your help! Sorry for taking some time to get back to this.

It then looks from your first example at least that I must provide host allocation before I can use acc_map_data(). I had thought I can get around host allocation in that case. From your second example I think I understand that what I was trying to do would be possible with a “device array”. That is good info. Thank you.

I am exploring several options here, and have a new question with managed memory. I will start a new thread for that. Thanks again!

-Gerhard