OpenAcc not allocating memory on GPU

Hi,

I am trying to allocate memory on the GPU that persists between subroutine calls. My understanding is that

c$acc declare device_resident(a, b)

when placed in a module will ensure that a and b, when allocated, exist on the GPU for the duration of the program. The program below compiles fine, but fails at run-time with the following error:
FATAL ERROR: data in PRESENT clause was not found: name=b
file:/lcpscratch/patnaik/openACC/tests/test2.f init line:21
My best guess is that the allocation is not happening on the GPU but on the CPU. I do not want to have a data directive with all GPU variables in the main, I want to isolate them into modules. Please help.

Regards, Gopal

c
compile with: pgfortran -acc -Minfo=accel test2.f
c

      module acc_data

      integer, parameter :: NX = 100000, NY = 1000
c$acc declare device_resident(a, b)
      real, allocatable, save, dimension(:,:) :: a, b
      real, allocatable, save, dimension(:,:) :: c
      
      contains

      subroutine init

      integer :: i, j

      allocate (a(NX,NY),b(NX,NY))
      allocate (c(NX,NY))

c$acc kernels loop present (a,b)
      do j = 1, NY
         do i = 1, NX
            a(i,j) = 1.3
            b(i,j) = 3.4
         end do
      end do

      return
      end subroutine init

      end module acc_data

      program test2

      use acc_data
      implicit none
      integer :: i, j

      call init

c$acc kernels loop present (a,b) copyout(c)
      do j = 1, NY
         do i = 1, NX
            c(i,j) = a(i,j)**b(i,j)
         end do
      end do
c$acc end kernels loop

      write(*,*)sum(c(1,:))
      stop
      end

Hi Gopal,

Unfortunately, there are a few dangling OpenACC features yet to be implemented, including device_resident. The others being host data and last private. In the mean time, you’ll need to reorganise your code a bit to use data regions instead.

For example:

% cat test2.f90 
c
compile with: pgfortran -acc -Minfo=accel test2.f
c
      module acc_data

      integer, parameter :: NX = 100000, NY = 1000
      real, allocatable, dimension(:,:) :: a, b
      real, allocatable, dimension(:,:) :: c
cacc declare device_resident(a, b)
     
      contains

      subroutine alloc 

      integer :: i, j
      allocate (a(NX,NY),b(NX,NY))
      allocate (c(NX,NY))

      end subroutine alloc
      subroutine init 

      integer :: i, j
c$acc kernels loop present (a,b)
      do j = 1, NY
         do i = 1, NX
            a(i,j) = 1.3
            b(i,j) = 3.4
         end do
      end do

      return
      end subroutine init

      end module acc_data

      program test2

      use acc_data
      implicit none
      integer :: i, j

      call alloc
c$acc data create(A(NX,NY), b(NX,NY))
      call init

c$acc kernels loop present (a,b) copyout(c)
      do j = 1, NY
         do i = 1, NX
            c(i,j) = a(i,j)**b(i,j)
         end do
      end do

c$acc end data

      write(*,*)sum(c(1,:))
      stop
      end 
% pgf90 -acc test2.f90 -Mfixed -Minfo=accel
init:
     26, Generating present(b(:,:))
         Generating present(a(:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     27, Loop is parallelizable
     28, Loop is parallelizable
         Accelerator kernel generated
         27, !$acc loop gang ! blockidx%y
         28, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 12 registers; 96 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 112 constant, 0 local memory bytes
test2:
     46, Generating local(b(:100000,:1000))
         Generating local(a(:100000,:1000))
     49, Generating present(b(:,:))
         Generating present(a(:,:))
         Generating copyout(c(:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     50, Loop is parallelizable
     51, Loop is parallelizable
         Accelerator kernel generated
         50, !$acc loop gang ! blockidx%y
         51, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 16 registers; 136 shared, 112 constant, 0 local memory bytes
             CC 2.0 : 19 registers; 0 shared, 192 constant, 0 local memory bytes
% a.out
    2440.103    
Warning: ieee_inexact is signaling
FORTRAN STOP

Best Regards,
Mat

Mat,

Thanks, that is similar to a workaround I found. I was hoping not to have to explicitly list all the device variables in the main program, as the actual code will have hundreds. I guess I’ll wait for the next update.

Also in your example, it seems that arrays a and b are first allocated on the host, something not really required, but makes sense if the code is to run on the host alone. I guess this is a good design practice?

Regards,
Gopal

I guess this is a good design practice?

I think so. One of the points of using directives is so you can turn them off. You could probably insert some logic in the code so that it would either way, but I don’t think it would be worth it.

  • Mat