Fortran allocatable array creation&use only on gpu

Hello,
I 'm not able to solve this problem : allocate and use fortran allocatable arrays only on gpu (so no pre-allocation on the host). This example mimics a more complicated application.

      program vector
!
      implicit none
      integer  ::  n, i
      real,  allocatable, dimension (:)  ::  u,p
!
      read (5,*) n
      allocate (u(n) )     ! , p(n) )
      do i = 1, n
         u(i) = -1.0
      end do
!
!$ACC ENTER DATA COPYIN(u(1:n) ) CREATE(p(1:n) )
!
!$ACC KERNELS LOOP PRESENT (p)
      do i = 1, n
         p(i) = 10.0
      end do
!$ACC END KERNELS LOOP
!
!$ACC KERNELS LOOP PRESENT (u,p)
      do i = 1, n
         u(i) = u(i) + p(i)
      end do
!$ACC END KERNELS LOOP
!
!$ACC EXIT DATA COPYOUT(u) DELETE(p)
!
      write (6,*) (u(i),i=1,n)
!
      stop
      end program vector

I use PGI Fortran v. 19.1, The compilation sequence is

 pgf90 -acc -O2 -g -Minfo forum.f -o vector.out 
vector:
      9, Memory set idiom, loop replaced by call to __c_mset4
     13, Generating enter data create(p(1:n))
         Generating enter data copyin(u(1:n))
     15, Generating present(p(:))
     16, Loop is parallelizable
         Generating Tesla code
         16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     16, Memory set idiom, loop replaced by call to __c_mset4
     21, Generating present(u(:),p(:))
     22, Loop is parallelizable
         Generating Tesla code
         22, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     22, Generated vector simd code for the loop
     27, Generating exit data copyout(u(:))
         Generating exit data delete(p(:))

The error at runtime :

echo 10 | ./vector.out 
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

If I take into account the allocation of array p on the host (line 8), everything’s fine.
could someone explain to me what’s wrong in this example ?

Regards,
Guy.

Hi Guy,

This is expected since data regions expect mirrored copies variables on both the host and device.

To create a device only copy of the data, there are a couple of ways to do this.

  1. Use CUDA Fortran “device” attribute

Here, add the CUDA Fortran device attribute to the declaration of “p”. When “p” is allocated, only a device copy will be created.

Note that I use a macro so the code will compile and run correctly for CPU only compilation as well.

% cat test.F90
#ifdef _CUDA
  #define DEVICE ,device
#else
  #define DEVICE
#endif
      program vector
!
      implicit none
      integer  ::  n, i
      real,  allocatable, dimension (:)  ::  u
      real,  allocatable, dimension (:) DEVICE  ::  p
!
      read (5,*) n
      allocate (u(n))
      allocate (p(n))
      do i = 1, n
         u(i) = -1.0
      end do
!
!$ACC ENTER DATA COPYIN(u(1:n) )
!
!$ACC KERNELS LOOP PRESENT (p)
      do i = 1, n
         p(i) = 10.0
      end do
!$ACC END KERNELS LOOP
!
!$ACC KERNELS LOOP PRESENT (u,p)
      do i = 1, n
         u(i) = u(i) + p(i)
      end do
!$ACC END KERNELS LOOP
!
!$ACC EXIT DATA COPYOUT(u)
!
      write (6,*) (u(i),i=1,n)
      deallocate(u)
      deallocate(p)
!
      stop
      end program vector
% pgfortran test.F90 -Mcuda  -ta=tesla -Minfo=accel
vector:
     20, Generating enter data copyin(u(1:n))
     23, Loop is parallelizable
         Generating Tesla code
         23, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     28, Generating present(u(:))
     29, Loop is parallelizable
         Generating Tesla code
         29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     34, Generating exit data copyout(u(:))
% a.out
64
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
Warning: ieee_inexact is signaling
FORTRAN STOP
  1. Use “declare device_resident”

The “device_resident” clause is the pure OpenACC method to state that the array should only be allocated on the device. You still need to allocate “p”.

% cat test.2.F90
      program vector
!
      implicit none
      integer  ::  n, i
      real,  allocatable, dimension (:)  ::  u
      real,  allocatable, dimension (:)  ::  p
!$acc declare device_resident(p)

      read (5,*) n
      allocate (u(n))
      allocate (p(n))
      do i = 1, n
         u(i) = -1.0
      end do
!
!$ACC ENTER DATA COPYIN(u(1:n) )
!
!$ACC KERNELS LOOP PRESENT (p)
      do i = 1, n
         p(i) = 10.0
      end do
!$ACC END KERNELS LOOP
!
!$ACC KERNELS LOOP PRESENT (u,p)
      do i = 1, n
         u(i) = u(i) + p(i)
      end do
!$ACC END KERNELS LOOP
!
!$ACC EXIT DATA COPYOUT(u)
!
      write (6,*) (u(i),i=1,n)
      deallocate(u)
      deallocate(p)
!
      stop
      end program vector
% pgfortran test.2.F90 -ta=tesla -Minfo=accel
vector:
     16, Generating enter data copyin(u(1:n))
     19, Loop is parallelizable
         Generating Tesla code
         19, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     24, Generating present(u(:))
     25, Loop is parallelizable
         Generating Tesla code
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     30, Generating exit data copyout(u(:))
% a.out
64
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
    9.000000        9.000000        9.000000        9.000000
Warning: ieee_inexact is signaling
FORTRAN STOP

Hope this helps,
Mat