OpenACC: cuStreamSynchronize crash when using pointers as parameters

I’m having trouble creating an accelerated routine to test the case of a user defined function MAX_VAL, one that has a logical array argument. Here is the rather contrived program I’ve created, a cutdown of a larger program:

module GLOBAL
real(8), allocatable,target :: A(:)
logical,allocatable, target :: L(:)
end module GLOBAL

program MAIN
use GLOBAL
implicit none
real(8), pointer :: pA(:)
logical, pointer :: pL(:)
real(8) :: MAX
integer:: i

allocate(A(20),L(20))
pA  => A
pL  => L

!- Initialization of A,L
!$acc parallel loop default(present)
do i=1:20
pA(i) = i * 1.0
pL(i) = .TRUE.
end do
!$acc update host(A,L)

!- Call the accelerated routine from within a compute construct. (This is where it fails.)
!$acc parallel copyout(MAX)
MAX=MAX_VAL(pA,mask=pL)
!$acc end serial
write(,) “MAX”,MAX,maxval(A,mask=L)
write(,) “”

 contains

 !- User defined function of intrinsic MAXVAL, with a logical array parameter
 REAL(8) function MAX_VAL(list, mask) result(res)
 !$acc routine
   implicit none
   REAL(8), intent(in) :: list(:)
    logical,intent(in)   :: mask(:)
   REAL(8) :: res
   integer   :: i

   res = -1.0
   do i=1:size(list)
     if (mask(i).AND.(res.GT.list(i))) res = list(i)
   end do
 end function MAX_VAL

end program MAIN

I’ve been working on variants of the above. The above crashes on the line that calls MAX_VAL as an accelerated routine. Apparently, it works when I run the program with MAX_VAL(A,L) instead. But the above does not, when I’m passing a pointers as the actual parameters.

The runtime error is:
Failing in Thread: 1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Any takers? I’m using 20.9.

Hi Richard,

The problem here is that in Fortran when passing a sliced and non-contiguous assumed shape arrays to a subroutine the compiler must first create a contiguous temp array, copy the original array to the temp, and then pass in the temp. Even though we can see pA and pL are contiguous, given pointer can point to non-contiguous slices of an array, the compile must assume it’s non-contiguous, It this temp array that’s causing the error.

The work around is to pass in the array using assumed-size, as shown below.

On a side note, we discourage user from using contained device subroutines. While it’s fine here, contained subroutines need to pass in a hidden argument to the parent’s stack so the child can access these variables. Since the stack address is on the host, this would cause errors if the child accesses one of the parent’s variables.

% cat test.F90
module GLOBAL
real(8), allocatable,target :: A(:)
logical,allocatable, target :: L(:)
end module GLOBAL

program MAIN
use GLOBAL
use openacc
implicit none
real(8), pointer :: pA(:)
logical, pointer :: pL(:)
real(8) :: MAX
integer:: i

allocate(A(20),L(20))
!$acc enter data create(A(:20),L(:20))
pA  => A
pL  => L

!- Initialization of A,L
!$acc parallel loop default(present)
do i=1,20
pA(i) = i * 1.0
pL(i) = .TRUE.
end do
!$acc update host(A,L)

!- Call the accelerated routine from within a compute construct. (This is where it fails.)
!$acc serial copyout(MAX) default(present)
MAX=MAX_VAL(20,pA,mask=pL)
!$acc end serial
write(*,*) "MAX",MAX,maxval(A,mask=L)
write(*,*) ""

contains

 !- User defined function of intrinsic MAXVAL, with a logical array parameter
 REAL(8) function MAX_VAL(sze,list, mask) result(res)
 !$acc routine
   implicit none
   integer, value :: sze
   logical :: mask(*)
   REAL(8) :: list(*)
   REAL(8) :: res
   integer   :: i
   res = -1.0
   do i=1,sze
     if (mask(i).AND.(res.LT.list(i))) res = list(i)
   end do
 end function MAX_VAL

end program MAIN
% nvfortran test.F90 -acc -V20.9 -Minfo=accel ; a.out
main:
     16, Generating enter data create(a(:20),l(:20))
     21, Generating Tesla code
         22, !$acc loop gang, vector(20) ! blockidx%x threadidx%x
     21, Generating default present(pa(1:20),pl(1:20))
     26, Generating update self(a(:),l(:))
     29, Generating implicit copy(.S0000) [if not already present]
         Accelerator serial kernel generated
         Generating Tesla code
         Generating implicit copy(pl,pa) [if not already present]
         Generating default present(pl(:),pa(:))
         Generating copyout(max) [if not already present]
max_val:
     38, Generating acc routine seq
         Generating Tesla code
 MAX    20.00000000000000         20.00000000000000

Hope this helps,
Mat

It helps, and then introduces other issues. The above example was suppose to be a cutdown of a larger program that is having problem. But apparently I did not recreate the issue correctly.

After making modifications, the execution of the device routine produces a “FATAL ERROR: NON-STRIDE-ONE ARGUMENT”. The difference is that A and L are allocatable array, defined in a module. And yes, MAX_VAL is a contained device subroutine.

I think that I understand how to fix the contained device routine–FORTRAN isn’t my best language–but I am unclear what the other issue is about. “A”, and “L” are definitely in device memory, but I’m using an alias to reference them (ie, pA => A). The call to the device routine is …MAX_VAL(pA,pL,length), as above.

Any ideas?

Did you mean that A and L are multi-dimensional arrays? This error would occur if pA and pL are pointing to non-contiguous array slices. Something like:

pA  => A(1,:)
pL  => L(1,:)

This unfortunately would put us back to square one since in order to pass in pA and pL, the compiler would need to go back to creating contiguous temp array, copy pA and pL to these arrays, and then pass the temp arrays to the subroutine. This is problematic for the device given the discreet memories, but also a performance issue on host due to the overhead of these temp arrays. In general, you should consider not using non-contiguous pointer slices.

Are you able to modify your code so that pA and pL point to contiguous slices?

pA  => A(:,1)
pL  => L(:,1)

Example of failure of pointing to non-contiguous slices

% cat test2.F90
module GLOBAL
real(8), allocatable,target :: A(:,:)
logical,allocatable, target :: L(:,:)
end module GLOBAL

program MAIN
use GLOBAL
use openacc
implicit none
real(8), pointer :: pA(:)
logical, pointer :: pL(:)
real(8) :: MAX
integer:: i

allocate(A(20,20),L(20,20))
!$acc enter data create(A(:20,:20),L(:20,:20))
pA  => A(1,:)
pL  => L(1,:)

!- Initialization of A,L
!$acc parallel loop default(present)
do i=1,20
pA(i) = i * 1.0
pL(i) = .TRUE.
end do
!$acc update host(A,L)

!- Call the accelerated routine from within a compute construct. (This is where it fails.)
!$acc serial copyout(MAX) default(present)
MAX=MAX_VAL(20,pA,mask=pL)
!$acc end serial
write(*,*) "MAX",MAX,maxval(A(1,:),mask=L(1,:))
write(*,*) ""

contains

 !- User defined function of intrinsic MAXVAL, with a logical array parameter
 REAL(8) function MAX_VAL(sze,list, mask) result(res)
 !$acc routine
   implicit none
   integer, value :: sze
   logical :: mask(*)
   REAL(8) :: list(*)
   REAL(8) :: res
   integer   :: i
   res = -1.0
   do i=1,sze
     if (mask(i).AND.(res.LT.list(i))) res = list(i)
   end do
 end function MAX_VAL

end program MAIN
% nvfortran test2.F90 -acc ; a.out
test2.F90:30 - main: FATAL ERROR: NON-CONTIGUOUS ARGUMENT
Failing in Thread:1
call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

Modified version with pointers pointing at contiguous slices:

% cat test3.F90
module GLOBAL
real(8), allocatable,target :: A(:,:)
logical,allocatable, target :: L(:,:)
end module GLOBAL

program MAIN
use GLOBAL
use openacc
implicit none
real(8), pointer :: pA(:)
logical, pointer :: pL(:)
real(8) :: MAX
integer:: i

allocate(A(20,20),L(20,20))
!$acc enter data create(A(:20,:20),L(:20,:20))
pA  => A(:,1)
pL  => L(:,1)

!- Initialization of A,L
!$acc parallel loop default(present)
do i=1,20
pA(i) = i * 1.0
pL(i) = .TRUE.
end do
!$acc update host(A,L)

!- Call the accelerated routine from within a compute construct. (This is where it fails.)
!$acc serial copyout(MAX) default(present)
MAX=MAX_VAL(20,pA,mask=pL)
!$acc end serial
write(*,*) "MAX",MAX,maxval(A(:,1),mask=L(:,1))
write(*,*) ""

contains

 !- User defined function of intrinsic MAXVAL, with a logical array parameter
 REAL(8) function MAX_VAL(sze,list, mask) result(res)
 !$acc routine
   implicit none
   integer, value :: sze
   logical :: mask(*)
   REAL(8) :: list(*)
   REAL(8) :: res
   integer   :: i
   res = -1.0
   do i=1,sze
     if (mask(i).AND.(res.LT.list(i))) res = list(i)
   end do
 end function MAX_VAL

end program MAIN
% nvfortran test3.F90 -acc ; a.out
 MAX    20.00000000000000         20.00000000000000

This is helpful. Seems I’ve got more to learn.

Thanks Mat.