Accelerating a variable fortran do loop increment failure

I have a fortran subroutine with an accelerated do loop in it with a variable increment. Something like:

SUBROUTINE FOO( A, N, ARR )

INTEGER, INTENT(IN) :: A, N
INTEGER, DIMENSION(N) :: ARR

INTEGER :: I, INC

INC = 1
IF (A.LT.10) INC = 10

!$acc kernels loop
DO I = 1,N,INC
! Various computations to set ARR
ARR(I) = 6
END DO

END SUBROUTINE FOO

INC is determined before the loop starts and does not change during the loop. When this routine is called from an accelerated executable, I get a runtime error:

Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: FOO.F90
Function: foo:1
Line: 11

Adding copyin(INC) to the loop does not help. If I remove INC from the
DO Loop construction, the accelerated executable runs fine. I can mimic the INC with an appropriate CYCLE command inside the loop, but it’s still doing all N iterations.

The OpenACC documentation says that the trip count of the do loop must be computable in constant time, and it certainly is.

Am I doing something wrong, or is a variable SKIP increment not possible in OpenACC?

Hi chasmotron,

Using INC here should be fine and I can’t recreate the error here using your snip-it as the basis. Hence I suspect something else is going on. Can you create a reproducing example which exhibits the error?

Here’s the example I wrote:

% cat test.F90

module test

contains
SUBROUTINE FOO( A, N, ARR )

INTEGER, INTENT(IN) :: A, N
INTEGER, DIMENSION(N) :: ARR

INTEGER :: I, INC

INC = 1
IF (A.LT.10) INC = 10

!$acc kernels loop copy(ARR)
DO I = 1,N,INC
! Various computations to set ARR
ARR(I) = 6
END DO

END SUBROUTINE FOO
end module test

program main
use test

INTEGER :: A, N
INTEGER, DIMENSION(:),allocatable :: ARR

N=1024
allocate(ARR(N))
A=1
ARR=1
call foo(A,N,ARR)
print *, ARR(1:20)
deallocate(ARR)
end program main

% nvfortran test.F90 -Minfo=accel -acc ; a.out
foo:
     15, Generating copy(arr(:)) [if not already present]
     16, Loop is parallelizable
         Generating NVIDIA GPU code
         16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
            6            1            1            1            1            1
            1            1            1            1            6            1
            1            1            1            1            1            1
            1            1

-Mat