Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Tlledal address during kernel execution

Hi,

I am trying to accelerate FVCOM by openACC.
To ensure Correct result, i need to use Atomic operation.
But when i use it, error occurs.

!$acc parallel loop
DO I=1,NE
IA=IEC(I,1)
IB=IEC(I,2)
...
!$acc atomic
XFLUX(IA)=XFLUX(IA)+(XADV+FXX*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
YFLUX(IA)=YFLUX(IA)+(YADV+FYY*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
XFLUX(IB)=XFLUX(IB)-(XADV+FXX*EPOR(IB))*( 1.0 SP-ISBC TMP )*IUCP(IB)
!$acc atomic
YFLUX(IB)=YFLUX(IB)-(YADV+FYY*EPOR(IB))*(1.0 SP-ISBC TMP )*IUCP(IB)

Error:
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Tlledal address during kernel execution.

if i don’t use the atomic operation, the program works well, but the result is wrong.
Why it’s generating?how to turn it off?

Thanks,
wjx

Hi wjx,

An illegal address error means that a bad address is being use such as a host address on the device or out-of-bounds error. Though it’s rather generic and I can’t tell from this snip-it why it’s occurring here.

Can you create a minimal reproducing example so I can investigate?

Thanks,
Mat

Hello, thank you very much for your advice.
I guess I have identified the issue:

In Fortran, arrays have a default index starting from 1,
but if I define the array with an index starting from 0,
then when this array is passed to the device, the element with index 0 may encounter an exception (lost or inaccessible).

You can try the following example, where accessing a(0) on the device side will result in an error.

module simpleOps_m
contains
 attributes(global)subroutine increment(a)
  implicit none
  integer,intent(inout)::a(:)
  integer::i,xx
  i = threadIdx%x
     if(i==2) then
        do xx=0,10
            write(*,*)'gpu',a(xx)
         end do
     end if
 end subroutine increment
end module simpleOps_m


program incrementTestGPU
  use cudafor
  use simpleOps_m
  implicit none
  integer ,parameter :: n =4
  integer :: a(0:10),xx
  integer, allocatable,device :: ad(:)
  allocate(ad(0:10))

  a=2
  ad=a
        do xx=0,10
                write(*,*)'cpu',a(xx)
        end do
  call increment<<<1,n>>>(ad,b)
  a=ad
end program incrementTestGPU

If the index xx starts from 1, the program will execute normally. If it starts from 0, it will produce an error(FAILED: 700(an illegal memory access was encountered)).

In FVCOM, I encountered a similar issue (possibly more complex), where the arrays on the device side had an offset in their index compared to the arrays on the host side.
You can take a look at the following images, and this is the problem I’ve encountered, which is also the reason for the errors in CUDA Fortran or OpenACC.

Thanks,
wjx

Since the device code gets translated into CUDA C, the compiler needs to implicitly map the Fortran to C indexing (i.e. subtract one). The easy fix is to add the lower bound in the local declaration of “a”

 attributes(global)subroutine increment(a)
  implicit none
  integer,intent(inout)::a(0:)    !<<< Add 0 for the lower bound
  integer::i,xx
  i = threadIdx%x

-Mat

Okay, thank you very much.

If I’m using OpenACC in Fortran and I haven’t explicitly defined the array on the device side, how can I correct its index?

just like these

REAL(SP) :: XFLUX(0:NT),YFLUX(0:NT)
...

!$acc atomic
XFLUX(IA)=XFLUX(IA)+(XADV+FXX*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
YFLUX(IA)=YFLUX(IA)+(YADV+FYY*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
XFLUX(IB)=XFLUX(IB)-(XADV+FXX*EPOR(IB))*( 1.0 SP-ISBC TMP )*IUCP(IB)
!$acc atomic
YFLUX(IB)=YFLUX(IB)-(YADV+FYY*EPOR(IB))*(1.0 SP-ISBC TMP )*IUCP(IB)

How can I ensure that the indices for xflux start from 0?

Thanks,
wjx

Sorry, but I’m not clear on what you mean by this. Assume your not using implicit typing, the arrays will be defined someplace like the local routine or in a module. The definition of the array with the zero index should be available. The device copy of the arrays are created when you include them in an OpenACC data directive.

Hello, thank you for your response. I still haven’t resolved the issue I’ve encountered. I suspect the problem may be related to the atomic operations in CUDA Fortran.

In the following case, I use 5 threads to perform an addition operation on a(3). The correct result should be 5, but the result I obtain is actually 4.

module simpleOps_m
contains
 attributes(global)subroutine increment(a,b)
  implicit none
  integer,intent(inout)::a(0:)
  integer,value::b
  integer::i,xx
  i = threadIdx%x

  call syncthreads()
  a(3)=atomicAdd(a(3),b)

 end subroutine increment
end module simpleOps_m

program incrementTestGPU
  use cudafor
  use simpleOps_m
  implicit none
  integer :: a(0:10),b,xx
  integer,device :: ad(0:10)

  a=0
  b=1
  ad=a

  call increment<<<1,5>>>(ad,b)
  a=ad
  write(*,*)'res=',a(3)
end program incrementTestGPU

(And the same issue, under CUDA C, works correctly, and the result is indeed 5.)
I would like to understand why this is happening and what’s wrong with my code. How can I fix it?

Thanks,
wjx

Hi wjx,

“atomicAdd” returns the value of the first argument prior to the add. So by using “a(3) = atomicAdd”, you’re overwriting the add. Try using “xx = atomicAdd(a(3),b)” instead.

-Mat