Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Tlledal address during kernel execution

wen1664379144 · September 28, 2023, 8:05am

Hi,

I am trying to accelerate FVCOM by openACC.
To ensure Correct result, i need to use Atomic operation.
But when i use it, error occurs.

!$acc parallel loop
DO I=1,NE
IA=IEC(I,1)
IB=IEC(I,2)
...
!$acc atomic
XFLUX(IA)=XFLUX(IA)+(XADV+FXX*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
YFLUX(IA)=YFLUX(IA)+(YADV+FYY*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
XFLUX(IB)=XFLUX(IB)-(XADV+FXX*EPOR(IB))*( 1.0 SP-ISBC TMP )*IUCP(IB)
!$acc atomic
YFLUX(IB)=YFLUX(IB)-(YADV+FYY*EPOR(IB))*(1.0 SP-ISBC TMP )*IUCP(IB)

Error:
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Tlledal address during kernel execution.

if i don’t use the atomic operation, the program works well, but the result is wrong.
Why it’s generating?how to turn it off?

Thanks,
wjx

MatColgrove · September 28, 2023, 9:37pm

Hi wjx,

An illegal address error means that a bad address is being use such as a host address on the device or out-of-bounds error. Though it’s rather generic and I can’t tell from this snip-it why it’s occurring here.

Can you create a minimal reproducing example so I can investigate?

Thanks,
Mat

wen1664379144 · October 6, 2023, 8:13am

Hello, thank you very much for your advice.
I guess I have identified the issue:

In Fortran, arrays have a default index starting from 1,
but if I define the array with an index starting from 0,
then when this array is passed to the device, the element with index 0 may encounter an exception (lost or inaccessible).

You can try the following example, where accessing a(0) on the device side will result in an error.

module simpleOps_m
contains
 attributes(global)subroutine increment(a)
  implicit none
  integer,intent(inout)::a(:)
  integer::i,xx
  i = threadIdx%x
     if(i==2) then
        do xx=0,10
            write(*,*)'gpu',a(xx)
         end do
     end if
 end subroutine increment
end module simpleOps_m


program incrementTestGPU
  use cudafor
  use simpleOps_m
  implicit none
  integer ,parameter :: n =4
  integer :: a(0:10),xx
  integer, allocatable,device :: ad(:)
  allocate(ad(0:10))

  a=2
  ad=a
        do xx=0,10
                write(*,*)'cpu',a(xx)
        end do
  call increment<<<1,n>>>(ad,b)
  a=ad
end program incrementTestGPU

If the index xx starts from 1, the program will execute normally. If it starts from 0, it will produce an error(FAILED: 700(an illegal memory access was encountered)).

In FVCOM, I encountered a similar issue (possibly more complex), where the arrays on the device side had an offset in their index compared to the arrays on the host side.
You can take a look at the following images, and this is the problem I’ve encountered, which is also the reason for the errors in CUDA Fortran or OpenACC.

Thanks,
wjx

MatColgrove · October 6, 2023, 5:28pm

Since the device code gets translated into CUDA C, the compiler needs to implicitly map the Fortran to C indexing (i.e. subtract one). The easy fix is to add the lower bound in the local declaration of “a”

 attributes(global)subroutine increment(a)
  implicit none
  integer,intent(inout)::a(0:)    !<<< Add 0 for the lower bound
  integer::i,xx
  i = threadIdx%x

-Mat

wen1664379144 · October 7, 2023, 3:51am

Okay, thank you very much.

If I’m using OpenACC in Fortran and I haven’t explicitly defined the array on the device side, how can I correct its index?

just like these

REAL(SP) :: XFLUX(0:NT),YFLUX(0:NT)
...

!$acc atomic
XFLUX(IA)=XFLUX(IA)+(XADV+FXX*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
YFLUX(IA)=YFLUX(IA)+(YADV+FYY*EPOR(IA))*( 1.0 SP-ISBC TMP )*IUCP(IA)
!$acc atomic
XFLUX(IB)=XFLUX(IB)-(XADV+FXX*EPOR(IB))*( 1.0 SP-ISBC TMP )*IUCP(IB)
!$acc atomic
YFLUX(IB)=YFLUX(IB)-(YADV+FYY*EPOR(IB))*(1.0 SP-ISBC TMP )*IUCP(IB)

How can I ensure that the indices for xflux start from 0?

Thanks,
wjx

MatColgrove · October 7, 2023, 6:40pm

Sorry, but I’m not clear on what you mean by this. Assume your not using implicit typing, the arrays will be defined someplace like the local routine or in a module. The definition of the array with the zero index should be available. The device copy of the arrays are created when you include them in an OpenACC data directive.

wen1664379144 · October 12, 2023, 11:16am

Hello, thank you for your response. I still haven’t resolved the issue I’ve encountered. I suspect the problem may be related to the atomic operations in CUDA Fortran.

In the following case, I use 5 threads to perform an addition operation on a(3). The correct result should be 5, but the result I obtain is actually 4.

module simpleOps_m
contains
 attributes(global)subroutine increment(a,b)
  implicit none
  integer,intent(inout)::a(0:)
  integer,value::b
  integer::i,xx
  i = threadIdx%x

  call syncthreads()
  a(3)=atomicAdd(a(3),b)

 end subroutine increment
end module simpleOps_m

program incrementTestGPU
  use cudafor
  use simpleOps_m
  implicit none
  integer :: a(0:10),b,xx
  integer,device :: ad(0:10)

  a=0
  b=1
  ad=a

  call increment<<<1,5>>>(ad,b)
  a=ad
  write(*,*)'res=',a(3)
end program incrementTestGPU

(And the same issue, under CUDA C, works correctly, and the result is indeed 5.)
I would like to understand why this is happening and what’s wrong with my code. How can I fix it?

Thanks,
wjx

MatColgrove · October 12, 2023, 2:58pm

Hi wjx,

“atomicAdd” returns the value of the first argument prior to the add. So by using “a(3) = atomicAdd”, you’re overwriting the add. Try using “xx = atomicAdd(a(3),b)” instead.

-Mat

Topic		Replies	Views
Problem with acc array reduction nvc, nvc++ and nvfortran	2	356	May 24, 2022
Support for associate syntax in OpenACC kernel nvc, nvc++ and nvfortran nvbugs	4	713	October 9, 2020
Accelerating a variable fortran do loop increment failure nvc, nvc++ and nvfortran	1	187	April 16, 2024
OpenACC: cuStreamSynchronize crash when using pointers as parameters nvc, nvc++ and nvfortran	4	820	December 7, 2021
Openacc - nested loops - illegal address during kernel execution nvc, nvc++ and nvfortran	3	26	March 15, 2025
call to cuStreamSynchronize returned error 700: Illegal addr Legacy PGI Compilers	1	3501	March 5, 2018
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	34	December 28, 2024
Add OpenACC to a Fortran loop Legacy PGI Compilers	5	7151	December 3, 2015
cudaLaunchKernel returned status 98: invalid device function nvc, nvc++ and nvfortran	15	4181	January 31, 2023
OpenACC Accelerator restriction: call to 'function' with no acc routine information nvc, nvc++ and nvfortran	9	542	November 26, 2024

Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Tlledal address during kernel execution

Related topics