Cuda-memcheck error that I cannot figure out

I’ll try to keep the situation as simple as possible, two fortran subroutines, one is called from within a kernel, using PGI 20.4

SUBROUTINE SUB1

!$acc routine(sub2) vector

!$acc update device(…)

!$acc parallel loop independent
DO I = 1,10
CALL SUB2
END DO

END SUBROUTINE SUB1

SUBROUTINE SUB2

INTEGER :: I, N
REAL, DIMENSION(1000) :: AA, BB, CC
REAL :: MAG

!$acc routine(sub2) vector

N = 50 ! Some value computed earlier in routine, set to 50 here

!$acc loop seq
DO I = 1,N
AA(I) = …
BB(I) = …
CC(I) = …
END DO

!$acc loop seq
DO I = 1,N
MAG = SQRT(AA(I)*AA(I) + BB(I)*BB(I) + CC(I)*CC(I))
END DO

END SUBROUTINE SUB2

If I comment out the line that sets MAG, then I get zero cuda-memcheck errors. If I uncomment the line that sets MAG, then cuda-memcheck gives me this:

========= CUDA-MEMCHECK
========= Invalid global write of size 8
========= at 0x000029b8 in sub2_
========= by thread (0,0,0) in block (51,0,0)
========= Address 0x00000000 is out of bounds
========= Device Frame:sub1_709_gpu (sub1_709_gpu : 0x458)

A bounds error doesn’t make sense to me since the second loop has exactly the same bounds as the previous loop.

Any help appreciated.

One thing that doesn’t quite make sense is that the error occurs in block 51. Since the OpenACC gang will map to a block, and the gang loop has a trip count of 10, I’m not understanding why there are 51 blocks.

My initial thought is that it’s a stack overflow due to the size of the local arrays but not sure. What happens to you make these smaller? And/Or try increasing the stack size via the environment variable “NV_ACC_CUDA_STACKSIZE=64MB”.

If that’s not it, can you provide a complete small reproducer? That would make it easier to determine the issue.

-Mat

Thanks for the advice. Reducing the size of the local arrays did make the error go away, although NV_ACC_CUDA_STACKSIZE didn’t seem to do anything. Does that environment variable work with PGI v.20.4 executables?

Does that environment variable work with PGI v.20.4 executables?

Sorry, I missed that you are using a PGI branded version which doesn’t recognize the “NV” prefix. With 20.4 it will use the “PGI” prefix, i,e, “PGI_ACC_CUDA_STACKSIZE”. The PGI prefix can be used with the new NVHPC branded compilers as well, but is deprecated.

Note that the CUDA Stacksize does have a hard limit (I believe it’s 64MB but 100% sure) so still could fail.