Cuda-memcheck error that I cannot figure out

chasmotron · June 28, 2021, 9:16pm

I’ll try to keep the situation as simple as possible, two fortran subroutines, one is called from within a kernel, using PGI 20.4

SUBROUTINE SUB1

!$acc routine(sub2) vector

!$acc update device(…)

!$acc parallel loop independent
DO I = 1,10
CALL SUB2
END DO

END SUBROUTINE SUB1

SUBROUTINE SUB2

INTEGER :: I, N
REAL, DIMENSION(1000) :: AA, BB, CC
REAL :: MAG

!$acc routine(sub2) vector

N = 50 ! Some value computed earlier in routine, set to 50 here

!$acc loop seq
DO I = 1,N
AA(I) = …
BB(I) = …
CC(I) = …
END DO

!$acc loop seq
DO I = 1,N
MAG = SQRT(AA(I)*AA(I) + BB(I)*BB(I) + CC(I)*CC(I))
END DO

END SUBROUTINE SUB2

If I comment out the line that sets MAG, then I get zero cuda-memcheck errors. If I uncomment the line that sets MAG, then cuda-memcheck gives me this:

========= CUDA-MEMCHECK
========= Invalid global write of size 8
========= at 0x000029b8 in sub2_
========= by thread (0,0,0) in block (51,0,0)
========= Address 0x00000000 is out of bounds
========= Device Frame:sub1_709_gpu (sub1_709_gpu : 0x458)

A bounds error doesn’t make sense to me since the second loop has exactly the same bounds as the previous loop.

Any help appreciated.

MatColgrove · June 29, 2021, 4:32pm

One thing that doesn’t quite make sense is that the error occurs in block 51. Since the OpenACC gang will map to a block, and the gang loop has a trip count of 10, I’m not understanding why there are 51 blocks.

My initial thought is that it’s a stack overflow due to the size of the local arrays but not sure. What happens to you make these smaller? And/Or try increasing the stack size via the environment variable “NV_ACC_CUDA_STACKSIZE=64MB”.

If that’s not it, can you provide a complete small reproducer? That would make it easier to determine the issue.

-Mat

chasmotron · June 30, 2021, 4:15pm

Thanks for the advice. Reducing the size of the local arrays did make the error go away, although NV_ACC_CUDA_STACKSIZE didn’t seem to do anything. Does that environment variable work with PGI v.20.4 executables?

MatColgrove · June 30, 2021, 5:12pm

Does that environment variable work with PGI v.20.4 executables?

Sorry, I missed that you are using a PGI branded version which doesn’t recognize the “NV” prefix. With 20.4 it will use the “PGI” prefix, i,e, “PGI_ACC_CUDA_STACKSIZE”. The PGI prefix can be used with the new NVHPC branded compilers as well, but is deprecated.

Note that the CUDA Stacksize does have a hard limit (I believe it’s 64MB but 100% sure) so still could fail.