I’ll try to keep the situation as simple as possible, two fortran subroutines, one is called from within a kernel, using PGI 20.4
SUBROUTINE SUB1
!$acc routine(sub2) vector
!$acc update device(…)
!$acc parallel loop independent
DO I = 1,10
CALL SUB2
END DO
END SUBROUTINE SUB1
SUBROUTINE SUB2
INTEGER :: I, N
REAL, DIMENSION(1000) :: AA, BB, CC
REAL :: MAG
!$acc routine(sub2) vector
N = 50 ! Some value computed earlier in routine, set to 50 here
!$acc loop seq
DO I = 1,N
AA(I) = …
BB(I) = …
CC(I) = …
END DO
!$acc loop seq
DO I = 1,N
MAG = SQRT(AA(I)*AA(I) + BB(I)*BB(I) + CC(I)*CC(I))
END DO
END SUBROUTINE SUB2
If I comment out the line that sets MAG, then I get zero cuda-memcheck errors. If I uncomment the line that sets MAG, then cuda-memcheck gives me this:
========= CUDA-MEMCHECK
========= Invalid global write of size 8
========= at 0x000029b8 in sub2_
========= by thread (0,0,0) in block (51,0,0)
========= Address 0x00000000 is out of bounds
========= Device Frame:sub1_709_gpu (sub1_709_gpu : 0x458)
A bounds error doesn’t make sense to me since the second loop has exactly the same bounds as the previous loop.
Any help appreciated.