Cuda fortran error: copyout Memcpy FAILED: 700(an illegal memory access was encountered)

When I was trying to compute 1D stencil with cuda fortran(using share memory), I got a illegal memory error. I have checked the program for a long time, but can not find the error.
`module reverse_m
implicit none
integer, device :: n_d
attributes(global) subroutine staticReverse(a, d)
real :: d(:), a(:), result
integer :: t, tr, t1, ra = 3, j
real, shared :: s(-ra+1:64 + ra)
!real, shared :: s(64 + 2*ra)

t = (blockIdx%x-1)*blockDim%x + threadIdx%x
!write(*,*) 't:', t, ra
if (t > ra .and. t < size(a) - ra) then
    t1 = threadIdx%x
    !write(*,*) 't:', t1, a(t)

    s(t1) = a(t)
    write(*,*) 's(t1)', t1, s(t1)
    if (t1 <= ra) then
            s(t1 - ra) = a(t - ra)
    if (t1 >= 64 - ra) then
            s(t1 + ra) = a(t + ra)
call syncthreads()
if (t > ra .and. t < size(a) - ra) then
    write(*,*) t,t1
    result = 0
    do j = -ra, ra
        result = result + s(t1 + j)
    d(t) = result
    write(*,*) 'size(d):', size(d)

end subroutine staticReverse
end module reverse_m

program sharedExample
use cudafor
use reverse_m

implicit none

integer, parameter :: n = 64, ra = 3
real :: a(n), r(n), d(n)
real, device :: d_a(n), d_d(n)
type(dim3) :: grid, tBlock
integer :: i, sizeInBytes, j

tBlock = dim3(64,1,1)
grid = dim3(1,1,1)

do i = 1, n
a(i) = i
do i = 1 + ra, n - ra
do j = -ra, ra
d(i) = d(i) + a(i - j)

sizeInBytes = sizeof(a(1))*tBlock%x

! run version with static shared memory
d_a = a
write(,) ‘Size(a):’, size(a)
call staticReverse<<<grid,tBlock>>>(d_a, d_d)
r = d_d
write(,) ‘Static case max error:’, maxval(abs(r-d))
if (maxval(abs(r-d)) .gt. 1.0e-7) then
write(,) “Test Failed”
write(,) “Test Passed”

end program sharedExample
When I was trying to print ‘s’ in the kernel function, It print nothing. It is weird. Can you help me find the error? thank you!

Hi Xll,

The problem here is that “s” is an automatic so dynamically allocated. For this, you’d need to pass in the size of the shared memory as the third argument in the chevron. See the “Dynamic Shared Memory” section at: Using Shared Memory in CUDA Fortran | NVIDIA Developer Blog

Though, the simpler solution here is to make “s” static by changing “ra” to be a parameter.

attributes(global) subroutine staticReverse(a, d)
real :: d(:), a(:), result
integer :: t, tr, t1, j
integer, parameter :: ra=3
real, shared :: s(-ra+1:64 + ra)

With this change, your program will run to completion, albeit fails verification.

% nvfortran test.cuf
% a.out
 Size(a):           64
 s(t1)           33    33.00000
 s(t1)           34    34.00000
 s(t1)           35    35.00000
 s(t1)           36    36.00000
... cut ...
 size(d):           64
 size(d):           64
 Static case max error:    427.0000
 Test Failed