Hello,
I have been unable to pass shared arrays from my kernel to device subroutines when the size of the shared array is also an argument. More specifically, the memcheck tool says there is an Invalid __global__ write of size 8 bytes
. Below is a reproducer to show the problem.
I compile and run it using nvfortran -cuda -g -gpu=debug -o reproducer_smem reproducer_smem.f90 && compute-sanitizer --tool=memcheck ./reproducer_smem
with nvhpc 24.1.
module cuda_module
use cudafor
implicit none
! Define a kind for real numbers
integer, parameter :: realType = kind(0.0d0)
contains
attributes(global) subroutine my_kernel(N)
integer, value :: N
real(kind=realType), dimension(N), shared :: d_s
integer :: i, tid
i = threadIdx%x
if (i <= N ) then
d_s(i) = 0.0_realType
endif
end subroutine my_kernel
attributes(global) subroutine my_calling_kernel(N)
integer, value :: N
real(kind=realType), dimension(N), shared :: d_s
call set(d_s, N, 0.0_realType)
end subroutine my_calling_kernel
attributes(global) subroutine my_calling_kernel_static(N)
integer, value :: N
real(kind=realType), dimension(2), shared :: d_s
call set(d_s, N, 0.0_realType)
end subroutine my_calling_kernel_static
attributes(device) subroutine set(arr, N, val)
real(kind=realType), dimension(N), shared :: arr
integer :: N
real(kind=realType) :: val
integer :: i
i = threadIdx%x
if (i <= N ) then
arr(i) = val
endif
end subroutine set
end module cuda_module
program main
use cudafor
use cuda_module
implicit none
! Define a kind for real numbers
integer, parameter :: nx = 2
integer :: i, smem
smem = nx*realType
write(*,'(A,I0,A)') 'running with smem of ',smem,' bytes'
call my_kernel<<<1, 1, smem>>>(nx)
call my_calling_kernel_static<<<1, 1, smem>>>(nx)
! **comment me out for no errors**
call my_calling_kernel<<<1, 1, smem>>>(nx)
end program main
I believe this should be possible based on my understanding of the documentation.
"Shared variables and arrays may be passed as actual arguments to from a device subprogram to another device subprogram, as long as the interface is explicit and the matching dummy argument has the shared attribute. "
This does work when I statically allocate the shared memory array in the kernel and this is the work around I have been using.
I have tried the following
- compiling with -gpu=nordc as suggested on this forum post
- launching the kernel with more shared memory
- adding combinations of the
shared
anddevice
attribute to theset
subroutine. The compiler warning ‘NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument arr’ suggestsshared
is not needed.
Any help would be greatly appreciated, thank you!