Nvfortran: Passing shared arrays of variable size to device subroutine causes memory error

Hello,

I have been unable to pass shared arrays from my kernel to device subroutines when the size of the shared array is also an argument. More specifically, the memcheck tool says there is an Invalid __global__ write of size 8 bytes. Below is a reproducer to show the problem.
I compile and run it using nvfortran -cuda -g -gpu=debug -o reproducer_smem reproducer_smem.f90 && compute-sanitizer --tool=memcheck ./reproducer_smem with nvhpc 24.1.


module cuda_module
  use cudafor
  implicit none

  ! Define a kind for real numbers
  integer, parameter :: realType = kind(0.0d0)

contains

  
  attributes(global) subroutine my_kernel(N)
   integer, value :: N
   real(kind=realType), dimension(N), shared :: d_s
   integer :: i, tid
  
   i = threadIdx%x 
   if (i <= N ) then
       d_s(i) = 0.0_realType
   endif
  
  end subroutine my_kernel



  attributes(global) subroutine my_calling_kernel(N)
   integer, value :: N

   real(kind=realType), dimension(N), shared :: d_s
  
    call set(d_s, N, 0.0_realType)
  
  end subroutine my_calling_kernel
  
  attributes(global) subroutine my_calling_kernel_static(N)
   integer, value :: N
   real(kind=realType), dimension(2), shared :: d_s
  
    call set(d_s, N, 0.0_realType)
  
  end subroutine my_calling_kernel_static

    
  attributes(device) subroutine set(arr, N, val)
    real(kind=realType), dimension(N), shared :: arr
    integer :: N
    real(kind=realType)  :: val
    integer :: i
    
    
    i = threadIdx%x 
    if (i <= N ) then
        arr(i) = val
    endif
  end subroutine set

end module cuda_module


program main
  use cudafor
  use cuda_module
  implicit none

  ! Define a kind for real numbers
  integer, parameter :: nx = 2
  integer :: i, smem

  smem = nx*realType
  
  write(*,'(A,I0,A)') 'running with smem of ',smem,' bytes'
  
  call my_kernel<<<1, 1, smem>>>(nx)
  call my_calling_kernel_static<<<1, 1, smem>>>(nx)
  
  ! **comment me out for no errors**
  call my_calling_kernel<<<1, 1, smem>>>(nx)

end program main 

I believe this should be possible based on my understanding of the documentation.

"Shared variables and arrays may be passed as actual arguments to from a device subprogram to another device subprogram, as long as the interface is explicit and the matching dummy argument has the shared attribute. "
This does work when I statically allocate the shared memory array in the kernel and this is the work around I have been using.

I have tried the following

  • compiling with -gpu=nordc as suggested on this forum post
  • launching the kernel with more shared memory
  • adding combinations of the shared and deviceattribute to the set subroutine. The compiler warning ‘NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument arr’ suggests shared is not needed.

Any help would be greatly appreciated, thank you!

Looks like this might have been a known issue. I can recreate it in our 24.1-24.5 releases, but seems to be ok in our latest 24.7 release.

% nvfortran -cuda -g -gpu=debug -o reproducer_smem reproducer_smem.f90 -V24.7
NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument arr (reproducer_smem.f90: 44)
% compute-sanitizer --tool=memcheck ./reproducer_smem
========= COMPUTE-SANITIZER
running with smem of 16 bytes
========= ERROR SUMMARY: 0 errors

Mind updating your compiler version and testing against your project?

Thank you for your prompt reply. Unfortunately I’m getting the same issue with 24.7.

> nvfortran -V
nvfortran 24.7-0 64-bit target on x86-64 Linux -tp skylake-avx512 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> compute-sanitizer -V
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2024 NVIDIA Corporation
Version 2024.2.0.0 (build 34384353) (public-release)
> which compute-sanitizer
/software/LAVA_deps/NVIDIA_24.7/Linux_x86_64/24.7/compilers/bin/compute-sanitizer
> nvfortran -cuda -g -gpu=debug -o reproducer_smem reproducer_smem.f90
NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument arr (reproducer_smem.f90: 51)
> compute-sanitizer --tool=memcheck ./reproducer_smem
========= COMPUTE-SANITIZER
running with smem of 16 bytes
========= Invalid __global__ write of size 8 bytes
=========     at cuda_module_set_+0x858 in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/reproducer_smem.f90:59
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f4d00007f4d is misaligned
=========     and is 2,317,451,443 bytes before the nearest allocation at 0x7f4d8a220000 of size 65,536 bytes
=========     Device Frame:cuda_module_my_calling_kernel_+0x5e0 in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/reproducer_smem.f90:37
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2dbe2f]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x15a13]
=========                in /software/LAVA_deps/NVIDIA_24.7/Linux_x86_64/24.7/cuda/12.5/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75750]
=========                in /software/LAVA_deps/NVIDIA_24.7/Linux_x86_64/24.7/cuda/12.5/lib64/libcudart.so.12
=========     Host Frame:__pgiLaunchKernel in /proj/build/24A/Linux_x86_64/rte/cudafor/src/pgi_cudalaunch.c:80 [0x9d90]
=========                in /software/LAVA_deps/NVIDIA_24.7/Linux_x86_64/24.7/compilers/lib/libcudafor.so
=========     Host Frame:MAIN_ in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/reproducer_smem.f90:83 [0x16cf]
=========                in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/./reproducer_smem
=========     Host Frame:main [0x1330]
=========                in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/./reproducer_smem
=========     Host Frame:__libc_start_main [0x3a7e4]
=========                in /lib64/libc.so.6
=========     Host Frame:_start [0x11dd]
=========                in /u/wk/janibal/repos/LAVA_RESEARCH/src/curv/./reproducer_smem
========= 
========= ERROR SUMMARY: 1 error

Below is the output of nvidia-smi. I am running this kernel on a Quadro P6000.

> nvidia-smi
Fri Aug 16 16:29:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P6000                   Off |   00000000:3B:00.0 Off |                  Off |
| 26%   35C    P8              9W /  250W |     336MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      8206      G   /usr/libexec/Xorg                             183MiB |
|    0   N/A  N/A      8518      G   /usr/bin/gnome-shell                          117MiB |
+-----------------------------------------------------------------------------------------+

Ok, not sure why it works for me on an H100, but moving to a P100, I can recreate the error again.

I’ve added an issue report, TPR #36201, and have asked engineering to investigate.

Using automatic arrays with dynamic shared arrays should work, though there might be something going on with the generated descriptor when passing the shared array. As a work around, you can switch to using assumed-size arrays so no descriptor is created.

  attributes(global) subroutine my_calling_kernel(N)
   integer, value :: N
   real(kind=realType), dimension(*), shared :: d_s
    call set(d_s, N, 0.0_realType)
  end subroutine my_calling_kernel

Thank you for submitting that bug report.
Is there a site where I can see the status of the bug report you created, TPR #36201? I can only find places to report new bugs. I haven’t yet found a spot to view the status of existing bug reports submitted by other users.

In my main program I have two different shared memory arrays of different sizes.
In the cuda fortran documentation it mentions a restriction on using multiple shared arrays of assumed-size.

if there is more than one assumed-size shared memory array, they are all implicitly equivalenced, starting at the same shared memory address

I’m not sure what this comment means, but I took it to mean the assumed-size arrays would split the shared memory equally and thus would need to be the same size given they are the same type . Is that correct? I suppose I could over allocate data for the shorter array, but would rather save the shared memory.

For now using static allocation seems to be the easiest workaround for me in the short term.

Is there a site where I can see the status of the bug report you created, TPR #36201?

Not for our TPR system. This is a carry-over from our PGI days that only internal to the NVHPC team.

For external views, you’d need to submit an NVBug. If you do submit an NVBug, let me know and I can add a note to the TPR so our team knows to update it.

Or if your organization happens to be on paid support, then you can submit it through your support portal.

Is that correct?

They would be equivalenced, meaning that they’ll both would share the same base address and it would be up to you to give the correct index offset so they don’t overlap.

Ideally you would be able to use automatic arrays so the compiler knows the extent and can correctly partition the shared memory, but the compiler doesn’t know the size of an assumed-size array, so can’t do this automatically.

You could still use this, provided you don’t pass the shared array as an argument to a device subroutine.

They would be equivalenced, meaning that they’ll both would share the same base address and it would be up to you to give the correct index offset so they don’t overlap.

Ahh, Ok. That makes sense now thank you.

In case it is helpful, I also had an issue with the reproducer on A100-SXM4-80GB and nvhpc 24.7.