Is cudaFuncAttributeMaxDynamicSharedMemorySize a supported attriburw?

I have a need to use more than the default amount of shared memory on a V100 GPU. I understand that I need to specifically enable use of more than 64KB using the cudaFuncSetAttribute however this fails under CUDA fortran. I have verified that it work under CUDA C++ using the cudaTensorCoreGemm example code.

Is the cudaFuncAttributeMaxDynamicSharedMemorySize attribute supported under CUDA fortran?

Here is my reproducer.

module smem
  use cudafor
  contains

attributes(global) subroutine kern(neq)
  implicit none
  integer, value :: neq

  ! declare shared memory
  real(kind=8), dimension(125,6,neq,3), shared  :: face_corr

    face_corr(threadIdx%x,threadIdx%y,threadIdx%z,1) = 0.D0
    face_corr(threadIdx%x,threadIdx%y,threadIdx%z,2) = 0.D0
    face_corr(threadIdx%x,threadIdx%y,threadIdx%z,3) = 0.D0
    call syncthreads()
end subroutine kern
end module smem



program main
  use smem
  implicit none
  integer :: ierr
  type(dim3)  :: tBlock, tGrid 

  tBlock = dim3(25, 6, 5 )
  tGrid  = dim3(1, 1,1 )

  ierr = cudaFuncSetAttribute(kern, cudaFuncAttributePreferredSharedMemoryCarveout, 100)
  if (ierr /= 0) then
     print *, 'main: error setting cudaFuncAttributePreferredSharedMemoryCarveout',cudaGetErrorString(ierr)
     stop
  endif

  ierr = cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, 98304)
  if (ierr /= 0) then
     print *, 'main: error setting cudaFuncAttributeMaxDynamicSharedMemorySize',cudaGetErrorString(ierr)
     stop
  endif
  call kern<<<tGrid, tBlock, 98304, 0>>>(5)
  ierr = cudaDeviceSynchronize();
end program main

pgfortran --version
pgfortran 19.10-0 LLVM 64-bit target on x86-64 Linux -tp skylake

pgfortran -g -O0 -Mcuda=debug -Mcuda=lineinfo -Mbounds -traceback -Mcuda=charstring -Mcuda=cc70 -Minfo=accel -Mcuda=ptxinfo -o smem.x smemMod.cuf smem.cuf

PBS r101i0n0 263> ./smem.x
main: error from cudaFuncAttributeMaxDynamicSharedMemorySize
invalid argument
Warning: ieee_inexact is signaling
FORTRAN STOP

Hi dkokron,

It seems that this problem only occurs when device side debugging is enabled. Can you try without the debugging flags (i.e. remove “-g -Mcuda=debug” or use “-g -Mcuda=nodebug”).

Not sure why it fails with debugging enabled so added a problem report (TPR #28449) and sent it to engineering for further investigation.

% pgfortran -O0 -g -traceback -Mcuda=charstring -Mcuda=cc70 -Minfo=accel test.cuf ; a.out
 main: error setting cudaFuncAttributeMaxDynamicSharedMemorySize
 invalid argument                                                                                                    
Warning: ieee_inexact is signaling
FORTRAN STOP
% pgfortran -O0 -Mcuda=nodebug -g -traceback -Mcuda=charstring -Mcuda=cc70 -Minfo=accel test.cuf ; a.out
%

-Mat

Mat,
Thanks for the quick reply and the workaround.
Dan

Mat,
While the workaround ‘works’ in the reproducer, I still cannot get the cudaFuncSetAttribute() call to work in the main application when using cudaFuncAttributeMaxDynamicSharedMemorySize. Do you have any suggestions for further investigating this issue. I can provide the whole code is required.
Dan

Hi Dan,

I haven’t used that call myself, so don’t have experience with it. If you can provide your code, that would be helpful for me to investigate.

I’m mentoring our first virtual GPU Hackathon this week so do have limited time available, but should be able to get to it by Friday.

-Mat

Mat,
I’ve been working back and forth between a working example and a non-working example. At this point, the working example is just a slimmed down (not scientifically functional) copy of the non-working example. I can even get the non-working example to fail the cudaFuncSetAttribute() call by putting a return at various points in the kernel code. Does the cudaFuncSetAttribute() function interrogate the kernel routine and fail if it doesn’t meet some internal requirements?
Dan

edit
I can even get the non-working example to pass the cudaFuncSetAttribute() call by putting a return at various points in the kernel code.

We now have PGI-20.4 available for testing. This issue is still present with 20.4

For the record, reducing the dynamic shared memory request from 98304 to 96000 in the cudaFuncSetAttribute() and launch options appears to have resolved all the weirdness.

cudaFuncSetAttribute(update0, cudaFuncAttributeMaxDynamicSharedMemorySize, 96000)
<<<tGrid, tBlock, 96000, 0>>>