It appears that whenever shared memory is used in CUDA fortran twice as much shared memory is allocated as explicitly requested. Any idea why this is the case? Writing a similar kernel and compiling directly with NVCC does not exhibit the same behavior.
The following demonstrates the behavior, this was compiled with CUDA/6.5 and PGI/15.3.0.
$ cat ftran.cuf
attributes(global) subroutine test
Real(8), shared :: a(64)
a = 1.0
end subroutine test
$ pgfortran ftran.cuf -Mcuda=ptxinfo,cc35 -c
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘test_’ for ‘sm_35’
ptxas info : Function properties for test_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 1024 bytes smem, 320 bytes cmem
1024 bytes of shared mem was reserved although only 512 were explicitly requested. Using other data types and different amounts of shared memory produce the same “2x” behavior. Any help is greatly appreciated.