Here is the ptxas information for the three functions:
1>ptxas info : Compiling entry function '_Z5Test3P6fInputPf'
1>ptxas info : Used 4 registers, 16+16 bytes smem
1>ptxas info : Compiling entry function '_Z5Test2P6float4Pf'
1>ptxas info : Used 4 registers, 16+16 bytes smem
1>ptxas info : Compiling entry function '_Z5Test1P6fInputPf'
1>ptxas info : Used 6 registers, 16+0 bytes lmem, 16+16 bytes smem
I was wondering why does Test1 kernel requires 16 bytes of lmem??? if I put another float4 array in the
structure and read it - the lmem usage will go up to 32 bytes.
I was pondering that for a while, and I can’t offer an answer. I am presuming that the compiler can’t resolve the scope of the pointer fetch for the float 4 case, so it declines to use registers, although there doesn’t seem to be a good reason why. Whether this behaviour is a feature or a bug is anyone’s guess.
Just got confirmation from nVidia that this issue has been fixed in a future version.
“This issue has been fixed in a development version of the CUDA toolkit. The fix will be available in a future release of CUDA. I will update this bug when that release becomes available.”