Resource usage of a CUDA kernel, as seen by using --ptxas-options=-v, is
for Compute Capability 2.0. But for such devices, instead of using an odd number of registers, one might as well use the next even number of registers. I believe the “4+0 bytes lmem” is for register spill over – if indeed the 4 bytes of lmem are for register spillover, why didn’t nvcc use 26 registers – thereby eliminating the needed for any bytes in lmem?
My first guess would be it is for a variable whose address is taken somewhere.
Are you using trigonometric functions? I think their implementations also use local memory. Anyway it should not be of too much concern on 2.0 devices as it will be cached.
I can’t believe the lmem usage to be the reason for that drastic drop in throughput, as the cache should help degrading gracefully.
Anyway, if the lmem usage really comes from the compiler trying to save registers, you might be able to change that using the launch_bounds() directive. See appendix B.17 of the Programming Guide for details.