I have a kernel where each thread does almost similar computation (using data stored in smem) on 8 or 16 floats.
These are stored in an array of floats which are looped over. Given my occupancy and block size I have plenty registers to spend to keep the entire array in registers at all times, however, even with --maxrregcount set properly the compiler insists that it wants to use local mem for this array, which more than halves the performance of the kernel.
The speed reduction can be easily verified by just accessing one element, but performing all the computations instead.
The profiler also indicate that a large number of local loads/stores is taking place when using the array.
I observe a speed benefit when storing the data in smem, but this reduce my block size and/or occupancy.
Johan Seland, PhD