I installed and compiled my kernel with the new 3.0 beta toolkit and observed a much lower register count.
max-register is set to 42
With CUDA 2.3: 40 register were used.
With CUDA 3.0b: 24 registers are used.
1>ptxas info : Used 24 registers, 48+16 bytes smem, 12 bytes cmem
I have unrolled my kernel manually so there was many redundant code. While CUDA 2.3 probably
put the calculated indexes into registers CUDA 3.0 recalculates this indexes every time.
This means many calculations have to be done again, but occupacy will be better.
Has someone else observed this?