Using -maxrregcount does not increases local memory usage?

I have a kernel using 12 registers.

To achieve optimal performance for a Cuda 1.1 card i need 10 registers.

So i changed 6 int variables to short and used the -maxrregcount 10 command to limit them.

PTX information says it works, but there is no allocated local memory visible?

Using 6 int’s:

without limit

ptxas info    : Used 12 registers, 40+16 bytes smem, 8 bytes cmem[1]

with limit

ptxas info    : Used 10 registers, 4+0 bytes lmem, 40+16 bytes smem, 8 bytes cmem[1]

Using 6 short’s:

without limit

ptxas info    : Used 12 registers, 40+16 bytes smem, 8 bytes cmem[1]

with limit:

ptxas info    : Used 10 registers, 40+16 bytes smem, 8 bytes cmem[1]

Where did those 2 registers go ?

The compiler might be able to re-compute a value it would have held in a register if one were available.

If you don’t want to rely on second-guessing, you can find out yourself using [font=“Courier New”]cuobjdump -sass[/font].

Since spilling registers can be pretty expensive, when you squeeze the register allocation with -maxrregcount the compiler will first try to recompute values to save on registers for temporaries, as tera points out. So you are trading in reduced register usage for increased dynamic instruction count. Therefore the resulting code may or may not be faster.

If you squeeze the register allocation even more, you will get register spilling eventually. As tera recommends, look at the disassembly from cuobjdump if you want to see all the gory details.

Thanks for those answers, did not knew that the compiler could recompute values.
Right now I don’t have the time for cuobjdump, but i flew over the nvidia pdf about it, and it looks promising.