Since spilling registers can be pretty expensive, when you squeeze the register allocation with -maxrregcount the compiler will first try to recompute values to save on registers for temporaries, as tera points out. So you are trading in reduced register usage for increased dynamic instruction count. Therefore the resulting code may or may not be faster.
If you squeeze the register allocation even more, you will get register spilling eventually. As tera recommends, look at the disassembly from cuobjdump if you want to see all the gory details.
Thanks for those answers, did not knew that the compiler could recompute values.
Right now I don’t have the time for cuobjdump, but i flew over the nvidia pdf about it, and it looks promising.