I was wondering why the compiler puts the result of the “or” operation in a new register, even though the results of the “shr” and “shl” are not used again?
For example, I believe this would save 1 register for each of rotate operations:
That way %r48 would be free for other use. Normally I wouldn’t saving one register wouldn’t make a significant difference, but there are at least 100 calls to the rotate macro in each of my threads.
nvcc always uses new register for results. This does not affect optimization because actual register allocation happens during PTX-to-CUBIN compilation, so having less registers in PTX doesn’t mean you’ll have less of them in compiled kernel.
Correct. nvcc outputs PTX using the “static single assignment” convention, which allows PTX to do some further optimization and then final register assignment.
Yes, you can use the unofficial decuda tool on the .cubin:
Indeed, registers are usually assigned efficiently (although I’ve had a few strange cases…) by ptxas, BTW it’s curious that x86 is one of the few architectures that has a native bitwise rotate instruction.
Something to keep in mind: according to the C/C++ standards, shifting is undefined when you have a 32-bit variable and the shift amount is greater than or equal to 32 or less than or equal to 0. We’ve had a number of people complain about how this doesn’t correspond to x86 behavior, but it is actually undefined.