Minor Compiler Optimization Bit rotation

In my program I make a lot of calls to a bit rotation macro:

#define rol(value, bits) (((value) << (bits)) | ((value) >> (32 - (bits))))

Since there is not a built-in bit rotation PTX instruction, there are many PTX instructions that look like this:

shr.u32 	%r46, %r45, 27;		  

	shl.b32 	%r47, %r45, 5;	   	

	or.b32 	%r48, %r46, %r47;

I was wondering why the compiler puts the result of the “or” operation in a new register, even though the results of the “shr” and “shl” are not used again?

For example, I believe this would save 1 register for each of rotate operations:

shr.u32 	%r46, %r45, 27; 

	shl.b32 	%r47, %r45, 5;

	or.b32 	%r47, %r46, %r47;

That way %r48 would be free for other use. Normally I wouldn’t saving one register wouldn’t make a significant difference, but there are at least 100 calls to the rotate macro in each of my threads.

nvcc always uses new register for results. This does not affect optimization because actual register allocation happens during PTX-to-CUBIN compilation, so having less registers in PTX doesn’t mean you’ll have less of them in compiled kernel.

Does this mean that the registers specified in the PTX files are not the actual registers that will be used?

Is there any way to see what the executable is actually doing with the registers?

Correct. nvcc outputs PTX using the “static single assignment” convention, which allows PTX to do some further optimization and then final register assignment.

Yes, you can use the unofficial decuda tool on the .cubin:


I used decuda and found that the register usage is being properly done, as I had suggested in my first post. Thanks for the fast replies!

Indeed, registers are usually assigned efficiently (although I’ve had a few strange cases…) by ptxas, BTW it’s curious that x86 is one of the few architectures that has a native bitwise rotate instruction.

Something to keep in mind: according to the C/C++ standards, shifting is undefined when you have a 32-bit variable and the shift amount is greater than or equal to 32 or less than or equal to 0. We’ve had a number of people complain about how this doesn’t correspond to x86 behavior, but it is actually undefined.