Minor Compiler Optimization Bit rotation

shifter1 · March 18, 2009, 1:20pm

In my program I make a lot of calls to a bit rotation macro:

#define rol(value, bits) (((value) << (bits)) | ((value) >> (32 - (bits))))

Since there is not a built-in bit rotation PTX instruction, there are many PTX instructions that look like this:

shr.u32 	%r46, %r45, 27;		  

	shl.b32 	%r47, %r45, 5;	   	

	or.b32 	%r48, %r46, %r47;

I was wondering why the compiler puts the result of the “or” operation in a new register, even though the results of the “shr” and “shl” are not used again?

For example, I believe this would save 1 register for each of rotate operations:

shr.u32 	%r46, %r45, 27; 

	shl.b32 	%r47, %r45, 5;

	or.b32 	%r47, %r46, %r47;

That way %r48 would be free for other use. Normally I wouldn’t saving one register wouldn’t make a significant difference, but there are at least 100 calls to the rotate macro in each of my threads.

AndreiB · March 18, 2009, 1:25pm

nvcc always uses new register for results. This does not affect optimization because actual register allocation happens during PTX-to-CUBIN compilation, so having less registers in PTX doesn’t mean you’ll have less of them in compiled kernel.

shifter1 · March 18, 2009, 1:36pm

Does this mean that the registers specified in the PTX files are not the actual registers that will be used?

Is there any way to see what the executable is actually doing with the registers?

seibert · March 18, 2009, 1:50pm

Correct. nvcc outputs PTX using the “static single assignment” convention, which allows PTX to do some further optimization and then final register assignment.

Yes, you can use the unofficial decuda tool on the .cubin:

http://www.cs.rug.nl/~wladimir/decuda/

shifter1 · March 18, 2009, 2:18pm

I used decuda and found that the register usage is being properly done, as I had suggested in my first post. Thanks for the fast replies!

wumpus · March 18, 2009, 8:08pm

Indeed, registers are usually assigned efficiently (although I’ve had a few strange cases…) by ptxas, BTW it’s curious that x86 is one of the few architectures that has a native bitwise rotate instruction.

tmurray · March 18, 2009, 9:45pm

Something to keep in mind: according to the C/C++ standards, shifting is undefined when you have a 32-bit variable and the shift amount is greater than or equal to 32 or less than or equal to 0. We’ve had a number of people complain about how this doesn’t correspond to x86 behavior, but it is actually undefined.

Topic		Replies	Views
ptxas optimization CUDA Programming and Performance	4	2893	January 9, 2009
Register economy when using constant make compiler use registers efficiently CUDA Programming and Performance	5	5990	July 30, 2008
Difference between the registers usage information showed in ptx file and cubin file CUDA Programming and Performance	4	1337	March 3, 2011
Use of register An odd problem CUDA Programming and Performance	12	2291	August 12, 2010
Register usage How good is the compiler? CUDA Programming and Performance	6	3019	April 3, 2008
Getting nvcc to consolidate registers CUDA Programming and Performance	19	19518	November 19, 2012
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4676	April 5, 2007
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3181	June 3, 2015
Why would recycling registers increase register count? CUDA Programming and Performance	1	581	September 10, 2018
Forcing register reuse in a loop CUDA Programming and Performance	9	3216	March 6, 2010

Minor Compiler Optimization Bit rotation

Related topics