Register economy when using constant make compiler use registers efficiently

Hi there,

I’ve just recently started to look more closely at generated PTX output by adding the -keep option to my build rules.

I found that the compiler is sometimes wasting registers by loading an often needed constant over and over into registers. Not just one register, but different ones.

Here is a simple code sample (not meaningful, but illustrative). Note that this can also apply to int and floats, as long as you reuse the the same constant (in this case true).

 bool __shared__ flag1, flag2, flag3;  

  int __shared__ out;

  

  out = (flag1 == true && flag2 == true && flag3 == true);

So what’s wrong with the generated PTX? The thing is, it re-loads the same constant (1 for “true”) over and over into different registers. Hey compiler, ever heard about register re-use? In this case r6 and r10 would not be needed at all, we could simply reuse r2.

ld.shared.s8  %r1, [flag3];    // id:7 flag3+0x0

	mov.s32  %r2, 1;              	// 

	set.eq.u32.s32  %r3, %r1, %r2;	// 

	neg.s32  %r4, %r3;            	// 

	ld.shared.s8  %r5, [flag1];    // id:8 flag1+0x0

	mov.s32  %r6, 1;              	// 

	set.eq.u32.s32  %r7, %r5, %r6;	// 

	neg.s32  %r8, %r7;            	// 

	ld.shared.s8  %r9, [flag2];    // id:9 flag2+0x0

	mov.s32  %r10, 1;              // 

	set.eq.u32.s32  %r11, %r9, %r10;	// 

	neg.s32  %r12, %r11;          	// 

	and.b32  %r13, %r8, %r12;      // 

	and.b32  %r14, %r4, %r13;      // 

	st.shared.s32  [out], %r14;    // id:10 out+0x0

So let’s see how we can coerce the compiler into storing the “true” in a register… and reusing it. Turns out to be quite simple. The trick is to generate the value “true” at run time, such that the compiler cannot determine it during compile time. Then we use the value for example by writing it to a dummy location. From now on the value is available in a register.

 bool __shared__ flag1, flag2, flag3;

  bool __shared__ trueval;

  

  // anything that evaluates to true at run time

  // that the compiler does not yet know at compile time

  trueval = threadIdx.x != 65535;

  

  out = (flag1 == trueval && flag2 == trueval  && flag3 == trueval );

See how it is now computing a value of 1 in register r1 and re-using it? We’ve added a bit of overhead to get the desired value into a register, but imagine this being done only once at the start of the kernel. The flag checks may have been in some tight inner computation loop, getting executed millions of times…

mov.u16  %rh1, %tid.x;        	// 

	mov.u16  %rh2, 65535;          // 

	setp.ne.u16  %p1, %rh1, %rh2;  // 

	selp.s32  %r1, 1, 0, %p1;      // 

	st.shared.s8  [trueval], %r1;  // id:13 trueval+0x0

	ld.shared.s8  %r2, [flag3];    // id:14 flag3+0x0

	set.eq.u32.s32  %r3, %r2, %r1;	// 

	neg.s32  %r4, %r3;            	// 

	ld.shared.s8  %r5, [flag1];    // id:15 flag1+0x0

	set.eq.u32.s32  %r6, %r5, %r1;	// 

	neg.s32  %r7, %r6;            	// 

	ld.shared.s8  %r8, [flag2];    // id:16 flag2+0x0

	set.eq.u32.s32  %r9, %r8, %r1;	// 

	neg.s32  %r10, %r9;            // 

	and.b32  %r11, %r7, %r10;      // 

	and.b32  %r12, %r4, %r11;      // 

	st.shared.s32  [out], %r12;    // id:17 out+0x0

The method of forcing registers to contain a wanted value may not be elegant - we could have used constant memory also to store “true” there.

But I wanted to illustrate the point that if you need one value stored in a register, you can coerce the compiler to do just that. Does not matter if the value you need is a pointer, an integer or FP constant or some other value. Compute the value and write it to some dummy location - and from now on it is available in a register. Most benefit is obtained when the computed value somehow depends on threadIdx or blockIdx because in this case using constant memory is not an option.

I used the register trick on an odd-even sorting kernel and I got the run time down from 42 seconds to 35 seconds for a specific problem size. The register count also decreased quite a bit. And this with only 2 minor tweaks like the one shown above.

Future CUDA versions may compile such code more efficiently, such that this hack may no longer show any benfits. Let me know if you know better tricks to make the PTX more efficient. I’m listening.

Christian

What version are you using? It seemed to me that CUDA2.0 seems to be a better job at optimizing that 1.1. At the same time I haven’t check this kind of things in the PTX. Don’t forget they could be improved when going from the PTX to the Cubin file. Cubin files are the only ones who contain the actual number of registers being used.

In this particular case I tried this both on CUDA 1.1 and CUDA 2.0 beta2 and I found that mentioned hacks worked equally well.

It’s not just about saving registers. The code gets more compact because one instruction for loading a constant into a register was saved. That’s 4 clock cycles saved per occurence or iteration ;)

You’re right: PTX is just an intermediate language, but somehow I doubt that a lot of extra optimizations are made when taking the final step to CUBIN format.

well, all register re-use is done in that step, so I guess it is a quite important one…

Actually, it is quite significant. I have kernels that generate ptx files with hundreds of registers that actually only use somewhere in 30s when compiled (I forget the exact number).

I think if you really want to optimize register usage then you should forget about PTX entirely and just use wumpus’ decuda and assembler.

When these C++ workarounds result in significant speedups, I am happy with them. In my particular case it was probably more the reduction of the instruction count that provided the most speedup.