I’ve just recently started to look more closely at generated PTX output by adding the -keep option to my build rules.
I found that the compiler is sometimes wasting registers by loading an often needed constant over and over into registers. Not just one register, but different ones.
Here is a simple code sample (not meaningful, but illustrative). Note that this can also apply to int and floats, as long as you reuse the the same constant (in this case true).
bool __shared__ flag1, flag2, flag3; int __shared__ out; out = (flag1 == true && flag2 == true && flag3 == true);
So what’s wrong with the generated PTX? The thing is, it re-loads the same constant (1 for “true”) over and over into different registers. Hey compiler, ever heard about register re-use? In this case r6 and r10 would not be needed at all, we could simply reuse r2.
ld.shared.s8 %r1, [flag3]; // id:7 flag3+0x0 mov.s32 %r2, 1; // set.eq.u32.s32 %r3, %r1, %r2; // neg.s32 %r4, %r3; // ld.shared.s8 %r5, [flag1]; // id:8 flag1+0x0 mov.s32 %r6, 1; // set.eq.u32.s32 %r7, %r5, %r6; // neg.s32 %r8, %r7; // ld.shared.s8 %r9, [flag2]; // id:9 flag2+0x0 mov.s32 %r10, 1; // set.eq.u32.s32 %r11, %r9, %r10; // neg.s32 %r12, %r11; // and.b32 %r13, %r8, %r12; // and.b32 %r14, %r4, %r13; // st.shared.s32 [out], %r14; // id:10 out+0x0
So let’s see how we can coerce the compiler into storing the “true” in a register… and reusing it. Turns out to be quite simple. The trick is to generate the value “true” at run time, such that the compiler cannot determine it during compile time. Then we use the value for example by writing it to a dummy location. From now on the value is available in a register.
bool __shared__ flag1, flag2, flag3; bool __shared__ trueval; // anything that evaluates to true at run time // that the compiler does not yet know at compile time trueval = threadIdx.x != 65535; out = (flag1 == trueval && flag2 == trueval && flag3 == trueval );
See how it is now computing a value of 1 in register r1 and re-using it? We’ve added a bit of overhead to get the desired value into a register, but imagine this being done only once at the start of the kernel. The flag checks may have been in some tight inner computation loop, getting executed millions of times…
mov.u16 %rh1, %tid.x; // mov.u16 %rh2, 65535; // setp.ne.u16 %p1, %rh1, %rh2; // selp.s32 %r1, 1, 0, %p1; // st.shared.s8 [trueval], %r1; // id:13 trueval+0x0 ld.shared.s8 %r2, [flag3]; // id:14 flag3+0x0 set.eq.u32.s32 %r3, %r2, %r1; // neg.s32 %r4, %r3; // ld.shared.s8 %r5, [flag1]; // id:15 flag1+0x0 set.eq.u32.s32 %r6, %r5, %r1; // neg.s32 %r7, %r6; // ld.shared.s8 %r8, [flag2]; // id:16 flag2+0x0 set.eq.u32.s32 %r9, %r8, %r1; // neg.s32 %r10, %r9; // and.b32 %r11, %r7, %r10; // and.b32 %r12, %r4, %r11; // st.shared.s32 [out], %r12; // id:17 out+0x0
The method of forcing registers to contain a wanted value may not be elegant - we could have used constant memory also to store “true” there.
But I wanted to illustrate the point that if you need one value stored in a register, you can coerce the compiler to do just that. Does not matter if the value you need is a pointer, an integer or FP constant or some other value. Compute the value and write it to some dummy location - and from now on it is available in a register. Most benefit is obtained when the computed value somehow depends on threadIdx or blockIdx because in this case using constant memory is not an option.
I used the register trick on an odd-even sorting kernel and I got the run time down from 42 seconds to 35 seconds for a specific problem size. The register count also decreased quite a bit. And this with only 2 minor tweaks like the one shown above.
Future CUDA versions may compile such code more efficiently, such that this hack may no longer show any benfits. Let me know if you know better tricks to make the PTX more efficient. I’m listening.