Overhead of warp divergence vs. extra multiplication by 0 or 1

I’m doing some GPGPU’ing. The algorithm I am converting has some small, really rather trivial if-else/branching.

To avoid warp divergence, I have come up with several alternatives to troublesome if-else statements, but I am unsure which alternatives are best/fastest.

Context:

int myspecialID;
int a, x;
int b, y;
int c, z;

Problem: What is the best way to set a = b or c, and x = y or z, according to myspecialID, i.e.

if (myspecialID == 1)
{  a = b;  x = y;  }
else
{  a = c;  x = z;  }

Solution 1:

a =  myspecialID == 1  ?  b : c;
x = myspecialID == 1  ?  y : z;

Solution 2:

a = b; x = y;
if (myspecialID != 1)
{a = c; x = z;}

Solution 3:

bool is_ID_1 = (myspecialID == 1);
a = (b * is_ID_1) +  (c *  (!is_ID_1));
x = (y * is_ID_1) +  (z *  (!is_ID_1));

I was leaning towards Solutons 2 and 3. But I can’t decide which one is better because I don’t know if “Overhead time of warp divergence” is greater than “multiplication by bool (i.e. 1 or 0)”.

Does anybody know? Since the “multiplication” is restricted to 1 or 0, can this be optimized in any way? Indeed, it is not a typical multiplication operation - it is so trivial in either case.

Write your code in a natural and readable fashion. The compiler will take care of local branching. If you are still worried, you can inspect the generated machine code by running cuobjdump --dump-sass on the binary. You will likely see either predicated instructions or select-type instructions.

You are correct. The priority is to make clear, readable code; avoid premature optimization; and trust in the compiler’s optimization magic. I agree.

But, this branching is deep inside the double-for-loop which is my algorithm. So it is a very big deal if multiplication by 1 or 0 (true or false), as in solution 3, is faster than any sort of branching due to if-statements/conditionals in the previous proposed solutions.

Thus, it really is an important question about scheduler/instructional overhead of branching vs. multiplication by 0 or 1.

PS: Despite the branching deep inside the double-for-loop, I maintain that the algorithm still stands to greatly benefit from a GPU implementation, because there are other, parallel, more costly calculations that also happen inside the double-for-loop that do not involve branching. I didn’t show you that part, of course, because it was not relevant to my specific question.

I’m leaning towards solution 1, or the original formulation.

Solution 3 is clearly inferior - it’s hard to read, and it would be a lot slower unless the compiler recognizes the idiom and transforms it back to one of the other alternatives.

All other variants should generate the same code, Historically there have been instances though where solution 1 generated better code.

As Norbert wrote: If in doubt just look at the generated code and/or benchmark it.

The correct answer is Solution 1. It’s the most natural and readable as well.

It should be 3 instructions with no divergence:

setp.eq.u32 p,mySpecialID,1;
selp        a,b,c,p;
selp        x,y,z,p;

The PTX manual reveals there are a few unique opcodes that simplify ternary conditionals.

As @njuffa notes, if you’re really concerned about optimizing a small sequence then just dump the PTX or SASS and see what it looks like.

…However…

If, by chance, you do what ‘allanmac’ suggests, and dump the PTX, only to find out that the compiler’s optimization is less “magic” than “tragic”, you might want to consider coding it this way:

int swtch = -( myspecialID == 1 );
a = c ^ ( ( b ^ c ) & swtch );
x = z ^ ( ( y ^ z ) & swtch );

No branching, no multiplication…

…Just saying…

Of course, the compiler might choose to use a branch to calc ‘swtch’, but if that’s the case, then the whole thing becomes kind of moot anyway…

A simple example can be seen here.

The PTX that is produced is as expected:

setp.eq.s32     %p1, %r5, 1;
selp.b32        %r8, %r1, %r2, %p1;
selp.b32        %r9, %r3, %r4, %p1;

The real test is to look at the SASS (machine code), using cuobjdump --dump-sass. PTX is intermediate code, and further optimizations are performed by PTXAS. PTXAS is a compiler that can and does perform if-conversion. In my experience, in certain situations, it it actually better for performance for PTX to retain a branch and rely on PTXAS’s if-conversion than for the front-end to convert the branch.

In the SASS, you may wind up with predicated instructions, or ICMPs (despite the name, a select-type instruction and not a comparison). Slightly larger if-then-else constructs are usually translated as a combination of a uniform branch (BRA.U) plus predication for best performance.

Note that these transformations are optimizations, and optimizations are disabled for debug builds, so you would likely see plain old branches for the code above in that case.

Your comment about “tragic” code generation implies that you are not getting the branchless result you desire. If so, could you show compilable repro code as well as the compiler switches used? How big is the performance difference between the code generated by the compiler and your hand-tweaked code?

I know nothing about the quality of code produced by any of the NVidia compilers, although judging by the comments made thus far, I’d have to say that it sounds really high.

My use of the word “tragic” actually stems from my experiences with the Intel machine code generated by the Microsoft compilers - things like ‘sub eax,ebx’ followed by ‘test eax,0’ - that kind of thing. All of which has left me more than a little reluctant to rely on any machine-produced ‘optimizations’ for anything. Or maybe I’m just jaded (Sidebar: the latest research suggests that the human brain could have as many as 47 neurotransmitters - a computer has one)…

I certainly didn’t mean to imply that NVidia’s compilers are lacking in any way, shape, or form (in my defense though, I did say ‘If, by chance’)…

On the contrary, having very recently witnessed firsthand how absurdly speedy one of my own running kernels is, my respect for all things NVidia has pretty much gone through the roof. Of course, I wrote the thing in PTX, but that’s still ‘compiled’, right?

I have been working closely with the CUDA compiler team for almost 8 years, and have looked at a lot of generated SASS code over the years. There is always room for further improvement, both functional and in terms of performance. Bug reports and enhancement requests filed via the registered developer website by CUDA users help drive compiler improvements, especially when accompanied by self-contained repro code, and in the case of performance issues, documentation of the performance impact.

Yes, PTX code is compiled (either offline or by a JIT compiler embedded into the driver), so you may observe transformations like loop unrolling and if-conversion happening in addition to traditional back-end tasks such as instruction scheduling and register allocation.