Is it a good idea to convert all logical operators into bitwise operators to stop short-circuiting for better warp divergence?

Even though a short one, a short-circuit of multiple logical AND/OR/XOR/etc operations in same line causes warp divergence right?

bool a, b, c, d, e;
init(a,b,c,d, warpLane);
// warp lanes 0,...3: short circuits at a
// warp lanes 4,...7: short circuits at b
// warp lanes 8,...15: short circuits at c
// warp lanes 16,...31: short circuits at d
bool something = a && b && c && d && e;

This has 4-way divergence right?

Is it enough to just convert it to bitwise operations to evade the warp divergence? Does it have any side effects or performance penalties under some cases?

unsigned int a, b, c, d; 
// initialize different per warp lane
init(a,b,c,d, warpLane);
unsigned int something = a & b & c & d & e; // assuming its faster than a * b * c * d * e

I don’t think so. There is not necessarily divergence at all.

Even if there were, it might manifest as predicated execution, which is a fairly efficient method for the GPU to handle conditional behavior.

Rather than treating such instances of “divergence” as a crisis a-priori, for my own work, I would seek to write understandable, expressive, maintainable code, and only seek alternatives when the profiler tells me that there is a need to.

1 Like

When the source code you linked is converted to “logical AND” version, some instructions become LOP3.LUT. I assume its a kind of lookup table to compute the multiple AND operations quicker (uses 2 instructions less).

(&):

 IADD3 R10, P1, R10, c[0x0][0x160], RZ 
 LOP3.LUT R11, R6, R5, R2, 0x80, !PT 
 LOP3.LUT P0, RZ, R11, 0xff, R8, 0x80, !PT 
 IMAD.X R11, RZ, RZ, c[0x0][0x164], P1 
 SEL R13, RZ, 0x1, !P0 
 STG.E.U8 [R10.64], R13 

(&&):

 IADD3 R10, P1, R10, c[0x0][0x160], RZ 
 IMAD.X R11, RZ, RZ, c[0x0][0x164], P1 
 ISETP.NE.AND P0, PT, R4, RZ, PT 
 ISETP.NE.AND P0, PT, R2, RZ, P0 
 ISETP.NE.AND P0, PT, R6, RZ, P0 
 ISETP.NE.AND P0, PT, R8, RZ, P0 
 SEL R5, RZ, 0x1, !P0 
 STG.E.U8 [R10.64], R5 

But both use “SEL” which is predicated selection.

LOP3.LUT is an instruction that implements any logical function of three inputs. There are 256 such functions, which can be identified by a 8-bit lookup table. The lookup table is baked into the instruction. In the disassembled SASS above it is the penultimate argument 0x80.

In modern GPUs, the traditional 2-input logic operations have been replaced by the generic LOP3 instruction. In many cases, this allows for fewer instructions to be generated. For example, with two LOP3 instructions one can implement more than half of all possible 4-input logic operations. A drawback of LOP3 is that it makes the generated machine code more difficult to read and set in correspondence to the logic operations in the source code.

Converting boolean operations to logical ones can impact performance both positively and negatively. This depends on contextual details. So a manual conversion needs to be analyzed on a case-by-case basis. Thus the recommendation to write the code in whatever coding style is most natural for the use case at hand, and only revisit it when profiling indicates that it is implicated in a performance bottleneck.

In many cases an important function of code is to communicate the intentions of the code writer to other humans, not just communicating it to a machine. And in some case conveying information to other humans is the most important aspect.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.