Shift direction and divergence

Within a warp, one strives to keep all threads executing the same instructions to avoid stalling threads.

If one thread is performing (x & y) >> z and another (a & b) << c, does the direction of the shift cause the threads to diverge?


Divergence refers to diverging control flow. From the scant information provided, one cannot tell whether there is divergent control flow in your code.

If we assume that your source code looks like this:

r = (cond) ? ((x & y) >> z) : ((a & b) << c);

there is a high likelihood that it would translate into machine code without any branches. Without a conditional branch, there cannot be control flow divergence.

Even if there were a local branch in the generated machine code, the compiler is very likely to insert a control-flow convergence point directly at the end of such an if-then-else construct (for older GPU architectures, these would be visible as a .s suffix on disassembled machine instructions). This kind of brief local divergence is not really harmful to performance, and programmers should not jump through hoops to avoid it.

I was thinking:
if(threadIdx.x == 0)
(x & y) >> z;
if(threadIdx.x == 1)
(a & b) << c;


repeating for a total of 32 threads, as I look to populate a shared memory array for the block to utilise, as opposed to just doing it all from thread 0, which would stall the rest of the block.

I believe it is pretty much guaranteed that those operations won’t be issued in the same (set) of instructions warp-wide. There will be conditional behavior evident in the SASS, either actual control divergence or else predication. Generally the compiler will lean towards predication, but I know of no way to make a guaranteed prediction by looking at source code. I don’t think you’re going to achieve this, for example: “one strives to keep all threads executing the same instructions” unless you mean “keep all threads execution the same instructions although some may be predicated off”

You can answer all of these questions yourself with the CUDA binary utilities.

Thanks, I hadn’t considered, “those operations won’t be issued in the same (set) of instructions warp-wide”. I was wondering if a few “shl” alongside “shr”'s in a warp would disrupt things, but it sounds like this won’t eventuate.

I’ll push on and see what happens.

Appreciate the help from you both.

The compiler backend, pxtas, performs machine-specific optimizations and has pretty accurate heuristics as to when to use branchless code based on predication and select-type instructions. On newer architectures is seems to prefer the use of select-type instructions from what I can tell.

From observation, if-conversion seems to be applied to singled-sided (if-then) or double-sided (if-then-else) conditionals with very small bodies (two to three instructions) only. For larger bodies, code based on (local) branching is faster by the compiler’s reckoning.

If I understand the example code correctly, your code assigns a different computation to every thread in a warp? If so, I think it is unlikly that if-conversion is applied. You can always check by looking at the generated machine code with cuobjdump --dump-sass.

Yes, this would be the case.

Given that, I am speculating it could be faster to have all the work done by a single thread, thus minimizing branching overhead. Instead of making assumptions, it seems best to run a quick experiment trying both variants.