Shift direction and divergence

rs277 · November 13, 2020, 7:44pm

Within a warp, one strives to keep all threads executing the same instructions to avoid stalling threads.

If one thread is performing (x & y) >> z and another (a & b) << c, does the direction of the shift cause the threads to diverge?

Thanks.

njuffa · November 13, 2020, 7:56pm

Divergence refers to diverging control flow. From the scant information provided, one cannot tell whether there is divergent control flow in your code.

If we assume that your source code looks like this:

r = (cond) ? ((x & y) >> z) : ((a & b) << c);

there is a high likelihood that it would translate into machine code without any branches. Without a conditional branch, there cannot be control flow divergence.

Even if there were a local branch in the generated machine code, the compiler is very likely to insert a control-flow convergence point directly at the end of such an if-then-else construct (for older GPU architectures, these would be visible as a .s suffix on disassembled machine instructions). This kind of brief local divergence is not really harmful to performance, and programmers should not jump through hoops to avoid it.

rs277 · November 13, 2020, 8:11pm

I was thinking:
if(threadIdx.x == 0)
(x & y) >> z;
if(threadIdx.x == 1)
(a & b) << c;

…

__syncthreads;

repeating for a total of 32 threads, as I look to populate a shared memory array for the block to utilise, as opposed to just doing it all from thread 0, which would stall the rest of the block.

Robert_Crovella · November 13, 2020, 8:16pm

I believe it is pretty much guaranteed that those operations won’t be issued in the same (set) of instructions warp-wide. There will be conditional behavior evident in the SASS, either actual control divergence or else predication. Generally the compiler will lean towards predication, but I know of no way to make a guaranteed prediction by looking at source code. I don’t think you’re going to achieve this, for example: “one strives to keep all threads executing the same instructions” unless you mean “keep all threads execution the same instructions although some may be predicated off”

You can answer all of these questions yourself with the CUDA binary utilities.

rs277 · November 13, 2020, 8:27pm

Thanks, I hadn’t considered, “those operations won’t be issued in the same (set) of instructions warp-wide”. I was wondering if a few “shl” alongside “shr”'s in a warp would disrupt things, but it sounds like this won’t eventuate.

I’ll push on and see what happens.

Appreciate the help from you both.

njuffa · November 13, 2020, 8:39pm

The compiler backend, pxtas, performs machine-specific optimizations and has pretty accurate heuristics as to when to use branchless code based on predication and select-type instructions. On newer architectures is seems to prefer the use of select-type instructions from what I can tell.

From observation, if-conversion seems to be applied to singled-sided (if-then) or double-sided (if-then-else) conditionals with very small bodies (two to three instructions) only. For larger bodies, code based on (local) branching is faster by the compiler’s reckoning.

If I understand the example code correctly, your code assigns a different computation to every thread in a warp? If so, I think it is unlikly that if-conversion is applied. You can always check by looking at the generated machine code with cuobjdump --dump-sass.

rs277 · November 13, 2020, 8:43pm

Yes, this would be the case.

njuffa · November 13, 2020, 9:28pm

Given that, I am speculating it could be faster to have all the work done by a single thread, thus minimizing branching overhead. Instead of making assumptions, it seems best to run a quick experiment trying both variants.

Topic		Replies	Views
Avoid branching ... CUDA Programming and Performance	3	3601	May 19, 2010
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	13885	June 14, 2009
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2939	December 28, 2008
Thread divergence when block size is equal to warp size CUDA Programming and Performance	2	598	June 5, 2019
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3933	December 21, 2016
Question about control flow divergence CUDA Programming and Performance	4	7311	July 24, 2008
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	862	January 8, 2020
Thread Divergence CUDA Programming and Performance	2	2730	September 27, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21235	March 12, 2007
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6664	August 1, 2008

Shift direction and divergence

Related topics