Thread divergence when block size is equal to warp size

YummiYo · June 5, 2019, 4:30pm

Hello,

I have a condition in my kernel code that depends on some value from the first thread within a block.

It looks like that:

a = __shfl_sync(0xffffffff, value, 0);

if (a > 0) {
  doA();
} else {
  doB();
}

After analyzing the generated CUDA assembly I discovered there is a thread divergence that worsens the performance. In my case the block size is (32,1,1) which, as far as I understand, guarantees all threads always follow the same path.

Is there any way to give compiler a hint to make optimization?

Robert_Crovella · June 5, 2019, 4:43pm

The potential for divergence is there in the code. However your arrangement guarantees that all threads in the warp will follow either the if path or the else path, resulting in no divergence at runtime.

In general, the potential for divergence can be analyzed statically. Actual determination of runtime divergence can only be made in the general case with knowledge of runtime data.

In this particular case, where you are broadcasting the same value to all threads in the warp, and then doing some boolean test on that value, there is no possibility for divergence at the warp level.

It’s not clear what optimization you would like to hint about.

Of course, both the if path, and the else path, must exist in the resultant compiler-generated code, along with some path-selection logic, either predication or a conditional jump/branch (or both).

YummiYo · June 5, 2019, 5:15pm

Thank you, I might be misinterpreting those SSY/SYNC instructions. In my case threads do not really diverge and always follow the same path.

Topic		Replies	Views
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	13885	June 14, 2009
Shift direction and divergence CUDA Programming and Performance	7	381	November 13, 2020
Avoid branching ... CUDA Programming and Performance	3	3601	May 19, 2010
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2939	December 28, 2008
branch and precision CUDA Programming and Performance	4	4820	October 29, 2008
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6664	August 1, 2008
Thread Divergence CUDA Programming and Performance	2	2730	September 27, 2008
Question about divergent branching CUDA Programming and Performance	3	6430	May 21, 2009
Thread divergence due to IF CUDA Programming and Performance	3	6853	September 13, 2007
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	862	January 8, 2020

Thread divergence when block size is equal to warp size

Related topics