Thread divergence when block size is equal to warp size


I have a condition in my kernel code that depends on some value from the first thread within a block.

It looks like that:

a = __shfl_sync(0xffffffff, value, 0);

if (a > 0) {
} else {

After analyzing the generated CUDA assembly I discovered there is a thread divergence that worsens the performance. In my case the block size is (32,1,1) which, as far as I understand, guarantees all threads always follow the same path.

Is there any way to give compiler a hint to make optimization?

The potential for divergence is there in the code. However your arrangement guarantees that all threads in the warp will follow either the if path or the else path, resulting in no divergence at runtime.

In general, the potential for divergence can be analyzed statically. Actual determination of runtime divergence can only be made in the general case with knowledge of runtime data.

In this particular case, where you are broadcasting the same value to all threads in the warp, and then doing some boolean test on that value, there is no possibility for divergence at the warp level.

It’s not clear what optimization you would like to hint about.

Of course, both the if path, and the else path, must exist in the resultant compiler-generated code, along with some path-selection logic, either predication or a conditional jump/branch (or both).

Thank you, I might be misinterpreting those SSY/SYNC instructions. In my case threads do not really diverge and always follow the same path.