Branch divergence and executing serial could be misinterpretted.

The documentation about branch divergence isn’t terribly clear.

It mentions that cuda hardware will execute “branch divergence” serially.

But what is ment with this ? It can be interpreted in different ways two ways:

if (A < B)
{
C = 1;
} else
{
C = 2
}

This is a branch, so the threads might “diverge”. The documentation mentions it will be executed serially ?!

But what is ment with this, some possibilities:

  1. Each thread will execute serially, one after the other, thus 32 serial operations:

Thread 1 executes the branch after which
Thread 2 executes the branch after which
Thread 3 executes the branch after which
Thread 4 executes the branch after which
Thread 5 executes the branch after which
… and so forth… until…
Thread 32 executes the branch.

So the number of cycles needed for executing the branch goes from 1 to 32 cycles.

(assuming one instruction costs one cycle, or one branch costs only cycle, in reality it’s probably a comparision + a jump + an assignment)

However perhaps something different was ment by the documentation:

All threads execute all instructions in seriel in parallel, which means the serial list of instructions is executed fully, but in parallel, this is a weird sentence and a weird concept to describe with just words.

Another way to describe it is: all the (serial) instructions out of which the branch is made of/consists of are executed in parallel, step by step, one instruction in parallel, followed by the next one and so forth.

This leads to a different concept:

Thread 1…32 execute the comparision in parallel after which
Thread 1…32 execute (and/or jump) to target 1 of the branch (C=1) in parallel after which
Thread 1…32 execute (and/or jump) to target 2 of the branch (C=2) in parallel after which

During the second and third phase some executions might not take place because of a mask or predicate, or perhaps some results or simply thrown out. Though I have seen other documentation mention that there is talk of certain utilization of execution units ?!

Now assuming the second example is how the GPU actually works when it comes to execute these 32 thread warps… and the utilization of ALU, FADD and such is only 50% ? Can the GPU use the other 50% to process another warp ? Or is that 50% that was not used, simply wasted ?

In the case of two-way divergence, such as might be experienced with a simple if…then…else construct, the GPU will force all threads (in lockstep) to execute both the “then” and the “else” path (assuming at least one thread would have chosen each path).

The mechanism under the hood to accomplish this may vary, but may involve something like predication, which you can read about in the PTX manual.

The net effect is that the code requires enough time to execute the then and else path, as if they were serial. It does not automatically diverge into 32 different threads of execution. The threads remain in lockstep. For the then path, those threads that would have selected that path will traverse that path in a normal fashion (in lockstep), and the threads that would not have selected that path will usually be inactive (ie. marching along with the other threads, but inactive). The reverse happens on the else path.

From a scheduling standpoint, as in all other situations, scheduling an instruction requires resources for all 32 threads in a warp, regardless of whether those threads are active or not.

There is no possibility to use “the other 50% to process another warp”. The execution resources for threads that are inactive or predicated off, are simply wasted during those cycles.

I am not sure the masking of threads in a warp for divergent control flows and the predication of instructions as expressed in PTX or SASS really use the same mechanism, but I don’t think this is explained in detail in the available documentation. The predicate registers used to predicate instructions at the SASS level are not identical with the thread-activation mask used to handle divergent control flows. My understanding (don’t take this as gospel) is as follows:

The instruction-level predication works by issuing the instruction to an execution unit, and when the predicate turns out to be false, suppressing register write-back in the case of a register-to-register instruction, or suppressing the memory access in case of a register/memory instruction (to avoid out-of-bounds access; the address is computed but discarded). In other words, the instruction is essentially discarded at commit time. In pipelined execution, the processor cannot know the status of a predicate at instruction issue time, so “late squashing” of instructions that are predicated off is the best one can do without introducing bubbles into a pipeline.

Since the processor maintains an active-thread mask per warp, it knows at instruction issue time whether a particular thread in a warp is active or not. So one would hope (for power saving reasons, if no others), that no operations are actually issued to execution units for threads that are currently inactive, that is, “early squashing”. I don’t know whether this is what is actually taking place. Even assuming early squashing, due to the lockstep execution in fixed-sized groups that still means these execution units aren’t available to threads from other warps (so efficiency is negatively affected), but at least they would not be burning power, just idle.

By divergence flows through multiple branches, one could certainly get to a point where only one thread in a warp is active, which would represent the maximum amount of serialization, and theoretically execution might not re-converge until the last instruction in the kernel. Such code would be functional but execute with very low efficiency. Part of the SASS optimizations performed by PTXAS is the placement of convergence points to improve code efficiency. I think these show up as a SASS instructions SYNC in recent machine instruction sets.

I defer to @njuffa and wordsmithed my previous statement a bit.

As I said, my understanding is limited, due to lack of detailed documentation of the GPU microarchitecture. I suspect one could dig out much detail from NVIDIA’s patents.

I don’t think the distinction between predication and inactive masking is essential to understanding how warp divergence is handled. You could imagine how it might work with predication, and end up with a reasonably good understanding of behavior. There may indeed be power differences in how it actually works, but these would be secondary considerations, IMO.

Since active-thread masks and predication registers are physically different structures that get updated in distinct and different ways I think one should not simply conflate the two, to avoid creating an incorrect mental model in the mind of programmers. This is why I decided to post in this thread, although generally I don’t like to comment on GPU hardware issues due to the deplorable state of documentation in that area that tends to create uncertainty.

It should also be noted that predication is a “user-level” mechanism that is fully exposed to programmers at PTX level, while the manipulation of active-thread masks is not visible at PTX level but is handled by a combination of hardware mechanisms (for branch execution) and SASS-level instructions inserted by the compiler (e.g. SSY instructions and .S suffixes; I think the latter was replaced by SYNC instruction in recent architectures, but I haven’t really checked the details as it is tedious to deal with).

Just to muddle the discussion further, Skybuck’s actual example of divergence doesn’t actually diverge control flow. This (common) situation where a register is set to one of two provided values based on a predicate like (A<B) is a fundamental PTX selection operation and doesn’t actually cause code divergence. A similar example like “if (A<B) C=1; else D=1” would be divergent because of the different output destinations.

This kind of analysis is exactly what njuffa and txbob are talking about in that the actual implementation (selection, predication, active thread masking) can be abstracted away from the CUDA programmer and it will just work, but as you peel open the layers and gain understanding, you do expose potential optimizations, especially in organizing data flow and execution in your very very inner loops where every clock matters.

I would say that from a practical performance optimization perspective, CUDA programmers should not worry about divergence unless the profiler tells them it has sizeable impact on the performance of their code.

Many questions and concerns about the performance impact of divergence I have seen raised on the internet are for code that is not actually impacted by divergence (since the branches in question get optimized out by the compiler), or only impacted in minor ways.