How many divergent branches can actually be discussed in parallel?

Hi,

I am trying to use CUDA for a problem which is not really a data parallel problem. Thus, I have a lot of divergent branches, which causes serialization of the warps.

So, how many divergent branches can be executed in parellel per Block? I guess that might be 8 (on the G80 for example, because the G80 multipcoressors have 8 scalar processors)? Does anyonw know anything about it?

Thanks for your answers!

Erik

The question should be “how many branches can be executed in parallel per warp”. The answer is one. But the loss of performance depends heavily on the nature of the branches.

Let’s assume your block has 128 threads. This evaluates to 4 warps, threads 0-31 end up in warp 0, threads 32-63 in warp 1 etc.
When a warp hits an MP, all SPs execute the same single instruction (SIMD). Each takes care of four threads, so the MP processes a single instruction from a single warp.

Now, if your branches are warp-aligned, meaning threads 0-31 take one path, 32-63 take another etc., you’re getting full performance and you don’t suffer from branching at all. If thread 0 takes one branch and threads 1-31 take another, this warp will be serialized and you will loose half of the performance, assuming here that all branches take the same time T to execute. If thread 0 takes one path, 1 takes another and 2-31 take a third control path, the execution time for this warp will be 3T and so on.

The maximum penalty for completely divergent branches (ie. every thread follows a different control path) is 32x. In this pessimistic scenario, each warp of the block will be processed sequentially over and over until it finishes.

Here’s my guess (and a kind of hand-wavy method)…
Each divergent branch is considered to be a separate instruction, and executed accordingly. So, if all 32 threads in a warp diverge, then it’ll take 32 clocks to execute this warp. If there were no divergence, then it would have taken 4 clocks (assuming 8 SPs per SM). Hence, I would consider each such divergent branch to be a separate warp. :whistling:
So, if you had 32 divergent branches in a warp, then i would consider that there are 32 separate warps created => 32 * 32 threads running! This way, using the number of threads launched onto the system + a divergence factor + max possible threads in a kernel, you could calculate the limit.

It takes 4 cycles to get a single instruction of a warp scheduled on the 8 SPs.
If a warp diverges, some threads get masked out and branches are executed sequentially instruction by instruction, each taking another 4 clocks to schedule (and a different number of clocks to complete, depending on the instruction). Actually, counting the clocks is not trivial. There were a bunch of threads about instructions-per-clock and it’s beyond the scope of this post ;)

Pictures make this easy to visualise:
Blue arrows represent threads. Fat means active, thin are masked out. Consecutive levels represent, roughly, clock cycles (or fours of them).


Note that whether non-taken branches really get skipped (through a jump) or are executed but in a completely masked-out state (thus wasting clocks) depends on compiler voodoo. For big conditional blocks it’s better to be able to totally skip one if no thread wants to execute it, for smaller blocks that’s an extra jump instruction to be processed (even when there is no jump) and it’s better to leave it as is, especially since arithmetic-logic instructions are cheap on GPUs. What the compiler chooses as worthy of placing a jump before is up to its discretion. The pictures assume doX() are heavy functions and the compiler placed conditional jumps.

Thank you all very much for your answers!

Is the branching done at runtime or at compile time? I am wondering how branching behaves in the following loop:

for(int i=0;i<threadID+5;i++) {

   fooBar();

}

All Threads at least execute fooBar() 5 times. Are this 5 executions of foo bar done in parallel, or are the threads serialized?

One other question: If all threads of the warp are on the same code path, shouldn’t all threads by synched after a common code line?

Best,

Erik

Branching is done at runtime, dynamically.

In this examples, the 5 runs of the loop will execute in parallel. In some cases, more iterations will be ran in parallel, ex. when threadIDs from an active warp are in the range of 64-95, 69 iterations will run in parallel.

I believe in current devices threads reconverge at the first possibility.