How many divergent branches can actually be discussed in parallel?

erik.rostock · September 30, 2009, 4:57pm

Hi,

I am trying to use CUDA for a problem which is not really a data parallel problem. Thus, I have a lot of divergent branches, which causes serialization of the warps.

So, how many divergent branches can be executed in parellel per Block? I guess that might be 8 (on the G80 for example, because the G80 multipcoressors have 8 scalar processors)? Does anyonw know anything about it?

Thanks for your answers!

Erik

_Big_Mac · September 30, 2009, 5:25pm

The question should be “how many branches can be executed in parallel per warp”. The answer is one. But the loss of performance depends heavily on the nature of the branches.

Let’s assume your block has 128 threads. This evaluates to 4 warps, threads 0-31 end up in warp 0, threads 32-63 in warp 1 etc.
When a warp hits an MP, all SPs execute the same single instruction (SIMD). Each takes care of four threads, so the MP processes a single instruction from a single warp.

Now, if your branches are warp-aligned, meaning threads 0-31 take one path, 32-63 take another etc., you’re getting full performance and you don’t suffer from branching at all. If thread 0 takes one branch and threads 1-31 take another, this warp will be serialized and you will loose half of the performance, assuming here that all branches take the same time T to execute. If thread 0 takes one path, 1 takes another and 2-31 take a third control path, the execution time for this warp will be 3T and so on.

The maximum penalty for completely divergent branches (ie. every thread follows a different control path) is 32x. In this pessimistic scenario, each warp of the block will be processed sequentially over and over until it finishes.

_teju · September 30, 2009, 5:27pm

Here’s my guess (and a kind of hand-wavy method)…
Each divergent branch is considered to be a separate instruction, and executed accordingly. So, if all 32 threads in a warp diverge, then it’ll take 32 clocks to execute this warp. If there were no divergence, then it would have taken 4 clocks (assuming 8 SPs per SM). Hence, I would consider each such divergent branch to be a separate warp. External Image
So, if you had 32 divergent branches in a warp, then i would consider that there are 32 separate warps created => 32 * 32 threads running! This way, using the number of threads launched onto the system + a divergence factor + max possible threads in a kernel, you could calculate the limit.

_Big_Mac · September 30, 2009, 7:21pm

It takes 4 cycles to get a single instruction of a warp scheduled on the 8 SPs.
If a warp diverges, some threads get masked out and branches are executed sequentially instruction by instruction, each taking another 4 clocks to schedule (and a different number of clocks to complete, depending on the instruction). Actually, counting the clocks is not trivial. There were a bunch of threads about instructions-per-clock and it’s beyond the scope of this post ;)

Pictures make this easy to visualise:
Blue arrows represent threads. Fat means active, thin are masked out. Consecutive levels represent, roughly, clock cycles (or fours of them).

External Media

Note that whether non-taken branches really get skipped (through a jump) or are executed but in a completely masked-out state (thus wasting clocks) depends on compiler voodoo. For big conditional blocks it’s better to be able to totally skip one if no thread wants to execute it, for smaller blocks that’s an extra jump instruction to be processed (even when there is no jump) and it’s better to leave it as is, especially since arithmetic-logic instructions are cheap on GPUs. What the compiler chooses as worthy of placing a jump before is up to its discretion. The pictures assume doX() are heavy functions and the compiler placed conditional jumps.

erik.rostock · October 1, 2009, 6:15am

Thank you all very much for your answers!

Is the branching done at runtime or at compile time? I am wondering how branching behaves in the following loop:

for(int i=0;i<threadID+5;i++) {

   fooBar();

}

All Threads at least execute fooBar() 5 times. Are this 5 executions of foo bar done in parallel, or are the threads serialized?

One other question: If all threads of the warp are on the same code path, shouldn’t all threads by synched after a common code line?

Best,

Erik

_Big_Mac · October 1, 2009, 11:45am

Branching is done at runtime, dynamically.

In this examples, the 5 runs of the loop will execute in parallel. In some cases, more iterations will be ran in parallel, ex. when threadIDs from an active warp are in the range of 64-95, 69 iterations will run in parallel.

I believe in current devices threads reconverge at the first possibility.

Topic		Replies	Views
Parallel thread processing in a warp CUDA Programming and Performance	5	3692	July 17, 2009
Can CUDA be useful for me? CUDA Programming and Performance	4	2553	June 12, 2008
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1656	January 23, 2009
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009
How many parallel threads? CUDA Programming and Performance	19	9862	October 1, 2021
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63011	January 25, 2009
Performance of Divergent Threads CUDA Programming and Performance	2	1636	September 29, 2008
Branching? CUDA Programming and Performance	7	3157	March 16, 2012
thread, warp, block, grid, device CUDA Programming and Performance	3	6210	November 25, 2016
Branching in kernel CUDA Programming and Performance	3	5313	June 5, 2008

How many divergent branches can actually be discussed in parallel?

Related topics