Branching in kernel

smokyboy · June 5, 2008, 11:25am

Hi,

I have a following kernel:

global void func(char* table) {
if (COND1) { …
}
if (COND2) { …
}
while (COND3) {…
}
}

My understanding of GPU parallelism is as follows - a multiprocessor executes a block of threads (a warp of them at the same time). Since a multiproc. is SIMD all of the threads should execute the same instruction. How does that go with the concept of branching and looping, where each thread goes its own way? Is there a penalty such that only the concordant threads run in parallel? Or does each processor have its own PC so that threads that go separate ways still run concurrently?

kristleifur · June 5, 2008, 12:37pm

Divergent theads slow things down - I’m not sure how, or how much, but they do. You can check out the CUDA Visual Profiler, it will tell you it you have a lot of divergent branches.

This paper may be useful:
NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE
… but I don’t know if it discusses this exact topic.

Finally, is there something in the manual? Or Mark Harris’s lovely CUDA optimisation slides from Supercomputing 2007?

MisterAnderson42 · June 5, 2008, 12:42pm

As different warps branch differently, there is no performance penalty. If threads within a warp diverge, there can be a performance penalty. In practice, the hardware is very efficient at handling these. Especially in the common case that your performance is memory bound, the effect of divergent warps will be minimal. See also the CUDA programming guide which has a better explanation of what causes divergent warps.

JHHPC · June 5, 2008, 1:25pm

Depending on the complexity of the kernel you will run into problems with your registers, as my experience shows, that a lot of if statements need a lot of registers.

That way less and less warps will run concurrently and finally the branching or divergent branching will take the last bit of performance out of your code.

But as long as enough warps run concurrently I would not concentrate on that issue.

Topic		Replies	Views
How many divergent branches can actually be discussed in parallel? CUDA Programming and Performance	5	3162	October 1, 2009
Warp branching CUDA Programming and Performance	11	10481	October 26, 2010
Question about divergent branching CUDA Programming and Performance	3	6518	May 21, 2009
If loops in kernel a problem? CUDA Programming and Performance	3	1809	February 26, 2009
Ternary operators and branching CUDA Programming and Performance	3	9255	May 3, 2009
Can CUDA be useful for me? CUDA Programming and Performance	4	2646	June 12, 2008
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3255	September 14, 2007
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	4229	December 21, 2016
Branching in kernels What if all kernels take the same route? CUDA Programming and Performance	2	2019	June 29, 2007
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3088	December 28, 2008

Branching in kernel

Related topics