Branching in kernel

Hi,

I have a following kernel:

global void func(char* table) {
if (COND1) { …
}
if (COND2) { …
}
while (COND3) {…
}
}

My understanding of GPU parallelism is as follows - a multiprocessor executes a block of threads (a warp of them at the same time). Since a multiproc. is SIMD all of the threads should execute the same instruction. How does that go with the concept of branching and looping, where each thread goes its own way? Is there a penalty such that only the concordant threads run in parallel? Or does each processor have its own PC so that threads that go separate ways still run concurrently?

Divergent theads slow things down - I’m not sure how, or how much, but they do. You can check out the CUDA Visual Profiler, it will tell you it you have a lot of divergent branches.

This paper may be useful:
NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE
… but I don’t know if it discusses this exact topic.

Finally, is there something in the manual? Or Mark Harris’s lovely CUDA optimisation slides from Supercomputing 2007?

As different warps branch differently, there is no performance penalty. If threads within a warp diverge, there can be a performance penalty. In practice, the hardware is very efficient at handling these. Especially in the common case that your performance is memory bound, the effect of divergent warps will be minimal. See also the CUDA programming guide which has a better explanation of what causes divergent warps.

Depending on the complexity of the kernel you will run into problems with your registers, as my experience shows, that a lot of if statements need a lot of registers.

That way less and less warps will run concurrently and finally the branching or divergent branching will take the last bit of performance out of your code.

But as long as enough warps run concurrently I would not concentrate on that issue.