Avoid branching ...


I’m a CUDA beginner and I am currently reading the Programming Guide. In the section “Control Flow Instructions” (5.4.2) I found the following paragraph:

I don’t understand, why the code does not branch when using the sample condition. How are the threads scheduled? Does the scheduler select only these threads to execute in a warp, where (threadIdx / warpSize) is equal?



that is pretty much correct. Divergence between different warps does generally not affect performance. Branch Divergence within a warp can be a total performance killer.

So I can’t control which threads will be selected for execution in a warp? Is that correct?

I tended to think that the threads are bundled to warps according to their ids, meaning thread 0-31 is executed in the first warp, 32-63 in the second warp and so on.

That is how it works. Only when there is intra-warp level divergence will you start seeing penalties because of branching.