Ternary operators and branching

Heya,

I’m looking for some pointers to understand how exactly branching works and how to avoid it so that the GPUs don’t start executing threads sequentially.

I need to solve two quandaries right now:

  1. I have a statement using a ternary operator: (x > 0.5) ? 1 : ((x < -0.5) ? -1 : 0). Would that branch? How can I rewrite it so that it doesn’t branch?

  2. If I have a conditional branch which executes every 1000th iteration of a kernel loop, that would branch, but would the GPU be smart enough to merge the different threads after the branch has finished? I’m talking about something like this:

for(int i = 0; i < 1000000; i++) { 

	if (i % 1000 == 0) {

		counter++;

	}

	// rest of the loop. Ideally threads would merge here even if the condition matched for one of the threads.

}

To answer number two (and possibly number one as well), sure. Branching is done via predication, so you’re still effectively executing an entire warp when you have a divergent branch, you’re just masking out some number of threads from having any effect (e.g., don’t write to registers, don’t load, don’t store, don’t set any error conditions). So when you branch, the predication mask will be set, some threads will not be executed, and then the branch will end and the predication mask will be cleared. Voila, your warp is back to executing normally.

Very short divergent branches usually aren’t a huge deal.

I have a follow up question to the above. I’m a little unclear on the following: What if all of the threads in the warp evaluate the conditional to be true, or all of them evaluate false? Would this cause branching?

ITYM: “Would this cause divergence?”

If all threads evaluate true (or false) then the branch is not a problem - there is no divergence. Divergence occurs when some threads evaluate the condition as true and some as false, in which case both branches must be executed (using predication to mask operations in the relevant threads), and hence you get a performance hit.