Branching Performance Hit

I am wondering if writing the following:

x = 1*(foo == 1) + 2 * (foo == 2) + (x+1)*(foo != 1 && foo != 2)

is in any way better for performance than

if (foo == 1)
x = 1
else if (foo == 2)
x = 2

thank you

I’m also not sure about the branching penalties…but I generally try to find ways to avoid else statements…

I’m not sure if this was just a simplified example? I’m not sure if this would be any better, but couldn’t you do something like this?

if( foo == 1 || foo == 2 )


   x = foo - 1;



Yes it is a simplified example.

It has never made a difference in any of the kernels I have tried it on.

My guess is that, if the code contained within the if/else statements is short enough, both paths are being calculated but only the result of the path that is “true” will be stored.
Your simplified code is also dependent on comparisons (predicates) which are basically the same as if/else test.
Maybe you should have a look at the difference in PTX code.


first, thanks for the clarification. Most of the dicussion revolved on simple code, since that is what I posted, but what about the following code that I wrote last year. Back then I believed that conditionals were a performance hit regardless of the their content, so my program contained none.

tripple = (a ^ b) & (b ^ c) & (c ^ d) & (d ^ e) & (e ^ f);

			db1	 = (a & d & ~(b | c | e | f));

			db2	 = (b & e & ~(a | c | d | f));

			db3	 = (c & f & ~(a | b | d | e));

			ro	  = flip_coin( );

			k1[x][y]= c1[x][y] ^ ((tripple | db1 | (ro & db2) | (~ro & db3)) & boundary[x][y]);

			k2[x][y]= c2[x][y] ^ ((tripple | db2 | (ro & db1) | (~ro & db3)) & boundary[x][y]);

			k3[x][y]= c3[x][y] ^ ((tripple | db3 | (ro & db1) | (~ro & db2)) & boundary[x][y]);

			k4[x][y]= c4[x][y] ^ ((tripple | db1 | (ro & db2) | (~ro & db3)) & boundary[x][y]);

			k5[x][y]= c5[x][y] ^ ((tripple | db2 | (ro & db1) | (~ro & db3)) & boundary[x][y]);

			k6[x][y]= c6[x][y] ^ ((tripple | db3 | (ro & db1) | (~ro & db2)) & boundary[x][y]);

this code branches if I do not use binary statements. what is the performance hit, however. Since, as you can probably tell, this code is hard to understand.

p.s. in case you are wondering this code performs a fluid dynmics simulation using lattice gas methods

I think it would be easiest to just benchmark this…you have non-branching code. Why not convert the above code so that it is easier to read by adding branches and compare the execution time of both using the high precision GPU timers through events?

Are there general guidelines? I simply cannot benchmark every time I write an if statement.

You can put it through the profiler with zero effort after each new if statement.

Usually branches aren’t as evil as they seem, as long as your code is not compute bound. If it’s memory bound, like most kernels, there might even be no performance hit at all.

The following code would outperform both of yours since it contains no warp divergence:

tmp = (foo == 1)? 1 : x + 1;

x = (foo == 2)? 2 : tmp;

I think there is some confusion around branching performance.

When the CUDA manual say that divergent branches are expansive, it means that it can be as inefficient as predication is. That is, in the worst case, all paths will have to be executed sequentially.

So if you write branches that may diverge, it may or may not be inefficient.
If you replace it with predication, it will be inefficient (you always execute all code paths even when there is no divergence).
Even worse, if you use boolean expressions to combine results instead of taking advantage of hardware predication, you will also pay the overhead of the boolean operators.

So, the general guideline is: just trust your compiler unless you have a very good reason not to do so.
The CUDA compiler will use predication for all short conditionals anyway.

So for example, if i have an if statement that is always true, it wouldn’t have much more performance impact than say, a #define?

Using ifs it would be much easier to enable or disable specific features in the code…

If the condition of the if is known at compile-time (constant or template parameter), the compiler will remove the unused branch.

Otherwise the impact should still be minimal (from 1 to 3 extra instructions to fetch and decode).

What confuses me is that I haven’t seen any predication getting used in the compiler-generated PTX ever, although the PTX language specification certainly has

predication features. I’ve always found branches in the PTX, even for short if / else expressions.

To prove me wrong, post some C++ and resulting PTX showing predication ;)


I posted exactly this same question a while back. wumpus (or was it some other decuda expert?) replied stating that the conversion is done by ptxas and only ends up in the cubin. I never got motivated enough to actually check for myself, though.

So when I’ve got no choice as a programmer, then Silvian Collange’s statement

“If you replace it with predication, it will be inefficient (you always execute all code paths even when there is no divergence).”

does not really help anybody. ;) Because the PTX->cubin assembler/optimizer chooses when or when not to apply predication.