Branching Performance Hit

adakkak · June 25, 2009, 6:20pm

I am wondering if writing the following:

x = 1*(foo == 1) + 2 * (foo == 2) + (x+1)*(foo != 1 && foo != 2)

is in any way better for performance than

if (foo == 1)
x = 1
else if (foo == 2)
x = 2
else
x++

thank you

jph4599 · June 25, 2009, 7:00pm

I’m also not sure about the branching penalties…but I generally try to find ways to avoid else statements…

I’m not sure if this was just a simplified example? I’m not sure if this would be any better, but couldn’t you do something like this?

if( foo == 1 || foo == 2 )

{

   x = foo - 1;

}

x++;

adakkak · June 25, 2009, 8:29pm

I’m also not sure about the branching penalties…but I generally try to find ways to avoid else statements…

I’m not sure if this was just a simplified example? I’m not sure if this would be any better, but couldn’t you do something like this?
if( foo == 1 || foo == 2 )

{

   x = foo - 1;

}

x++;

Yes it is a simplified example.

MisterAnderson42 · June 25, 2009, 8:37pm

It has never made a difference in any of the kernels I have tried it on.

Nico · June 25, 2009, 8:53pm

My guess is that, if the code contained within the if/else statements is short enough, both paths are being calculated but only the result of the path that is “true” will be stored.
Your simplified code is also dependent on comparisons (predicates) which are basically the same as if/else test.
Maybe you should have a look at the difference in PTX code.

N.

adakkak · June 26, 2009, 2:11pm

first, thanks for the clarification. Most of the dicussion revolved on simple code, since that is what I posted, but what about the following code that I wrote last year. Back then I believed that conditionals were a performance hit regardless of the their content, so my program contained none.

tripple = (a ^ b) & (b ^ c) & (c ^ d) & (d ^ e) & (e ^ f);

			db1	 = (a & d & ~(b | c | e | f));

			db2	 = (b & e & ~(a | c | d | f));

			db3	 = (c & f & ~(a | b | d | e));

			ro	  = flip_coin( );

			k1[x][y]= c1[x][y] ^ ((tripple | db1 | (ro & db2) | (~ro & db3)) & boundary[x][y]);

			k2[x][y]= c2[x][y] ^ ((tripple | db2 | (ro & db1) | (~ro & db3)) & boundary[x][y]);

			k3[x][y]= c3[x][y] ^ ((tripple | db3 | (ro & db1) | (~ro & db2)) & boundary[x][y]);

			k4[x][y]= c4[x][y] ^ ((tripple | db1 | (ro & db2) | (~ro & db3)) & boundary[x][y]);

			k5[x][y]= c5[x][y] ^ ((tripple | db2 | (ro & db1) | (~ro & db3)) & boundary[x][y]);

			k6[x][y]= c6[x][y] ^ ((tripple | db3 | (ro & db1) | (~ro & db2)) & boundary[x][y]);

this code branches if I do not use binary statements. what is the performance hit, however. Since, as you can probably tell, this code is hard to understand.

p.s. in case you are wondering this code performs a fluid dynmics simulation using lattice gas methods

jph4599 · June 26, 2009, 2:23pm

I think it would be easiest to just benchmark this…you have non-branching code. Why not convert the above code so that it is easier to read by adding branches and compare the execution time of both using the high precision GPU timers through events?

adakkak · June 26, 2009, 6:22pm

Are there general guidelines? I simply cannot benchmark every time I write an if statement.

_Big_Mac · June 27, 2009, 10:42am

You can put it through the profiler with zero effort after each new if statement.

Usually branches aren’t as evil as they seem, as long as your code is not compute bound. If it’s memory bound, like most kernels, there might even be no performance hit at all.

cvnguyen · June 27, 2009, 12:13pm

The following code would outperform both of yours since it contains no warp divergence:

tmp = (foo == 1)? 1 : x + 1;

x = (foo == 2)? 2 : tmp;

Sylvain_Collange · June 27, 2009, 5:19pm

I think there is some confusion around branching performance.

When the CUDA manual say that divergent branches are expansive, it means that it can be as inefficient as predication is. That is, in the worst case, all paths will have to be executed sequentially.

So if you write branches that may diverge, it may or may not be inefficient.
If you replace it with predication, it will be inefficient (you always execute all code paths even when there is no divergence).
Even worse, if you use boolean expressions to combine results instead of taking advantage of hardware predication, you will also pay the overhead of the boolean operators.

So, the general guideline is: just trust your compiler unless you have a very good reason not to do so.
The CUDA compiler will use predication for all short conditionals anyway.

_Tom · June 28, 2009, 1:38pm

I think there is some confusion around branching performance.

When the CUDA manual say that divergent branches are expansive, it means that it can be as inefficient as predication is. That is, in the worst case, all paths will have to be executed sequentially.

So if you write branches that may diverge, it may or may not be inefficient.

If you replace it with predication, it will be inefficient (you always execute all code paths even when there is no divergence).

Even worse, if you use boolean expressions to combine results instead of taking advantage of hardware predication, you will also pay the overhead of the boolean operators.

So, the general guideline is: just trust your compiler unless you have a very good reason not to do so.

The CUDA compiler will use predication for all short conditionals anyway.

So for example, if i have an if statement that is always true, it wouldn’t have much more performance impact than say, a define?

Using ifs it would be much easier to enable or disable specific features in the code…

Sylvain_Collange · June 30, 2009, 1:55pm

If the condition of the if is known at compile-time (constant or template parameter), the compiler will remove the unused branch.

Otherwise the impact should still be minimal (from 1 to 3 extra instructions to fetch and decode).

cbuchner1 · June 30, 2009, 2:01pm

What confuses me is that I haven’t seen any predication getting used in the compiler-generated PTX ever, although the PTX language specification certainly has

predication features. I’ve always found branches in the PTX, even for short if / else expressions.

To prove me wrong, post some C++ and resulting PTX showing predication ;)

Christan

MisterAnderson42 · June 30, 2009, 2:12pm

I posted exactly this same question a while back. wumpus (or was it some other decuda expert?) replied stating that the conversion is done by ptxas and only ends up in the cubin. I never got motivated enough to actually check for myself, though.

cbuchner1 · June 30, 2009, 2:21pm

So when I’ve got no choice as a programmer, then Silvian Collange’s statement

“If you replace it with predication, it will be inefficient (you always execute all code paths even when there is no divergence).”

does not really help anybody. ;) Because the PTX->cubin assembler/optimizer chooses when or when not to apply predication.

Christian

Topic		Replies	Views
Branch or not CUDA Programming and Performance	7	2496	February 28, 2018
Overhead of warp divergence vs. extra multiplication by 0 or 1 CUDA Programming and Performance	9	2204	February 27, 2013
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	443	March 3, 2024
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1456	July 13, 2015
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	2957	January 14, 2014
Branching optimization manual or automatic? CUDA Programming and Performance	6	12336	March 3, 2011
Uint64_t result evaluation & storage eats up 25% of kernel performance CUDA Programming and Performance cuda , kernel	28	954	October 3, 2023
CUDA compiler needs too much help in order to use select instead of branch CUDA Programming and Performance	6	542	October 12, 2021
No conditional statements Ternary operator or bit twiddling or predicate instructions? CUDA Programming and Performance	7	1947	June 15, 2011
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3920	December 21, 2016

Branching Performance Hit

Related topics