Branch or not

Does this cause a branch under nvcc?

double a = oldValue;
bool ind = x > y;

a += (newValue - oldValue) * ind;

It would be equivalent to:

double a = oldValue;
if(x > y) a = newValue;

Just thinking about tricks to avoid if statements and whether this can help, or if the two end up being compiled to the same assembler - i.e. ind = x > y requires a branch anyway…

If I look at PTX I can see the following:


//c:/TestBranch/ 	double a = oldValue;
	.loc	1 5 11
	mov.f64 	%fd5, %fd1;

//c:/TestBranch/ 	bool ind = x > y;
	.loc	1 6 11	%p1, %fd3, %fd4;
	selp.u16	%rs1, 1, 0, %p1;
	cvt.s16.s8 	%rs2, %rs1;

//c:/TestBranch/ 	a += (newValue - oldValue) * ind;
	.loc	1 8 2
	sub.f64 	%fd6, %fd2, %fd1;
	cvt.rn.f64.s16	%fd7, %rs2;
	mul.f64 	%fd8, %fd6, %fd7;
	add.f64 	%fd9, %fd5, %fd8;

//c:/TestBranch/ 	return a;
	.loc	1 9 2
	mov.f64 	%fd10, %fd9;
	st.param.f64	[func_retval0+0], %fd10;



//c:/TestBranch/ 	double a = oldValue;
	.loc	1 14 11
	mov.f64 	%fd5, %fd1;

//c:/TestBranch/ 	if (x > y) a = newValue;
	.loc	1 15 2	%p1, %fd3, %fd4;
	not.pred 	%p2, %p1;
	@%p2 bra 	BB7_2;
	bra.uni 	BB7_1;


//c:/TestBranch/ 	if (x > y) a = newValue;
	.loc	1 15 13
	mov.f64 	%fd6, %fd2;


//c:/TestBranch/ }
	.loc	1 16 1
	st.param.f64	[func_retval0+0], %fd7;

I am not fluent in PTX so would appreciate any expert views on whether the first snippet of PTX contains a branch or not… (I suspect the branch is the lines:

@%p2 bra 	BB7_2;
bra.uni 	BB7_1;

in the second block, which if so, my idea is proved correct, but I am not sure).

For anything relevant to performance, looking at PTX is meaningless. It’s a portable ISA and compiler intermediate representation. Look at the machine code (SASS) instead with cuobjdump --dump-sass.

To first order, do not worry about branches when writing CUDA code. Just write the code in a natural readable style. If the CUDA profiler tells you that branch divergence has a significant negative impact on performance, then look at the relevant branch(es). I predict that this will hardly ever be the case: there are typically “bigger fish to fry” when optimizing for the GPU.

Thanks, I will look at the machine code. Maybe I’m being too paranoid but I’m writing a library of mathematical special functions and performance will be important for its success. I like to know roughly how the compiler translates the code and how to minimise branching.

Local branches with small loop bodies typically translate into a select instruction (think C’s ternary operator) or predicated execution. The choice is up to undocumented compiler-internal heuristics that are subject to change, but examination of the generated machine code shows that the compiler’s choices make sense 95% of the time.

I have extensive experience writing CUDA code for performance-optimized elementary and special functions libraries. Branches at the source code level are a minor issue in my experience. The compiler mostly can be trusted to do the right thing, but that should be verified of course. Selecting the most appropriate algorithm is often the hardest part of an efficient implementation. I have spent weeks on that in some cases.

What kind of special functions are you working on?

it seems that you try to judge from CPU experience where unpredicted branch is a big hit. on GPU, all operations are predicated (similar to AVX512), so your code code will be executed as two operations:

p1 := (x > y) 
@p1  a := newValue  // assignment performed only if p1 is true

this code is as fast as possible. and while “if” body contains up to ~6 commands, compiler generates just a sequence of predicated operations

Thanks for the help, I’m writing a library that will cover special functions such as the incomplete beta function. I had a look at the sass assembler and you are both absolutely right about the predicated operation:

/*0188*/                   DSETP.GT.AND P0, PT, R8, R10, PT;   /* 0x5b84038000a70807 */
        /*0190*/                   PSETP.AND.AND P0, PT, !P0, PT, PT;  /* 0x50900380e0078007 */
        /*0198*/                   SSY 0x208;                          /* 0xe290000006800000 */
                                                                       /* 0x007fbc03fde01fef */
        /*01a8*/               @P0 SYNC;                               /* 0xf0f800000000000f */
        /*01b0*/                   BRA 0x1b8;                          /* 0xe24000000007000f */

(although undocumented assembler is a bit mysterious!).

If I can summarize, when writing CUDA code with performance in mind, branching is only an issue if you have a branches with large amounts of code within one branch so the cost of divergence is high - e.g. if one thread within a warp takes a long way through the branch all the other threads need to wait until the thread that took the long route executes and returns to the common code.

Also if you have a loop that may execute different numbers of iterations then the threads within the warp will effectively all loop as long as the thread with highest number of iterations. This makes:


loops potentially troublesome if the condition may take a long time to be reached in rare cases.

That’s a bit too simplistic. You will also have to take into account the probability of divergence across the threads in a warp. Often times, library code for floating-point functions will comprise a fastpath (that handles the common case) and a slowpath (that handles difficult corner cases). Many times, the probability of hitting the slow path is low. Depending on the source data distribution (in real-life use cases, there is often value locality), the chance of divergence may then be even lower.

For the most part, worries about branch divergence should be deferred until after a prototype implementation can be profiled in something approximating a real-life context. In my experience, branch divergence is rarely the most pressing performance issue in special function evaluation.

A couple of tips:

[1] I have found that one can often cut down on slowpath complexity by only implementing the fastpath first, then checking thoroughly which special cases it already happens to handle correctly, and adding slowpath code only for the remaining special cases. In other words, some special cases are in fact handled by the fastpath code going about its normal business. Do make note of that fact in comments, otherwise the next person touching the code may be in for a nasty surprise. This is a trade-off between performance and maintainability.

[2] The following pattern is often conducive to superior performance (credit goes to Alex Fit-Florea for pointing this out to me years ago). Compute the fast path result. Immediately after that, check whether a special case needs to be handled, and then invoke the slowpath if necessary, conditionally overwriting the fastpath result. This often results in better performance than a more traditional pattern that checks for special cases prior to invoking the fastpath.