Branch or not

John_Smith_Lon · February 26, 2018, 5:09pm

Does this cause a branch under nvcc?

double a = oldValue;
bool ind = x > y;

a += (newValue - oldValue) * ind;

It would be equivalent to:

double a = oldValue;
if(x > y) a = newValue;

Just thinking about tricks to avoid if statements and whether this can help, or if the two end up being compiled to the same assembler - i.e. ind = x > y requires a branch anyway…

John_Smith_Lon · February 26, 2018, 5:48pm

If I look at PTX I can see the following:

func_exec_begin6:

//c:/TestBranch/kernel.cu:5 	double a = oldValue;
	.loc	1 5 11
tmp12:
	mov.f64 	%fd5, %fd1;
tmp13:

//c:/TestBranch/kernel.cu:6 	bool ind = x > y;
	.loc	1 6 11
	setp.gt.f64	%p1, %fd3, %fd4;
	selp.u16	%rs1, 1, 0, %p1;
tmp14:
	cvt.s16.s8 	%rs2, %rs1;

//c:/TestBranch/kernel.cu:8 	a += (newValue - oldValue) * ind;
	.loc	1 8 2
	sub.f64 	%fd6, %fd2, %fd1;
	cvt.rn.f64.s16	%fd7, %rs2;
	mul.f64 	%fd8, %fd6, %fd7;
	add.f64 	%fd9, %fd5, %fd8;
tmp15:

//c:/TestBranch/kernel.cu:9 	return a;
	.loc	1 9 2
	mov.f64 	%fd10, %fd9;
	st.param.f64	[func_retval0+0], %fd10;
	ret;
tmp16:
func_end6:

versus

func_exec_begin7:

//c:/TestBranch/kernel.cu:14 	double a = oldValue;
	.loc	1 14 11
tmp17:
	mov.f64 	%fd5, %fd1;
tmp18:

//c:/TestBranch/kernel.cu:15 	if (x > y) a = newValue;
	.loc	1 15 2
	setp.gt.f64	%p1, %fd3, %fd4;
	not.pred 	%p2, %p1;
	@%p2 bra 	BB7_2;
	bra.uni 	BB7_1;

BB7_1:

//c:/TestBranch/kernel.cu:15 	if (x > y) a = newValue;
	.loc	1 15 13
tmp19:
	mov.f64 	%fd6, %fd2;
tmp20:

BB7_2:

//c:/TestBranch/kernel.cu:16 }
	.loc	1 16 1
	st.param.f64	[func_retval0+0], %fd7;
	ret;
tmp21:
func_end7:

I am not fluent in PTX so would appreciate any expert views on whether the first snippet of PTX contains a branch or not… (I suspect the branch is the lines:

@%p2 bra 	BB7_2;
bra.uni 	BB7_1;

in the second block, which if so, my idea is proved correct, but I am not sure).

njuffa · February 26, 2018, 6:52pm

For anything relevant to performance, looking at PTX is meaningless. It’s a portable ISA and compiler intermediate representation. Look at the machine code (SASS) instead with cuobjdump --dump-sass.

To first order, do not worry about branches when writing CUDA code. Just write the code in a natural readable style. If the CUDA profiler tells you that branch divergence has a significant negative impact on performance, then look at the relevant branch(es). I predict that this will hardly ever be the case: there are typically “bigger fish to fry” when optimizing for the GPU.

John_Smith_Lon · February 26, 2018, 10:55pm

Thanks, I will look at the machine code. Maybe I’m being too paranoid but I’m writing a library of mathematical special functions and performance will be important for its success. I like to know roughly how the compiler translates the code and how to minimise branching.

njuffa · February 26, 2018, 11:02pm

Local branches with small loop bodies typically translate into a select instruction (think C’s ternary operator) or predicated execution. The choice is up to undocumented compiler-internal heuristics that are subject to change, but examination of the generated machine code shows that the compiler’s choices make sense 95% of the time.

I have extensive experience writing CUDA code for performance-optimized elementary and special functions libraries. Branches at the source code level are a minor issue in my experience. The compiler mostly can be trusted to do the right thing, but that should be verified of course. Selecting the most appropriate algorithm is often the hardest part of an efficient implementation. I have spent weeks on that in some cases.

What kind of special functions are you working on?

BulatZiganshin · February 27, 2018, 7:11pm

it seems that you try to judge from CPU experience where unpredicted branch is a big hit. on GPU, all operations are predicated (similar to AVX512), so your code code will be executed as two operations:

p1 := (x > y) 
@p1  a := newValue  // assignment performed only if p1 is true

this code is as fast as possible. and while “if” body contains up to ~6 commands, compiler generates just a sequence of predicated operations

John_Smith_Lon · February 28, 2018, 9:00pm

Thanks for the help, I’m writing a library that will cover special functions such as the incomplete beta function. I had a look at the sass assembler and you are both absolutely right about the predicated operation:

/*0188*/                   DSETP.GT.AND P0, PT, R8, R10, PT;   /* 0x5b84038000a70807 */
        /*0190*/                   PSETP.AND.AND P0, PT, !P0, PT, PT;  /* 0x50900380e0078007 */
        /*0198*/                   SSY 0x208;                          /* 0xe290000006800000 */
                                                                       /* 0x007fbc03fde01fef */
        /*01a8*/               @P0 SYNC;                               /* 0xf0f800000000000f */
        /*01b0*/                   BRA 0x1b8;                          /* 0xe24000000007000f */

(although undocumented assembler is a bit mysterious!).

If I can summarize, when writing CUDA code with performance in mind, branching is only an issue if you have a branches with large amounts of code within one branch so the cost of divergence is high - e.g. if one thread within a warp takes a long way through the branch all the other threads need to wait until the thread that took the long route executes and returns to the common code.

Also if you have a loop that may execute different numbers of iterations then the threads within the warp will effectively all loop as long as the thread with highest number of iterations. This makes:

while(condition)
{
}

loops potentially troublesome if the condition may take a long time to be reached in rare cases.

njuffa · February 28, 2018, 9:26pm

That’s a bit too simplistic. You will also have to take into account the probability of divergence across the threads in a warp. Often times, library code for floating-point functions will comprise a fastpath (that handles the common case) and a slowpath (that handles difficult corner cases). Many times, the probability of hitting the slow path is low. Depending on the source data distribution (in real-life use cases, there is often value locality), the chance of divergence may then be even lower.

For the most part, worries about branch divergence should be deferred until after a prototype implementation can be profiled in something approximating a real-life context. In my experience, branch divergence is rarely the most pressing performance issue in special function evaluation.

A couple of tips:

[1] I have found that one can often cut down on slowpath complexity by only implementing the fastpath first, then checking thoroughly which special cases it already happens to handle correctly, and adding slowpath code only for the remaining special cases. In other words, some special cases are in fact handled by the fastpath code going about its normal business. Do make note of that fact in comments, otherwise the next person touching the code may be in for a nasty surprise. This is a trade-off between performance and maintainability.

[2] The following pattern is often conducive to superior performance (credit goes to Alex Fit-Florea for pointing this out to me years ago). Compute the fast path result. Immediately after that, check whether a special case needs to be handled, and then invoke the slowpath if necessary, conditionally overwriting the fastpath result. This often results in better performance than a more traditional pattern that checks for special cases prior to invoking the fastpath.

Topic		Replies	Views
Branching Performance Hit CUDA Programming and Performance	15	2678	June 30, 2009
Overhead of warp divergence vs. extra multiplication by 0 or 1 CUDA Programming and Performance	9	2224	February 27, 2013
do warp vote functions cause branching? CUDA Programming and Performance	16	3626	August 11, 2010
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	463	March 3, 2024
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3943	December 21, 2016
cant call any kernel function CUDA Programming and Performance	8	4834	June 6, 2011
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	3013	January 14, 2014
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1471	July 13, 2015
Is dynamic code generation possible? CUDA Programming and Performance	20	4544	December 29, 2012
CUDA compiler needs too much help in order to use select instead of branch CUDA Programming and Performance	6	561	October 12, 2021

Branch or not

Related topics