Cost of bra instruction

ryta1203 · January 13, 2010, 7:16pm

Is there any overhead associated with the “bra” instruction (PTX)? Does anyone know this? It seems that there is none (even though it’s a control flow instruction), so I was just wondering.

For example, given some kernel code:

if (cond)
…
…
…
endif
end of kernel

PTX code gives a predicate for not the cond and branches based on that. So if the cond is NOT met, it branches to the end of the kernel.

The kernel runs faster executing the branch then it does the ALU instructions inside. Just wondering why this is.

Keldor314 · January 13, 2010, 7:46pm

The cost of a branch instruction is the same as any other standard instruction - 4 clocks per warp. A conditional branch is only predicated if not all threads in the warp have the same condition. That is, it only does both sides of the branch if there are threads in the warp that need to execute both sides. In the uniform case, this is more efficient than standard predication, since only one side of the branch needs to be executed.

Thus, if you have a block like

if (cond)
…
endif

then the inner block is only executed if a thread in the warp actually passes the cond. If none of the threads actually have cond==true, then the entire block can be skipped.

ryta1203 · January 13, 2010, 8:10pm

Ok, thanks, I would have expected that the branch instruction would have an some overhead associated with it.

So essentially in CUDA there is really no reason to remove branches, only just try to group warps that take the same path together?

So no advantage can be had by removing control flow instructions!?

seibert · January 13, 2010, 8:54pm

Correct. Branch instructions that do not cause warp divergence are negligible. If your code is constrained by memory bandwidth (as many are), then FLOPS (or instructions per second in general) are basically free.

One reason there is no obvious branch penalty is because if you have lots of active threads, they will fill the stream processor pipeline with instructions from different threads which do not depend on the outcome of the branch. On a normal CPU, if you branch, there is a possibility you will need to flush the pipeline if the branch predictor guessed wrong.

ryta1203 · January 13, 2010, 9:54pm

Yes, I believe that this kernel might be memory bound, that would explain things; however, is a branch instruction execution time really the same as an add instruction execution time? If so, there must be some “tricks” in the PTX compiler, sadly, Nvidia isn’t revealing the corresponding ISA.

_Big_Mac · January 13, 2010, 10:05pm

It’s not necessarily as fast as an float add. 4 cycles per warp is the time it takes to issue such instruction. How much it takes to execute it (or what’s the latency associated with it) is not disclosed.

If there was no penalty associated with jumping, there’d be no need to have predication.

ryta1203 · January 13, 2010, 10:32pm

Well, that’s what I thought, but even after creating some small benchmarks which are ALU bound (because the execution time increases with additional ALU instr) I find that the kernel with CF instructions still performs the same as the kernel without for an even distribution of the conditional. I just find these results to be somewhat surprising.

Keldor314 · January 14, 2010, 7:59am

Predication is slightly faster in the divergent case since it requires one less instruction. A conditional branch would be something like

set_pred ~condition
bra continue
code
continue:

where simple predication only needs

set_pred condition
code

I suspect that all conditional flow control is implemented by predication of the branch instruction.

Anyhow, this is why the compiler automatically predicates branchs where there are less than 4 instructions in the branch - with a branch that small, the extra instruction for the branch is significant when compared to the chance for the entire warp to skip the branch.

ryta1203 · January 14, 2010, 7:52pm

I think I might have found my answer.

Topic		Replies	Views
[Solved] PTX ISA predicated execution and the warp divergence issue CUDA Programming and Performance	6	3011	January 14, 2014
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3941	December 21, 2016
branching and SIMD processor serialization vs predication CUDA Programming and Performance	7	10704	October 26, 2007
Conditional code and the instructions number threshold when replacing with predicated instructions CUDA Programming and Performance	0	718	March 27, 2014
CUDA compiler needs too much help in order to use select instead of branch CUDA Programming and Performance	6	558	October 12, 2021
Logic operations and branching Do logic operations in kernel lead to branching CUDA Programming and Performance	6	6159	June 25, 2011
Branching Performance Hit CUDA Programming and Performance	15	2678	June 30, 2009
Branch or not CUDA Programming and Performance	7	2770	February 28, 2018
Overhead of warp divergence vs. extra multiplication by 0 or 1 CUDA Programming and Performance	9	2222	February 27, 2013
Instruction timings More info than in the guide CUDA Programming and Performance	5	8284	May 21, 2007

Cost of bra instruction

Related topics