Cost of bra instruction

Is there any overhead associated with the “bra” instruction (PTX)? Does anyone know this? It seems that there is none (even though it’s a control flow instruction), so I was just wondering.

For example, given some kernel code:

if (cond)

end of kernel

PTX code gives a predicate for not the cond and branches based on that. So if the cond is NOT met, it branches to the end of the kernel.

The kernel runs faster executing the branch then it does the ALU instructions inside. Just wondering why this is.

The cost of a branch instruction is the same as any other standard instruction - 4 clocks per warp. A conditional branch is only predicated if not all threads in the warp have the same condition. That is, it only does both sides of the branch if there are threads in the warp that need to execute both sides. In the uniform case, this is more efficient than standard predication, since only one side of the branch needs to be executed.

Thus, if you have a block like

if (cond)


then the inner block is only executed if a thread in the warp actually passes the cond. If none of the threads actually have cond==true, then the entire block can be skipped.

Ok, thanks, I would have expected that the branch instruction would have an some overhead associated with it.

So essentially in CUDA there is really no reason to remove branches, only just try to group warps that take the same path together?

So no advantage can be had by removing control flow instructions!?

Correct. Branch instructions that do not cause warp divergence are negligible. If your code is constrained by memory bandwidth (as many are), then FLOPS (or instructions per second in general) are basically free.

One reason there is no obvious branch penalty is because if you have lots of active threads, they will fill the stream processor pipeline with instructions from different threads which do not depend on the outcome of the branch. On a normal CPU, if you branch, there is a possibility you will need to flush the pipeline if the branch predictor guessed wrong.

Yes, I believe that this kernel might be memory bound, that would explain things; however, is a branch instruction execution time really the same as an add instruction execution time? If so, there must be some “tricks” in the PTX compiler, sadly, Nvidia isn’t revealing the corresponding ISA.

It’s not necessarily as fast as an float add. 4 cycles per warp is the time it takes to issue such instruction. How much it takes to execute it (or what’s the latency associated with it) is not disclosed.

If there was no penalty associated with jumping, there’d be no need to have predication.

Well, that’s what I thought, but even after creating some small benchmarks which are ALU bound (because the execution time increases with additional ALU instr) I find that the kernel with CF instructions still performs the same as the kernel without for an even distribution of the conditional. I just find these results to be somewhat surprising.

Predication is slightly faster in the divergent case since it requires one less instruction. A conditional branch would be something like

set_pred ~condition
bra continue

where simple predication only needs

set_pred condition

I suspect that all conditional flow control is implemented by predication of the branch instruction.

Anyhow, this is why the compiler automatically predicates branchs where there are less than 4 instructions in the branch - with a branch that small, the extra instruction for the branch is significant when compared to the chance for the entire warp to skip the branch.

I think I might have found my answer.