Is there any overhead associated with the “bra” instruction (PTX)? Does anyone know this? It seems that there is none (even though it’s a control flow instruction), so I was just wondering.
For example, given some kernel code:
end of kernel
PTX code gives a predicate for not the cond and branches based on that. So if the cond is NOT met, it branches to the end of the kernel.
The kernel runs faster executing the branch then it does the ALU instructions inside. Just wondering why this is.