Had som troubles with my code when using visual profiler to profile. According to the output I had a very high number of divergent branches even though I could not understand why. After a bit of search I found the following thread:
suggesting that the use of powf could cause branching. So I tested the suggestion and changed all powf(x,2) to x*x instead. And the number of divegent branches decreased drastically.
Now to my question. Why does powf cause divergent branches?
because it’s using branching internally ;)
use __powf() if you can live with a less accurate version that translates more directly to an operation of the special function unit (SFU).
And in your case of pow(x,2) a simple floating point multiplication would be much faster anyway. e.g. on the G80/G92 chip there are 2 SFUs per 8 CUDA cores, hence most SFU operations take 16 clock cycles for a warp of 32 threads whereas a simple MUL, ADD or MAD takes 4 clock cycles for a warp. Not so sure about G200 based chips.
Have a look in math_functions.h if you are interested - the “accurate” version of powf contains quite a lot of domain and range checking logic before calculating powf(a,b) = exp(b*log(a)).