The execution time of the C++ standard math functions
powf() is about twice that of the sum of the execution time of the standard math functions
log()\logf()combined. This rule of thumb applies to all platforms I am familiar with including CUDA.
Why twice the time? The additional time is need to (1) compute the logarithm to extended precision to guarantee an accurate
pow()/powf() result (2) deal with the many special cases prescribed by the language standard.
The device-function intrinsic
__powf() is not encumbered by these requirements, that is, it neither computes the logarithm to extended precision nor does it make any effort to get special cases correct. If you compile with
-use_fast_math which includes
-ftz=true), the implementation comprises three inlined instructions (here is
__powf() disassembled from
/*0020*/ MUFU.LG2 R0, c[0x0][0x160] ;
/*0040*/ FMUL.FTZ R0, R0, c[0x0][0x164] ;
/*0050*/ MUFU.EX2 R0, R0 ;
I don’t see how one can get any faster code for a power function that takes two
float arguments. I find it hard to believe that this would be a bottleneck in anything but a trivial test app. If you have profiler output to the contrary please share it.
Your question suggests that your use case may be using
pow with integer exponents. A call
pow (double, int) or
pow (float, int) will not go through the standard math functions, but uses a square-multiply algorithm that scans the exponent one bit at a time. This may result in code that is optimal for very small exponents, and it may be slower than invoking
__powf (float, float).