The execution time of the C++ standard math functions `pow()`

and `powf()`

is about twice that of the sum of the execution time of the standard math functions `exp()/expf()`

, and `log()\logf()`

*combined*. This rule of thumb applies to all platforms I am familiar with including CUDA.

Why twice the time? The additional time is need to (1) compute the logarithm to extended precision to guarantee an accurate `pow()/powf()`

result (2) deal with the many special cases prescribed by the language standard.

The device-function intrinsic `__powf()`

is *not* encumbered by these requirements, that is, it neither computes the logarithm to extended precision nor does it make any effort to get special cases correct. If you compile with `-ftz=true`

(or `-use_fast_math`

which includes `-ftz=true`

), the implementation comprises three inlined instructions (here is `__powf()`

disassembled from `sm_70`

code):

```
/*0020*/ MUFU.LG2 R0, c[0x0][0x160] ;
/*0040*/ FMUL.FTZ R0, R0, c[0x0][0x164] ;
/*0050*/ MUFU.EX2 R0, R0 ;
```

I don’t see how one can get any faster code for a power function that takes two `float`

arguments. I find it hard to believe that this would be a bottleneck in anything but a trivial test app. If you have profiler output to the contrary please share it.

Your question suggests that your use case may be using `pow`

with integer exponents. A call `pow (double, int)`

or `pow (float, int)`

will *not* go through the standard math functions, but uses a square-multiply algorithm that scans the exponent one bit at a time. This may result in code that is optimal for very small exponents, and it may be slower than invoking `__powf (float, float)`

.