Implement faster cuda intrinsics for specific power functions

982630737 · November 4, 2020, 3:36am

Hello, I want to use a fast power function to calculate x^p. Cuda has provided an intrinsic called __powf(x, p) however it is still slower than I expect. I know the reason why power function is so slow is that there are many corner cases. However, for my case, x is a float between [0, 1] and p is a fixed large positive float number, for example 15 or 25. Can I write a custom intrinsic that calculates x^25 (for example)? Since I don’t know how __powf is implemented (I know it uses __log2f and __exp2f), it is hard for me to modify the implementation to fit my need. Can anyone help me? Thank you so much!

Note that I don’t care about high accuracy. For my case, it is enough that the maximum absolute error just be less than 1e-4.

njuffa · November 4, 2020, 5:41am

The execution time of the C++ standard math functions pow() and powf() is about twice that of the sum of the execution time of the standard math functions exp()/expf(), and log()\logf()combined. This rule of thumb applies to all platforms I am familiar with including CUDA.

Why twice the time? The additional time is need to (1) compute the logarithm to extended precision to guarantee an accurate pow()/powf() result (2) deal with the many special cases prescribed by the language standard.

The device-function intrinsic __powf() is not encumbered by these requirements, that is, it neither computes the logarithm to extended precision nor does it make any effort to get special cases correct. If you compile with -ftz=true (or -use_fast_math which includes -ftz=true), the implementation comprises three inlined instructions (here is __powf() disassembled from sm_70 code):

        /*0020*/                   MUFU.LG2 R0, c[0x0][0x160] ;
        /*0040*/                   FMUL.FTZ R0, R0, c[0x0][0x164] ;
        /*0050*/                   MUFU.EX2 R0, R0 ;

I don’t see how one can get any faster code for a power function that takes two float arguments. I find it hard to believe that this would be a bottleneck in anything but a trivial test app. If you have profiler output to the contrary please share it.

Your question suggests that your use case may be using pow with integer exponents. A call pow (double, int) or pow (float, int) will not go through the standard math functions, but uses a square-multiply algorithm that scans the exponent one bit at a time. This may result in code that is optimal for very small exponents, and it may be slower than invoking __powf (float, float).

982630737 · November 4, 2020, 7:08am

Thank you for your reply. Yes, I agree that the log-exp paradigm can not be faster than simply calling __powf(), but I wonder if there are other approaches. Since for my case p is fixed in advance, the function is thus unary. I would like to learn how other unary function intrinsics (for example, __expf and __logf) are implemented, and derive a similar approximation to power function. I has tried Taylor expansion, but to reach the accuracy the order needed is very large, which leads to a slow calculation.

I really need an extremely fast calculation of power function. In my case, the power operation takes at least 90% of time. I use power function to calculate p-norm of millions of high dimensional vectors.

njuffa · November 4, 2020, 7:26am

In my case, the power operation takes at least 90% of time

You mean __powf(), in code compiled with -ftz=true, takes 90% of the time? I find that hard to believe. There have to be at least a few other instructions, maybe even a load and a store somewhere.

It is very hard to beat the three instruction sequence I showed.

If the exponent is compile-time constant, the compiler’s constant propagation optimization may help. Take a look at the generated machine code with cuobjdump --dump-sass to check whether that is happening. You could implement specialized versions of pow by template metaprogramming that expand into the minimum number of multiplications. That is only advantageous for small integer exponents obviously.

I would like to learn how other unary function intrinsics (for example, __expf and __logf) are implemented

Compile a minimal kernel using the function under test in a release build, then use cudobdump --dump-sass to examine the generated machine code, and all will be revealed. __expf() and __logf in particular map to the MUFU.EX2 and MUFU.LG2 machine instructions with additional instructions for base conversion and (optionally) denormal support.

982630737 · November 4, 2020, 7:38am

I replace the p-norm function (which calculates (\sum_i (x_i)^p)^(1/p)) to the summation function (which calculates \sum_i x_i), and the latter only takes less than 20% time. I use __powf() but I does not add other configurations (-ftz or -use_fast_math). Will -ftz have effect on __powf() intrinsic?

Also, do you mean MUFU.LG2 and MUFU.EX2 are basic instructions? I thought MUFU.LG2 and MUFU.EX2 comprise of a series of additions and multiplications.

njuffa · November 4, 2020, 8:17am

Yes. -ftz=true will eliminate the portion of the code that is needed to handle denormal operands. I goofed in my earlier post: since mathematically, pow (x,y) = exp2 (y * log2 (x)), constant propagation cannot help when the exponent is compile-time constant.

Yes. The hardware performs quadratic interpolation in internal ROM-based tables for these. MUFU stands for MUlti Function Unit. See:

S. Oberman and M. Siu, “A High-Performance Area-Efficient Multifunction Interpolator.” In 17th IEEE Symposium on Computer Arithmetic, June 2005, pp. 272-279

If all your exponents are small non-negative integers known at compile time, you could use something like the following which relies on constant propagation and dead-code elimination:

/* compute a**b, where b in [0,28] */
__device__ float powf_int (float a, int b)
{
    float r, s, t, u, v;
    if (b == 0) {
        r = 1;
    } else if (b == 1) {
        r = a;
    } else if (b == 2) {
        r = a * a;
    } else if (b == 3) {
        r = a * a * a;
    } else if (b == 4) {
        s = a * a;
        r = s * s;
    } else if (b == 5) {
        s = a * a;
        r = a * s * s;
    } else if (b == 6) {
        s = a * a;
        r = s * s * s;
    } else if (b == 7) {
        s = a * a;
        t = s * s;
        r = a * s * t;
    } else if (b == 8) {
        s = a * a;
        t = s * s;
        r = t * t;
    } else if (b == 9) {
        t = a * a * a;
        r = t * t * t;
    } else if (b == 10) {
        s = a * a;
        t = s * s;
        r = s * t * t;
    } else if (b == 11) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = a * s * u;
    } else if (b == 12) {
        s = a * a;
        t = s * s;
        r = t * t * t;
    } else if (b == 13) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = a * t * u;
    } else if (b == 14) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = s * t * u;
    } else if (b == 15) {
        s = a * a;
        t = a * s * s;
        r = t * t * t;
    } else if (b == 16) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = u * u;
    } else if (b == 17) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = a * u * u;
    } else if (b == 18) {
        t = a * a * a;
        u = t * t * t;
        r = u * u;
    } else if (b == 19) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = a * s * v;
    } else if (b == 20) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = t * u * u;
    } else if (b == 21) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = a * t * v;
    } else if (b == 22) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = s * t * v;
    } else if (b == 23) {
        s = a * a;
        t = a * s;
        u = s * t;
        v = u * u;
        r = t * v * v;
    } else if (b == 24) {
        s = a * a;
        t = s * s;
        u = t * t;
        r = u * u * u;
    } else if (b == 25) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = a * u * v;
    } else if (b == 26) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = s * u * v;
    } else if (b == 27) {
        t = a * a * a;
        u = t * t * t;
        r = u * u * u;
    } else if (b == 28) {
        s = a * a;
        t = s * s;
        u = t * t;
        v = u * u;
        r = t * u * v;
    }
    return r;
 }

system · March 24, 2022, 8:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Faster and more accurate implementation of log1pf() CUDA Programming and Performance	15	3301	January 25, 2017
__powf(): wrong behavior CUDA Programming and Performance	8	4678	October 23, 2009
A more accurate and faster implementation of powf() CUDA Programming and Performance	6	4349	February 13, 2025
More accurate version of exp2f() with no change in performance CUDA Programming and Performance	0	738	February 17, 2023
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21498	April 16, 2009
Accuracy-optimized performance-neutral implementation of erfcxf() CUDA Programming and Performance	2	416	September 4, 2022
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4466	December 25, 2017
Faster and more accurate implementation of logf() CUDA Programming and Performance	10	2127	August 7, 2024
Register usage of pow() CUDA Programming and Performance	8	5508	June 20, 2011
Faster and more accurate implementation of log2f() CUDA Programming and Performance	5	1300	March 7, 2019

Implement faster cuda intrinsics for specific power functions

Related topics