What would nvcc do when deal with pow(x,2)?

EzPizzy · February 10, 2023, 3:00am

I thought nvcc would deal with __powf(x,2) by compile it to assembly that equals to x * x

But here is my test code

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <math.h>

__global__ void pow_test(float *input, float *output)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    output[idx] = __powf(input[idx], 2);
}

__global__ void mul_test(float *input, float *output)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    output[idx] = input[idx] * input[idx];
}

int main()
{
    int M;
    std::cin >> M;
    float *a;
    cudaMallocManaged((void **)&a, M * sizeof(float));
    float *b;
    cudaMallocManaged((void **)&b, M * sizeof(float));
    float *c;
    cudaMallocManaged((void **)&c, M * sizeof(float));

    for(int i = 0; i < M; i++)
    {
        a[i] = rand() % 11;
    }

    int b_num = (M + 1023) / 1024;

    cudaEvent_t start, mid, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&mid);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    pow_test<<<b_num, 1024>>>(a, b);
    cudaDeviceSynchronize();

    cudaEventRecord(mid);

    mul_test<<<b_num, 1024>>>(a, c);
    cudaDeviceSynchronize();

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float u_t;
    cudaEventElapsedTime(&u_t, start, mid);
    printf("pow_time:%g\n", u_t);
    cudaEventElapsedTime(&u_t, mid, stop);
    printf("mul_time:%g\n", u_t);

    float max = - 100;
    for(int i = 0; i < M; i++)
    {
        max = (max < (b[i] - c[i])) ? (b[i] - c[i]) : max;
    }
    printf("%f\n", max);

    return 0;
}

And here is a running test

$ ./test
10000
pow_time:0.254016
mul_time:0.006464
0.000004

So is my test right? In fact, the nvcc wouldn’t treat __powf(x,2) as x * x, right?

If my test is right, I wonder why this would happen?

Because I thought this pow(x,2) compile optimization is not a difficult thing.

emmanuel.madrigal · February 16, 2023, 8:03pm

Hi Ez,

As you can see in the documentation here:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html#group__CUDA__MATH__INTRINSIC__SINGLE_1g2c2b295816185f6ce2423471df529974

The function is meant to approximate raise a float to the power of a float since you are using optimizations yourself by using this function which the GPU will optimize. For corner cases such as raising to an integer, you are probably better off doing the optimization manually.

EzPizzy · February 20, 2023, 7:34am

Sorry about my random number array generation mistake.
It was supposed to be this:

a[i] = (float)rand() / (float)(RAND_MAX / 3);

Then the input array a[i] should be random float rather than integer now.

And the test result almost the same:

$ ./test
10000
pow_time:0.20928
mul_time:0.007808
0.000002

Actually my only point is that I dont understand why nvcc didnt compile __powf(x,2) to x * x.

Because according to my test result, x * x is much more faster and I guess this compile method is not difficult.

Topic		Replies	Views
Compiling double precission I can't use pow function... CUDA Programming and Performance	3	3521	June 14, 2009
Strange behavior compiling an expression with powf() ?! CUDA Programming and Performance	8	1011	October 27, 2015
__powf(x,y) gives nan CUDA Programming and Performance	11	3085	January 31, 2013
pow function not compiling in CUDA kernel CUDA Programming and Performance	2	12774	September 21, 2009
CUDA and Exponentiation Legacy PGI Compilers	3	5328	August 18, 2010
Bug in the POW function? CUDA Programming and Performance	8	2720	December 5, 2021
pow function error with 1 float, 1 int input CUDA Programming and Performance	1	2962	June 26, 2007
Can't compile fast math functions CUDA Programming and Performance	2	2136	May 29, 2010
Implement faster cuda intrinsics for specific power functions CUDA Programming and Performance cuda	6	4280	November 4, 2020
why the result of powf(-0.07346,2) equal to nan? OptiX	7	1256	June 14, 2022

What would nvcc do when deal with pow(x,2)?

Related topics