What would nvcc do when deal with pow(x,2)?

I thought nvcc would deal with __powf(x,2) by compile it to assembly that equals to x * x

But here is my test code

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <math.h>

__global__ void pow_test(float *input, float *output)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    output[idx] = __powf(input[idx], 2);
}

__global__ void mul_test(float *input, float *output)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    output[idx] = input[idx] * input[idx];
}

int main()
{
    int M;
    std::cin >> M;
    float *a;
    cudaMallocManaged((void **)&a, M * sizeof(float));
    float *b;
    cudaMallocManaged((void **)&b, M * sizeof(float));
    float *c;
    cudaMallocManaged((void **)&c, M * sizeof(float));

    for(int i = 0; i < M; i++)
    {
        a[i] = rand() % 11;
    }

    int b_num = (M + 1023) / 1024;

    cudaEvent_t start, mid, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&mid);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    pow_test<<<b_num, 1024>>>(a, b);
    cudaDeviceSynchronize();

    cudaEventRecord(mid);

    mul_test<<<b_num, 1024>>>(a, c);
    cudaDeviceSynchronize();

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float u_t;
    cudaEventElapsedTime(&u_t, start, mid);
    printf("pow_time:%g\n", u_t);
    cudaEventElapsedTime(&u_t, mid, stop);
    printf("mul_time:%g\n", u_t);

    float max = - 100;
    for(int i = 0; i < M; i++)
    {
        max = (max < (b[i] - c[i])) ? (b[i] - c[i]) : max;
    }
    printf("%f\n", max);

    return 0;
}

And here is a running test

$ ./test
10000
pow_time:0.254016
mul_time:0.006464
0.000004

So is my test right? In fact, the nvcc wouldn’t treat __powf(x,2) as x * x, right?

If my test is right, I wonder why this would happen?

Because I thought this pow(x,2) compile optimization is not a difficult thing.

Hi Ez,

As you can see in the documentation here:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html#group__CUDA__MATH__INTRINSIC__SINGLE_1g2c2b295816185f6ce2423471df529974

The function is meant to approximate raise a float to the power of a float since you are using optimizations yourself by using this function which the GPU will optimize. For corner cases such as raising to an integer, you are probably better off doing the optimization manually.

Sorry about my random number array generation mistake.
It was supposed to be this:

a[i] = (float)rand() / (float)(RAND_MAX / 3);

Then the input array a[i] should be random float rather than integer now.

And the test result almost the same:

$ ./test
10000
pow_time:0.20928
mul_time:0.007808
0.000002

Actually my only point is that I dont understand why nvcc didnt compile __powf(x,2) to x * x.

Because according to my test result, x * x is much more faster and I guess this compile method is not difficult.