Calling __host__ function from __global__ function

While building someone else’s PyTorch cpp/CUDA extension (SparseConvNet) on Windows, I am getting errors of the following type. My guess is that quite a number of people have built this library on Linux and probably aren’t getting these errors. Any idea why these errors may be getting reported only on Windows?
Thanks.

folderPath\SparseConvNet\sparseconvnet\SCN\CUDA/BatchNormalization.cu(49): error: calling a __host__ function("pow<float, double, (int)0> ") from a __global__ function("BatchNormalization_f_train<float, (int)16, (int)64> ") is not allowed

folderPath\SparseConvNet\sparseconvnet\SCN\CUDA/BatchNormalization.cu(49): error: identifier "pow<float, double, (int)0> " is undefined in device code

folderPath\SparseConvNet\sparseconvnet\SCN\CUDA/BatchNormalization.cu(83): error: calling a __host__ function("pow<float, double, (int)0> ") from a __global__ function("BatchNormalization_f_test<float, (int)16, (int)64> ") is not allowed

folderPath\SparseConvNet\sparseconvnet\SCN\CUDA/BatchNormalization.cu(83): error: identifier "pow<float, double, (int)0> " is undefined in device code

folderPath\SparseConvNet\sparseconvnet\SCN\CUDA/BatchNormalization.cu(49): error: calling a __host__ function("pow<float, double, (int)0> ") from a __global__ function("BatchNormalization_f_train<float, (int)12, (int)64> ") is not allowed
  • Interestingly, “pow” happens to be listed among the available “Standard Functions” in the documentation at https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#standard-functions. The comment about the “Standard Functions” there says “The functions from this section can be used in both host and device code”. So, why is nvcc taking “pow” to be host only function?
  • Just as an experiment, I replaced “pow” by “powf” and the compiler did not generate errors.

Any insights why this could be?
Thanks.

If I interpret the error message correctly, this is not a function signature supported by CUDA (or standard C++, best I know). Available choices are:

pow (double, int)
pow (double, double)
pow (float, int)
pow (float, float)

If a function call in device code doesn’t match any of the available signatures and cannot be resolved to any particular existing signature (presumably because of ambiguity), the compiler may assume it refers to a host function. I am not entirely sure under which circumstances that happens.

The first line in question seems to be:

_saveInvStd = pow(_saveInvStd / nActive + eps, -0.5);

where _saveInvStd is of type T, nActive is of type int and 0.5 is of type double. So I am assuming you get this error message when you instantiate this with T=float, as that results in pow (float, double).

Here is what happens when I extract this into a simple standalone demo kernel:

#include <cstdio>
#include <cstdlib>

__global__ void kernel (float f, double d, int i)
{
    double r1, r2, r3, r4, r5, r6, r7, r8, r9;
    r1 = pow (f, i);
    r2 = pow (f, f);
    r3 = pow (d, i);
    r4 = pow (d, d);
    r5 = powf (f, i);
    r6 = powf (f, f);
    r7 = powf (d, i);
    r8 = powf (d, d);
    r9 = pow (f, d);
    printf ("%23.16e %23.16e %23.16e %23.16e %23.16e %23.16e %23.16e %23.16e\n",
            r1, r2, r3, r4, r5, r6, r7, r8);
}

int main (void)
{
    kernel<<<1,1>>>(2,3,4);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

The CUDA 9.2 toolchain on Windows complains:

pow.cu(15): error: more than one instance of overloaded function "pow" matches the argument list:
            function "pow(double, double)"
            function "pow(float, float)"
            function "pow(float, int)"
            argument types are: (float, double)

Interesting… The actual call is pow(_saveInvStd / nActive + eps, -0.5), which takes just two arguments. Perhaps, the compiler is seeing some other templated declaration of “pow” somewhere else that takes three parameters and is confusing it to be that instead of as calling the standard function. Wondering whether there is someway to indicate that what we want is the standard function “pow”… A namespace for anything like that…

When instantiated with T = float, computing pow(_saveInvStd / nActive + eps, -0.5) turns into pow (float/int + float, double) ==> pow (float, double). This is not supported by the CUDA standard math library.

I haven’t checked whether some host compilers have a prototype for and therefore accept pow (float, double). If so, it would seem to be an extension over the standard; at least I haven’t found anything in the C++ standard requiring such an overload.

I have only very rudimentary knowledge of Python, but as far as I know it does not support a single-precision floating-point type, only double precision. Which probably means that in this context the code should never be instantiated with T = float.