TL;DR: Does the line “g_foo<1>;” count as instantiating the global template function g_foo<1>? (If not, is there a way to lazily instantiate but not execute template functions?)
Context/details:
Suppose we have a host/device template function and we wish to specialize it based on where it was called. Our code might use the CUDA_ARCH macro like this:
#include <iostream>
template<int X>
__global__ void g_foo();
template<int X>
__device__ __host__ void foo(){
#ifndef __CUDA_ARCH__
// Process locally if small; otherwise do in GPU
g_foo<X><<<1, 1>>>();
#else
g_foo<X>; // !!!
printf("%d\n", X);
#endif
}
template<int X>
__global__ void g_foo(){
foo<X>();
}
int main(){
foo<1>();
}
The CUDA programming guide states: “If a global function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether CUDA_ARCH is defined and regardless of the value of CUDA_ARCH.”
Thus, if the host version of foo calls g_foo, the device version of foo must instantiate g_foo.
In practice, it seems that merely writing “g_foo;” instantiates it (and as far as I can tell, does not seem to result in any executed code). Is this behaviour defined or is it just a hack? If this is just a hack, is there a proper way to do it or is this pattern just unviable?