is eaten by nvcc. However, as soon as a function is a template, it disappears completely from the ptx-assembly. So I guess it must be instatiated for some given template parameters somehow. But how? And what to give cudaLaunch?
Thats right, by definition (in C++ or CUDA), template classes don’t generate any code unless you instantiate them!
To invoke your template kernel, just pass the template parameters before the execution configuration, like this:
foo<float><<<threads, blocks, smem>>>(devicevar);
However, if the type of devicevar is obvious to the compiler, you should be able to drop the explicit template parameter in this case, since the compiler can infer T from the type of devicevar. Try it both ways. The one above will certainly work.
Note that dynamic (extern) shared memory is tricky with kernels that are templatized on type. I have a solution for this that I will include in a future SDK sample…
I usually use nvcc only on the global and device code and run cudaLaunch from c++. So I guess one way is to add a dummy host function which is never called that performs the invocations needed in the .cu-file to get the instantiations generated, and then run cudaLaunch( “foo”) or something?
The global func is defined in the cu-file as follows:
template<typename T, unsigned int block_dim_log2, unsigned int tile_size_log2, unsigned int cols_log2>
__global__ void
foo( unsigned int *out, float t )
{
// ...
}
Using cudaLaunch is only necessary if you are using the driver API (cuda.dll, cu* functions). The syntax I provided should work for the runtime API (cudart.dll, cuda* functions).
edit: just noticed you are calling from a C++ file (i.e. not compiled by nvcc), which means you do need to call cudaLaunch. Calling the function directly only works within a single compilation unit compiled by nvcc, currently.