I some times have observed templated CUDA kernel code -
CASE A) template <int M, int N> __global__ void dummyKernel(int a, int b, int c) { // do something }
the above can also be written like
CASE B) __global__ void dummyKernel(int M, int N, int a, int b, int c) { // do something }
Is there any advantage of using templated CUDA kernel ? Is there any advantage of using CASE A) instead of CASE B)
There is often performance advantage for CASE A, as the CUDA compiler, or any C++ compiler for that matter, can trivially propagate compile-time constant template parameters M and N into the function body, where they are propagated further by standard constant-propagation optimizations, which can then can enable a whole slew of other optimizations. Beyond that, use of templating generally allows for shifting some work from run time to compile time by the use of template metaprogramming.
For CASE B, some compilers can create clone instances of a function based on calls with compile-time constant function arguments M and N. I do not know whether the CUDA compiler does that, though. I know for certain that the Intel compilers have included this capability for a long time.
When CUDA transitioned from C to C++ (2008 to 2009, I think), template support was one of the first, if not the first, C++ feature made available, because this is a very powerful tool from a software engineering perspective, in particular when it comes to types as template parameters. As an example, the numerous GEMM variants in CUBLAS were an awful mess of pre-processor directives prior to template support in CUDA.
Modern CUDA is simply a dialect of C++, and most if not all of the best practices regarding the use of C++ language features apply.
@njuffa just a follow up Question - does templated kernel help, when we receive variables at run time and pass them to the kernel via templates? for example accepting blocksize, grid size at run time and passing them to the kernel via templates? Or using templated kernel has no added advantage whether we receive variables at compile time or run time?
Are these completely random values that occur at run time, or is it one combo from a small pre-determined set of values (say, at most a dozen)?
For the latter case the technique I have used in the past is to explicitly instantiate the template using all possible sets of values, store pointers to these functions in a array, then invoke the appropriate variant at run-time using the function pointer.
If the inputs (here: N, M) are completely random, templating is not applicable unless I am missing something.
To get the best of both worlds you could provide specialized versions (potentially faster) for the most common parameter combinations and one generic routine (potentially slower) that can deal with any combination of input parameters.
You really have to understand, why your specific kernel runs faster with one or more compile-time constant values. Perhaps it is not the blocksize or grid size, which has to be constant for best performance, but some derived value, e.g. number of iterations of an unrolled loop or a stride size, etc. It either has an influence on control flow, changing it from dynamic to static or removing conditional code or it simplifies calculations or saves local registers.
The best solution could also be a mixture, e.g. block size as template parameter and grid size as kernel function parameter.