Advantage of templated cuda kernel?

tavivekuh · July 20, 2024, 1:34am

I some times have observed templated CUDA kernel code -
CASE A)
template <int M, int N> __global__ void dummyKernel(int a, int b, int c) { // do something }
the above can also be written like
CASE B)
__global__ void dummyKernel(int M, int N, int a, int b, int c) { // do something }

Is there any advantage of using templated CUDA kernel ? Is there any advantage of using CASE A) instead of CASE B)

njuffa · July 20, 2024, 2:08am

There is often performance advantage for CASE A, as the CUDA compiler, or any C++ compiler for that matter, can trivially propagate compile-time constant template parameters M and N into the function body, where they are propagated further by standard constant-propagation optimizations, which can then can enable a whole slew of other optimizations. Beyond that, use of templating generally allows for shifting some work from run time to compile time by the use of template metaprogramming.

For CASE B, some compilers can create clone instances of a function based on calls with compile-time constant function arguments M and N. I do not know whether the CUDA compiler does that, though. I know for certain that the Intel compilers have included this capability for a long time.

When CUDA transitioned from C to C++ (2008 to 2009, I think), template support was one of the first, if not the first, C++ feature made available, because this is a very powerful tool from a software engineering perspective, in particular when it comes to types as template parameters. As an example, the numerous GEMM variants in CUBLAS were an awful mess of pre-processor directives prior to template support in CUDA.

Modern CUDA is simply a dialect of C++, and most if not all of the best practices regarding the use of C++ language features apply.

tavivekuh · July 21, 2024, 2:05am

@njuffa just a follow up Question - does templated kernel help, when we receive variables at run time and pass them to the kernel via templates? for example accepting blocksize, grid size at run time and passing them to the kernel via templates? Or using templated kernel has no added advantage whether we receive variables at compile time or run time?

njuffa · July 21, 2024, 2:16am

Are these completely random values that occur at run time, or is it one combo from a small pre-determined set of values (say, at most a dozen)?

For the latter case the technique I have used in the past is to explicitly instantiate the template using all possible sets of values, store pointers to these functions in a array, then invoke the appropriate variant at run-time using the function pointer.

If the inputs (here: N, M) are completely random, templating is not applicable unless I am missing something.

To get the best of both worlds you could provide specialized versions (potentially faster) for the most common parameter combinations and one generic routine (potentially slower) that can deal with any combination of input parameters.

Curefab · July 21, 2024, 10:30am

You really have to understand, why your specific kernel runs faster with one or more compile-time constant values. Perhaps it is not the blocksize or grid size, which has to be constant for best performance, but some derived value, e.g. number of iterations of an unrolled loop or a stride size, etc. It either has an influence on control flow, changing it from dynamic to static or removing conditional code or it simplifies calculations or saves local registers.

The best solution could also be a mixture, e.g. block size as template parameter and grid size as kernel function parameter.

Topic		Replies	Views
Is CUDA C or C++ ? CUDA Programming and Performance	12	33647	January 30, 2009
[SOLVED] Performance with switch-case inside kernel CUDA Programming and Performance	5	6786	December 7, 2016
Different CUDA kernel variations based on runtime context CUDA Programming and Performance	1	594	December 23, 2016
CUDA with C++ CUDA Programming and Performance	5	3596	May 28, 2009
help with first cuda program CUDA Programming and Performance	5	3879	June 24, 2009
Using Lambda functions in CUDA with template functions CUDA Programming and Performance	4	2308	September 12, 2016
Passing C++ templates to CUDA How to pass compile-time constants from C++ to CUDA CUDA Programming and Performance	4	3541	June 1, 2009
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
C++ Templates CUDA Programming and Performance	3	1709	April 7, 2010
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4295	May 24, 2018

Advantage of templated cuda kernel?

Related topics