Different CUDA kernel variations based on runtime context


I’m trying to think about the better way of handling CUDA code in production scenarios where you need to execute the same kernel but with different small variations depending on user parameters, context or platform capabilities. Similar problems exists in handling graphics shaders code where at the end you have to deal with a combinatorial explosion in code paths.

Is it NVRTC and MACRO configurations or kernel strings construction and compilation at runtime as with OpenCL the only viable path here? Or is there some techniques more cuda_runtime “triple chevron” and offline compilation friendly solutions maybe using heavily templatized kernels or so?

A point to some OS libraries with good designs would be ideal.

Thanks in advance.

How many different variants of the code are we talking about? I have used templated kernels, invoked via functions pointers, with up to one hundred or so different variants. One obvious drawback is the increase in compilation time. The obvious alternative is online code generation, either from PTX or from HLL level.

Reasonable guidance on how to proceed maybe the way you are currently handling the same issue for CPU code, e.g. different optimized performance paths for SSE, AVX, AVX2, and AVX512 times user parameters times contexts.

If the kernel changes are minor, performance gains from templating can also be minor (2-3%), in which case you may cut down on the number of template parameters by simply handling some of the possible variants at run time, via plain old if-statements.