I’m trying to think about the better way of handling CUDA code in production scenarios where you need to execute the same kernel but with different small variations depending on user parameters, context or platform capabilities. Similar problems exists in handling graphics shaders code where at the end you have to deal with a combinatorial explosion in code paths.
Is it NVRTC and MACRO configurations or kernel strings construction and compilation at runtime as with OpenCL the only viable path here? Or is there some techniques more cuda_runtime “triple chevron” and offline compilation friendly solutions maybe using heavily templatized kernels or so?
A point to some OS libraries with good designs would be ideal.
Thanks in advance.