Add template argument to kernel slow down the program


I tried to find similar topics but haven’t found some so…

When I try to make template from existing kernel (add only one template argument, integer, which is not used in the code at all), my kernel slows down (about 30-40%). I tried to watch some ptx code but it is very similar, except that one instruction for creating shared memory is missing, but don’t know if it could indicate some problem.

So my question is, have you encountered the same thing (slower template kernels than no template kernels)? What possibly am I doing wrong?

using CUDA 7.0, Tesla K10, linux Mint 17.1

Thank you for your help.

could you provide an example/ sample perhaps

If your code is being JIT compiled because you didn’t specify the right native architecture, the template may slow down the actual compile itself which you’d see at runtime since it’s effectively compiling every time you run.

That would not affect kernel runtime though, but it may be you’re just measuring wallclock time.