I tried to find similar topics but haven’t found some so…
When I try to make template from existing kernel (add only one template argument, integer, which is not used in the code at all), my kernel slows down (about 30-40%). I tried to watch some ptx code but it is very similar, except that one instruction for creating shared memory is missing, but don’t know if it could indicate some problem.
So my question is, have you encountered the same thing (slower template kernels than no template kernels)? What possibly am I doing wrong?
using CUDA 7.0, Tesla K10, linux Mint 17.1
Thank you for your help.