Performance slowdown when moving template parameter to function argument

Thanks everyone for the replies

Tera - my code isn’t compartmentalized too well, but the source files above are entirely self contained - the templated kernel is in main.cuh - but I wouldn’t worry putting much time into it, I feel like the launch bounds was what I was missing. Interestingly enough the Titan X is still substantially slower for the parameter’ed function - that could be to do with the 16 bit support you mentioned.

Note to self - always check for store spills in compiler output :)

Thanks for that