Thanks everyone for the replies
Tera - my code isn’t compartmentalized too well, but the source files above are entirely self contained - the templated kernel is in main.cuh - but I wouldn’t worry putting much time into it, I feel like the launch bounds was what I was missing. Interestingly enough the Titan X is still substantially slower for the parameter’ed function - that could be to do with the 16 bit support you mentioned.
Note to self - always check for store spills in compiler output :)