Any advice on adjusting code for Maxwell when coming from Kepler

Looks like we are getting back straight to the integer multiplies which are emulated on Maxwell. You would definitely want to look at the SASS to see how many additional instructions that produces on Maxwell. The expansion may create more additional instructions than can be absorbed by the higher instruction throughput on Maxwell, leading to overall slowdown.

As I recall, PTX offers about 25 different flavors of IMUL and IMAD, I think it is possible that some of the emulation sequences may not be fully optimized, in which case you may want to consider filing a bug.

If Maxwell follows the precedent set by previous GPUs, the 64-bit integer conversion instructions are handled by the double-precision execution pipe as that is the only execution path that can consume and deliver 64-bit data. I think the DP ratio of Maxwell consumer parts is lower than the DP ratio of Kepler consumer parts? If so, that could also play a role.