why compute 3.5 faster than sm_61 on titanx pascal CUDA 8.0

Any ideas why a program compiled with -arch sm_35 should run slightly faster
on a TITAN X (Pascal) than when it is compiled with -arch sm_61.
The code was optimised for compute level 3.5 under CUDA 6 and has not been
optimised since.

The difference is not huge, about 5%, and the executables produce the same answers
but the elapsed times seem to me to be round the wrong way.

I am expecting the bulk of the runtime to be spent in two CUDA kernels
which are limited by reading global memory.

Many thanks
Bill

What you are observing is an artifact of CUDA’s two-stage compilation process, each of which uses an optimizing compiler (one in NVVM, the other PTXAS). You can just as easily get the opposite effect, that is, a slowdown of 5%. Basically, noise in a complex process consisting of numerous phases.

The other effect you may be observing is that the PTXAS component in the CUDA driver gets updated in between CUDA releases, so that JIT-compilation via PTXAS may result in slightly faster executables compared to offline compilation as PTXAS optimizations are improved over time.

Thanks njuffa.