Any advice on adjusting code for Maxwell when coming from Kepler

Ok, so I am still trying to wrap my head around adjusting code for Maxwell. As an example my updated permutation code still runs far faster on the GTX 780ti I do not understand why;

GTX 780ti


GPU timing for 13!: 5.489 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==4332== Profiling application: ConsoleApplication1.exe
==4332== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
193.35ms  5.39775s          (95016 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 780         1         7  void _gpu_perm_basic<int=65536>(int) [181]
5.59110s  3.1851ms              (1 1 1)       (256 1 1)        25        0B        0B         -           -  GeForce GTX 780         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [186]
GTX 980

GPU timing for 13!: 7.439 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==1276== Profiling application: ConsoleApplication1.exe
==1276== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
208.78ms  7.33532s          (95016 1 1)       (256 1 1)        27        0B        0B         -           -  GeForce GTX 980         1         7  void _gpu_perm_basic<int=65536>(int) [180]
7.54411s  2.1879ms              (1 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 980         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [185]

When I look at the generated PTX, most of it is more or less the same other than this section of the main kernel;

GTX 780ti CUDA 6.0 compute 3.5

BB5_3:
	st.local.u32 	[%rd5], %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	@%p3 bra 	BB5_15;

	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;


	mov.u32 	%r44, %r2;


BB5_5:

	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB5_7;

GTX 980 CUDA 6.5 compute 5.2

BB2_3:
	st.local.u32 	[%rd5], %r36;




	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	mov.u32 	%r44, %r2;
	@%p3 bra 	BB2_14;


BB2_4:
	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB2_6;

It look like in that first sub-section things got re-ordered but overall the PTX looks very similar. I know that is better to look at the SASS, but not sure how to view this.

At this point I am guessing this implementation was over-engineered for Kepler, and I am not making the correct adjustments for the new Maxwell arch.