Ok, so I am still trying to wrap my head around adjusting code for Maxwell. As an example my updated permutation code still runs far faster on the GTX 780ti I do not understand why;
GTX 780ti
GPU timing for 13!: 5.489 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==4332== Profiling application: ConsoleApplication1.exe
==4332== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
193.35ms 5.39775s (95016 1 1) (256 1 1) 26 0B 0B - - GeForce GTX 780 1 7 void _gpu_perm_basic<int=65536>(int) [181]
5.59110s 3.1851ms (1 1 1) (256 1 1) 25 0B 0B - - GeForce GTX 780 1 7 _gpu_perm_basic_last_step(__int64, int, __int64) [186]
GTX 980
GPU timing for 13!: 7.439 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==1276== Profiling application: ConsoleApplication1.exe
==1276== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
208.78ms 7.33532s (95016 1 1) (256 1 1) 27 0B 0B - - GeForce GTX 980 1 7 void _gpu_perm_basic<int=65536>(int) [180]
7.54411s 2.1879ms (1 1 1) (256 1 1) 26 0B 0B - - GeForce GTX 980 1 7 _gpu_perm_basic_last_step(__int64, int, __int64) [185]
When I look at the generated PTX, most of it is more or less the same other than this section of the main kernel;
GTX 780ti CUDA 6.0 compute 3.5
BB5_3:
st.local.u32 [%rd5], %r36;
sub.s64 %rd37, %rd6, %rd35;
setp.lt.s32 %p3, %r2, 0;
@%p3 bra BB5_15;
mov.u32 %r28, 1;
shl.b32 %r38, %r28, %r36;
mov.u32 %r44, %r2;
BB5_5:
mov.u32 %r39, %r44;
mov.u32 %r9, %r39;
cvt.s64.s32 %rd27, %r9;
mul.wide.s32 %rd28, %r9, 8;
add.s64 %rd30, %rd23, %rd28;
ld.const.u64 %rd12, [%rd30];
mul.lo.s64 %rd38, %rd12, %rd27;
setp.le.s64 %p4, %rd38, %rd37;
mov.u32 %r42, %r9;
mov.u32 %r43, %r9;
@%p4 bra BB5_7;
GTX 980 CUDA 6.5 compute 5.2
BB2_3:
st.local.u32 [%rd5], %r36;
mov.u32 %r28, 1;
shl.b32 %r38, %r28, %r36;
sub.s64 %rd37, %rd6, %rd35;
setp.lt.s32 %p3, %r2, 0;
mov.u32 %r44, %r2;
@%p3 bra BB2_14;
BB2_4:
mov.u32 %r39, %r44;
mov.u32 %r9, %r39;
cvt.s64.s32 %rd27, %r9;
mul.wide.s32 %rd28, %r9, 8;
add.s64 %rd30, %rd23, %rd28;
ld.const.u64 %rd12, [%rd30];
mul.lo.s64 %rd38, %rd12, %rd27;
setp.le.s64 %p4, %rd38, %rd37;
mov.u32 %r42, %r9;
mov.u32 %r43, %r9;
@%p4 bra BB2_6;
It look like in that first sub-section things got re-ordered but overall the PTX looks very similar. I know that is better to look at the SASS, but not sure how to view this.
At this point I am guessing this implementation was over-engineered for Kepler, and I am not making the correct adjustments for the new Maxwell arch.