Any advice on adjusting code for Maxwell when coming from Kepler

CudaaduC · November 6, 2014, 5:38am

Ok, so I am still trying to wrap my head around adjusting code for Maxwell. As an example my updated permutation code still runs far faster on the GTX 780ti I do not understand why;

GTX 780ti


GPU timing for 13!: 5.489 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==4332== Profiling application: ConsoleApplication1.exe
==4332== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
193.35ms  5.39775s          (95016 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 780         1         7  void _gpu_perm_basic<int=65536>(int) [181]
5.59110s  3.1851ms              (1 1 1)       (256 1 1)        25        0B        0B         -           -  GeForce GTX 780         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [186]

GTX 980

GPU timing for 13!: 7.439 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==1276== Profiling application: ConsoleApplication1.exe
==1276== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
208.78ms  7.33532s          (95016 1 1)       (256 1 1)        27        0B        0B         -           -  GeForce GTX 980         1         7  void _gpu_perm_basic<int=65536>(int) [180]
7.54411s  2.1879ms              (1 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 980         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [185]

When I look at the generated PTX, most of it is more or less the same other than this section of the main kernel;

GTX 780ti CUDA 6.0 compute 3.5

BB5_3:
	st.local.u32 	[%rd5], %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	@%p3 bra 	BB5_15;

	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;


	mov.u32 	%r44, %r2;


BB5_5:

	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB5_7;

GTX 980 CUDA 6.5 compute 5.2

BB2_3:
	st.local.u32 	[%rd5], %r36;




	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	mov.u32 	%r44, %r2;
	@%p3 bra 	BB2_14;


BB2_4:
	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB2_6;

It look like in that first sub-section things got re-ordered but overall the PTX looks very similar. I know that is better to look at the SASS, but not sure how to view this.

At this point I am guessing this implementation was over-engineered for Kepler, and I am not making the correct adjustments for the new Maxwell arch.

Topic		Replies	Views
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26921	December 23, 2014
So what's new about Maxwell? CUDA Programming and Performance	166	55915	March 10, 2015
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10331	April 5, 2012
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	1944	November 11, 2017
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	14920	March 24, 2018
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19929	March 12, 2014
Maxwell suddernly becomes 10x slower CUDA Programming and Performance	15	4570	February 24, 2016
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11162	May 23, 2010
Forward looking GPU integer performance CUDA Programming and Performance	22	21599	March 20, 2017
Kepler and Maxwell, oh my! CUDA Programming and Performance	55	55758	October 19, 2010

Any advice on adjusting code for Maxwell when coming from Kepler

Related topics