Any advice on adjusting code for Maxwell when coming from Kepler

Evaluating Maxwell for a number of projects using code originally developed on Kepler, and am wondering, given the different arch, how do I determine what code modifications to make to optimize.

Going from the GTX 780ti to the GTX 980 the majority my of cherished benchmarking code is running faster on the GTX780ti, and not all of that is memory bound code.

For example how would one handle going from 2880 cores 15 SMs, to 16 SMs 2048 cores ?

The CUDA-Z numbers are much better with the GTX 980 (except for DP), but even running the CUDA 6.5 MatrixMulCUBLAS sample is significantly slower than the GTX780ti.

So other than digging into the assembly(working on it but still need to learn more), how can I determine how to adjust the launch configuration etc for Maxwell by profiling?

Other than memory-bound codes, what are some other codes that run slower on current Maxwell GPUs compared to Kepler? What properties do these codes have?

The Maxwell ISA frequently causes static and dynamic instruction count to be higher compared to Kepler. I have seen code size differences of up to factor two in the worst case. From what I can see looking at the generated SASS, this is due both to increased emulation (for example, 32-bit integer multiplies are emulated now), as well as ISA other changes (for example, double-precision instructions apparently can’t take constant memory references any more or at least the compiler does not seem to be generating them for Maxwell). It is conceivable that larger code takes some (presumably minor) hit from increased ICache misses. Whether there could be issues with instruction decode limits, I can’t say due to lack of hands-on exposure to Maxwell.

Have you had a chance to check out the Maxwell Tuning Guide:

http://docs.nvidia.com/cuda/maxwell-tuning-guide

Well my fabulous permutation code which goes through all permutations of an array in local memory (using factorial decompostion) is much slower on the GTX 980 than the GTX 780ti.

Using CUDA 6.0 on the GTX 780ti generating all 13! permutations of an array takes exactly 6.3 seconds, while using the GTX 980 CUDA 6.5 code generation 5.2 it takes 7.7 seconds.

This is the exact code (Windows sorry, for linux there has to be a casting adjustment for the long long type…):

http://pastebin.com/jVGGM46x

What is even more odd is a bunch of my work related projects which I assumed more more memory bound actually are running 10-30% faster using the GTX 980, so it is interesting situation.

In the end I will play around with fine tuning to see what happens, but in general just want to understand how to adjust existing code for the new arch.

Thanks, that helps

I do not want to jump to conclusions (especially since I don’t have a Maxwell-based GPU to run on), but your permutation code includes what looks like 64-bit integer multiplies fairly deep inside a loop nest. An analysis may start by simply checking on the number of instructions in the “for (d=digits-2 …)” loop. Comparing various statistics in the profiler should also prove insightful.

That makes sense, and will do as you recommend.

Thanks

Haven’t had a chance to look at your code yet, but this statement sounds suspicious:

“permutation code which goes through all permutations of an array in local memory”

With Maxwell the writable L1 is gone, writes now go to L2 which includes register spills. If all you’re doing is reading then make sure LDG.CI is what is being used so that the texture cache is getting some use.

But yah, I’d focus on your inner loops too, especially if you’re using 64 bit integer multiply in them.

Thanks to the advice of both njuffa and scottgray it appears I got the running time for 13! down to 4.8 seconds on the GTX 980.

I had to adjust(increase) the work done per thread block, and cache some of the intermediate values in shared memory to reduce the frequency of 64 bit integer multiplication.

One thing I have noticed about Maxwell is that any application I have which uses a moderate amount of shared memory tends to have an ‘out-of-the-box’ performance improvement when compared to Kepler.

Maxwell has more shared memory than Kepler (per SM). Individual threadblocks are still limited to 48KB max, but big maxwell has 96KB per SM, and little Maxwell has 64KB per SM (vs. Kepler at max 48KB per SM), so there are definite opportunities for occupancy improvement when shared memory usage is an occupancy limiter (and by extension, for performance improvement when occupancy is a performance limiter).

http://docs.nvidia.com/cuda/maxwell-tuning-guide/#shared-memory-capacity

@CudaaduC: Nice to hear regarding the out-of-the-box benefits for kernel with shared memory when going from Kepler to Maxwell. Do you have some kernels with texture memory, and if so did you notice the same effect for texture memory ? Because we are leaning heavily on texture reads in combination with ‘register blocking’ in register arrays in our kernels (2-D convolutions with small kernels mostly).

I never explicitly use the texture cache, rather I use __ldg() instead. In general the GTX 980 underperforms the GTX 780ti in memory operations. For 32-bit float computation the GTX 980 outperforms the GTX 780ti by a considerable margin.

Actually the application I was referring to was a convolution where the main kernel loads and manipulates a large cache of complex values in shared memory. With Maxwell those kernels operate more efficiently presumably because of the arch changes related regarding shared memory and it relation to occupancy.

So to answer your question try both approaches, but I have found that in general it usually is better to choose shared over other options.

The GTX 780 Ti and the GTX 980 use identical memory technology, but since the width of the memory interface on the GTX 980 is only 2/3 that of the GTX 780 Ti memory interface, memory throughput is likewise only 2/3. Abstracting from this particular case: Generally speaking, over the past ten years the memory bandwidth of GPUs has increased more slowly than the FLOPS, so for use cases where there is a choice, leaning towards increased computation and reduced memory accesses is usually a good way of future-proofing one’s code.

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-780-ti/specifications
Memory Interface Width 384-bit
Memory Bandwidth (GB/sec) 336

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications
Memory Interface Width 256-bit
Memory Bandwidth (GB/sec) 224

Regarding shared memory

Not only has the size increased but also the LD/ST unit pressure has decreased a bit 32 / 192=1/6 -> 32/128 = 1/4 => 50% improvement.

In fact looking at the aggregate bandwidth:

Now given that the GM204 has 16 * 32 LD/ST units that would give us 16321.126 Ghz ~ 576.5 GWORD/s * 4 byte /word = 2306 GB/s of bandwidth (right?).

Meanwhile the GTX780TI has 15 SM:s * 32 LD/ST 0.875 Ghz ~ 420 GWORD/s * 4 Bytes/word = 1680 GB/s

So if I haven’t fat fingered my calculator there should be 2306/1680 ~ 1.37x more shared memory bandwidth.

Overall maxwell seems like a much more balanced architecture from both a bandwidth and memory resource point of view than Kepler was, really an appreciated step back towards Fermi, IMO.

I had a dinner discussion with a couple of guys from the arch. group at Nvidia and they too felt that they’d built a much more balanced architecture.

Side note:

I’ve noticed some performance drops in a few kernels where I had excessive register spilling, for Kepler/Fermi, that was fine, all of it landed in the L1 cache, now it makes more sense to use a bit less registers and instead cache a bit more in SMEM, if possible.

Yep, I reduced those 64 bit multiplications by a large constant by doing 64 bit integer subtractions rather than conditional multiplications. Also changing the work done per thread made a difference, as well as setting the max used register compile option.

Oddly the GTX 780ti is still beating the GTX 980 in running time for this problem, and I am wondering if the 64 bit integer operations on the GTX 980 are somehow slower.

Unfortunately this is not true, each Kepler SM has 32 8-bytes wide shared memory banks and can deliver 8 bytes per bank per clock.

Re: Oddly the GTX 780ti is still beating the GTX 980 in running time for this problem, and I am wondering if the 64 bit integer operations on the GTX 980 are somehow slower.

64-bit integer arithmetic has always been emulated using 32-bit operations. Other than the changes prompted by the removal of native 32-bit integer multiplies, which likely had a negative impact on the performance of 64-bit integer multiplies and divisions I do not know of any changes in Maxwell that would negatively affect the performance of 64-bit integer operations. I am afraid you will have to dive into details of your code to pinpoint the issue.

Note that I have not measured the relative speed of 64-bit integer multiplications and divisions on Maxwell, the working hypothesis of some degree of slowdown is based on instruction count of the emulation code. The “proof of the pudding is in the eating”, so I would recommend measuring execution times. I cannot do that because I do not have a Maxwell GPU at my disposal.

Ah indeed! I forgot that Kepler had the extra wide 8-byte banks!

I wonder how many epople managed to use these banks while working in SP though? Feels like it’s an DP related functionality?

If someone needs to operate on complex numbers (float2), then those shared 8-byte banks come into play then as well.

Ok, so I am still trying to wrap my head around adjusting code for Maxwell. As an example my updated permutation code still runs far faster on the GTX 780ti I do not understand why;

GTX 780ti


GPU timing for 13!: 5.489 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==4332== Profiling application: ConsoleApplication1.exe
==4332== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
193.35ms  5.39775s          (95016 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 780         1         7  void _gpu_perm_basic<int=65536>(int) [181]
5.59110s  3.1851ms              (1 1 1)       (256 1 1)        25        0B        0B         -           -  GeForce GTX 780         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [186]
GTX 980

GPU timing for 13!: 7.439 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu!
==1276== Profiling application: ConsoleApplication1.exe
==1276== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
208.78ms  7.33532s          (95016 1 1)       (256 1 1)        27        0B        0B         -           -  GeForce GTX 980         1         7  void _gpu_perm_basic<int=65536>(int) [180]
7.54411s  2.1879ms              (1 1 1)       (256 1 1)        26        0B        0B         -           -  GeForce GTX 980         1         7  _gpu_perm_basic_last_step(__int64, int, __int64) [185]

When I look at the generated PTX, most of it is more or less the same other than this section of the main kernel;

GTX 780ti CUDA 6.0 compute 3.5

BB5_3:
	st.local.u32 	[%rd5], %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	@%p3 bra 	BB5_15;

	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;


	mov.u32 	%r44, %r2;


BB5_5:

	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB5_7;

GTX 980 CUDA 6.5 compute 5.2

BB2_3:
	st.local.u32 	[%rd5], %r36;




	mov.u32 	%r28, 1;
	shl.b32 	%r38, %r28, %r36;
	sub.s64 	%rd37, %rd6, %rd35;
	setp.lt.s32	%p3, %r2, 0;
	mov.u32 	%r44, %r2;
	@%p3 bra 	BB2_14;


BB2_4:
	mov.u32 	%r39, %r44;
	mov.u32 	%r9, %r39;
	cvt.s64.s32	%rd27, %r9;
	mul.wide.s32 	%rd28, %r9, 8;
	add.s64 	%rd30, %rd23, %rd28;
	ld.const.u64 	%rd12, [%rd30];
	mul.lo.s64 	%rd38, %rd12, %rd27;
	setp.le.s64	%p4, %rd38, %rd37;
	mov.u32 	%r42, %r9;
	mov.u32 	%r43, %r9;
	@%p4 bra 	BB2_6;

It look like in that first sub-section things got re-ordered but overall the PTX looks very similar. I know that is better to look at the SASS, but not sure how to view this.

At this point I am guessing this implementation was over-engineered for Kepler, and I am not making the correct adjustments for the new Maxwell arch.