Why is there only a small improvement in performance for a code running on the K20 as compared to th

We have been running some tests with a Magneto-hydrodynamics code which uses numerical stabilization to control solution stability at shocked regions of the grid for the problem.

For the code in question we have achieved good levels of performance and performance scales well with the
size of the grid.

However, when we have compared the performance of the code on the K20 and the M2070 we obtained performance improvements much less than the factor of 2 which has been claimed for many applications.

We would like to know if the following reason might be plausible.

Basically the K20 has 2688 cores as compared to the 448 cores of the M2070. The K20 has a bandwidth of 208GB/s as compared to the M2070 bandwidth which is 144GB/s.

Is it plausible that with its increased number of cores the Kepler K20 imposes a much greater demand on the memory bandwidth. This is not compensated for by the increased Kepler K20 bandwidth. Our problem places such a demand on bandwidth because we employ a large number of fields and temporary fields used for the numerical stabilisation.

What other reasons might there be for our observed performance?
Many Thanks

Code sometimes needs to be (re-)tuned to show improvement on Kepler. Check out the Kepler Tuning Guide.

If your thread blocks use a lot of shared memory, it is difficult to achieve a good occupancy on Kepler devices. I just noticed that in a piece of crypto software that I was trying to optimize across various nVidia GPU generations. I only got my expected Kepler bonus when I reduced the shared memory size per block. Unfortunately this required a major refactoring of the code.

FLOPS increase exponentially over time, mem bandwidth increases linearly. This is not news, this is the memory wall. If you are DRAM memory bandwidth bound, the best increase you can expect is 208/144 = 44% faster.

If you are memory bandwidth bound, but have a small working set then the larger cache on K20 can provide better scaling than this estimate. For example, my code (HOOMD-blue) falls into this category and runs 83% faster on K20X vs M2070.

I think the factor of two improvements have been cherry-picked and that most applications do not attain this level of speedup on K20X vs M2070.

I’ve measured everything from 2x improvement to a slight loss of performance in moving different programs from Fermi to Kepler (GTX 580 to 680, no access to a K20 or Titan yet!), even after retuning things like block sizes.

Seibert, do you have 680? Could you check one little thing for me. There is a CUPTI example called callback_metric. When I run it with the following parameters:

./callback_metric 0 achieved_occupancy

I get achieved_occupancy = 1.620684, i.e. occupancy > 100%. I wonder if it is reproducible.

Yes, it is reproducible.

Here is my GTX 580:

CUDA Device Number: 0
CUDA Device Name: GeForce GTX 580
Launching kernel: blocks 196, thread/block 256
Duration = 4992ns
Pass 0
Launching kernel: blocks 196, thread/block 256
	active_cycles = 31511 (2054, 2029, 1878, 1944, 1898, 1949, 1900, 1978, 1956, 1946, 1942, 1965, 1987, 2030, 2018, 2037)
	active_cycles (normalized) (31511 * 16) / 16 = 31511
	active_warps = 1145072 (78808, 72062, 69499, 69029, 69386, 72167, 68262, 71415, 72428, 68306, 71546, 71174, 72093, 74327, 72006, 72564)
	active_warps (normalized) (1145072 * 16) / 16 = 1145072
Metric achieved_occupancy = 0.757058

And my GTX 680:

CUDA Device Number: 3
CUDA Device Name: GeForce GTX 680
Launching kernel: blocks 196, thread/block 256
Duration = 7008ns
Pass 0
Launching kernel: blocks 196, thread/block 256
	active_cycles = 38233 (4642, 4873, 4673, 4666, 4999, 4640, 4998, 4742)
	active_cycles (normalized) (38233 * 8) / 8 = 38233
	active_warps = 4049628 (497948, 515104, 500332, 503792, 514776, 499456, 514160, 504060)
	active_warps (normalized) (4049628 * 8) / 8 = 4049628
Metric achieved_occupancy = 1.654995

CUPTI HAS A BUG!

  • Seibert, thanks for checking!

Thanks for the responses to this thread have been very helpful.
I have used the Kepler tuning guide, I found that its clarity could be improved by the provision of some examples such as those mentioned in this thread.

One of the issues I have had is with the utilisation of shared memory so I have had to review that part of the code. Although that has not improved performance it has improved the numerical agreement between the algortihm when run on the M2070 and the K20.

Vasily: Same for a GTX Titan:

CUDA Device Number: 0
CUDA Device Name: GeForce GTX TITAN
Launching kernel: blocks 196, thread/block 256
Duration = 8672ns
Pass 0
Launching kernel: blocks 196, thread/block 256
	active_cycles = 38363 (2703, 2751, 2653, 2759, 2758, 2718, 2751, 2755, 2720, 2737, 2767, 2740, 2790, 2761)
	active_cycles (normalized) (38363 * 14) / 14 = 38363
	active_warps = 3981504 (279664, 285632, 274460, 284924, 286108, 281444, 286704, 286776, 282152, 283452, 288472, 281380, 292592, 287744)
	active_warps (normalized) (3981504 * 14) / 14 = 3981504
Metric achieved_occupancy = 1.621641

Tera: Thanks! If I was at NVIDIA, I’d submit a bug report.

Concerning the achieved_occupancy error. We’ve confirmed that it is reproducible in 5.0. In the upcoming 5.5 release it is fixed and you should be able to verify using the RC release (which will be made available to you if you are a registered developer).

When will be the 5.5 release date?

How about the Multiprocessor Efficiency - will that be fixed as well or shall I extract a testcase?

The 5.5 release is available now. Can you try it and see if your issue is resolved? Thanks.