Tesla k20 vs GTX680 benchmarks...!!!!!

Just got our supermicro station with a new K20 gpu card. Tried a few benchmarks of our code and NVIDIA sdks and K20 was always 10-40% slower depending on the work load…Is that possible or have we done something wrong setting it up…!!!

The following recent thread has some discussion you may find useful:

https://devtalk.nvidia.com/default/topic/527247/cuda-programming-and-performance/noob-alert-tesla-k20-slower-than-gtx-580-/

To check your setup: Make sure there is adequate power supply and cooling. Are both power connectors plugged in? Check the clock frequencies, while executing a compute load, with nvidia-smi. Check the bandwidth of PCIe transfers and within GPU memory.

Are you compiing with -arch=sm_35 when building for the K20?

If your application is limited by memory throughput, compare the memory bandwidth of the GTX 680 (no ECC) with the K20 (ECC enabled by default; this reduces app-useable bandwidth somewhat), and see whether the difference in memory bandwidth matches the difference in app-level performance.

I read the discussion carefully. I replied in this thread as it is a GK104 vs GK110 (kepler vs kepler) case here.

  • checked the cable connections. They were fine.
  • Disabled ECC and restarted (no major changes - code is compute bound)
  • Yes compiling with -arch=sm_35 for K20

Benchmarks K20 Vs GT680:

  • Our code - computationally bound with O(3CN) global memory atomic writes on N elements -> 1-40% slower
    (performance improves proportionally to the number of elements)
  • SDK transpose samples K20 10-90 % faster
  • SDK simpleAtomicIntrinsics k10 10x slower !!! (why?)
  • other sdk samples tried -10% - +30% difference

How do i Check the clock frequencies, while executing a compute load, with nvidia-smi ?

At the end of of the nvidia-smi output, there should be three sections for clocks. As I recall (I am not sitting in front of a CUDA capable machine right now), the first of those sections shows current clocks. When the GPU is idling, these clocks are very low, as the GPU will be in one of the power-saving modes. While a CUDA computation is running, the clocks should be reflecting full-performance state, and on a K20c the clocks should then show 705 MHz for the core and 2600 MHz for the memory. I don’t know what the clocks would be for other variants of K20. Here is the tail of the output from nvidia-smi -q while running a CUDA app on my K20c:

Clocks
        Graphics                : 705 MHz
        SM                      : 705 MHz
        Memory                  : 2600 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes
        Process ID              : xxxxx
            Name                : besseli0
            Used GPU Memory     : 2180 MB

The SDK examples are not constructed as benchmarks suitable for comparing performance across GPUs. They are usually written in a simplified manner to keep focus on whatever feature they are designed to demonstrate. So the fact that simpleAtomicIntrinsics seems to run much slower on the GK110 may or may be related to the slowdown you are observing with your application. It certainly looks intriguing but may be a red herring.

I am not aware of any slowdowns in global memory atomics between GK104 and GK110, and if the code is bound by the throughput of computation the performance of atomics should be secondary. To avoid any misunderstandings, I classify global-memory atomics as a memory operations, not computational ones, as the memory part of an atomic operation is the slow part.

Off the top of my head I do not have any ideas why a computationally bound code would be slower on a GK110 compared to GK104, provided there is enough parallelism to fill the larger GK110 well. I have never had a chance to compare a GK104 and a GK110 side-by-side. Is this computation mostly integer, single-precision, or double-precision?

It is mostly single precision with a small percentage of integer math (including bit masks and bit shifts). According to what i have read and seen on videos and demonstrations of GK110, there should be no reason for it to be slower. Therefore i suspect set-up related issues…

I just checked the smi -q output and it is similar to yours, regardless if a cuda app is running or not… isn’t that odd too…?

ps.
can send you my code to try on a K20 if that helps…

The peak single-precision throughput of the K20c is about 14% higher than that of the GTX 680:

GTX 680: 1536 cores @ 1006 MHz => 3.090 TFLOPS
K20c: 2496 cores @ 705 MHz => 3.520 TFLOPS

With ECC off (which is what you tried), the theoretical memory bandwidth of the K20c is about 8% higher than that of the GTX 680:

GTX 680: 192 GB/sec
K20c: 208 GB/sec

Given that, I have no explanation for the slowdown you observe at app level. I assume that you are performing a tightly controlled experiment, where you simply swap the GPUs in the same system, so all other system components (hardware and software) are exactly the same. I also assume you are compiling the app with the appropriate architecture switches, i.e. -arch=sm_30 for the GTX 680 and -arch-sm_35 for the K20c.

You stated that the app slows down by between 1% and 40%. What are the differences between the best case and the worst case scenarios? That may provide some hints as to what the performance limiting factor may be. I would suggest starting to profile the app on both platforms to see whether anything jumps out. The latest version of the profiler offers quite a bit of in-depth analysis.

I checked with colleagues knowledgable about global memory atomics, and the entire Kepler family shares the same implementation. So I wouldn’t expect massive performance differences between GTX680 and K20c in this regard. The 10x difference oberserved may be an artifact of the way the SDK example works; as I said these examples apps are not designed as benchmarks.