Noob Alert: Tesla K20 slower than GTX 580?

I have just taken over a CUDA project here at work. I have not done CUDA programming other than a hello world, or even much in the way of C++ before, so I am expecting a big learning curve. However, even without touching a line of code, I have found that our application runs about 30% slower on a brand-new system with a Tesla K20 than on my own beat-up developer workstation with a GTX 580. Also, most of the samples I’ve tried that are included with the 5.0 toolkit run slower on the Tesla machine. Both computers are running Windows 7 64-bit.

It was expected that we might have to tweak our code to take advantage of some of the new CUDA features in 5.0 (and the Telsa architecture), but I was not expecting this type of performance disparity. I guess my question is two-fold:

  1. How can I make sure the Tesla is not damaged, and is functioning correctly?
  2. Does anyone have a “well, duh” explanation for this behavior?

I appreciate your help!


There can be number of reasons for it. First check raw performance of cards, SM*number * clock. What is your Tesla model?

There are multiple versions of K20. Can you show the output of nvidia-smi -q ?

When Kepler first came out, quite a few apps ran slower than Fermi GPUs. It took months for actual testing and development to show that in many, if not most, of those cases, the problem was simply tuning. The radically different SP per SM counts, the new ratios of compute to shared memory size/bandwidth, the different cache behavior, the different register counts… all of these were a much bigger change than the Tesla->Fermi transition was.

But experience has shown that Kepler really is just as good as Fermi for CUDA apps. Most apps which had performance drops need retuning and sometimes reorganization. It’s not that Kepler is slower, it’s just different. A Kepler-tuned kernel will run slower on Fermi as well.

It’s not ideal… we’d all love to have every code automatically retune itself for every GPU… in fact there’s been research on that . Now after 6 months of playing with Kepler, I prefer Kepler’s enhancements, even though I had to rebalance a lot of Tesla/Fermi code to get my performance back.

One thing to keep in mind when comparing Tesla cards to consumer cards in general is that Tesla cards come with ECC. When ECC is enabled, memory bandwidth available to applications is reduced. So if your application is bandwidth-bound, you may want to temporarily turn off ECC on the Tesla card for a like-to-like comparison with the GTX 580. Specifications for the theoretial memory bandwidth of these two GPUs can be found here:

GTX 580:

As SPWorley points out, there are various architectural differences (many driven by requirements for increased energy efficiency) between the Fermi and Kepler families. Some retuning of existing code may be necessary. This is particularly likely for code that has been very tightly tuned to the Fermi architecture. One possible scenario is that the existing code simply does not expose sufficient parallelism to fully utilize the much “wider” Kepler architecture, which provides many more functional units (running at lower clock speeds) than previous GPU architectures.

Thanks everyone for the thoughtful replies. I have some of the requested information below -

Is ‘number’ the number of cores per SM? I guess this would be (figures from deviceQuery):

Tesla K20: 13 x 192 x  706 = 1,762,176
GTX 580:   16 x  32 x 1544 =   790,528

Yikes, can we expect no significant performance gains? I sense an uncomfortable conversation with my boss in the near future, who expected a 10-fold speed increase with the new card and assigned me to make it happen…

I believe it is a K20c. The results of nvidia-smi -q are below:

Timestamp                       : Thu Jan 10 09:08:37 2013
Driver Version                  : 307.45

Attached GPUs                   : 2
GPU 0000:02:00.0
    Product Name                : Tesla K20c
    Display Mode                : Disabled
    Persistence Mode            : N/A
    Driver Model
        Current                 : TCC
        Pending                 : TCC
    Serial Number               : 0324712003072
    GPU UUID                    : GPU-e82f0cdb-0e58-1387-ee59-6cc969b6610b
    VBIOS Version               :
    Inforom Version
        Image Version           : 2081.0204.00.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : N/A
        Pending                 : N/A
        Bus                     : 0x02
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102210DE
        Bus Id                  : 0000:02:00.0
        Sub System Id           : 0x098210DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : 30 %
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 13 MB
        Free                    : 4786 MB
    Compute Mode                : Default
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Gpu                     : 38 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 16.61 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

Thanks folks,


Tesla20 faster 10 times in double performance per watt.

Nothing in the nvidia-smi output indicates a problem to me. The output clearly identifies this K20 as a K20c, i.e. an actively cooled wortkstation-class part, and otherwise looks very much like the output from my K20c at work (except that you seem to be running under Windows). I don’t know what performance gains you expected compared to your previous Tesla solution (a C2050 or C2075 ?). In general you will observe larger performance improvements for compute-bound tasks than tasks bound by memory throughput.

Thanks for taking a look at it. This is our first Tesla - our application was developed on and currently runs on Fermi-based cards - but I did expect it to be faster on the Tesla. I disabled ECC per your earlier post and saw no significant gains, so I am guessing the processing is not bandwidth-bound.

It sounds like restructuring the app is the thing to do now. I appreciate everyone’s time and comments.


You are probably aware of it at this point, but I figured I should point it out just in case: A change to the ECC settings requires a reboot to take effect.

It seems like the best thing to do now is to find out where the bottlenecks are in the application with the help of the profiler, then drill down on those.

If your application was:

  • Using double precision
  • Was compute bound
  • Running on a consumer Fermi

You could in the best case scenario expect ~ 8x speedup if it scales perfectly to the new architecture.

Even if you fulfill all 3 requirements, your application could still end up going from being compute bound to becoming bandwidth bound given that the K20 only has a marginal increase in bandwidth capacity over for ex GTX580, while radically increasing SP and DP performance.

My 2 cents …

I have access to the server with two Tesla K20X cards installed (the most powerfull Kepler solution Nvidia privides), making a test drive right now.

My conslusion is very frustrating: even Tesla K20X is approximately 10-15% SLOWER than GTX580 for tasks that use single precision math and employ lots of random access reads from global memory. When it comes to the cache efficiency Kepler looks very weak.

I have also done a number of tests with GTX680, this card is 2.5 times slower on my tasks.

Need to re tweak a program. Though do not expect big speed up in compare with GTX580. Those chips have nearly same size afaik. Kepler has lower frequency, therefor lower power consumption and easier to produce.

I’ll jump in too…

A Kepler SMX has quite a different “shape” than a Fermi SM. A block can have twice as many 63-register warps but only half as much shared memory per warp. That right there has a huge impact on preexisting Fermi kernels.

In a decent-sized project I finished last fall, each Kepler thread block was achieving 2.2x the throughput on 2x the workload of a maxed out Fermi block. The advantage decreases on larger problems as the device becomes memory bound.

At the least, any highly tuned Kepler kernel should be focused on fully utilizing the SMX since each one is quite a chunk of silicon by itself.

I would classify my Kepler kernel design style as Volkovian with an extreme focus on abusing SHFL and minimizing shared and device memory transactions. Baroque Neo-Volkovian? :)

Summary: Kepler is a monster if you design your kernels to fully utilize what it offers.

Unfortunately, “Volkovian” style can’t be used for all kinds of problems … It can’t for mine. Vasily Volkov was very kind to check one of my kernels out to see whether I’m missing something serious that impacts the performance - his verdict is simple: my sort of calculations is just not too Kepler-friendly.

I’m guessing that some of the suffering kernels are shared memory intensive?

I’ve identified this as a significant bottleneck and it makes sense when looking at the #load store units per FPU.

My kernels are both shared memory intensive and “global memory random access” intensive. Both aspects are crucial on Kepler comparing with Fermi.

I have not observed that global memory atomics have seen any regression. And NV touted 9x improvment for atomices in this generation.

Atomics on shared memory however suffer with the limited number of load store units.

The first part of your uncomfortable conversation with your boss should be about setting reasonable expectations for hardware improvements. No generation of GPU (or CPU for that matter) has ever been 10x faster than the previous generation. You’re lucky if you see that kind of improvement after 3 or 4 generations. :)

Now, that said, Kepler’s changes relative to Fermi go in both directions. Comparing a GTX 680 (which is, ignoring double precision, about 13% slower than a K20 assuming naive clock*CUDA cores scaling) to a GTX 580:

  • Atomic operations are much faster
  • Single precision floating point throughput is 2x greater
  • Hardware special function throughput is 2.5x greater
  • Memory bandwidth is about the same
  • Integer operations are a little slower
  • Shared memory, registers, L1 and L2 cache per thread is going to be lower due to the need for much larger blocks to maximize throughput on Kepler

When I first ran my programs on the GTX 680, they ran a lot slower than the GTX 580 because my block sizes were way too small for the GTX 680. Once I fixed that, I found that my programs ran anywhere from 15% slower to 2x faster on the GTX 680, depending on the mix of operations I was using.