Tesla C2070 Performance Comparing Tesla C2070 performance to Geforce GTX

We observe a strange discrepancy between the basic benchmark for the Tesla C2070 on the one hand and the the Geforce GTX 285 on the other. To be specific the Tesla C2070 gives worse performance than the Geforce GTX 285 for matrixMul (from the SDK).

GTX: 226 Gflops/s
C2070: 183 Gflops/s !!

The bandwidth test also gives worse results on the C2070

Does anyone have seen similar results? any idea on what could be done to improve the performance? It appears that the $250 is way better than a $4000 card. Are we missing something?

The C2070 is a Padova system with the following specs:

2x Nehalem 4C E5530 @ 2.4 GHz
24GB @ 1333 mHz memory
Ubuntu 10.10 64bit

The GTX card is on:
Intel Xeon @2.00GHz
8GB memory
Debian 64bit

The memory bandwidth of the 285 is actually higher than a 2070. See http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Also for benchmarking sgemm or dgemm I would use the actual blas routines not the sdk example. It isn’t nearly as highly optimized and won’t necessarily give a true picture of sgemm/dgemm performance between the two cards.

I am positive that dgemm performance on the 2070 will trounce the 285.

I had a similar experience with the C2050, and after a few hours of hair pulling (needed to explain to boss why we bought it for over 2000 USD… )

I found out that ECC is on by default and that it seemed to be reducing reducing memory bandwidth by over 50% …

To turn it off (and it took a while to find out how to do this, since it wasn’t in any manuals we had) I used the Nvidia control pannel.

hope this helps,

eldad.

PS - I have to say, I have recently installed the GTX 480 and except the amount of RAM, I don’t find the GTX 480 performance much below the C2050.

I think the issues here boil down to a few common points of confusion:

  1. The SDK examples usually make for terrible benchmarks. They are written with an eye toward demonstrating a particular technique in isolation rather than efficiently solving a real problem.

  2. The Tesla cards are not faster overall than top-of-the-line GeForce cards. If your kernel is limited by single precision, integer, or memory bandwidth performance, you will find the Tesla to be slower than a GTX 480 or 580. If you are limited by double precision performance (and be sure it is not just memory bandwidth), then Tesla will be faster.

You should not buy Tesla for raw computational performance (except double precision), but rather because you want the other features: Better QA testing for 24/7 use, more memory, ECC, bidirectional DMA transfers over the PCI-Express bus, the Windows TCC driver that lets you bypass the overhead of the WDDM, better technical support

  1. ECC really seems to be a memory bandwidth performance killer, and many kernels are memory bandwidth limited, not computationally limited.

Thank you all for the help! I never received an e-mail about responses so didn’t check the replies earlier. We also found that if you take advantage of the registers on the Fermi architecture rather than just use shared memory for matrix multiplication, you get much better performance. This is described in the following paper:

www.netlib.org/lapack/lawnspdf/lawn227.pdf

Disabling ECC also helped :smile: