We observe a strange discrepancy between the basic benchmark for the Tesla C2070 on the one hand and the the Geforce GTX 285 on the other. To be specific the Tesla C2070 gives worse performance than the Geforce GTX 285 for matrixMul (from the SDK).
GTX: 226 Gflops/s
C2070: 183 Gflops/s !!
The bandwidth test also gives worse results on the C2070
Does anyone have seen similar results? any idea on what could be done to improve the performance? It appears that the $250 is way better than a $4000 card. Are we missing something?
The C2070 is a Padova system with the following specs:
Also for benchmarking sgemm or dgemm I would use the actual blas routines not the sdk example. It isn’t nearly as highly optimized and won’t necessarily give a true picture of sgemm/dgemm performance between the two cards.
I am positive that dgemm performance on the 2070 will trounce the 285.
I think the issues here boil down to a few common points of confusion:
The SDK examples usually make for terrible benchmarks. They are written with an eye toward demonstrating a particular technique in isolation rather than efficiently solving a real problem.
The Tesla cards are not faster overall than top-of-the-line GeForce cards. If your kernel is limited by single precision, integer, or memory bandwidth performance, you will find the Tesla to be slower than a GTX 480 or 580. If you are limited by double precision performance (and be sure it is not just memory bandwidth), then Tesla will be faster.
You should not buy Tesla for raw computational performance (except double precision), but rather because you want the other features: Better QA testing for 24/7 use, more memory, ECC, bidirectional DMA transfers over the PCI-Express bus, the Windows TCC driver that lets you bypass the overhead of the WDDM, better technical support
ECC really seems to be a memory bandwidth performance killer, and many kernels are memory bandwidth limited, not computationally limited.
Thank you all for the help! I never received an e-mail about responses so didn’t check the replies earlier. We also found that if you take advantage of the registers on the Fermi architecture rather than just use shared memory for matrix multiplication, you get much better performance. This is described in the following paper: