Tesla C2070 vs. GX2 speed test

Hi everyone,

I was doing some speed tests on our new Tesla C2070. Let me begin by saying that for most of my actual real code the Tesla is between 1 and 3 times faster than one of our old GeForce 9800 GX2. However, for small problem sizes the GeForce is often considerably faster. This made me do some speed tests from the code examples, with the following results:

== clock ==
[clock] starting…

Using CUDA device [0]: Tesla C2070
time = 390322
[clock] test results…
PASSED

[clock] starting…

Using CUDA device [1]: GeForce 9800 GX2
time = 20950
[clock] test results…
PASSED

== eigenvalues ==
[eigenvalues] starting…

Using CUDA device [0]: Tesla C2070
Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 34.177879 ms
Average time step 2, one intervals: 11.691751 ms
Average time step 2, mult intervals: 0.006080 ms
Average time TOTAL: 45.962509 ms
[eigenvalues] test results…
PASSED

[eigenvalues] starting…

Using CUDA device [1]: GeForce 9800 GX2
Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 12.691402 ms
Average time step 2, one intervals: 3.923561 ms
Average time step 2, mult intervals: 0.003820 ms
Average time TOTAL: 16.648130 ms
[eigenvalues] test results…
PASSED

Is this normal? Should the $2500 Tesla be (much!) slower than the $500 GX2?

Are you compiling your tests with -arch sm_20 for the Tesla? If not, the driver will have to recompile your kernel from the PTX at load time, and for short benchmarks that might be significant, depending on how the timing is being done.

The cards are in different (though identical) computers. I thought that the compiler would not use the PTX in a machine with only Tesla cards, but anyway, running the tests when compiling with -arch sm_20 for the Tesla gives the same results.

I agree that these examples from NVIDIA are very short (couple of ms), so the overhead of setting up the cards etc, which I’ve seen to be longer on Teslas, may play a big role. Indeed, when I use the matrixMul example, it gives me a higher throughput for the Tesla, but about the same speed. (The matrices are 2x larger in every dimension on the Tesla, which scales the problem by a factor of 4, which is exactly the difference between the execution times on each card).

GX2:
Using Matrix Sizes: A(320 x 480), B(320 x 320), C(320 x 480)

CUDA matrixMul Throughput = 99.0369 GFlop/s, Time = 0.00099 s, Size = 98304000 Ops, NumDevsUsed = 1, Workgroup = 256

Tesla:
Using Matrix Sizes: A(640 x 960), B(640 x 640), C(640 x 960)

CUDA matrixMul Throughput = 186.6738 GFlop/s, Time = 0.00421 s, Size = 786432000 Ops, NumDevsUsed = 1, Workgroup = 1024

Still, for every of my actual problems that uses below about 500 threads, the GX2 is considerably faster. Above 500 threads the Tesla quickly becomes faster. How much faster should the Tesla be for a typical problem? Or does this depend a lot on the type of problem?

‘eigenvalues’ is a poor representative of card performance. In the example you provided, all kernels are launched as <<<1,256>>>, which is woefully insufficient to load the Tesla (you need about 20x more threads in flight to achieve full performance). For starters, it only loads 1 SM out of 14 (because there’s only 1 block). Nor is the 9800 GX2 loaded to any significant degree (also only 1 SM loaded out of a total of 32). That still does not explain why the 9800 does the job faster, I’m not sure where the bottleneck is in that program … only that results aren’t very meaningful.

9800 GX2 is older architecture, and it has fewer compute cores, but individual cores are faster (1500 MHz vs 1150 MHz). Except for limitations of compute capability 1.1, it’s quite a capable card. Its single-precision FLOPS rating is almost equal to Tesla’s.

Thanks for the explanation, Hamster. Indeed we bought the Tesla card principally for the ECC and the double precision. The thing that intrigues me is that it seems very hard to put one’s finger on the exact circumstances under which each card performs best. I agree that the eigenvalue, and probably also the clock examples are not representative of most real-world problems. Nevertheless I have encountered in my real-world problems situations where the GX2 was significantly faster. Your remark that “you need about 20x more threads (than 256) in flight to achieve full performance” probably explains most of this, as these problems were for up to 500 threads, not 5000. However this does not mean that a similar problem is " a poor representative of card performance". It just means that the optimum choice for a computing platform depends on your problem and there does not exist a single “representative of card performance” (but this is of course somewhat of an open door…).