Tesla S2050 performance double precision performance too low

Tesla S2050 delivers questionable performance in double precision here (Centos 5.3, 64 bit, CUDA Toolkit 3.2 RC, 260.24, -arch=sm20).
Blackscholes reports a throughput of 1.7755 GOptions/s vs. 3.9667 GOptions/s on a GTX480, in the matrixMul test (converted to double) GTX480 is 20% faster than Tesla (~89 vs 109 GFLOPS dp) . This picture extends to my own application, in which GTX480 constantly beats the Tesla GPU by 20%.

I have installed the most recent drivers+toolkit, even switch ECC off for testing, but Tesla remains to be slow (memory bandwidth is also low, 85 GB device-device for Tesla, 111 GB/s for GTX 480).

What can be done to improve Tesla’s dp performance ?

Tesla S2050 delivers questionable performance in double precision here (Centos 5.3, 64 bit, CUDA Toolkit 3.2 RC, 260.24, -arch=sm20).
Blackscholes reports a throughput of 1.7755 GOptions/s vs. 3.9667 GOptions/s on a GTX480, in the matrixMul test (converted to double) GTX480 is 20% faster than Tesla (~89 vs 109 GFLOPS dp) . This picture extends to my own application, in which GTX480 constantly beats the Tesla GPU by 20%.

I have installed the most recent drivers+toolkit, even switch ECC off for testing, but Tesla remains to be slow (memory bandwidth is also low, 85 GB device-device for Tesla, 111 GB/s for GTX 480).

What can be done to improve Tesla’s dp performance ?

Principially the bandwidth is lower with Tesla Boards than with GeForce cards (they use lower memory frequency). Hence if you are bandwitdh limited the GeForce will give better performance. (And if you take a look at your numbers, the 89 vs 109 GFlops correspond very nicely to the measured transfer rates. Also the GTX480 has one more of those compute units (15 instead of 14), which should give it (combined with the higher frequency of its cores) a nice lead in Single Prec and Integer Compute Performance.

Bottomline: In most Apllications I would expect the GTX480 to beat the C2050 by 20%-25%. Only if you are really double precision bound this might not be the case.

Best Regards

Ceearem

Principially the bandwidth is lower with Tesla Boards than with GeForce cards (they use lower memory frequency). Hence if you are bandwitdh limited the GeForce will give better performance. (And if you take a look at your numbers, the 89 vs 109 GFlops correspond very nicely to the measured transfer rates. Also the GTX480 has one more of those compute units (15 instead of 14), which should give it (combined with the higher frequency of its cores) a nice lead in Single Prec and Integer Compute Performance.

Bottomline: In most Apllications I would expect the GTX480 to beat the C2050 by 20%-25%. Only if you are really double precision bound this might not be the case.

Best Regards

Ceearem

thank you for your quick response. I understand that memory bandwidth is in favor of the GTX480, but to me it seems that it must be extremely hard to make use of the additional DP-units when even Nvidia-demo programs such as Blackscholes cannot effectively use them.

Is there some “generally approved” cuda benchmark for dp computation, just that I can verify that our hardware and software setup is ok and that Tesla is - at least in principle - capable of delivering 500-600 dp GFLOP/s ?

The bottom line is that I recommended buying the S2050 instead of a bunch of GTX480 and I will face some serious discussions if Tesla continues delivering only 80% of a GTX480 instead of the expected 300-400% ;-)

thank you for your quick response. I understand that memory bandwidth is in favor of the GTX480, but to me it seems that it must be extremely hard to make use of the additional DP-units when even Nvidia-demo programs such as Blackscholes cannot effectively use them.

Is there some “generally approved” cuda benchmark for dp computation, just that I can verify that our hardware and software setup is ok and that Tesla is - at least in principle - capable of delivering 500-600 dp GFLOP/s ?

The bottom line is that I recommended buying the S2050 instead of a bunch of GTX480 and I will face some serious discussions if Tesla continues delivering only 80% of a GTX480 instead of the expected 300-400% ;-)

may be, it is the ECC thing…that is limiting the performance.

may be, it is the ECC thing…that is limiting the performance.

Does blackscholes really use doubles alot?

Does blackscholes really use doubles alot?

Btw, do you know that NVIDIA artificially downclocked Tesla to prevent gamers of buying Teslas? NV offers full support for Tesla, and they do not want to get many calls from gamers.

Btw, do you know that NVIDIA artificially downclocked Tesla to prevent gamers of buying Teslas? NV offers full support for Tesla, and they do not want to get many calls from gamers.

That’s total nonsense. (perhaps you are kidding and I’m being obtuse) I think the Telsa price tag of $2500 is going to discourage gamers quite effectively. :) The downclocking in the Tesla series is to increase reliability and stability.

That’s total nonsense. (perhaps you are kidding and I’m being obtuse) I think the Telsa price tag of $2500 is going to discourage gamers quite effectively. :) The downclocking in the Tesla series is to increase reliability and stability.

yes, you are right, Blackscholes is not a useful example in this case, since it is based on float.

yes, you are right, Blackscholes is not a useful example in this case, since it is based on float.

Also, Black scholes is not arithmetic intensive compared to discrete, lattice methods like binomial trees. we have seen earlier that black scholes yeilds the worst speedup - primarily dominated by memcopies…

Also, Black scholes is not arithmetic intensive compared to discrete, lattice methods like binomial trees. we have seen earlier that black scholes yeilds the worst speedup - primarily dominated by memcopies…

in the binomialOptions sample I get these results:

************************* single precision:

Using CUDA device [0]: Tesla S2050

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 27.306999 msec

Options per second : 18749.771666

Using CUDA device [0]: GeForce GTX 480

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 21.336000 msec

Options per second : 23996.999877

************************ double precision:

Using CUDA device [0]: Tesla S2050

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 45.983002 msec

Options per second : 11134.549311

Using CUDA device [0]: GeForce GTX 480

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 40.305000 msec

Options per second : 12703.138472

The differences in single precision can be explained by higher SM count, higher cpu frequency etc. , but again Tesla doesn’t live up to the expected performance in dp mode.

Is there a possibility of nvcc producing code that is not aware of Tesla’s extra dp units ? (I used the supplied makefile and also built the example explicitly with -arch sm_20)

in the binomialOptions sample I get these results:

************************* single precision:

Using CUDA device [0]: Tesla S2050

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 27.306999 msec

Options per second : 18749.771666

Using CUDA device [0]: GeForce GTX 480

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 21.336000 msec

Options per second : 23996.999877

************************ double precision:

Using CUDA device [0]: Tesla S2050

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 45.983002 msec

Options per second : 11134.549311

Using CUDA device [0]: GeForce GTX 480

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 40.305000 msec

Options per second : 12703.138472

The differences in single precision can be explained by higher SM count, higher cpu frequency etc. , but again Tesla doesn’t live up to the expected performance in dp mode.

Is there a possibility of nvcc producing code that is not aware of Tesla’s extra dp units ? (I used the supplied makefile and also built the example explicitly with -arch sm_20)