Tesla S2050 performance double precision performance too low

pezet · October 7, 2010, 4:08pm

Tesla S2050 delivers questionable performance in double precision here (Centos 5.3, 64 bit, CUDA Toolkit 3.2 RC, 260.24, -arch=sm20).
Blackscholes reports a throughput of 1.7755 GOptions/s vs. 3.9667 GOptions/s on a GTX480, in the matrixMul test (converted to double) GTX480 is 20% faster than Tesla (~89 vs 109 GFLOPS dp) . This picture extends to my own application, in which GTX480 constantly beats the Tesla GPU by 20%.

I have installed the most recent drivers+toolkit, even switch ECC off for testing, but Tesla remains to be slow (memory bandwidth is also low, 85 GB device-device for Tesla, 111 GB/s for GTX 480).

What can be done to improve Tesla’s dp performance ?

pezet · October 7, 2010, 4:08pm

Tesla S2050 delivers questionable performance in double precision here (Centos 5.3, 64 bit, CUDA Toolkit 3.2 RC, 260.24, -arch=sm20).
Blackscholes reports a throughput of 1.7755 GOptions/s vs. 3.9667 GOptions/s on a GTX480, in the matrixMul test (converted to double) GTX480 is 20% faster than Tesla (~89 vs 109 GFLOPS dp) . This picture extends to my own application, in which GTX480 constantly beats the Tesla GPU by 20%.

I have installed the most recent drivers+toolkit, even switch ECC off for testing, but Tesla remains to be slow (memory bandwidth is also low, 85 GB device-device for Tesla, 111 GB/s for GTX 480).

What can be done to improve Tesla’s dp performance ?

ceearem · October 7, 2010, 4:22pm

Principially the bandwidth is lower with Tesla Boards than with GeForce cards (they use lower memory frequency). Hence if you are bandwitdh limited the GeForce will give better performance. (And if you take a look at your numbers, the 89 vs 109 GFlops correspond very nicely to the measured transfer rates. Also the GTX480 has one more of those compute units (15 instead of 14), which should give it (combined with the higher frequency of its cores) a nice lead in Single Prec and Integer Compute Performance.

Bottomline: In most Apllications I would expect the GTX480 to beat the C2050 by 20%-25%. Only if you are really double precision bound this might not be the case.

Best Regards

Ceearem

ceearem · October 7, 2010, 4:22pm

Principially the bandwidth is lower with Tesla Boards than with GeForce cards (they use lower memory frequency). Hence if you are bandwitdh limited the GeForce will give better performance. (And if you take a look at your numbers, the 89 vs 109 GFlops correspond very nicely to the measured transfer rates. Also the GTX480 has one more of those compute units (15 instead of 14), which should give it (combined with the higher frequency of its cores) a nice lead in Single Prec and Integer Compute Performance.

Bottomline: In most Apllications I would expect the GTX480 to beat the C2050 by 20%-25%. Only if you are really double precision bound this might not be the case.

Best Regards

Ceearem

pezet · October 7, 2010, 4:39pm

thank you for your quick response. I understand that memory bandwidth is in favor of the GTX480, but to me it seems that it must be extremely hard to make use of the additional DP-units when even Nvidia-demo programs such as Blackscholes cannot effectively use them.

Is there some “generally approved” cuda benchmark for dp computation, just that I can verify that our hardware and software setup is ok and that Tesla is - at least in principle - capable of delivering 500-600 dp GFLOP/s ?

The bottom line is that I recommended buying the S2050 instead of a bunch of GTX480 and I will face some serious discussions if Tesla continues delivering only 80% of a GTX480 instead of the expected 300-400% ;-)

pezet · October 7, 2010, 4:39pm

thank you for your quick response. I understand that memory bandwidth is in favor of the GTX480, but to me it seems that it must be extremely hard to make use of the additional DP-units when even Nvidia-demo programs such as Blackscholes cannot effectively use them.

Is there some “generally approved” cuda benchmark for dp computation, just that I can verify that our hardware and software setup is ok and that Tesla is - at least in principle - capable of delivering 500-600 dp GFLOP/s ?

The bottom line is that I recommended buying the S2050 instead of a bunch of GTX480 and I will face some serious discussions if Tesla continues delivering only 80% of a GTX480 instead of the expected 300-400% ;-)

Sarnath · October 7, 2010, 4:50pm

may be, it is the ECC thing…that is limiting the performance.

Sarnath · October 7, 2010, 4:50pm

may be, it is the ECC thing…that is limiting the performance.

Lev · October 7, 2010, 5:59pm

Does blackscholes really use doubles alot?

Lev · October 7, 2010, 5:59pm

Does blackscholes really use doubles alot?

Lev · October 7, 2010, 6:11pm

Btw, do you know that NVIDIA artificially downclocked Tesla to prevent gamers of buying Teslas? NV offers full support for Tesla, and they do not want to get many calls from gamers.

Lev · October 7, 2010, 6:11pm

Btw, do you know that NVIDIA artificially downclocked Tesla to prevent gamers of buying Teslas? NV offers full support for Tesla, and they do not want to get many calls from gamers.

seibert · October 7, 2010, 6:30pm

That’s total nonsense. (perhaps you are kidding and I’m being obtuse) I think the Telsa price tag of $2500 is going to discourage gamers quite effectively. :) The downclocking in the Tesla series is to increase reliability and stability.

seibert · October 7, 2010, 6:30pm

That’s total nonsense. (perhaps you are kidding and I’m being obtuse) I think the Telsa price tag of $2500 is going to discourage gamers quite effectively. :) The downclocking in the Tesla series is to increase reliability and stability.

pezet · October 7, 2010, 6:50pm

yes, you are right, Blackscholes is not a useful example in this case, since it is based on float.

pezet · October 7, 2010, 6:50pm

yes, you are right, Blackscholes is not a useful example in this case, since it is based on float.

Sarnath · October 8, 2010, 6:45am

Also, Black scholes is not arithmetic intensive compared to discrete, lattice methods like binomial trees. we have seen earlier that black scholes yeilds the worst speedup - primarily dominated by memcopies…

Sarnath · October 8, 2010, 6:45am

Also, Black scholes is not arithmetic intensive compared to discrete, lattice methods like binomial trees. we have seen earlier that black scholes yeilds the worst speedup - primarily dominated by memcopies…

pezet · October 8, 2010, 9:25am

in the binomialOptions sample I get these results:

************************* single precision:

Using CUDA device [0]: Tesla S2050

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 27.306999 msec

Options per second : 18749.771666

Using CUDA device [0]: GeForce GTX 480

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 21.336000 msec

Options per second : 23996.999877

************************ double precision:

Using CUDA device [0]: Tesla S2050

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 45.983002 msec

Options per second : 11134.549311

Using CUDA device [0]: GeForce GTX 480

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 40.305000 msec

Options per second : 12703.138472

The differences in single precision can be explained by higher SM count, higher cpu frequency etc. , but again Tesla doesn’t live up to the expected performance in dp mode.

Is there a possibility of nvcc producing code that is not aware of Tesla’s extra dp units ? (I used the supplied makefile and also built the example explicitly with -arch sm_20)

pezet · October 8, 2010, 9:25am

in the binomialOptions sample I get these results:

************************* single precision:

Using CUDA device [0]: Tesla S2050

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 27.306999 msec

Options per second : 18749.771666

Using CUDA device [0]: GeForce GTX 480

Using single precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 21.336000 msec

Options per second : 23996.999877

************************ double precision:

Using CUDA device [0]: Tesla S2050

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 45.983002 msec

Options per second : 11134.549311

Using CUDA device [0]: GeForce GTX 480

Using double precision…

Generating input data…

Running GPU binomial tree…

Options count : 512

Time steps : 2048

binomialOptionsGPU() time: 40.305000 msec

Options per second : 12703.138472

The differences in single precision can be explained by higher SM count, higher cpu frequency etc. , but again Tesla doesn’t live up to the expected performance in dp mode.

Is there a possibility of nvcc producing code that is not aware of Tesla’s extra dp units ? (I used the supplied makefile and also built the example explicitly with -arch sm_20)

Topic		Replies	Views
Disappointed performance using C2050 CUDA Programming and Performance	20	8063	September 2, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8854	September 22, 2010
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2686	October 19, 2019
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10884	September 14, 2010
Tesla 20-Series Features and Advantages CUDA Programming and Performance	65	152801	December 21, 2010
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9522	November 3, 2013
TITAN X CUDA Programming and Performance	35	10678	March 23, 2015
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	4148	November 7, 2010
CUDA Double Precision Performance 933 GFlops vs 78GFlops CUDA Programming and Performance	17	10240	March 9, 2009
Stream Benchmark CUDA Programming and Performance	20	17376	March 15, 2023

Tesla S2050 performance double precision performance too low

Related topics