Tesla 20-Series Features and Advantages

Updated: Aug 11: There is now a web page with details:
http://www.nvidia.com/object/why-choose-tesla.html

A common question we get is why should I buy Tesla instead of GeForce.
Here are some things to consider, written with Tesla 20-series / Fermi products in mind:

Tesla 20-series (Fermi-based) products are designed for high performance
scientific and technical GPU computing.

They thus have features, testing, and support over and above our consumer
GeForce GTX 470 and 480 (Fermi-based) products such as:

  • Double precision is 1/2 of single precision for Tesla 20-series, whereas double precision
    is 1/8th of single precision for GeForce GTX 470/480
  • ECC is available only on Tesla
  • Tesla 20-series has 2 DMA Engines (copy engines). GeForce has 1 DMA Engine. This
    means that CUDA applications can overlap computation and communication on Tesla using
    bi-directional communication over PCI-e.
  • Tesla products have larger memory on board (3GB and 6GB)
  • Cluster management software is only supported on Tesla products
  • The TCC (Tesla Compute Cluster) driver for Windows is only supported on Tesla
  • OEMs offer integrated workstations and servers with Tesla products only
  • HPC ISV software is tested, certified, and supported only on Tesla products
  • Tesla products are built for reliable long running computing applications and
    undergo intense stress testing and burn-in. In fact, we create a margin in
    memory and core clocks (by using lower clocks) to increase reliability and long life.
  • Tesla products are manufactured by NVIDIA and come with a 3-year warranty
  • Tesla customers receive enterprise support and have higher priority for CUDA bugs
    and requests for enhancements
  • Tesla products have long availability cycles ranging from 18 to 24 months and NVIDIA
    gives its customers a 6 month EOL notice before discontinuing a Tesla product.

Learn more at http://www.nvidia.com/tesla
CUDA Software Development tools are linked from : http://www.nvidia.com/object/tesla_software.html

Knowledgebase entry that will kept up to date
http://nvidia.custhelp.com/cgi-bin/nvidia…hp?p_faqid=2595

And, perhaps most critically for many applications, Tesla comes with much more memory than GeForce cards… 3 or 6 GB, versus the GTX480 GeForce of 1.5 GB.

Thanks for reminding me of this very important feature! I updated the main post.

Sumit,

Can you provide links for “TCC” and “Cluster Manager Software”?

BEst Regards,
Sarnath

All Tesla product drivers are at http://www.nvidia.com/drivers

Select Tesla 1U System -> S1070 -> Windows 2008 (R2) (x64) and you will come to TCC

We will release Tesla C1060 TCC drivers soon for Windows Vista and Windows 7

Cluster software links are on the SW Tools page (link in original post)

Thank you Sumit. It was useful.

Nice to see auto-parallelizing software like “Goose”, “HMPP” and the likes.

What do you mean by this? I’m regularly using Sabalcore on-demand cluster, having GTX285 attached to number of nodes, and am able to utilize these through TORQUE resource manager without any issues…

“Double precision is 1/2 of single precision for Tesla 20-series”

Is this an artificial cap to get people to buy Tesla? Not that there’s anything wrong with it, but
I question its long term effect on GPGPU adoption. If capped, a GTX 480 will have 168 Gdflops @ $3 / Gflop, compared to maybe $8 / Gflop for CPUs. If uncapped and hence $0.74 a Gflop, that will be a major attraction.

We have no official word on that from NVIDIA and I doubt we’d ever hear “yes, we capped GeForces to drive Tesla sales” anyway, but there’s a good possibility this isn’t an artificial cap. It might be a legitimate way of increasing yields (which are pretty poor I hear). If some DP FPUs fail to work, don’t throw away the die, disable the duds and make it a GeForce. Gamers don’t need DP so why make them pay for hand-picked all-working chips? That would be the same motivation that made Cell a 7-core processor in PlayStation 3 while “premium” blade servers had all 8 cores active.

On the other hand, the capping theory may be true as well. One way of confirming it would be if someone made some sort of a flash-hack that’d show you can activate the missing ALUs. I don’t expect NVIDIA admitting if this were true.

I doubt they would disable ALUs because I think single precision and double precision are done in the same unit, according to

“A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design” from 2007. The abstract says

Sharing the ALUs for float and double definitely makes sense due to the ultimate flexibility of allowing all units being used at once instead of half idle. I believe Int24 operations use the single precision FPU, since

  1. Their throughputs are the same as float.

  2. Float has 24 significant digits.

If DP and SP is one by the same physical ALUs, the yield increase theory doesn’t seem to make sense anymore…

It didn’t make much sense before, to be honest. It would be a pretty fortuitous defect distribution that would leave exactly the right number of working double precision ALUs in each MP and leave the rest of the die otherwise intact…

Working on the theory that there is currently only one Fermi die (ie. all GF100s are born equal), it seems much more likely that the strategy used with OpenGL acceleration on Quadro boards forever has now been extended to the compute APIs on Fermi, ie. if you want the full feature set, buy the professional board.

  1. Because we learned our lesson from the GT200 and have crippled the GF100 to MAKE YOU buy the Tesla.

  2. See answer number 1.

:rolleyes:

I would think that seeing one 480 core device (Fermi) vs. two 240 core devices (Tesla–/GTX295) is a big win for many developers that aren’t investing in distributing their application across devices.

We currently have ~20 S1070 Tesla with 2 Teslas per server machine. Another advantage to Fermi, in addition to what you say,

is that I can cut the server count by half to acheive ~ the same computional power.

It also means more computional power per PCI slot - a limiting factor today.

All in all if Fermi delivers ~x2 the performance - I think its an exciting change…

eyal

Sumit,

OK. Here’s the $64,000 question.

Are the DP FPUs on the Fermi chip deliberately turned off or destroyed for the consumer GTX 470/480 chips, or is this a yield issue, where otherwise good Fermi chips with some faulty DP FPU units are then salvaged by putting them in the consumer gaming cards?

Or, in other words, are you all deliberately making your Fermi chips less powerful than they are, or is this a question of availability of fully functioning chips? If the former, how should someone on a computational budget most effectively spend their dollars? If the latter, can we expect improvements in the process to eliminate this issue in the future?

Regards,

Martin

It’s possible (albeit a little farfetched) to imagine that the ALU is the fabrication weak link on the chip, and that many of the otherwise good chips have ALU fault levels varying from, say, 30-60%, so by shutting down 75% of them, all of those chips are made usable. However, I thought that each ALU is specific to a CUDA core or a group of cores, and not free floating.

But I suspect there’s more truth in your latter point, and that nVidia’s approach is “We’ve put X $$$ into CUDA development, and must sell computational cards at $Y to make that division profitable. And we’ve spent $A on development for gaming, so the gaming cards must sell at $B for profitability, given our projected sales.” (quotes mine.) I wish they would realize that HPC is a smaller market, with more cost conscious consumers, and try to make that division merely breakeven.

It’s tempting to go out and buy one share of nVidia stock, simply to get their annual reports mailed to me. (I know people do this with Berkshire Hathaway, to get the inside info, even though one share runs about $10,000. People also do this with Sam’s Club, since you can shop there if you’re a stockholder.) Perhaps then I could get more inside info. Couple it with one share of AMD to monitor the whole market.

Regards,

Martin

As far as I understand, a DP unit is actually 2 SP units with some extra logic. So there are no separate DP units anymore as on GT200. That is why only 1 DP warp is running in 2 clock-cycles while 2 SP warps are running in 2 clock-cycles on each fermi-multiprocessor

Since Fermi’s DP is Real Deal DP, full IEEE 754-2008, not an approximation, it’s likely more accurate to say that it’s fundamentally a DP unit with extra logic to alternatively let it do 2 SP results in parallel. I think the extra computational bits are also cleverly used to implement FMA.

So with all this nice new tech rolled in, it’s all the more disappointing to have 3/4 of the performance capped on the consumer cards.

Martin