A common question we get is why should I buy Tesla instead of GeForce.
Here are some things to consider, written with Tesla 20-series / Fermi products in mind:
Tesla 20-series (Fermi-based) products are designed for high performance
scientific and technical GPU computing.
They thus have features, testing, and support over and above our consumer
GeForce GTX 470 and 480 (Fermi-based) products such as:
Double precision is 1/2 of single precision for Tesla 20-series, whereas double precision
is 1/8th of single precision for GeForce GTX 470/480
ECC is available only on Tesla
Tesla 20-series has 2 DMA Engines (copy engines). GeForce has 1 DMA Engine. This
means that CUDA applications can overlap computation and communication on Tesla using
bi-directional communication over PCI-e.
Tesla products have larger memory on board (3GB and 6GB)
Cluster management software is only supported on Tesla products
The TCC (Tesla Compute Cluster) driver for Windows is only supported on Tesla
OEMs offer integrated workstations and servers with Tesla products only
HPC ISV software is tested, certified, and supported only on Tesla products
Tesla products are built for reliable long running computing applications and
undergo intense stress testing and burn-in. In fact, we create a margin in
memory and core clocks (by using lower clocks) to increase reliability and long life.
Tesla products are manufactured by NVIDIA and come with a 3-year warranty
Tesla customers receive enterprise support and have higher priority for CUDA bugs
and requests for enhancements
Tesla products have long availability cycles ranging from 18 to 24 months and NVIDIA
gives its customers a 6 month EOL notice before discontinuing a Tesla product.
And, perhaps most critically for many applications, Tesla comes with much more memory than GeForce cards… 3 or 6 GB, versus the GTX480 GeForce of 1.5 GB.
What do you mean by this? I’m regularly using Sabalcore on-demand cluster, having GTX285 attached to number of nodes, and am able to utilize these through TORQUE resource manager without any issues…
“Double precision is 1/2 of single precision for Tesla 20-series”
Is this an artificial cap to get people to buy Tesla? Not that there’s anything wrong with it, but
I question its long term effect on GPGPU adoption. If capped, a GTX 480 will have 168 Gdflops @ $3 / Gflop, compared to maybe $8 / Gflop for CPUs. If uncapped and hence $0.74 a Gflop, that will be a major attraction.
We have no official word on that from NVIDIA and I doubt we’d ever hear “yes, we capped GeForces to drive Tesla sales” anyway, but there’s a good possibility this isn’t an artificial cap. It might be a legitimate way of increasing yields (which are pretty poor I hear). If some DP FPUs fail to work, don’t throw away the die, disable the duds and make it a GeForce. Gamers don’t need DP so why make them pay for hand-picked all-working chips? That would be the same motivation that made Cell a 7-core processor in PlayStation 3 while “premium” blade servers had all 8 cores active.
On the other hand, the capping theory may be true as well. One way of confirming it would be if someone made some sort of a flash-hack that’d show you can activate the missing ALUs. I don’t expect NVIDIA admitting if this were true.
I doubt they would disable ALUs because I think single precision and double precision are done in the same unit, according to
“A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design” from 2007. The abstract says
Sharing the ALUs for float and double definitely makes sense due to the ultimate flexibility of allowing all units being used at once instead of half idle. I believe Int24 operations use the single precision FPU, since
It didn’t make much sense before, to be honest. It would be a pretty fortuitous defect distribution that would leave exactly the right number of working double precision ALUs in each MP and leave the rest of the die otherwise intact…
Working on the theory that there is currently only one Fermi die (ie. all GF100s are born equal), it seems much more likely that the strategy used with OpenGL acceleration on Quadro boards forever has now been extended to the compute APIs on Fermi, ie. if you want the full feature set, buy the professional board.
I would think that seeing one 480 core device (Fermi) vs. two 240 core devices (Tesla–/GTX295) is a big win for many developers that aren’t investing in distributing their application across devices.
Are the DP FPUs on the Fermi chip deliberately turned off or destroyed for the consumer GTX 470/480 chips, or is this a yield issue, where otherwise good Fermi chips with some faulty DP FPU units are then salvaged by putting them in the consumer gaming cards?
Or, in other words, are you all deliberately making your Fermi chips less powerful than they are, or is this a question of availability of fully functioning chips? If the former, how should someone on a computational budget most effectively spend their dollars? If the latter, can we expect improvements in the process to eliminate this issue in the future?
It’s possible (albeit a little farfetched) to imagine that the ALU is the fabrication weak link on the chip, and that many of the otherwise good chips have ALU fault levels varying from, say, 30-60%, so by shutting down 75% of them, all of those chips are made usable. However, I thought that each ALU is specific to a CUDA core or a group of cores, and not free floating.
But I suspect there’s more truth in your latter point, and that nVidia’s approach is “We’ve put X $$$ into CUDA development, and must sell computational cards at $Y to make that division profitable. And we’ve spent $A on development for gaming, so the gaming cards must sell at $B for profitability, given our projected sales.” (quotes mine.) I wish they would realize that HPC is a smaller market, with more cost conscious consumers, and try to make that division merely breakeven.
It’s tempting to go out and buy one share of nVidia stock, simply to get their annual reports mailed to me. (I know people do this with Berkshire Hathaway, to get the inside info, even though one share runs about $10,000. People also do this with Sam’s Club, since you can shop there if you’re a stockholder.) Perhaps then I could get more inside info. Couple it with one share of AMD to monitor the whole market.
As far as I understand, a DP unit is actually 2 SP units with some extra logic. So there are no separate DP units anymore as on GT200. That is why only 1 DP warp is running in 2 clock-cycles while 2 SP warps are running in 2 clock-cycles on each fermi-multiprocessor
Since Fermi’s DP is Real Deal DP, full IEEE 754-2008, not an approximation, it’s likely more accurate to say that it’s fundamentally a DP unit with extra logic to alternatively let it do 2 SP results in parallel. I think the extra computational bits are also cleverly used to implement FMA.