More details on new Tesla w/ Fermi GPU posted

seibert · November 17, 2009, 3:17am

Some more details on the next Tesla have been revealed:

[url=“Home | NVIDIA Newsroom”]http://www.nvidia.com/object/io_1258360868914.html[/url]
[url=“http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html”]http://www.nvidia.com/object/product_tesla...0_C2070_us.html[/url]
[url=“http://www.nvidia.com/object/product_tesla_S2050_S2070_us.html”]http://www.nvidia.com/object/product_tesla...0_S2070_us.html[/url]

Nice to see that Telsa C2050/70 will stay in the same power envelope as the current C1060 (190W). Interesting that now they are just quoting double precision FLOPS (520-630 GFLOPS). Based on the previously announced 512 stream processors, 50% efficiency for double precision and the factor of 2 for the FMAD, that suggests the shader clock rate will be between 1.0 and 1.25 GHz. Also, now the Tesla gets a video connector.

Price are also quoted (surprising this far ahead of time): $2499 for 3GB and $3999 for 6GB card. The statement is also made that GeForce cards based on Fermi will come first (Q1 2010), with Tesla in Q2.

(Here’s hoping for a GTX 380 with 1.5 GB of RAM running at 1.25 GHz for $400 in 3 months…)

StickGuy · November 17, 2009, 9:34am

I’m glad they came to their senses and put the video out on the Tesla card. That was a very strange decision.

Mu-Chi_Sung · November 17, 2009, 9:36am

Some more details on the next Tesla have been revealed:

http://www.nvidia.com/object/io_1258360868914.html

http://www.nvidia.com/object/product_tesla…0_C2070_us.html

http://www.nvidia.com/object/product_tesla…0_S2070_us.html

Nice to see that Telsa C2050/70 will stay in the same power envelope as the current C1060 (190W). Interesting that now they are just quoting double precision FLOPS (520-630 GFLOPS). Based on the previously announced 512 stream processors, 50% efficiency for double precision and the factor of 2 for the FMAD, that suggests the shader clock rate will be between 1.0 and 1.25 GHz. Also, now the Tesla gets a video connector.

Price are also quoted (surprising this far ahead of time): $2499 for 3GB and $3999 for 6GB card. The statement is also made that GeForce cards based on Fermi will come first (Q1 2010), with Tesla in Q2.

(Here’s hoping for a GTX 380 with 1.5 GB of RAM running at 1.25 GHz for $400 in 3 months…)

Thanks for sharing!

I was also surprised that they announce the quote but disappointed for the Q2 shipping date. I hope GeForce version could come out earlier (before end of Feb) so that we can test new features in CUDA 3.0 on Fermi. The power consumption is pretty nice. Look like they managed to find a balance between performance and watt. I can’t wait to get one!!!

avidday · November 17, 2009, 10:47am

What is perhaps more interesting is that the otherwise perfectly accurate leaked information about Fermi that has been published elsewhere was suggesting that tape-out performance targets were 750 double precision Gflops and 1.5 single precision Tflops. The fact they have come in lower than that implies that maybe TSMCs 40 nanometre rule isn’t working out as well as hoped and shader clocks are lower than the design goal.

stan_fr · November 17, 2009, 12:05pm

For what it is worth :

I’ve read that the conservative clock rates account for that fact : the (hidden ?) purpose was to maintain a reasonable power enveloppe.

Same source (www.hardware.fr) points that the present Tesla Card performance seems to habe been a bit exagerated in the comparison,

and that AMD Cypress seems to be ahead of Fermi/Tesla.

Sorry for my English : I live on the bad side of the pond !

_Big_Mac · November 17, 2009, 1:54pm

That’s curious, does it mean single precision will be barely above 1 TFLOPS?

I know how fallible FLOPS are as a performance metric but from a marketing POV this is really bad. It makes the cards look just marginally better than the old ones and much slower than AMDs (~2 TFLOPS AFAIK).

avidday · November 17, 2009, 3:14pm

That is certainly what it sounds like. Of course a counter argument could be made that with a flat memory model, cache, multiple kernel support and all the other new stuff, the computational efficiency of Fermi will be a lot better than either the GT200, or the comparable AMD part. But the extrapolated headline single precision number does look rather modest.

It might also be that the non-ECC version of the core can run higher memory and shader clocks and so the consumer GPU versions might well be considerably faster. The die size is the biggest concern, though. The rumour sites have it pegged at about 23x23mm - ie. roughly 530mm^2, which is gigantic. It can’t be a cheap die to fab, probably considerably more expensive than the GT200, despite the transition to the TSMC 40 nanometer rule.

theMarix · November 17, 2009, 3:17pm

Well, who actually exploited that Teraflop on the old cards? It if is actually more reachable on the new cards, then they could still come up with something like Quantiflops as AMD did with Quantispeed when they bailed out of the GHz race.

Obone · November 17, 2009, 4:13pm

Errrr… Pardon.

I don’t know anything about Cuda or Tesla [other than it’s used for high end design work and in hospitals], but that seems a bit pricey to me :blink: .

Seems to me that nVidia are opting for a more elegant solution, and trying to squeeze every ounce of potential out of their card, then again look at the die size :huh: .

I’m confoozed… Again.

Tigga · November 17, 2009, 4:18pm

That’d be your problem then! The S20x0 comprises of 4 GPUs, each with more memory than a GeForce, and each certified to a much higher level. The markup is probably higher than GeForces, but the production runs are probably lower. if you compare it to anything else in the HPC market it’s really quite cheap!

Obone · November 17, 2009, 4:31pm

Thanks for the enilghtenment Tigga :). Yes I see from the links these cards have multiple GPU’s with up to 6GB’s DDR5 per GPU :blink: . I guess it does work out at good value, I suppose it’s because I’m used to thinking of things in terms of Geforce and the 3D games arena.

seibert · November 17, 2009, 4:42pm

Some clarification here: the S2050/70 are 1U rackmount enclosures, each with four C2050/70 cards inside. That’s why the price is a little more than 4x the cost of one C2050/70.

Keep in mind that lots of people do CUDA work with GeForce cards, which have similar performance, less memory, and less quality assurance (and way lower cost). In the GeForce 8 and GT200 eras, the associated Tesla cards used the same GPU as the high end GeForce card. It remains to be seen if this will hold true for the Fermi generation of cards.

If you are willing to tolerate a card failure once in a while, GeForce + CUDA is a nice match. :) (If you are building a large cluster, where dealing with the QA on a hundred GeForce cards is not cost effective, then Tesla is a good choice. Or if you need a LOT of GPU memory…)

avidday · November 17, 2009, 4:57pm

I am guessing that the consumer cards won’t have ECC memory support. Whether NVIDIA fab a simplified memory controller, or whether they just don’t QA and connect the on die ECC circuits is open to speculation, but it would certainly be one way to lower costs and potentially die size. On the other hand the rumour sites (which have been mostly accurate) haven’t been talking about taping out anything other than the Fermi die, although apparently there are some other lower power designs which have taped out this quarter and which should see the light of day next year.

Obone · November 17, 2009, 6:15pm

Some clarification here: the S2050/70 are 1U rackmount enclosures, each with four C2050/70 cards inside. That’s why the price is a little more than 4x the cost of one C2050/70.

Keep in mind that lots of people do CUDA work with GeForce cards, which have similar performance, less memory, and less quality assurance (and way lower cost). In the GeForce 8 and GT200 eras, the associated Tesla cards used the same GPU as the high end GeForce card. It remains to be seen if this will hold true for the Fermi generation of cards.

If you are willing to tolerate a card failure once in a while, GeForce + CUDA is a nice match. :) (If you are building a large cluster, where dealing with the QA on a hundred GeForce cards is not cost effective, then Tesla is a good choice. Or if you need a LOT of GPU memory…)

Now you’ve got me interested, I may look into some of this Cuda business, it would be nice to put my hardware to good use… There’s only so many times you can play Crysis ^_^

parallelis · November 17, 2009, 7:02pm

For me the point on the new Fermi is not it’s peak theorical performance.

The key points are:

Easiest port of CPU C source-code (or C++) to GPU without considering underlaying architecture (registers, shared memory, …)
C++ support
Ability to effectively reach peak TFlop performance-level
real support for double-precision arithmetic, that is mandatory for many problems

If you just take a look at the other OpenCL GPU provider, you will see that it’s newest architecture could not cope with these 4 points on real-world application, but will only shine on pure MAD Gflops benchmark. Exactly as their 4xxx series where shining compared to GeForce 8xxx 9xxx etc, but unable to cope for real-world OpenCL code.

To resume, Fermi will enable much more developers to use OpenCL technology (derived from CUDA) and port their actual code to GPU with great real-world performance-level. And that’s invaluable.

StickGuy · November 17, 2009, 11:05pm

Can you back this up? They’ve only had a GPU OpenCL driver available for a month or two.

SPWorley · November 17, 2009, 11:23pm

AMD’s OpenCL support is out and it has rough edges… painful but expected for a first release.

R700 performance is worse than anyone expected. It’s really terrible.

The reason is local memory (like CUDA shared memory ). The R700’s local memory behavior makes it inappropriate for OpenCL’s local memory use… basically thread writes aren’t visible to their neighbors. Reads are OK.

So R700 maps all local memory accesses to device memory, and now your one-clock reads and writes get latencies of 200-500 clocks and may also have throughput bandwidth issues. Some algorithms without memory access work OK, but the majority of programs are just crap… the CPU emulator beats even high end R700 cards.

It’s a hardware issue, not a driver problem. R800 is unaffected.

This really sucks for AMD since most of their installed base is R600 (with no OpenCL support), and the remainder is R700. R800 has almost no installed base yet… it’s only a month old and cards are scarce.

http://forums.amd.com/devforum/messageview…threadid=121298

http://forums.amd.com/devforum/messageview…threadid=121273

avidday · November 17, 2009, 11:24pm

One month and five days to be precise - the AMD openCL drivers and SDK beta were publicly released on the 13th of October.

jjp · November 18, 2009, 8:02am

To be more exact: all threads may read from the whole local memory but each thread only has a designated region to which it can write. That this local memory has to go unused in OpenCL is unfortunate because - while less useful than the local memory we’ve got on Nvidia chips since G80 - it would still be much better than not having local memory at all.

Of your points the only the thing about C++ support is valid. I know of people who have had to program for both vendors and at least for what they were doing RV770 was often showing better performance than G200 (not using OpenCL but AMD’s proprietary language). Keep in mind that the AMD/ATI chips don’t have to reach their peak FLOP rate to be at least as effective as Fermi. And with double precision it is actually realistic for them to reach the peak FLOP rate (since then they do not depend on the compiler being able to map code to effective VLIW instructions). In terms of hardware I don’t see them much behind Fermi, maybe even ahead performance-wise.

But the gist of the matter is that Nvidia offers very decent developer support and actually takes the HPC business serious. There is a lot more to it than the performance of the bare hardware.

Jimmy_Pettersson · November 18, 2009, 8:50am

I agree. At the end of the day the peak flops you see in the commercial means nothing. What does matter if the hardware is "simple"enough and if you have the development tools that will let you get the performance out of the chip. This seems to be what Nvidia understands.

Topic		Replies	Views
Tesla 20-Series Features and Advantages CUDA Programming and Performance	65	151999	December 21, 2010
Fermi? Sounds interesting... CUDA Programming and Performance	58	15509	October 18, 2009
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22355	May 5, 2010
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13180	May 20, 2012
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10331	April 5, 2012
Fermi question CUDA Programming and Performance	30	5553	May 26, 2010
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10183	September 14, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29123	December 8, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8637	September 22, 2010
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010

More details on new Tesla w/ Fermi GPU posted

Related topics