Nice to see that Telsa C2050/70 will stay in the same power envelope as the current C1060 (190W). Interesting that now they are just quoting double precision FLOPS (520-630 GFLOPS). Based on the previously announced 512 stream processors, 50% efficiency for double precision and the factor of 2 for the FMAD, that suggests the shader clock rate will be between 1.0 and 1.25 GHz. Also, now the Tesla gets a video connector.
Price are also quoted (surprising this far ahead of time): $2499 for 3GB and $3999 for 6GB card. The statement is also made that GeForce cards based on Fermi will come first (Q1 2010), with Tesla in Q2.
(Here’s hoping for a GTX 380 with 1.5 GB of RAM running at 1.25 GHz for $400 in 3 months…)
I was also surprised that they announce the quote but disappointed for the Q2 shipping date. I hope GeForce version could come out earlier (before end of Feb) so that we can test new features in CUDA 3.0 on Fermi. The power consumption is pretty nice. Look like they managed to find a balance between performance and watt. I can’t wait to get one!!!
What is perhaps more interesting is that the otherwise perfectly accurate leaked information about Fermi that has been published elsewhere was suggesting that tape-out performance targets were 750 double precision Gflops and 1.5 single precision Tflops. The fact they have come in lower than that implies that maybe TSMCs 40 nanometre rule isn’t working out as well as hoped and shader clocks are lower than the design goal.
That’s curious, does it mean single precision will be barely above 1 TFLOPS?
I know how fallible FLOPS are as a performance metric but from a marketing POV this is really bad. It makes the cards look just marginally better than the old ones and much slower than AMDs (~2 TFLOPS AFAIK).
That is certainly what it sounds like. Of course a counter argument could be made that with a flat memory model, cache, multiple kernel support and all the other new stuff, the computational efficiency of Fermi will be a lot better than either the GT200, or the comparable AMD part. But the extrapolated headline single precision number does look rather modest.
It might also be that the non-ECC version of the core can run higher memory and shader clocks and so the consumer GPU versions might well be considerably faster. The die size is the biggest concern, though. The rumour sites have it pegged at about 23x23mm - ie. roughly 530mm^2, which is gigantic. It can’t be a cheap die to fab, probably considerably more expensive than the GT200, despite the transition to the TSMC 40 nanometer rule.
Well, who actually exploited that Teraflop on the old cards? It if is actually more reachable on the new cards, then they could still come up with something like Quantiflops as AMD did with Quantispeed when they bailed out of the GHz race.
I don’t know anything about Cuda or Tesla [other than it’s used for high end design work and in hospitals], but that seems a bit pricey to me :blink: .
Seems to me that nVidia are opting for a more elegant solution, and trying to squeeze every ounce of potential out of their card, then again look at the die size :huh: .
That’d be your problem then! The S20x0 comprises of 4 GPUs, each with more memory than a GeForce, and each certified to a much higher level. The markup is probably higher than GeForces, but the production runs are probably lower. if you compare it to anything else in the HPC market it’s really quite cheap!
Thanks for the enilghtenment Tigga :). Yes I see from the links these cards have multiple GPU’s with up to 6GB’s DDR5 per GPU :blink: . I guess it does work out at good value, I suppose it’s because I’m used to thinking of things in terms of Geforce and the 3D games arena.
Some clarification here: the S2050/70 are 1U rackmount enclosures, each with four C2050/70 cards inside. That’s why the price is a little more than 4x the cost of one C2050/70.
Keep in mind that lots of people do CUDA work with GeForce cards, which have similar performance, less memory, and less quality assurance (and way lower cost). In the GeForce 8 and GT200 eras, the associated Tesla cards used the same GPU as the high end GeForce card. It remains to be seen if this will hold true for the Fermi generation of cards.
If you are willing to tolerate a card failure once in a while, GeForce + CUDA is a nice match. :) (If you are building a large cluster, where dealing with the QA on a hundred GeForce cards is not cost effective, then Tesla is a good choice. Or if you need a LOT of GPU memory…)
I am guessing that the consumer cards won’t have ECC memory support. Whether NVIDIA fab a simplified memory controller, or whether they just don’t QA and connect the on die ECC circuits is open to speculation, but it would certainly be one way to lower costs and potentially die size. On the other hand the rumour sites (which have been mostly accurate) haven’t been talking about taping out anything other than the Fermi die, although apparently there are some other lower power designs which have taped out this quarter and which should see the light of day next year.
Now you’ve got me interested, I may look into some of this Cuda business, it would be nice to put my hardware to good use… There’s only so many times you can play Crysis ^_^
For me the point on the new Fermi is not it’s peak theorical performance.
The key points are:
Easiest port of CPU C source-code (or C++) to GPU without considering underlaying architecture (registers, shared memory, …)
C++ support
Ability to effectively reach peak TFlop performance-level
real support for double-precision arithmetic, that is mandatory for many problems
If you just take a look at the other OpenCL GPU provider, you will see that it’s newest architecture could not cope with these 4 points on real-world application, but will only shine on pure MAD Gflops benchmark. Exactly as their 4xxx series where shining compared to GeForce 8xxx 9xxx etc, but unable to cope for real-world OpenCL code.
To resume, Fermi will enable much more developers to use OpenCL technology (derived from CUDA) and port their actual code to GPU with great real-world performance-level. And that’s invaluable.
AMD’s OpenCL support is out and it has rough edges… painful but expected for a first release.
R700 performance is worse than anyone expected. It’s really terrible.
The reason is local memory (like CUDA shared memory ). The R700’s local memory behavior makes it inappropriate for OpenCL’s local memory use… basically thread writes aren’t visible to their neighbors. Reads are OK.
So R700 maps all local memory accesses to device memory, and now your one-clock reads and writes get latencies of 200-500 clocks and may also have throughput bandwidth issues. Some algorithms without memory access work OK, but the majority of programs are just crap… the CPU emulator beats even high end R700 cards.
It’s a hardware issue, not a driver problem. R800 is unaffected.
This really sucks for AMD since most of their installed base is R600 (with no OpenCL support), and the remainder is R700. R800 has almost no installed base yet… it’s only a month old and cards are scarce.
To be more exact: all threads may read from the whole local memory but each thread only has a designated region to which it can write. That this local memory has to go unused in OpenCL is unfortunate because - while less useful than the local memory we’ve got on Nvidia chips since G80 - it would still be much better than not having local memory at all.
Of your points the only the thing about C++ support is valid. I know of people who have had to program for both vendors and at least for what they were doing RV770 was often showing better performance than G200 (not using OpenCL but AMD’s proprietary language). Keep in mind that the AMD/ATI chips don’t have to reach their peak FLOP rate to be at least as effective as Fermi. And with double precision it is actually realistic for them to reach the peak FLOP rate (since then they do not depend on the compiler being able to map code to effective VLIW instructions). In terms of hardware I don’t see them much behind Fermi, maybe even ahead performance-wise.
But the gist of the matter is that Nvidia offers very decent developer support and actually takes the HPC business serious. There is a lot more to it than the performance of the bare hardware.
I agree. At the end of the day the peak flops you see in the commercial means nothing. What does matter if the hardware is "simple"enough and if you have the development tools that will let you get the performance out of the chip. This seems to be what Nvidia understands.