Fermi? Sounds interesting...

Looks like press reports are starting to come out for the next CUDA architecture, which NVIDIA is calling “Fermi”:

[url=“http://www.pcmag.com/article2/0,2817,2353608,00.asp”]http://www.pcmag.com/article2/0,2817,2353608,00.asp[/url]

The article is light on low-level details, but we do learn:

  • 512 stream processors
  • Fermi gets a 384 bit memory interface (presumably GDDR5 to ensure it still has more bandwidth than the 512-bit interface we have now on the GTX 285)
  • Double precision is 8 times faster, which is hard to parse given that we don’t know what the clock rate is for Fermi. One possible interpretation is that each multiprocessor can now execute 2 or 4 double precision instructions per clock, rather than just 1 like now.
  • We’re getting a multi-level cache system like Larrabee. Presumably this also contributes to the better C++ support. (although perhaps there are also new PTX features to make stuff like vtables work better)
  • Multiple kernels running on the same device.

Let’s just say I’m excited, though curious how much of this will trickle down to consumer cards. I am amazed (in a good way) that you can get all the compute capability of a Tesla in a $350 GTX 285 card. I hope that trend continues…

yah, it is excited to sound that “Double precision is 8 times faster” for scientific computation, also for me.

Me too! What I choose to do for graduate work highly depends on whether double floating point precision is available.

Anandtech (as usual) has a much better discussion of the new architecture:

[url=“NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010”]http://www.anandtech.com/video/showdoc.aspx?i=3651[/url]

More good stuff:

  • Fully IEEE spec floating point
  • 32-bit integer multiplies at the same rate as single precision MAD
  • double precision at half the rate of single precision (!!!)
  • Instead of 8 processors per multiprocess, there is now 32. (whoa, keeping that pipeline full could be tough…)
  • Shared memory and a real L1 cache now inhabit the same 64 kB space on each multiprocessor. You can select either 16 or 48 kB shared memory, and the rest goes to L1.
  • 768 kB L2 cache + faster atomic memory ops (no need to go off-chip)
  • Scheduling is still done at the half-warp level (odd…)
  • 10x faster switching between host contexts (yay for multiuser systems)
  • ECC for Tesla (good, Tesla needs some more differentiation from the GeForce)
  • Unified 64-bit address space. All the different memory types (local, shared, global) are given separate address blocks.
  • Support of C++ features like virtual functions, exceptions(!!!), new/delete (!!!) on the device

This sounds absolutely amazing… Where do I get one? :)

Just back from the first day of the conference.
There was a lot of small mentions of Fermi’s features but no specific tech presentation.

But the chip diagram Jen-Hsun showed 32 SPs per SM.

It is speculation, pure speculation on my part, this is not confirmed, don’t trust this, but the phrase “more flexible SMs” was used once, and putting this tiny mention together with the 32 SPs makes me think that maybe SMs now have 64K of shared memory and not 16K. This is still the same amount of shared memory per SP. The net effect of this would indeed be more flexible designs, since current kernels would run fine, just with more blocks per SM, but you’d have the option of using more shared memory (at the cost of fewer SMs.) Please note again this is my personal speculation entirely… It could also be that the “shared memory” is now just gone and it’s a per SM cache of the new flat address space.

Jen-Hsun also mentioned “C++ support.” He didn’t explain what this meant. This makes me think that it means virtual functions and function pointers… which would make sense if the memory address space is now merged and flattened. It’s hard to do with G200 since a class in registers doesn’t HAVE an address, so you can’t do the indirections.

A question I’ll try to find out is if the memory address space is flat 64 bit or 32 bit. I would hope it’s 64 bit, even with the painful size of the pointers, since we’re hitting 32 bit limits hard already. It’d also allow nicer memory mapping games with the CPU since you have so much address flexibility. Larrabbe is 64 bit addressing (only).

The double support was full IEEE-754/2008.

Another surprise… the ECC support is NOT for just device memory. It’s also applied at all caches, to shared memory, and to registers too! I never knew ECC would even apply to registers, but I guess it makes sense.

Jen-Hsun said the first silicon was delivered “only a few days ago” and praised the engineers for getting it working so fast “though this isn’t at full speed yet, so it’s only 6X and not 8X faster.” It was vague if he meant the drivers needed more tuning for the new circuit timings, etc, or if the silicon itself was slow, but I think he meant the low level BIOS timings since he mentioned it in context with getting the software running.

I think the other speakers talked more about watts per flop more than anything else. Everyone on the high end is completely obsessed with electricity usage which is killing them. It makes me think there’s work done on the energy use of Fermi as well.

Finally, and this is an aside, about 1/3 of the keynote was in 3D. It was surprisingly effective… not just a gimmick. Even Jen-Hsun’s slides were in 3D and it was used nicely. The 3D live video projection of Jen-Hsun on the side screens as he spoke was especially vivid!

Also a nice thumbs up to the RTT guys for a great augmented reality live demo of adding virtual rims and brake shoes to a Ferrari tire. It was excellent, with Jen-Hsun moving the tire and a hand-held light and in video it really, truly, looked like the tire’s rims were there. It was so well done it was hard to understand what he was demoing at first until you saw his REAL tire was rimless.

Later a Tegra demo of augmented reality made me realize that Tegra may be a big deal too, I had always kind of ignored it since cell phones aren’t my interest.

Looks like that Anandtech article answered a lot of my fuzzy speculations. And it’s all good news except having to wait for the cards!

Seibert, Thanks for the link.

So there are no texture processing clusters anymore (I heard there were 16 units for a 512 proc card)

Then we really need to hope for multiple MIMD and multiple instruction decoders per SM, as the number of instruction decoders per proc ratio will otherwise go down.

More kernels running at the same time is for me also big news, as we have a lot of processing kernels, where we have not enough work to do for a card, but we could do a bunch of them at the same time to load up the card a bit more.

One thing I noticed though is that there are less threads running at the same time on Fermi than on GT200 (161536 = 24576 & 301024 = 30720), but I guess that there is less need for it with the caches.

Hey, found more tech details from NV in a whitepaper overview.

And David Kanter’s excellent take!
[url=“Inside Fermi: Nvidia's HPC Push”]Real World Tech

Wooha, this is not just an incremental improvement over the previous architecture. This is a good reason for me to skip the upgrade from G80/G92 to GT200 and wait for Fermi ;)

The renowned German c’t magazine reasons that (assuming the same clock rate as the Tesla C1060) the theoretical peak flop rate should reach 2 Teraflop/s (SP) and 630 GFlop/s (DP), in comparison to the 933 GFlop/s (SP) and 75 GFlop/s (DP) of the C1060.

From my point of view, it would be interesting to know when this tech makes it into laptop chips in some scaled down form. It’s still hard to get an affordable GT200 architecture based laptop graphics chip these days.

Christian

Thanks for sharing the links.

some features are attractive to me, including

  1. ECC support

  2. 64-bit Memory Addressing

  3. double precision at half the rate of single precision

  4. fused multiply-add (FMA) to maintain accuracy

  5. newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. Then modulo operation can be faster

  1. GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM

memory. (bigger memory than Tesla C1060)

Question: the whitpaper mentions “Out of Order thread block execution” but never interpret it,

does anyone have idea bout “Out of Order thread block execution”?

I cannot wait to buy one but I hope that price should be put on the ground, not lifted into the heavens,

even “Last quarter the Tesla business unit made $10M”

Does 64-bit addressing mean we will have access to as much system memory as we like from within a single kernel execution via zero copy memory? That could be quite interesting for visualising very large volumetric datasets for instance although the fact you can’t write into 3D textures or invalidate the texture cache from inside a kernel might still be a problem.

Looks like this “8x” is a bit marketing thing. Originally it says “up to 8x times faster double precision” and as I can understand double precision with GT300 is 4x times faster than on GT200 but nVidia implemented FMAD for DPFP and that’s how “up to 8x times” was born. But DPFP speed anyway is impressive.

Holy crap, I got all excited!

Now I only hope the power requirements and price won’t be too big.

“Shared memory and a real L1 cache now inhabit the same 64 kB space on each multiprocessor. You can select either 16 or 48 kB shared memory, and the rest goes to L1.”

Can someone precise what does that mean for constant memory and size of registers?
Thanks. External Image

Will Fermi eventually replace the Tesla brand? Or is Fermi just an internal code name…

OK I want Fermi, who doesn’t. So I will want to know what the factors are that will impact building a Fermi box. I have learned a lot building CUDA boxes - namely that it’s not that easy to find a motherboard, power supply, and case, that will let you pack a lot of Teslas or G285s in one box.

At the moment the most I can do is three Teslas or G285s on the Asus P6T7 motherboard, otherwise the dual size cards obstruct and squash headers on the edge of the motherboard. I wouldn’t have enough power cables for a fourth, since I’m using a Thermaltake Toughpower 1200W supply. And it was annoying to fit that into the ABS/Canyon 595 case.

Then there is the issue how how many x16 PCI-E2 slots can live on the motherboard.

Obviously, the various costs of CUDA computation go down the more CUDA cards can be stuffed in a single box, otherwise one card per box is probably the fastest (until they can virtualize a bigger CUDA computation automatically across multiple cards).

The gaming sites pretty much all conclude that 3-way and 4-way SLI isn’t really that important, and it’s obvious why game designers aren’t going to rush to develop games that require that sort of system to play well. So the PC gamers (which seem to drive the market up to now - although the threat from consoles is clear) aren’t going to demand a lot in this direction, especially in the current economic situation. So I expect fairly limited improvement in this direction during the months we will be waiting for Fermi to become available.

So what are people thinking of using for case, motherboard, and power supply? And how many Fermis per box are they aiming at?

512 cores SP for Fermi means 256 ‘cores’ DP.

240 cores SP for GT200 means 30 ‘cores’ DP.

256/30 > 8.

DP FMAD is already there on GT200, SP FMAD is new for Fermi.

Then I’m curious why they haven’t used “over 8.5x” instead of “up to 8x” :).

Yeah, right, my mistake, I’ve got confused with MAD vs FMAD.

Also not so clear for me new dual-issue thing.

But peak performance listed as 512 FMA ops/clock, so “dual issued” doesn’t means “two instructions per clock”? Or FMA is an exception and it’s possible to perform, for example, two SPFP MULs within single clock?