Fermi? Sounds interesting...

has anyone noticed this card has sli fingers on it?

and a tesla card with a DVI output?

Check out the discussion here no one caught it till the 3rd page so read through…

[url=“http://forums.slizone.com/index.php?showtopic=39522&st=0”]http://forums.slizone.com/index.php?showtopic=39522&st=0[/url]

The DVI output is nice, since on fast motherboards with huge numbers of 16x PCI-E2 slots you don’t get onboard graphics. So you can’t just pack C1060 cards in all of them. This leads to the annoying problem that many of the single width PCI-E2 cards have heat sinks which protude too far off the back of the card to fit next to a Telsa. And even if (like mine) the machine is a CUDA server, you only need that graphics card in there to configure the BIOS, install OS, etc., but still, you have to stick it in there at some point.

I see some people expecting that Fermi will be less than 300W per card. That means that maybe a four Fermi CUDA server is feasible. Now if I only knew of a 1200-1500 W single rail power supply.

If you are speaking of a single 12 volt rail 1200 watt PSU there are a couple I believe…

1200 watt 100 amp single rail…

http://www.newegg.com/Product/ProductRevie…N82E16817815005

I would not read too much into a prototype board being held up during a talk. They build those cards in configurations designed for testing, and not necessarily anything like what goes to market. When NVIDA was sending out samples of the GT200 boards, some of us got 192 SP, 1 GHz, 512-bit bus, 1GB cards. This was a GPU configuration that never made it into a shipping product.

The specifications say it has four 12V rails. It also has no more outputs than my Thermaltake Toughpower 1200W, so I don’t see how it will do four big PCI-E2 cards.

I have the same PSU you have and sorry about that i dropped the wrong link…But i will dig around and see if I can find something better…this is the link I meant to leave you…

http://www.pcpower.com/power-supply/turbo-cool-1200.html

That looks more like it.

awesomeness…!!

any tentative date when this baby hits the markets ?

Santa Claus would know.

But my guess is he will have to ask the Easter Bunny, or perhaps even the mid-summer faeries to be sure…

any Santa’s in Nvidia who can “HINT” us on the release date (at-least the month) and /or approximate price ;) ?

I wouldn’t be much surprised if they didn’t know the precise release date with much certainty. And I’m pretty sure they wouldn’t disclose anything here before making an official statement outside. Just like with this Fermi HW.

You don’t need to have all the tesla cards plug-in while installing OS, you can configure it, and then plug-in the cards and ssh into the server ;-)

I actually tried that.

Oddly enough this BIOS stops somewhere when I add the fourth card. Possibly it can tell there is no graphics card? I can’t easily figure out what the issue is when I can’t see it. Needless to say, putting a graphics card back in prevents the hang up from repeating.

What BIOS are you using?

If Nvidia wants to make FERMI more like a CPU, why don’t they put in larger cache? Is it technically impossible? I am curious since current CPU can have cache size of several MB.

Caches are static RAM, which means they need logic 4 gates per bit to implement. So they consume a lot of die space.

Considering that in modern CPUs the caches easily take 2/3rds of the available die space, nVidia may have chosen

the less expensive route here.

CPUs depend a lot more on the availability of caches to obtain peak performance, whereas on a GPU they are a bit

less critical to obtain decent performance (in CUDA you can usually achieve the same with clever programming and

by making best use of the available 16kb of user managed cache, i.e. shared memory).

For graphics applications you have a good locality in the cache access anyway, which means they may be smaller.

And the GPUs also have a 5-10 times higher memory bandwidth to compensate for smaller caches.

Christian

YES but now with FRMI here it would be interesting to see how all our codes perform. What kind of programming strategy will be more suitable for more cache or more shared memory etc… so exiting times :) .

Also I just read somewhere that we can run 16 multiple kernels at once on FERMI. Is that number only for single precision or is it valid even for double precision ?

Becouse in double precision we cannot scehdule threads from two different warps at once on the FERMI but we can do that with the single precision fp operations. Just cuious… ;)

GPUs have also always excelled at tasks where the working set you have to loop over would blow through any cache (well, at least any cache you are likely to see in the next 5 years). That’s why they spend the $$$ on super-fast interfaces to global memory, rather than even more on-chip cache. I expect that trend will continue to be true even for Fermi, but now the extra on-chip cache will improve the case where you need to stream a huge dataset, but keep some random-access data structures close by.

I am especially curious to see how something like histogramming performs with the simple algorithm of atomically incrementing bin counters in global memory. (Histogramming big data sets is probably my most common computational task.) Now that atomic ops can happen on chip in the L2 cache, that might actually be reasonably quick even without doing anything fancy.

If I understand the whitepaper, the multiple kernel thing has nothing to do with single or double precision. There is a top-level scheduler on the chip that issues blocks from active kernels to the multiprocessors. Each multiprocessor then has two local schedulers issuing half-warps into the various load/store, single, double, and special function pipelines. The single/double precision issue just has to do with individual instructions. Individual kernels aren’t double or single precision, but they contain a mix of single/double/integer/SFU/load/store instructions. Issuing a double precision instruction requires both schedulers, reducing throughput temporarily, but that’s it.

What I am curious to see is whether blocks from different kernels can run on the same multiprocessor. I can see pros and cons to this:

  • Blocks from different kernels will have different instruction streams and put more pressure on the local instruction cache.

  • Blocks from different kernels will require different amounts of chip resources (shared memory, registers, etc), which means keeping track of more information in each multiprocessor.

  • However, if you have blocks from kernels which require different resources, it’s possible you could make more efficient use of the chip by packing a large and small block onto one multiprocessor where two large blocks would not fit.

Another thing that struck me today: The new cache hierarchy eliminates much of the difference between global, constant and texture memory. Assuming the L1 cache can broadcast efficiently, there is no need to distinguish global and constant memory at all. Texture reads from linear memory is also now the same as global memory reads as well. Texturing from a CUDA array is still a little different, since that involves a space-filling curve and potentially some interpolation, but wouldn’t be surprised if that also uses the L1 cache.

This brings up two questions:

  • Will there be a way to make global memory reads bypass the cache hierarchy entirely? Large coalesced, single use reads do not benefit from being cached, and might flush useful things out of the cache. Some way to avoid that would be nice.

  • Are the L1 caches coherent on different multiprocessors? With current CUDA semantics, I’m not sure that it matters as long as atomic operations go direct to the L2…

(I realize these are rhetorical questions since I doubt anyone is allowed to answer them, unless I missed the answer in the whitepaper.)