building the best CUDA machine what hardware should be used?

I’m building a highly scalable computing program using CUDA, and I think it will benefit from multiple GPUs tremendously.

I plan to create a CUDA context on each GPU, however as I understand my primary device shouldn’t be used for GPU computing.

Is 3 CUDA GPUs the best I can expect with today’s technology? The best motherboard I can find is here:
http://www.newegg.com/Product/Product.asp?..N82E16813131146

however it only has 4 PCI Express slots, one of which will presumably be needed for the primary adapter (leaving only 3 for CUDA G80’s.)

Generally, the application I am building benefits most from more streams, so in this case I am anticipating i’ll have 128x3 = 384 streams to work with if we get the CUDA context management working as we want it to.

Is 384 the maximum number of streams I can expect to achieve with todays technology?

The motherboard you link to will not accept three 8800 cards along with an additional graphics card to be the primary video adapter. All of the 8800 cards I have seen include a very large and heavy heatsink/fan cooling system on top of the GPU and memory. It takes up two slots, with the second slot used as a direct vent outside the case. Installing three cards will cover up every other available slot on that motherboard.

That said, if you are planning to use Linux (which is probably a good idea with such an unusual video configuration), you could easily work with the system remotely via ssh. A loss of video interactivity on a headless system won’t matter to you. I think it was reported in another thread that you can use CUDA on Linux without starting X servers for every card. You just need to ensure the kernel drivers are loaded. (I haven’t tested this personally, so you might want to confirm that.)

Your next limiting factor is going to be power for the cards. I think the GTX is rated at peak power usage of 185W, so you’re going to want a power supply that can at least do 3x that. Of course, it will be more subtle than that, since what matters is the amount of power supplied on the rails that power the cards, and not the total power.

I think running 3 cards in one system is going to be a bit of a challenge, but you might get it to work. (If so, let us know what parts you used!) I’m almost certain 4 cards would need some custom hardware.

(Edited to fix power consumption of GTX.)

We’ve had success building two systems based on the Asus P5N32-E SLI motherboard.
They have Intel Core 2 Quad CPUs, 8GB memory, and 3 GeForce 8800GTX cards each.
The hard part of building one of these has been getting our hands on high capacity PSUs.
With all three GPUs and all four CPU cores running flat out, our test system uses 700 watts, as measured by a Kill-o-watt that we have the machine plugged into. We’re running Linux and I’ve had no trouble running code on all 3 GPUs at once, regardless whether or not X was running. Though I think with X running you have to keep your kernel invocation GPU runtime below 5 seconds, but all of my current test kernels run in 5 seconds or less (per invocation) now, so that hasn’t been a problem.

John Stone

Are there any adapters for PCI Express x16? Something to maybe extend the slot out a bit so I can get the 3 cards installed.

Our development environment is XP so swithcing to linux to get the display-less setup going isn’t a very attractive option. Ideally we would have 3 non-primary displays in XP, on a motherboard with 4 pci express 16 slots.

I am still leaning towards 4x PCI slots - To be honest, switching to a linux development environment is unattractive because of our current windows-based toolset.

I am looking at this here:
http://www.adexelec.com/pciexp.htm

Specifically, the “PE-FLEX16” flexible extender (to help get 3 G80’s in one machine.) Can anyone please comment as to whether this could have an impact on the reliability of the computations?

This motherboard appears to have the right slot layout to fit 3 GeForce 8800 cards, and one single slot video card at the end to be the primary display:

http://www.gigabyte.com.tw/Products/Mother…%20Quad%20Royal

Note that only the blue slots are x16. The black slots (even the full length ones) are x1, so one of your 3 CUDA cards would be handicapped in data transfers to and from the CPU. Another potential problem is that the full length cards would possibly hit the RAM slots (rightmost card in the photo), or cover up the IDE and/or SATA connectors.