Best PSU solution for 4 x GTX 295

These are custom cases manufactured for us by Protocase.com. The illustrations show the three card versions (seven PC AT slots in the back). The disk drives (max two, usually 1TB in a RAID mirror) are screwed to the top of the case and hang over the CPU. This puts them in the path of the airflow from the front of the case to the back. The power supply space is oversized so that there’s a bit of extra room on all sides for a larger power supply. All the power supplies mentioned in this thread easily fit. There’s no room for floppys or DVD drives, but if anybody needs one of those they just use a USB drive or network to a machine that has a DVD.

For four GTX 295 cards, the company is having Protocase build a slight modification to this case that’s one PC AT slot wider (adds about an inch in width) plus we added one more fan in front for a total of three 120mm fans. They plan on using 110 CFM fans. I think they expect to get the first units in this week.

The seven slot existing version fits into a 20" international standard airline carry-on bag (a Tumi roll-on) and is very durable. I’ve personally taken demo units four or five times to Europe and Asia. The eight slot version won’t fit into a 20" roll-on but it will fit into a US standard 22" roll on. So if you want to take one of those babies to Europe as carry-on luggage you’ll have to fly Business Class or First Class, where almost all airlines allow 22". :-)

If all goes OK, manifold is going to give Protocase permission to build these for whoever wants them as we expect a lot of our customers and partners are going to want to use four card rigs. Pricing is not bad at all, under $200 for quantity 12 and I think even just a single unit is under $350. Our engineers love Protocase - easy to work with, total reliability, amazingly fast turnaround, good prices and great quality. Plus, once you get used to having custom hardware built exactly to your tastes, that’s hard to give up! :-)

If anyone wants to pursue this, keep in mind these are experimental cases. So far no problems with heating, but who knows long term with four cards. Also, they are straight metal with mounting posts for ATX cards but no power supply or fans. Bring your own screws, power switch, etc.
demobox.jpg

Wow, that is great! I’m working on a side project which might result in me needing to build a few GPU workstations for my collaborators in a few months, and this would be perfect. I’ve been racking my brain trying to figure out how to get the case size down and the GPU count up, and this is the ideal solution. Definitely post more info when Protocase is ready to take outside orders.

As the FASTRA folks said regarding long term reliability: “Clearly, such high temperatures will restrict the lifetime of this system. However, for its price, a new system can easily be bought within one or two years.” This also applies to our usage. If we burn out our system every 2 years, we still save money with CUDA. :)

Well, the cases work great and fit fine. (see attached image) They even fit into a 20" airline carry-on if you use a Tumi.

My next task is to get my demos ready for a presentation I’m doing in Denver next week. My problem now is that I can’t seem to get four GTX 295’s recognized by CUDA applications (not the SDK samples, not our stuff) in an ASRock X58 Supercomputer motherboard. Only three are recognized. I’m bummed, because our engineers are using a different motherboard so I got a unit with a spare ASRock unit to fool with. I’ve got the latest BIOS in it, etc., but still it only sees three cards (for a total of six GPUs). Any suggestions would be greatly appreciated… :-)

What is your OS? If Linux, what does “ls /dev/nv*” give?

You’ve probably tried shuffling cards. Any three of the cards work in any three slots?

LEDs burning bright on the back of each GPU when four cards in?

Probably a BIOS issue.

Hello,

I have a couple of questions for those, who are experienced with building multi-GPU systems off commodity hardware:

  1. Looking at the pictures of multi-GPU systems, such as the ones attached to this thread, I can see, that GPU boards are placed adjacent to each other inside a typical multi-GPU box. I also saw a pretty big fan on the “palm” side of my GTX280 board (this is in addition to a smaller fan, next to the board’s DVI socket). Obviously, for three out of four GPU boards in a quad-GPU setup, all the airflow, generated by the larger fans is completely blocked by the adjacent GPU boards. Does the GPU board’s smaller fan (the one, next to the DVI socket) provide sufficient cooling for running tasks reliably 24x7 on it?

  2. Has anybody explored off-the-shelf liquid cooling solutions? Are those necessary at all, is there any measurable benefit (to justify the expense)?

It would be awesome to have some kind of a “hardware wiki” (nvidiki?) to cover multi-GPU systems. From the commercial standpoint, besides being a resource for promoting NVidia’s sales, it can perhaps even earn some ad $s from vendors of GPU boards and related hardware.

I have one of these ASRock motherboards now, but all it has is a GTX 295 and an early GT200 card in it. All of my other devices are busy churning away on jobs to hopefully produce results for a conference (oddly, also in Denver next week) so I can’t try 4 cards in one system.

I do have to say that while the hardware features are nice, I was disappointed in how quirky the BIOS on this motherboard has been. It took far too much fussing to get it working with RHEL 5.3, even after I finally figured out how to reflash the BIOS without a floppy or Windows. I have been more impressed with Gigabyte (although that’s a Phenom motherboard, not a Core i7), and will probably use their Core i7 board for my next system.

For the current NVIDIA reference design that actually isn’t the case. If you look at the fan on a standard GTX260/275/290/285. you will see that the fan and rear part of the card “cowling” is actually at an angle to the rest of the card. When mulitple cards are placed next to one another, there is a wedge shaped gap in front of each card main fan, which lets air get in. I believe the current NVIDIA build guidelines recommend placing a case fan to blow air at the fan end of the cards from either the side or rear so as to make sure sufficient cool air is supplied to the card fans. I too was somewhat skeptical, but it seems to work fine in all the multi GPU systems I have tinkered with.

I’m running Window XP X64. All four LEDs burn bright green. BIOS is set to using PCI-E for the video card. I agree it is almost certainly a BIOS issue. Some info, numbering the PCI-E slots as “1” closest to the CPU and “4” closest to the mobo edge, and “card” meaning a GTX 295:

Running three cards in slots 1, 2, 3 works fine. Plugging a card into slot 4 and it does not appear to Windows, the CUDA SDK samples (device query, monte carllo, etc.), or to our code.

Plugging a single card into slot 4 and rebooting, Windows did not recognize it as an NVIDIA card at all, recognizing it only as a a VGA display. After re-installing GeForce drivers (downrevving it to 181.xx in dim remembrance of a thread somewhere…) the system recognized it as a NIVIDIA card and it ran OK for CUDA.

Plugging a second card into slot 3 and rebooting, POST comes up on the slot 3 card as does the initial Windows loading screen, but then Windows launches using the slot 4 card as console. Strange. After that, I plugged in two monitors into the slot 4 card and one monitor into the top DVI connector of whatever was the lowest slot number card. CUDA runs fine with two cards.

Plugging a third card into slot 2 and rebooting, POST comes up on slot 2 as does the initial Windows loading screen with the console coming up on the slot 4 card again. CUDA runs fine with three cards.

Plugging a fourth card into slot 1 and rebooting, POST comes up on slot 1 as does the initial Windows loading screen with the console coming up on the slot 4 card again. However, on an apparently random basis a blue screen crash occurs sometime between POST and, on about one trial in ten, after one logs into Windows. This is interesting because adding cards in the 1, 2, 3, 4 sequence using the latest GeForce drivers just had the fourth card not recognized but never a blue screen.

This is where we left it last night. Today we’ll try taking out the fourth card, re-installing the latest WHL GeForce driver and see if the blue screen happens.


Regarding cooling: the case we are using is a variation of a custom Protocase.com build we’ve been using for a couple of years. The original version has two 120mm approx 50 CFM fans in the front and routinely hosts three GTX 295s. No overheating issues. The new case, illustrated in the prior post above, adds a third 120mm fan that blows directly onto the GPU cards. It also upgrades the fans to 110 CFM (they found a fan that is not too loud!) with the idea of pressurizing the interior a bit to assist flow through the GPU cards. We won’t be able to tell how well this will work out until all four GPU cards get put into continuous, intensive usage, but we are guardedly optimistic it will be OK.

I grant that these case layouts are special (ahem…) cases. :-) The compact case was originally designed to allow multi-GPU rigs to be easily transported for demos, meetings with partners, etc. But they ended up becoming popular for general development use given the compact size. There has been some talk about water cooling but that doesn’t work well for us when hardware gets replaced a lot with whatever is the latest new thing, different test configurations, etc. There’s also the capital cost adder when you’re talking a few dozen systems with four GTX 295’s each.

Larger cases would allow you to toss more fans at the problem. But what really counts is good airflow and you can often do better with a more closely cowled design. Aircraft engines, for example, are sometimes more tightly cowled than frontal area aerodynamics would require, to ensure good cooling airflow. A well-designed compact box can have better airflow than a poorly designed big case. If pressurizing flow through the GPU cards can be achieved, the compact case might even have better cooling than an open frame.

We have many customers who are interested in putting together four-card rigs for their work, so once we get all this sorted out I’m going to publish a set of illustrated, step-by-step instructions that covers everything from bare case to software configuration with a complete bill of materials for every component used, vendors selling the stuff, etc. I’ve been hanging around the lab taking lots of pictures, grabbing labels and packaging material and taking careful notes. That should save a lot of time in the future for everyone. :-)

Dimitri,

Your effort will indeed be very much appreciated.

Ah-ha!

Now I understand!

Thanks!

An update…

It turns out some of my (former) fellow countrymen have a four GTX 295 rig running on an ASRock x58 motherboard, and have reported their success in the Russian language forum at http://total-oc.com/forum/viewtopic.php?f=…04&start=30

For those who don’t yet read Russian a quick translation/summary of the operative advice:

I switched down to one card in slot 1. I then got 178.13 from guru3d and modified the .inf as above. I installed it and the card ran OK into Windows. Although the CUDA SDK samples could not detect any CUDA devices, our code did detect both GPUs. :-)

I then added a second card. Now, the display was stuck in VGA mode showing a " %NVIDIA_GT200.DEV_05E0.1% " device, and neither the CUDA SDK nor our code could see any GPUs at all. :-( After much tinkering with different settings, I gave up.

I then loaded the current GeForce WHL driver, verified it worked as before, and then loaded the newest April 8 beta GeForce driver (185.something…). This worked with three cards, but then plugging in a fourth card led to a blue screen. :-(

Not sure what next to do. Our engineers tell me I’m being hardcore and I should just chill out, use one of the other motherboards (like the MSI Jaak is using…) for now and that soon either the ASRock’s BIOS issues will get sorted or there will be an alternative like the P6T7 in general production. That’s probably all true but I hate being patient now that I got my teeth in it. Probably should have started this as a new thread… sorry to hijack the topic.

I’d be grateful for any advice!

Problem solved. It works! I made a totally stupid mistake in altering the 178.13 INF. The attached images show Manifold (our humble product…) running with 8 GPUs, the SDK monte carlo demo running with 8 GPUs, and the device query reporting 8 GPUs:-)

Also attached are instructions for modifying the INF.

To get this running, I uninstalled NVIDIA drivers, correctly hacked the INF and then with three cards in the machine installed 178.13 using the hacked INF. Booted and verified the three cards worked OK. Shut down and installed the fourth card. Booted and Windows reported a new card, so I just re-installed 178.13, rebooted and it all worked OK. [Well, the CUDA 2.1 SDK samples didn’t run so I dropped down to the CUDA 2.0 SDK and CUDA 2.0 Toolkit].

Thanks for the help and advice from everyone. Thanks especially to the guys in Russia who figured out the trick of using 178.13 and hacking the INF in the first place!

[Edit]: Corrected a typo in the .txt, deleting an initial single quote character at the beginning of one of the lines…

PC power and cooling released a new 910 watt power supply for quad-PCIe graphics card systems today.
88% efficiency, too, which is a nice bonus.

I’ve never used it of course, just saw the marketing press release today.

Edit: Ha, careful of marketing. “Power connectors for 4 PCIe graphics cards” means there’s 2 6-pin and 2 6/8-pin connectors, enough for two modern cards.
Marketers are good at this kind of weaselling, since it is true in theory that you could run 4 entry-level-power (one power plug) cards with it.

Just wanted to report… we are now building machines for our internal use with four GTX 295 cards as a matter of routine, using the new “E” case, ASRock motherboard and Enermax PSU.

Complete specs and detailed build instructions are at [url=“http://www.manifold.net/downloads/Building_an_E_Box.pdf”]http://www.manifold.net/downloads/Building_an_E_Box.pdf[/url] - there’s also a news note on how these were used at a recent Manifold user meeting at [url=“http://www.manifold.net/info/pr_gpu_record2.shtml”]http://www.manifold.net/info/pr_gpu_record2.shtml[/url]

There are many warnings in the build instructions to not leave these unattended. I have to agree there is wisdom in the notion of being very careful with hardware that is not UL approved and has been tossed together in a more or less ad hoc manner. But despite all that, the machines run great with no fires so far.

Also, it turns out that they will fit into a 20" airline carry-on (at least the Tumi 20"). Very handy to take your big CUDA rig on business trips!

This is great! Thanks for sharing this info with us. The ASRock motherboard on my desk may soon find its way into one of these cases. (Though not with four GTX 295s for a while.)

The guide is great! It’s interesting, informative, and you know it’s well written when the introduction has great practical advice like
“Do not try to build a system like this if you’re a moron.” and “Creating hardware configurations like this is probably illegal in some stupid nanny state somewhere.”
It’s truly great!

Thanks for your build notes and updates here on the forum, too. You may be helping a lot of people as they build their own boxes, even if they’re different flavors.

Yeah, this made me laugh, although I don’t know if things are really quite so bad as they sound. I can understand Manifold.net wanting to cover themselves in case something happens, but these cases are no more or less “uncertified” than the beige-box computers people build from parts all the time. The most important thing is that the power supply itself be UL-listed. I don’t know that the rest of the computer needs to be certified. For example, my MacBook does not appear to have a UL certification, but the power brick does. Probably the biggest danger here is destroying your $3000 in computer parts. :)

Anyway, with something this unusual, it’s better to cover yourself in terms of liability, so the disclaimer is both worthwhile and entertaining.

Dimitri (Dima if I may),

Thanks for the entertaining and detailed document! I was anxiously waiting for it. I know I should first submit my questions to UN’s subcommittee of coconut tree tarantula preservation in Oceania, but thought you might have a sec to answer. Do you have any experience with running this setup with less than 4 cards, i.e. will it be “safe” ( I mean OK) in respect with the overheating/burning the house down issues to run them with 3 (or 2) cards unattended? We might have some money to build several CUDA rigs, and I want to find the most feasible/space saving way of doing it. This puppies seem to be an interesting solutions cause they are compact and powerful, but volatile :)

Cheers,

 Demq

I’m neither a lawyer nor am I a hardware engineer so the following should not be taken as anything but inexpert guesswork…

There have been many “B” boxes in use at our company for a long time running continuously with no problems. “E” boxes hosting 4 GTX 295 cards per box have now been run for endless hours per day for months with no problems either. But then the boxes tend to get only sporadic use as is typical in software development where most of the time the GPUs aren’t doing anything - I guess if they were doing something continuous like folding all the time they’d heat up more. As it is, they don’t seem to heat up much at all.

All of our new builds are using E boxes as they are just slightly bigger than the B box and nobody wants to give up the possibility of running four GPU cards. People also seem to like the three fans. Some configurations have been built with switches on the fans so they can be set to a lower, quiet speed if populated with fewer cards. An E box with only two GPU cards in it runs very cool.

I personally wouldn’t worry at all about an “E” box with three cards in it either. For that matter, I’ve forgotten plenty of times to turn off the four-card E box I get to play with and have left it running unattended, at one point for a week or two (oops…), with no problems or overheating at all. But that could just be dumb luck.

One more thing: for new builds they’re going to be trying the ASUS P6T7 motherboard. If the GPU cards ever got down to single width and one aux power cord that would allow eight GPU cards. :-)

Are there any watchdog utilities that can monitor GPU temps and shut down the system if they get too high?
Such things exist for CPUs… often in BIOS so they’re even OS independent.

If you wanted to be fancy you could even make the watchdog run a very small 1ms CUDA compute every 30 seconds, and if that compute failed, count that as a shutdown event as well since maybe it’s another stability issue that the single temp sensors can’t tell you about.