Manifold Custom Case Rev. 2 Success! 8 CUDA devices in your carry-on luggage

I wanted to report that today I successfully built one of the Manifold Custom Revision 2 cases described here:

http://forums.nvidia.com/index.php?s=&…st&p=539172

I followed much of Dimitri Rotow’s parts list with a few deviations:

  • Enermax Galaxy 1250W PSU (in case my employer is reading this: yes, this power supply is UL-listed. Safety first! :) )
  • Intel Core i7 920 (2.67 GHz)
  • 4x EVGA GTX 295 cards (the new single PCB variety)
  • Manifold Custom Case Rev 2 from Protocase.com
  • 3x Thermaltake A2018 120mm fans (blue LED and various speed control options)
  • Rosewill RFT-120 120mm fan filter
  • Patriot Viper 12 GB RAM PC-10666 kit (6x 2GB modules)
  • Intel X25-M (G1) 80 GB Solid State Disk
  • Asus P6T7 WS Supercomputer motherboard

I have no affiliation with Manifold, so take this as the perspective of an outsider working with the Manifold case. (I do have lots of experience assembling computers, so I’ll mostly focus on the unique features of this system.)

Although I want to write a much more complete document in the future (with photos!), here are some of my initial impressions before I forget them. Many of these things are in the “Building an E Box” PDF, but I’m repeating them here because I didn’t appreciate their importance when I read through the document the first time. You should not take this post as a substitute for reading the Manifold document, however. It’s very informative!

======

The Case:

  • Protocase offers truly amazing service. I highly recommend that you browse around the documentation PDFs on their website. You’ll learn a lot about enclosure design and working with sheet metal. Protocase also recently popped up in the news as the manufacturer of the Backblaze Storage Pod, which holds 45 disks in a $750 4U rackmount enclosure.

  • Ordering the case was pretty easy, though the sales rep I emailed for a quote initially did not know what the “Manifold Custom Case Rev 2” was. I pointed her to Dmitri Rotow’s forum post above, and then in a day I had a quote in hand. $343 for one case, dropping to $204 each for an order of 10.

  • My case shipped from Nova Scotia, but FedEx International shipping was included in the above prices, so I didn’t realize it until I got the tracking number.

  • The case build quality is impressive, especially with the powder coat on all surfaces. (I went for leaf green, just because I was tired of black and beige.) Everyone in the office spent a few minutes admiring the case before it was whisked off to the lab.

======

Assembly:

  • The case comes with only the screws required to hold the sheet metal together, but none for mounting the computer parts. The Thermaltake fans come with suitable screws, nuts and washers, and the Enermax power supply has its own screws as well. You will need to supply 9 screws for the motherboard, 4 screws to hold down the graphics cards, and whatever screws are required to hold down the disks you install. (I didn’t use any hard disk screws, but more on that later.) Motherboard standoffs are built into the case bottom. If you have a bag of miscellaneous computer screws, you should be in good shape.

  • There is not a wasted cubic centimeter in this case! You need to read the assembly order in the Manifold documentation. I made my life a little difficult by installing all four GTX 295 cards before installing the SATA cable. Fortunately the P6T7 has angled connectors, so you can get under the graphics card and finesse the cable in if you have thin fingers. (or forceps)

  • Getting the GTX 295 cards installed is a rather harrowing game of 3D Tetris. As the documentation states, you do need to flex the back of the case gently to get the card backplate lip around the obstructions. Everything springs back just fine, though.

  • The X25-M is a 2.5" form factor, which means the screw holes in the case lid are not spaced appropriately to mount it directly. Instead my plan was to put the drive into a Icy Dock 2.5" to 3.5" SATA converter. I’ve used this enclosure before, and it was great. However, it seems to be a little longer than a normal 3.5" drive and collided with the PSU when oriented with the cables facing away from the PSU. Turning it around just barely worked if you flexed the cables at a hard angle going into the connector.

  • The manual isn’t kidding about needing standoffs for the hard drive. It is impossible to get the connectors in (especially since the Enermax SATA power angled connectors bend the wrong way) if the drive is flush. It turns out that if you have some motherboard standoffs laying around, those work in a pinch instead of nylon ones. (Nylon would provide better vibration isolation, but whatever. :) )

  • As it happened, my Icy Dock enclosure was defective, so I finally decided to just velcro the X25-M (which is very small and light) against the back of the case through the 40 mm optional fan holes. This works really well, and I would highly recommend doing this if you use a SSD instead of a rotating disk.

  • You can reach the power switch jumper on the P6T7 from the front-left corner of the case if you crack the lid. This is handy if you need to short the power switch jumper horizontally with a screwdriver (be careful!) because you forgot to set the BIOS to auto power-on before installing everything. :)

======
Operations:

  • For other reasons, I had to install Scientific Linux 5.3 (this is a RHEL 5.3 rebuild, like Centos). I turned off the Marvell SAS controller and set the SATA controller to AHCI mode in the BIOS.

  • RHEL 5.3 and the CUDA 2.3 driver had no problem recognizing all 8 devices in the P6T7 motherboard. There is one BIOS update on the Asus website that I did not apply since everything worked first time.

  • There was some unusual CPU clock ramping behavior initially. /proc/cpuinfo said the cores were stuck at 1.6 GHz, even under load. Strangely, top showed the 8 single-threaded CPU-bound processes each using 250% CPU, so clearly the system was in some kind of confused superposition of max and idle clock rate. I finally just forced the CPU to run at full clock all the time in the BIOS. (Perhaps the BIOS update helps here. I haven’t tried it.)

  • Idle, the system draws about 600W at the plug. (Note this is with the clock rate forced to max.)

  • As you activate more CUDA devices, the power usage ramps up. I observed a maximum of 1100W, but my test jobs might not have loaded every single CUDA device simultaneously. Consider that value a lower bound on the power usage. :)

  • Power usage immediately ramps down when the card is idle. When all the jobs finished, the power draw was back at 600W.

  • I’m still trying to figure out how to monitor the temperature sensors in the system. sensor-detect in the lm_sensors package was able to detect the ADT7473 chips on the GTX 295 cards. (There appears to be one per card, not one per device.) However, the kernel shipped with RHEL 5.3 did not have a driver for this chip. It does appear in later Linux kernels, and it looks like the driver might be backported to the RHEL 5.4 kernel (which has the same version as 5.3, but RedHat modifies the stock kernel quite a bit). The nvidia-smi tool does not seem to be able to read temperatures from a GTX 295.

  • The back of the computer gets really warm when the GPUs are operating, hence my interest in monitoring the temperature more closely. I’ve run GPUs this hot for extended periods of time, so I’m not immediately concerned, but it is worth keeping an eye on.

======

Misc Hardware:

  • The Asus P6T7 is a really fantastic motherboard! I also have an ASRock Supercomputer motherboard, and it has not impressed me. I found the ASRock BIOS buggy, hard to update without a floppy drive or Windows, and even after finally updating it, only slightly less buggy. (In RHEL5.3, it throws spurious ATA timeout errors constantly on ports with no devices attached. There is also really weird BIOS interaction between the on-board ethernet and the firewire port for some inexplicable reason.) In contrast, the Asus motherboard worked nearly flawlessly (not sure whose fault the CPU clock rate issues are), and is really well built. The P6T7 costs more, but if you have the budget, it is well worth it.

  • This isn’t CUDA-related, but the X25-M is a mind-blowing device. I’m a little late to the SSD party, but the performance improvement over magnetic disk is amazing. Even when I deliberately oversubscribed the virtual memory, the system stayed responsive (though a little sluggish) under constant swapping to the SSD. Running out of real memory is no longer quite the EPIC FAIL that it used to be with rotating disk.

======

Anyway, sorry for the length of the post. As I mentioned, I want to study the power and heat profile of this system so I can decide under what load it is suitable for 24/7 operation. (Shortening part lifetimes is acceptable, but locking up and getting wrong answers is not.) I will probably be switching to Ubuntu 9.04 in order to get the lm_sensors support. Among other things, I will also be investigating failure modes (like a dead case fan) to see what sort of automated shutdown settings are required. My goal is graceful failure and recovery, rather than 100% uptime. At ~$3800 each, this system could completely break every year, and still save us tons of money. :)

I’d like to thank Dmitri Rotow and Manifold.com for publishing their case design! It’s helped give me a big head-start on my project (and introduced me to the exciting world of custom fabrication).

Wow, thanks for the build report! I keep thinking about “my next box” and always worry about making sure I’ll be able to deal with 3 double cards (4 is better but always more tricky!) This is pretty beyond what I’m considering (for now) anyway.

I think I’m even more interested in your eventual heat reports… but can you give a quick impression now? What do the GPU sensors read? Does the air coming out the back burn your hand? (I’m serious, actually, it’s quite possibly 70C)
How about case fan noise?

What kind of math are you running on these beasts? Is it going to be a 24/7 kind of endless compute, or will it be transient loads like the Manifold guys?

I could imagine writing a kind of Linux status daemon that not only periodically queries the GPU temperatures, but also works as a kind of health status permission manager. You’d have a library that your compute code would query (or poll) asking if the GPU health is OK to continue to run kernels. The daemon would have GPU temp thresholds, telling query clients to wait if the temps are too high. The daemon could even run its own CUDA kernels, doing a memtest and an FFT or something to look for errors compared to a reference result… if such a mismatch occurs, a status flag is set, alarm sounded, and all compute from client queries is denied until a manual human check and reset. You’d of course have logs of all temps and such.

Total build costs were?

Also I dig your article for the occurence of “EPIC FAIL”, luckily unrelated your build results ;)

He mentioned $3800.

Yeah, of the $3800, a little over $2k is for all the graphics cards. If you used the cheaper ASRock Motherboard and a 1 TB disk, you would knock about $250 off the parts price. (More now, since it looks like the cost of the 80 GB X25-M has shot up by $100 since we bought ours. Crazy demand!)

I didn’t query the GPU temperature (after nvidia-smi and lm_sensors failed, I forgot to try the nvidia-settings app) directly. The air definitely doesn’t burn my hand, but the metal bracket is uncomfortably warm.

As for case fan noise, it’s hard to say, since this is sitting on a table next to a rack of computers. The HP DL185 disk server is so obnoxiously loud, I can’t hear the GPU computer over it, even close by. :) I guess that means it’s quieter than a rack mount computer thanks to the 120mm fans.

My future application is the computing for our physics experiment, which hopefully starts next year. Each event recorded by our detector needs to be processed through a chain of analysis modules, which calculate different important quantities. One of these modules is a lot faster with CUDA, but there isn’t much benefit for the rest of the code. (Yet… :) ) Putting a CUDA device in each compute node is therefore not efficient because it will only be used about 20% of the time. It is also impractical because at this stage, we need to use whatever computers we already have laying around, none of which can run a decent CUDA device. And a Tesla S1070 for each existing computer is definitely not in the budget…

Instead, the idea is to take a cluster of existing compute nodes and add one of these GPU nodes to the mix, and offload just this one stage of the event processing to the GPU over Gigabit. (It’s still a long enough calculation to cover for the ethernet latency.)

The processing load in this case is “transient 24/7”, likely. The system needs to be able to handle higher than normal event rates, so in most cases the system will be running constantly, but with a less than 100% duty cycle.

At the same time, we might deploy a few of these around the collaboration (several universities involved) for development purposes, where the usage will be more like several day bursts of 100% load.

(The task itself is actually an algorithm which reconstructs the location of an event, like the recoil of a nucleus, inside a spherical detector where the light pattern is observed near the surface of the sphere. Unfortunately, the optics are not as simple as we would like, so I have to do some 2D integrals over the possible “photon histories” to work out the most probable event location. It’s ridiculously slow, but CUDA eats it up.)

Yeah, this is exactly what I’m thinking here! This computer will be providing a network-accessible compute service to a pool of non-CUDA capable computers. The calculation can always be performed on the non-CUDA nodes, just a lot slower. The server can then self-regulate its temperature quite easily by refusing clients if temperatures left an acceptable range. I would need a way to associate device IDs from lm_sensors with CUDA device numbers. I suppose the easiest way would be at startup to run short jobs on each device separately and watch which sensor gets warm. :)

A followup to building 8-GPU systems: RenderStream is now building custom CUDA boxes, including 8-GPU configurations with a lot of customization options. I have never purchased from them, but they are the only place I know that offers 8-GPU boxes designed for CUDA. (There are a couple builders who offer quad-Tesla, but not quad GTX 295)

Since RenderStream spun out of the past 8 years of building cluster supercomputers for their own chip fab lithography supercomputing computations, they have direct experience that I doubt other system builders can offer.

They’re also cosponsoring the CUDA contest going on now.

I just saw this thread and thought I’d contribute a couple of photos of an ASUS P6T7 motherboard in an E box.

The attached photos show a P6T7 in an E box after loading up software and just before the test video card (some sort of cheapo GT Nvidia card) gets ripped out and replaced with a couple of GTX 295s. As can be seen there is room enough along the sides of the motherboard and in front and under the fans so cables can be led and tied out of the way to keep the region over the motherboard uncluttered. This provides good airflow and allows service access to components.

I agree the ASUS P6T7 is a fine motherboard and that protocase.com is an outstanding vendor. The E box / P6T7 combo has become very popular at manifold.net even for those who aren’t going to install four GTX 295 cards.
Ebox_P6T7_2.jpg
Ebox_P6T7.jpg

EDIT: Nevermind. Found the link I was looking for: (protocase.com)

seibert,
I know it’s an old post, but do you have the design template for the manifold case in ProtoCase? I’m also interested to build a 4U case to hold 4 gpus.