What do I need for a 4 GPU CUDA Setup?

Hi,
I’m going to do a CUDA project for a scientific research. Now I’m planing the system and thinking about 4 Geforce Gtx 280 Cards.

I’m wondering which mainboard would be able to take 4 graphiccards. It doesn’t matter wheter it is AMD or INTEL. For Intel I only found board with 3x PCIex16 ports, for AMD I found some with 4 ports but with an AMD chipset.
So is the chipset importent for CUDA?

The next problem is the operating system. It must be windows, because I only want to extend a existing programm by CUDA which is already programmed for windows. But it doesn’t matter which type of windows. So what would be better vista 32, vista 64, xp or xp 64? 4 gigs of RAM would be nice but arem’t necessary.

Last question:
Is it necessary to connect all graphiccards to a monitor to use them?

Thx for your help.

PS: Tesla is no option because the hole system price should be under 3000 Euro.

You might want to check out our recently announced new “Personal Supercomputer” systems:
http://www.nvidia.com/object/tesla_supercomputer_wtb.html

I don’t have any monitors hooked to my cuda box, but it is running linux; which may be why that is possible. I just don’t run an xserver.

Boards with 4 PCIe slots:

  1. IBM Skulltrail (BOXD5400XS) 4 PCIe x16 1.0, Intel 5400 chipset.

  2. ASUS L1N64-SLI WS/B four PCIe slots that are physically 16 but electrically x16 x8 x16 x8, Nvidia 680a chipset, also version 1.0 PCIe.

These are both dual CPU boards, Xeon or Opteron but you can run them with one quad-core which will give you one CPU core per GPU. You will get somewhat reduced CPU/GPU memory bandwidth with the x8 slots.

If your budget is 3000 euro, the ASUS will be much cheaper, be sure to get BUFFERED memory, their manual is confusing. The Skulltrail only supports high end (US$1000) Xeons and expensive registered memory. Also, make sure your PSU has enough PCIe connectors and the Skulltrail requires TWO CPU power cables.

Skippy

You should take a look at the FASTRA page for information about the issues involved:

http://fastra.ua.ac.be/en/index.html

The page says 8 GPUs, but it is actually 4 cards because they built it using 9800 GX2 cards, each of which appears as two CUDA devices. Their parts are a little old now, but would be a great starting point for such a project. Definitely read their Technical FAQ (under “Specs and Benchmarks”) for some discussion of the cooling challenges.

Their system was built with 4000 euros at a time when the 9800 GX2 cost more than the GTX 280 does now, so your 3000 euro goal is probably doable.

Take a look on Asus P5N64, it has 2 PCIe 2.0 x16, 1 x16 and x8, you will need a riser card to use 4 GTX280.

Instead of two power supplies, buy a “video card” PSU that slots into a 5 3/4" bay, such as this one: http://www.newegg.com/Product/Product.aspx…p;Tpk=gpu%20psu (you’ll need 1 of these per card)

Don’t use Vista because last time I heard it doesn’t activate video cards that don’t have monitors attached (although this may have gotten fixed). (Alternatively, you can wire up a dongle to look like a fake monitor.) There’s no problem going 64bit, but as always you may uncover some bugs in your existing code and also you waste some resources (all GPU pointers become 64bit too).

Simon, that ‘personal supercomputer’ is over six grand. Come on. You can do the same thing for one and a half (much less EUR3000).

skippy1729, the Skulltrain uses “Fully Buffered” memory (aka FB-DIMM). This is very different from simply “buffered” memory.

In Newegg, it’s very easy to do a search for 4x PCI-E 2.0 x16. Go to motherboards, then pick intel or amd (amd has more), and go to Advanced Search

Most do not have four cards spaced two apart, which is what you need. But here are two:

http://www.newegg.com/Product/Product.aspx…N82E16813186152
http://www.newegg.com/Product/Product.aspx…N82E16813130136

They are both very cheap. They wire up as x8 when you use four slots, but honestly you don’t need more bandwidth than that. (Anyway it is the same bandwidth as the Skulltrain’s PCI-E 1.0 x16.)

Last step is finding a case with 8 slots. (The standard is 7.)

Please let us know when you get this done, and post pics! With all the money you save (use GTX260), maybe you should build two and make a cluster?

Many thanks to all of you.

That are much informations… I’m going to check all of them.

I will post back when I’ve decided which configuratin to use.

But I’m still not sure about the operating sytem. If I would take 3 GB main memory and 4 graphics with 1 GB RAM each (equals in 7 GB overall memory), could 32 Bit Windows adress it? Or will I get some compatibilty problems?

Again many thanks. It seems to be a realy good community here.

32 bit OS is capable of adressing 3 GB of RAM and 4x1GB of video memory are not accessible by OS memory manager anyway, that is different from conventional RAM.

Don’t use Vista because of limitations Alex wrote about; use Linux or XP.

Use good PSU, at least 1.5 kW.

Another problem not mentioned here is cooling – you’ll have to install aditional case fans at least.

Yeah, it’s no problem. BTW, I wanted to say that even if you get a 32bit OS you should stock up on RAM (eg 8GB). 32bit OSes can still make use of up to 64GB (eg for swap space and disk cache), it’s just a single 32bit application can’t access that much.

that’s not really true as far as I know (unless you use something like PAE, which is nice and slow)

honestly, if you’re not using a 64-bit OS as your primary development platform at this point, you should be.

Oops, dangerous remark when the only platform that has a CUDA debugger is 32 bit ;) :D Other than that you are 100% right.

there was much wailing and grinding of teeth when they told me that it’s 32-bit only for the moment and promised 64-bit would follow as soon as possible. :P

(this is why I used the word “primary,” because I still have a secondary 32-bit install for testing and debugger)

It’s absolutely true. There’s nothing wrong with PAE. You say it’s “slow,” which it may be for certain use cases (eg developing applications with it). But for swap and disk cacheing, it is blindingly fast. (Since your other reference point for performance is a hard drive.)

Since most people use the extra ram only for swap and disk cache anyway, >4GB of RAM on a 32bit OS is almost as good as on a 64bit OS. However, for development, use 64bit so you’re not surprised by pointer bugs down the line. (Unless you’re porting existing 32bit code and would rather not deal with them.) Also note that you might have to set up both a 32bit and 64bit build environment because the 64bit CUDA Toolkit can’t compile 32bit code (for distribution to 32bit users).

Honestly, I really wish this whole topic wasn’t a point for discussion and source of issues in this day and age. Sigh.

Last time I installed windows XP 32 Bit it could only handle 3.5 gb of system memory. I know that there are extra address lanes since the pentium generation, but the normal windows XP doesn’t use them.
So you say application could use more RAM than windows XP could supply?

But the system memory is not my problem, the application uses only about 200 MB so far.
But I’m affraid that there could be some problems with the huge amount of graphic memory.
I always thought that the 32 Bit address lanes were used to address all memory available in the system?!

Take a look how to enable PAE: http://www.microsoft.com/whdc/system/platf…PAE/PAEdrv.mspx

Btw, I stand corrected. On XP, Microsoft placed a limit of 4GB physical ram. (If you turn on PAE, the OS will see 4GB instead of 2 or 3). But on x32 server OSs, you can see more. Also, this won’t let an application use more than 3GB. But it lets the system use the extra ram for useful things.

If you could address all available memory, you wouldn’t need cudaMemcpy(), you’d just write straight to device pointers. But since you need cudaMemcpy(), you can have a host pointer 0x12345678 and a device pointer 0x12345678 (and three more such device pointers belonging to three other CUDA contexts).

I researched this some more, and actually the question of “how can you have 16GB of GPU memory on a 32bit OS” makes a lot of sense. Indeed, although you’re not allowed to access device memory from the host, all of the GPU’s memory is in fact mapped into the OS and available to drivers.

What I discovered is that the only way a 32bit OS can access 16GB video card memory is to turn on PAE in the first place. This enables 64bit page table entries in the 32bit OS.

Also, whether going 32bit PAE or 64bit, the chipset itself must support wide addressing. All Opteron/Athlon 64 chipsets do, but only Intel chipsets that are newer than 975X, P965, and 955X support the capability.

See here: http://support.microsoft.com/kb/929605/en-us