High GPU/CPU ratio cluster?

Demq · December 1, 2009, 9:09am

Hi.

I’ve been away from CUDA world for a while, but now I need to get specs for a small GPU cluster we want to build at our university. The idea is to purchase off the shelf rack-mountable system with server-grade hardware and maybe add just gaming GPUs (waiting till Fermi-based ones come out in a month or so?), as TESLAs seem to be stiffly priced (especially Fermi once), and we don’t really need ECC. We need a high GPU/CPU(core) ratio. The best I found from Super-Micro chassis was U4 4-GPU/2-CPU(8 cores) solution, where you don’t have to buy GPUs. Colfax seems to have a 8 GPU/ 2 CPU solution (Colfax International | HPC & AI Solutions), but they require min 4 TESLA config. Does anyone know of a similar solution where you are not tied to also purchase GPUs? I know it might be a stretch to have 8 GPUs, but the performance of type of code we are going to run on it scales up linearly with the # of GPUs.

Cheers

eyalhir74 · December 1, 2009, 9:32am

Hi.

I’ve been away from CUDA world for a while, but now I need to get specs for a small GPU cluster we want to build at our university. The idea is to purchase off the shelf rack-mountable system with server-grade hardware and maybe add just gaming GPUs (waiting till Fermi-based ones come out in a month or so?), as TESLAs seem to be stiffly priced (especially Fermi once), and we don’t really need ECC. We need a high GPU/CPU(core) ratio. The best I found from Super-Micro chassis was U4 4-GPU/2-CPU(8 cores) solution, where you don’t have to buy GPUs. Colfax seems to have a 8 GPU/ 2 CPU solution (Colfax International | HPC & AI Solutions), but they require min 4 TESLA config. Does anyone know of a similar solution where you are not tied to also purchase GPUs? I know it might be a stretch to have 8 GPUs, but the performance of type of code we are going to run on it scales up linearly with the # of GPUs.

Cheers

Hi,

Nice to see people are having the same problem as I do :)

First a few thoughts and clarifications:

1. Fermi as Teslas will probably be ready only in ~4-6 months and not one month.

2. You still need to see how Fermi will scale your application performance prior to purchasing massive amount of it.

3. nVidia recommends a ratio of 1:1 CPU:GPU and to have at least as much CPU RAM as GPU RAM. So for 8 GPUs per server,

   you'd want 8 CPU cores and 32GB CPU RAM.

That said… After a long period of testing this is what we’ve decided:

We selected the SuperMicro machine you’ve specified in order to allow for 2 S1070 (and potentially 4 S1070).
HP DL380 was also a good candidate allowing you to connect max 2 S1070.
Both servers showed the same performance and reliability
You can now use nVidia cards called DHIC to connect 1 S1070 to a single PCI slot - but it will halve the PCI bandwidth per GPU.
It really depends on your algorithm - I didnt have any problem using more GPUs than CPUs and had much less CPU RAM then the

 total of GPU RAM in the system. Also PCI overhead was not a big issue (~10% of overall time) and therefore didnt suffer from PCI 

 issues and the DHIC might even be a good solution to increase the amount of GPUs per server.

Bottom line - you’d probably be better off by purchasing/getting try&buy of the two/three leading servers/configuration and test

your real code on them in order to figure whats best for your needs.

Hope that helped :)

eyal

avidday · December 1, 2009, 9:45am

You dilemma is roughly similar to what our group was facing a little while back. Our conclusion was that rack mounted gear only make sense if you need high density, and/or don’t have much space, or already have rack based machine room infrastructure you can or need to piggyback off. We were targeting a 1:1 cpu:gpu ratio with 8-16 cpus total, which is a bit different to you, but it still worked out to be a lot more cost effective and simpler to go with pedestal mount nodes than rack mounted ones.

SPWorley · December 1, 2009, 9:49am

Also, it will be very application dependent, but a dual CPU server will never be as efficient in PCIe transfers as a single CPU machine. The PCIe lanes usually hang off of one CPU and the other CPU has to route its PCIe transfers through it. Likely this won’t affect throughput but it will affect latency. This may be an issue especially if you’re using zero-copy memory or run < 5 millisecond kernels.

seibert · December 1, 2009, 4:13pm

Has anyone benchmarked (maybe tmurray did this when they first came out…) the dual X58 motherboards? With appropriate CPU-affinity settings, those sound like they should offer full PCI-e bandwidth to processes running on both CPUs.

seibert · December 1, 2009, 4:17pm

I agree here as well. If “small” means less than 10, you’ll save a lot of money just getting tower cases. With flat shelves, you could even still fit them on a single rack. Our 4x GTX 295 system in a very compact case lives on a shelf in our rack along side the stack of 1U XServes (don’t ask, I inherited that part of the cluster).

avidday · December 1, 2009, 10:39pm

If my memory serves me correctly, there were several thread posted here where people were getting very strange and asymmetrical results on dual Tylersberg motherboards, even with what should be the correct cpu affinity/numa settings. There was one very attractive sounding board (either Tyan or Supermicro I can’t remember which) which lots of lanes in x16 slots, but gave poor bandwidth under some conditions with GT200 cards.

tmurray · December 1, 2009, 11:33pm

This has been my experience with all dual X58 boxes. Not sure whether it’s a reference BIOS issue or what.

Demq · December 5, 2009, 6:29pm

Thanks everyone for your input. We will probably go with Supermicro’s 4-U chassis for 2:1 GPU o CPU ratio. I would personally prefer using just usual tower cases, but the system is going to be a prototype for a large-scale machine we are going to build later and has to be “server grade” rack-mountable. Hopefully the new Fermi based consumer cards will come out soon so we can start on building:)