GPU cluster


I want to build a small GPU cluster. I plan to run 3 cards on a 4 PCI-E mainboard and to use the last slot for an InfiniBand network card like…start.shtml#HBA or any other (latency is 1.2 us). First I want to connect two computers by a special cable and if it works I want to build a cluster of up to 12 computers (using a switch).

Does anybody have experience with that ? What is the network technology with the lowest latency ? is there a another possibility to connect the computers with even lower latency (and larger bandwidth) ? Any suggestions are welcome !

why build your own when nvidia offer ones already?

there are a few different solutions on there, have a read.

Tesla S870 GPU Computing System

  • Four GPUs (128 thread processors per GPU)

    • 6 GB of system memory (1.5 GB dedicated memory per GPU)

    • Standard 19”, 1U rack-mount chassis

    • Connects to host via cabling to a low power PCI Express x8 or x16 adapter card

    • Configuration: 2 PCI Express connectors driving 2 GPUs each (4 GPUs total)

I don’t understand:

I connect 4 GPUs with one PCI-E to the host ?

But I can’t access the 6BG memory in a shared manner,

so I have to access the different GPUs individually (as I do now just by

plugging 3 cards on a mainboard)

Is the communication between the GPUs faster than FSB+PCI-E ?

How many Tesla S870 (4 GPUs) one can combine in one computer ?

Best Jonas

The S870 requires two PCI-E slots for 4 GPUs.

Each S870 uses 2 16x PCIe connections each connected to two cards. There is no programming difference (that I’m aware of) to having 4 cards in a single computer, either way you have to run each one within a different context and collect results yourself. I.e. the 6GB memory would be split across the individual cards either way. The advantages of using it over many single cards inside the system are 1) it is rack mountable for nice cluster management. 2) It has it’s own power supply so you don’t need monster 1500W power supplies in each system.

The disadvantages are that it shares 2 cards over one PCI-e meaning less CPU<-> GPU communication bandwidth is available for you. Since your main question is about the best way to limit communication latency, this may be important to you.

I will point out to you that the minimum latency for a CPU<->GPU memory transfer is about 20us, and bandwidth is limited to 3GiB/s under ideal conditions. The minimum latency for launching a kernel is about 5us in my tests, but that increases as the size of the kernel launch gets larger. Be very careful in your choice of MB too. Some switch to 8x PCIe when you add more than two cards slowing this theoretical peak further. I’ve also seen some reported bandwidth tests that copied from 2 GPU’s simultaneously and the effective bandwidth for each dropped to ~1GiB/s.

Finally, I will answer your original question. Myrinet is another high speed interconnect that is often used. I’ve worked on a half-dozen systems with infiniband and one with myrinet. My experience (keep in mind the small sample size) is that infiniband is more stable when running long jobs and I had fewer software issues setting it up. Make sure your power supplies are very stable however, or even use redundant supplies, as a string of random crashes I experienced due to infiniband errors turned out to be a power supply issue, according to the admin of that system.

The PCI-e adapters of the S870 are Gen2. If you have the right motherboard, you can transfer at gen2 speed from the MB to the internal switch (also Gen2) in the S870.

Another good high-speed solution is Quadrics.

I am not sure your back to back configuration will work, Infiniband needs a switch, you cannot just connect two boxes with a cable.

Oh, that is good to know. From the marketing material and brochures, it was never obvious that the PCIe was Gen2. This basically removes the communication disadvantage from my previous post then, since gen2 has double the bandwidth.

Though, I’d still be curious to know what kind of throughput can be obtained copying a buffer (say a 1MB one) from each of the 4 GPUs in the S870 to the CPU “simultaneously” (I guess it wouldn’t be simultaneous with the switch). Information on what MB/chipset the test was performed on would also be nice to know.

Guys, please correct me if I am wrong… I thought I had been told (and read in the NVIDIA HPC brief) that the S870 could be ordered in two configurations - with one of the option being all 4 GPU’s connected over one PCIx interface.

We would like to build a system with 2 S870’s. I’d like to use a SUN or Dell workstation with two PCIx slots has the host for the 2 S870’s. If two adapters are required for each S870, that kinda shoots that plan in the foot.

Also, does anyone know of plans to support Win64 on the S870. Right now, it is Linux only, correct?


The S870 needs two PCI-e connectors (not PCI-X).

also you need to remember that it is recommended to have one cpu (or core) for each gpu. so for such a configuration you would need an octacore computer !

Is this still true with the new 1.1 API, which supports asynchronous operation?

Yes, it seems to be true with 1.1. Problem is that one context is allowed per thread and there are some issues with synchronization functions (cudaThreadSynchronize()) which still causes high CPU usage.

Also when you queue up more than 16 kernels asynchronously.


I’m trying to devise an relatively simple way to utilize (2) S870’s. My initial work and learning curve for multi-gpu apps would only use (1) S870, but eventually I’d like to be able to scale up to two or more.

So, can I drive (1) S870 from (1) Sun Ultra24 workstation, which would utilize both x16 gen 2 slots. I can get the Ultra24 with the Intel Core2 Quad-core processor (so, 4 cores to 4 gpus)???

Then, can I link two (2) of these workstation-based setups together using one of the network topologies referred to in this topic to form a small cluster?

Any insight or ideas you have are certainly appreciated!


So if the S870 connects via 2 second generation PCIe buses can the bus advantages also be used for the C870?
Can I use a C870 in a board with 2nd generation PCIe and utilize it’s advantages?

The C870 is a Gen1 device. If you plug it in a Gen2 slot, the slot will run at Gen1 speed.

Thanks for the clarification!