Does anybody have experience with that ? What is the network technology with the lowest latency ? is there a another possibility to connect the computers with even lower latency (and larger bandwidth) ? Any suggestions are welcome !
Each S870 uses 2 16x PCIe connections each connected to two cards. There is no programming difference (that I’m aware of) to having 4 cards in a single computer, either way you have to run each one within a different context and collect results yourself. I.e. the 6GB memory would be split across the individual cards either way. The advantages of using it over many single cards inside the system are 1) it is rack mountable for nice cluster management. 2) It has it’s own power supply so you don’t need monster 1500W power supplies in each system.
The disadvantages are that it shares 2 cards over one PCI-e meaning less CPU<-> GPU communication bandwidth is available for you. Since your main question is about the best way to limit communication latency, this may be important to you.
I will point out to you that the minimum latency for a CPU<->GPU memory transfer is about 20us, and bandwidth is limited to 3GiB/s under ideal conditions. The minimum latency for launching a kernel is about 5us in my tests, but that increases as the size of the kernel launch gets larger. Be very careful in your choice of MB too. Some switch to 8x PCIe when you add more than two cards slowing this theoretical peak further. I’ve also seen some reported bandwidth tests that copied from 2 GPU’s simultaneously and the effective bandwidth for each dropped to ~1GiB/s.
Finally, I will answer your original question. Myrinet is another high speed interconnect that is often used. I’ve worked on a half-dozen systems with infiniband and one with myrinet. My experience (keep in mind the small sample size) is that infiniband is more stable when running long jobs and I had fewer software issues setting it up. Make sure your power supplies are very stable however, or even use redundant supplies, as a string of random crashes I experienced due to infiniband errors turned out to be a power supply issue, according to the admin of that system.
The PCI-e adapters of the S870 are Gen2. If you have the right motherboard, you can transfer at gen2 speed from the MB to the internal switch (also Gen2) in the S870.
Another good high-speed solution is Quadrics.
I am not sure your back to back configuration will work, Infiniband needs a switch, you cannot just connect two boxes with a cable.
Oh, that is good to know. From the marketing material and brochures, it was never obvious that the PCIe was Gen2. This basically removes the communication disadvantage from my previous post then, since gen2 has double the bandwidth.
Though, I’d still be curious to know what kind of throughput can be obtained copying a buffer (say a 1MB one) from each of the 4 GPUs in the S870 to the CPU “simultaneously” (I guess it wouldn’t be simultaneous with the switch). Information on what MB/chipset the test was performed on would also be nice to know.
Guys, please correct me if I am wrong… I thought I had been told (and read in the NVIDIA HPC brief) that the S870 could be ordered in two configurations - with one of the option being all 4 GPU’s connected over one PCIx interface.
We would like to build a system with 2 S870’s. I’d like to use a SUN or Dell workstation with two PCIx slots has the host for the 2 S870’s. If two adapters are required for each S870, that kinda shoots that plan in the foot.
Also, does anyone know of plans to support Win64 on the S870. Right now, it is Linux only, correct?
also you need to remember that it is recommended to have one cpu (or core) for each gpu. so for such a configuration you would need an octacore computer !
Yes, it seems to be true with 1.1. Problem is that one context is allowed per thread and there are some issues with synchronization functions (cudaThreadSynchronize()) which still causes high CPU usage.
I’m trying to devise an relatively simple way to utilize (2) S870’s. My initial work and learning curve for multi-gpu apps would only use (1) S870, but eventually I’d like to be able to scale up to two or more.
So, can I drive (1) S870 from (1) Sun Ultra24 workstation, which would utilize both x16 gen 2 slots. I can get the Ultra24 with the Intel Core2 Quad-core processor (so, 4 cores to 4 gpus)???
Then, can I link two (2) of these workstation-based setups together using one of the network topologies referred to in this topic to form a small cluster?
Any insight or ideas you have are certainly appreciated!
So if the S870 connects via 2 second generation PCIe buses can the bus advantages also be used for the C870?
Can I use a C870 in a board with 2nd generation PCIe and utilize it’s advantages?