Each S870 uses 2 16x PCIe connections each connected to two cards. There is no programming difference (that I’m aware of) to having 4 cards in a single computer, either way you have to run each one within a different context and collect results yourself. I.e. the 6GB memory would be split across the individual cards either way. The advantages of using it over many single cards inside the system are 1) it is rack mountable for nice cluster management. 2) It has it’s own power supply so you don’t need monster 1500W power supplies in each system.
The disadvantages are that it shares 2 cards over one PCI-e meaning less CPU<-> GPU communication bandwidth is available for you. Since your main question is about the best way to limit communication latency, this may be important to you.
I will point out to you that the minimum latency for a CPU<->GPU memory transfer is about 20us, and bandwidth is limited to 3GiB/s under ideal conditions. The minimum latency for launching a kernel is about 5us in my tests, but that increases as the size of the kernel launch gets larger. Be very careful in your choice of MB too. Some switch to 8x PCIe when you add more than two cards slowing this theoretical peak further. I’ve also seen some reported bandwidth tests that copied from 2 GPU’s simultaneously and the effective bandwidth for each dropped to ~1GiB/s.
Finally, I will answer your original question. Myrinet is another high speed interconnect that is often used. I’ve worked on a half-dozen systems with infiniband and one with myrinet. My experience (keep in mind the small sample size) is that infiniband is more stable when running long jobs and I had fewer software issues setting it up. Make sure your power supplies are very stable however, or even use redundant supplies, as a string of random crashes I experienced due to infiniband errors turned out to be a power supply issue, according to the admin of that system.