I have been trying to set up a small computing cluster of 4 computers just using connectX-5 HCAs. We have one Master computer with two cards (a double and a single port) and then three Slave computers each with a single port. When I just have a master and slave computer hooked up it works fine, but when I start adding more slaves, the connection drops between the other system.

Do I need to have some sort of connection manager set up? any advice of how to set this up would be greatly appreciated. (or at least a point to the documentation would be nice)

We are considering the same thing I bet.

Building a “no-hop” grid but out of 100GbE links.

I read from googling posts that one way to test the throughput / cabling is to connect the CU QSFP28 to another one in another PC and do a dd or other tool and do a direct transfer point to point.

This validates the cable and ports without any switch in the way. Why wouldn’t that work if natively as you also suggest - no switch latency delays.

The rates we need limit the number of links to target to 10-12 which is doable with 5-6 ConnectX4 in the right box.