I only have two nodes so I have not been able to test the below.
Today as we all know we can do 2-node and 3-node clusters using direct DAC cables keeping the latency very low. With 2-nodes we get good scaling with Tensor-parallel workloads and if you have 3-nodes then you are stuck with pipeline-parallel workloads not adding any speed. If you want 4-nodes or more a switch is required adding latency affecting Tensor-parallel workloads.
So what if we could do 4-nodes without the latency penalty?
With the two ConnectX-7 ports per unit we can do 3-nodes using 3 QSFP56 DAC cables. If you try 200G QSFP56 to 2x 100G QSFP56 breakout cables you run out of ports in the nodes before you have a full mesh since you can’t recombine the spare 100G ends together.
Let me introduce an option that can: 200GBASE-SR4 transceivers, they are the optical version of the DAC cable above but where you can break out and recombine the lanes as you want using basic MPO-12 to LC-LC breakout cables and LC-LC duplex couplers.
The connectivity is pretty straight forward. Plug in one transceiver in each of your 4 nodes. Plug in one breakout cable into each transceiver.
Join the breakout cables as following (LC-LC 5 and 6 are unused):
Source node
Source cable
Destination node
Destination cable
Node-1
LC-LC 1
Node-2
LC-LC 3
Node-1
LC-LC 2
Node-2
LC-LC 4
Node-2
LC-LC 1
Node-3
LC-LC 3
Node-2
LC-LC 2
Node-3
LC-LC 4
Node-3
LC-LC 1
Node-4
LC-LC 3
Node-3
LC-LC 2
Node-4
LC-LC 4
Node-4
LC-LC 1
Node-1
LC-LC 3
Node-4
LC-LC 2
Node-1
LC-LC 4
This gives us a ring of 100G links, for a full mesh we can add two regular QSFP56 DAC cables, one from Node-1 to Node-3 and one from Node-2 to Node 4.
With the full physical mesh established, configure /31 ranges on each point to point link like the the 2-node and 3-node clusters. If you used port 2 for the ring then you will have enP2p1s0f0np0 go to enP2p1s0f1np1 in one direction and enP2p1s0f1np1 to enP2p1s0f0np0 in the opposite direction.
This entire thing could be done using custom QSFP56 DAC cables as well, but as a quick test to see what the perfomance gains are, the above can be done using off the shelf parts.
The eagled eyed of you can have probably already figured out that this approach could be used to make a 5-node cluster, or offer 4x100G connectivity to a NAS.
Just curious - have you validated this in practice, especially wrt thermals and sustained throughput?
I’m asking because I mapped out a connectivity plan for my cluster using fiber instead of DAC which I much preferred for (reasons that trump cost) but have been holding back due to the online pundits all advocating DAC due to thermals.
In general I prefer SFP transceivers and fiber. Costs more upfront but there’s a lot more flexibility on the backend when you separate the cables from the endpoints. Different lengths? Just get a new fiber? Different breakouts? New harness. Different switch port? New transceiver. Also fiber is more compact inside the rack, and can get run to a nearby (or far!) location outside the rack without needing new transceivers. With DAC, every darn change is an entire new cable…
I can’t say if this would work or not, but I love the direction of exploring more cluster configurations and using reduced bandwidth. I’ve been posting about the possibility of a 16x cluster and am interested in having cluster size be dynamically configurable, so a larger cluster could be easily split into subclusters or recombined on the fly. When you say LC-LC couplers, are you saying something like the ones below?
It seems my idea with splitting and merging the optical outputs wasn’t that bad. After all I just found these OAC H cross connects that is doing exactly what this Frankenstein mod would have achieved:
From your original idea, note that 3-node is a mesh, not a ring. If you make it a ring (w/ 4 sparks), then you’ll end up needing to add hops whenever you need communication across the ones not directly connected. It sorta defeats the purpose of trying so hard to go without a switch. Node 1 → Node 3 communication would be degraded and either fail or would need to hop over another spark (which would have the dual effect of adding a hop and stealing bandwidth/throughput from another link). So inferior to switch there.
I don’t think most of what your proposing would work. Switches allow more complex configuration of the underlying channels, but there is negotiation involved in agreeing on signals. It isn’t just like a simple electrical conductor. Typically a host-network adapter won’t support all of that. It might be possible to make it work if you hack around in the network kernel drivers? But functionally rewiring the channels is a lot more complicated than just splicing up the connections.
And e.g. if someone is talking about 8-16x sparks and e.g. being able to dynamically reconfigure them, etc. etc… just get a switch. Switch is cheap compared to 8-16x sparks. It’s literally the solution to the problem. I’d only pursue this kind of network sorcery if you’re basically just dying to dive deep into the networking part.
Not to be a buzzkill or negative, but I’ve been doing this stuff a while (both the networking and the Sparks) – and it sounds tempting to e.g. do a breakout cable and then recombine them or things like that (whether optical or copper isn’t really much difference)-- but it’s just not that straightforward in practice.
It is a full mesh, the ring can either be the Frankenstein option as per post 1 or use the H cross connect OAC cable in the post above yours. The 200G paths are bog standard DACs used for any 2-node clusters.
If a cable breaks it isn’t worse than if a cable facing a switch breaks.
The Sparks already have two logical interfaces per physical port, which are mapped to lane 1-2 and 3-4 respectively and pretty much the reason we need to put IPs on both to get the full 200G on a single cable.
This topology can either run as 4-nodes, or two 2-nodes (1-3 and 2-4) or just singles, all without the added latency of a switch in the middle.
If I decide to go with 4 Sparks down the road I will go there directly, time will tell.
As for use case, this is just a hobby to learn something new and keep me busy while recovering from an upcoming heart surgery. Whatever I learn will assist in my career progression.
Also I don’t want a noisy switch taking up space in my office. :)
And yes a lot of people are very happy with the switch option, as that is how every single 4-node or larger cluster is built today.
But it is probably also why going from 2 to 4 nodes isn’t scaling as well as going from 1 to 2 nodes. However until someone is testing this, or upgrading the Mikrotik to a low latency switch like Nvidia/Mellanox we won’t know.
A ring would work with standard DAC cables. But any node not directly connected will add latency. And that is the goal of this exercise, to see what 4 nodes can do once the latency component is removed from the equation.
As you can see from my later replies in this thread, this was never about cost savings and the cable solution is actually more expensive than a switch.
This was about two things, remove the need for a fairly large and noisy switch and reducing the latency and see how that would impact the scaling going from 2 to 4 nodes which isn’t linear like going from 1 to 2.
BTW, why is this forum randomly not quoting posts when told to?
what benefit it will be as connection will be limiting throughput. You should get same result or even less then with 3 boxes due to limited throughput. As boxes are 260+ and there is no benefit to split 200 to 2 x 100 as more coordination.