4-node DGX Spark cluster without a switch

I only have two nodes so I have not been able to test the below.

Today as we all know we can do 2-node and 3-node clusters using direct DAC cables keeping the latency very low. With 2-nodes we get good scaling with Tensor-parallel workloads and if you have 3-nodes then you are stuck with pipeline-parallel workloads not adding any speed. If you want 4-nodes or more a switch is required adding latency affecting Tensor-parallel workloads.

So what if we could do 4-nodes without the latency penalty?

With the two ConnectX-7 ports per unit we can do 3-nodes using 3 QSFP56 DAC cables. If you try 200G QSFP56 to 2x 100G QSFP56 breakout cables you run out of ports in the nodes before you have a full mesh since you can’t recombine the spare 100G ends together.

Let me introduce an option that can: 200GBASE-SR4 transceivers, they are the optical version of the DAC cable above but where you can break out and recombine the lanes as you want using basic MPO-12 to LC-LC breakout cables and LC-LC duplex couplers.

The connectivity is pretty straight forward. Plug in one transceiver in each of your 4 nodes. Plug in one breakout cable into each transceiver.

Join the breakout cables as following (LC-LC 5 and 6 are unused):

Source node Source cable Destination node Destination cable
Node-1 LC-LC 1 Node-2 LC-LC 3
Node-1 LC-LC 2 Node-2 LC-LC 4
Node-2 LC-LC 1 Node-3 LC-LC 3
Node-2 LC-LC 2 Node-3 LC-LC 4
Node-3 LC-LC 1 Node-4 LC-LC 3
Node-3 LC-LC 2 Node-4 LC-LC 4
Node-4 LC-LC 1 Node-1 LC-LC 3
Node-4 LC-LC 2 Node-1 LC-LC 4

This gives us a ring of 100G links, for a full mesh we can add two regular QSFP56 DAC cables, one from Node-1 to Node-3 and one from Node-2 to Node 4.

With the full physical mesh established, configure /31 ranges on each point to point link like the the 2-node and 3-node clusters. If you used port 2 for the ring then you will have enP2p1s0f0np0 go to enP2p1s0f1np1 in one direction and enP2p1s0f1np1 to enP2p1s0f0np0 in the opposite direction.

This entire thing could be done using custom QSFP56 DAC cables as well, but as a quick test to see what the perfomance gains are, the above can be done using off the shelf parts.

The eagled eyed of you can have probably already figured out that this approach could be used to make a 5-node cluster, or offer 4x100G connectivity to a NAS.

Just curious - have you validated this in practice, especially wrt thermals and sustained throughput?

I’m asking because I mapped out a connectivity plan for my cluster using fiber instead of DAC which I much preferred for (reasons that trump cost) but have been holding back due to the online pundits all advocating DAC due to thermals.

In general I prefer SFP transceivers and fiber. Costs more upfront but there’s a lot more flexibility on the backend when you separate the cables from the endpoints. Different lengths? Just get a new fiber? Different breakouts? New harness. Different switch port? New transceiver. Also fiber is more compact inside the rack, and can get run to a nearby (or far!) location outside the rack without needing new transceivers. With DAC, every darn change is an entire new cable…

The added heat load should be 5w per node, less than the idle power saving implemented on the ConnectX-7s a while ago.

I agree that a custom DAC would be the ideal solution. Not sure how hard it would be to get something like that made as a bespoke request.

I can’t say if this would work or not, but I love the direction of exploring more cluster configurations and using reduced bandwidth. I’ve been posting about the possibility of a 16x cluster and am interested in having cluster size be dynamically configurable, so a larger cluster could be easily split into subclusters or recombined on the fly. When you say LC-LC couplers, are you saying something like the ones below?

https://www.fs.com/uk/products/76105.html

That is using single mode fibres, you want multitmode OM4 for these transceivers.

Here is a double width version: https://www.fs.com/uk/products/68522.html?attribute=58255&id=3467685

It seems my idea with splitting and merging the optical outputs wasn’t that bad. After all I just found these OAC H cross connects that is doing exactly what this Frankenstein mod would have achieved:

FS version

Naddod version

I have had no luck finding a DAC version of this.

From your original idea, note that 3-node is a mesh, not a ring. If you make it a ring (w/ 4 sparks), then you’ll end up needing to add hops whenever you need communication across the ones not directly connected. It sorta defeats the purpose of trying so hard to go without a switch. Node 1 → Node 3 communication would be degraded and either fail or would need to hop over another spark (which would have the dual effect of adding a hop and stealing bandwidth/throughput from another link). So inferior to switch there.

I don’t think most of what your proposing would work. Switches allow more complex configuration of the underlying channels, but there is negotiation involved in agreeing on signals. It isn’t just like a simple electrical conductor. Typically a host-network adapter won’t support all of that. It might be possible to make it work if you hack around in the network kernel drivers? But functionally rewiring the channels is a lot more complicated than just splicing up the connections.

And e.g. if someone is talking about 8-16x sparks and e.g. being able to dynamically reconfigure them, etc. etc… just get a switch. Switch is cheap compared to 8-16x sparks. It’s literally the solution to the problem. I’d only pursue this kind of network sorcery if you’re basically just dying to dive deep into the networking part.

Not to be a buzzkill or negative, but I’ve been doing this stuff a while (both the networking and the Sparks) – and it sounds tempting to e.g. do a breakout cable and then recombine them or things like that (whether optical or copper isn’t really much difference)-- but it’s just not that straightforward in practice.

Yes, I think your order of upgrades…

1 Spark

1 Spark, one cable (check cable is detected and interfaces appear)

2 Spark, one cable (get the cluster working)

3 Spark, 3 Cables (not optimal as you over subscribe on Connect7)

1 Switch, 3 Spark (possibly new cables needed now)

1 Switch, 4 Spark

And so on.

If you get to 4 Spark, you may start wishing you had gone with 1 RTX 6000 Pro, depending on your use case.

To make things a bit more clear here is a diagram:

It is a full mesh, the ring can either be the Frankenstein option as per post 1 or use the H cross connect OAC cable in the post above yours. The 200G paths are bog standard DACs used for any 2-node clusters.

If a cable breaks it isn’t worse than if a cable facing a switch breaks.

The Sparks already have two logical interfaces per physical port, which are mapped to lane 1-2 and 3-4 respectively and pretty much the reason we need to put IPs on both to get the full 200G on a single cable.

This topology can either run as 4-nodes, or two 2-nodes (1-3 and 2-4) or just singles, all without the added latency of a switch in the middle.

I went directly to 2 Sparks, one cable.

If I decide to go with 4 Sparks down the road I will go there directly, time will tell.

As for use case, this is just a hobby to learn something new and keep me busy while recovering from an upcoming heart surgery. Whatever I learn will assist in my career progression.

Also I don’t want a noisy switch taking up space in my office. :)

The total bandwidth across both ConnectX-7 interfaces is 200Gbps. You are oversubscribing the bandwidth available by 100% (400Gbps on a 200Gbps bus).

Correct, so either the ConnectX-7 interfaces will throttle the traffic or your switch will with 4 nodes on 200G each.

I still think that the Mikrotik is a small price to pay to ensure all 4 nodes have an even share of the available bandwidth. MikroTik CRS804-4DDQ-hRM 400G Cloud Router Switch | 4X QSFP56-DD 400G Ports | Dual 10G Ethernet | Quad-Core 2GHz CPU | 4GB RAM | RouterOS v7 | Redundant PSU | Rackmount 1U: Amazon.co.uk: Computers & Accessories

Just realized the cables are MORE expensive than getting a switch. :-D

Crazy indeed.

And yes a lot of people are very happy with the switch option, as that is how every single 4-node or larger cluster is built today.

But it is probably also why going from 2 to 4 nodes isn’t scaling as well as going from 1 to 2 nodes. However until someone is testing this, or upgrading the Mikrotik to a low latency switch like Nvidia/Mellanox we won’t know.

My understanding is that you could do a ring.

A ring would work with standard DAC cables. But any node not directly connected will add latency. And that is the goal of this exercise, to see what 4 nodes can do once the latency component is removed from the equation.

Yes, at 4 Sparks, the CRS804 DDQ switch is roughly 7% TCO of the whole deployment.

As you can see from my later replies in this thread, this was never about cost savings and the cable solution is actually more expensive than a switch.

This was about two things, remove the need for a fairly large and noisy switch and reducing the latency and see how that would impact the scaling going from 2 to 4 nodes which isn’t linear like going from 1 to 2.

BTW, why is this forum randomly not quoting posts when told to?

what benefit it will be as connection will be limiting throughput. You should get same result or even less then with 3 boxes due to limited throughput. As boxes are 260+ and there is no benefit to split 200 to 2 x 100 as more coordination.