We can still fund the 4th spark :D
That’s money well spent.
I totally agree, I’m in
Let’s wait, otherwise there won’t be an incentive to try 3-node configurations :) We already know that 4-node configurations work well, although of course I wouldn’t mind getting a 4th one at some point :)
By the way, thanks to the community and the enthusiasm shown for Spark, the third Spark is on the way. The question is, do we use an ABC triangle or a router (like the Mikrotik CRS804 or similar) to connect them? Thanks in advance.
A switch would be more straightforward, but I want to try a mesh. With 3 sparks you can get direct connection between any of the pairs by utilizing both ports. The downside is that there will be three different subnets.
The playbook is up: Connect Three DGX Spark in a Ring Topology | DGX Spark
with an interesting note: In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster
I’m not sure this makes any sense. Why would they need to assign IPs to both “twins” on each port? Also, they assign IPs from the same subnet to both “twins” (which are separate interfaces from the OS standpoint) - this is a bad practice from the networking standpoint and can cause lots of weird issues (and will break autodiscovery).
Ah, that makes a lot of sense — just to get you and the other experts here to come up with a community solution, as always… ;)
Thank you very much for the replies, I’ll play a bit as soon as I get around to it.
Patrick from ServeTheHome had some recommendations in that thread as they tested Switches with Sparks. A new video will be up shortly with that topic.
Watching progress on a 3-node cluster setup with great interest :)
Not even a question for me - as soon as i see the link - I’m straight up paying Eugr back for all his support! The man earned it more than those @nvidia marketing clankers talking crap.
In a mesh scenerio you’d have to use two ports on spark, from what I understand would drop the speed to 100G on each.
”To achieve 200 Gbps, the CX7 uses multi-host mode aggregating both PCIe x4 links — so both ports are always active together even through a single cable.”
”One cable = 200G (both PCIe x4 links feeding that one port)”
I just caught up to this – wasn’t really browsing forums yesterday. Interesting. You’re right that doesn’t make sense…
Maybe they intended for parallelism to still treat it as 2x NCCL/RoCEv2 interfaces even though our cap is still 100G? Although in that case, we’d need to use 6x separate subnets instead of 3… Not that I’m convinced that would bring any benefit – in fact I’m pretty certain that would only be a drawback.
And if they didn’t intend that (to leverage both partitions to be 2x interfaces per interconnect), then yeah it’s just superfluous… but I don’t think it would really hurt either. So I don’t know if it’s wrong but it is unnecessary.
Well, and I was right. The bandwidth was half of what it should be with this kind of setup (with 2 or 3 nodes). With a proper setup (each subinterface has its own subnet), everything works as it should.
It does require a special NCCL build - stay tuned.
When set up as in the playbook:
# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 62831 on spark device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 4016 on spark2 device 0 [000f:01:00] NVIDIA GB10
# Rank 2 Group 0 Pid 8382 on spark3 device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869168 1431655764 float none -1 656678 26.16 17.44 0 651586 26.37 17.58 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 17.5093
#
# Collective test concluded: all_gather_perf
When set up properly:
# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 65784 on spark device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 7094 on spark2 device 0 [000f:01:00] NVIDIA GB10
# Rank 2 Group 0 Pid 9011 on spark3 device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869168 1431655764 float none -1 477778 35.96 23.97 0 472288 36.38 24.25 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.1112
#
# Collective test concluded: all_gather_perf
#
OK, I’ve got 3 node mesh working with my internal builds. For the most part, it makes more sense to just use 2 nodes to run larger models and a third one to run embedding/small models and other stuff.
Tensor-parallel doesn’t work with 3 nodes as most (all?) models don’t support this configuration. However, there are other options.
I tried to run Qwen3-397B-int4-autoround with pipeline-parallel-size=3 and it fits the model + ~500K of context with gpu-memory-utilization 0.7, but the performance is obviously lower than running on dual Sparks, but somewhat usable. I want to try Expert Parallel to see if it works better for this config.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | pp2048 | 868.36 ± 136.55 | 2431.50 ± 429.00 | 2427.07 ± 429.00 | 2431.64 ± 428.99 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | tg32 | 18.45 ± 0.46 | 19.33 ± 0.47 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_pp @ d4096 | 781.30 ± 530.48 | 46168.34 ± 60274.66 | 46163.91 ± 60274.66 | 46168.48 ± 60274.67 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_tg @ d4096 | 17.90 ± 0.29 | 18.67 ± 0.47 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | pp2048 @ d4096 | 364.60 ± 17.69 | 5635.19 ± 282.86 | 5630.77 ± 282.86 | 5635.47 ± 282.86 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | tg32 @ d4096 | 18.04 ± 0.12 | 19.00 ± 0.00 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_pp @ d8192 | 1180.34 ± 8.33 | 6945.97 ± 48.72 | 6941.54 ± 48.72 | 6946.23 ± 48.78 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_tg @ d8192 | 17.77 ± 0.23 | 19.00 ± 0.00 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | pp2048 @ d8192 | 366.17 ± 2.00 | 5597.64 ± 30.64 | 5593.21 ± 30.64 | 5597.74 ± 30.64 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | tg32 @ d8192 | 17.85 ± 0.08 | 18.00 ± 0.00 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_pp @ d16384 | 1146.14 ± 3.18 | 14300.90 ± 39.49 | 14296.47 ± 39.49 | 14301.15 ± 39.63 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_tg @ d16384 | 17.50 ± 0.47 | 19.00 ± 1.41 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | pp2048 @ d16384 | 339.53 ± 0.78 | 6036.26 ± 13.88 | 6031.84 ± 13.88 | 6036.51 ± 13.77 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | tg32 @ d16384 | 17.86 ± 0.23 | 18.33 ± 0.47 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_pp @ d32768 | 1061.60 ± 1.94 | 30872.52 ± 56.36 | 30868.09 ± 56.36 | 30872.70 ± 56.47 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | ctx_tg @ d32768 | 17.76 ± 0.36 | 18.67 ± 0.47 | |||
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | pp2048 @ d32768 | 306.20 ± 3.18 | 6693.63 ± 69.29 | 6689.20 ± 69.29 | 6693.76 ± 69.31 | |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | tg32 @ d32768 | 17.45 ± 0.02 | 18.33 ± 0.47 |
llama-benchy (0.3.5)
date: 2026-03-25 12:03:17 | latency mode: api | pp basis: ttfr
The mesh configuration breaks 2 node clusters unless patched NCCL is used, so until I fully test and merge it into main, I have to reconfigure my cluster every night so my nightly pipeline succeeds.
I’m very tempted to buy a third Spark and eagerly await to see if you manage to squeeze out a little more performance. Even just releasing memory pressure and being able to fit more context would be nice.
We have updated the Three sparks in a Ring playbook to allow higher bandwidth. Please check it out now: Connect Three DGX Spark in a Ring Topology | DGX Spark
