Should we as a community gofundme one Spark for Eugr's nightly builds?

We can still fund the 4th spark :D
That’s money well spent.

2 Likes

I totally agree, I’m in

Let’s wait, otherwise there won’t be an incentive to try 3-node configurations :) We already know that 4-node configurations work well, although of course I wouldn’t mind getting a 4th one at some point :)

3 Likes

By the way, thanks to the community and the enthusiasm shown for Spark, the third Spark is on the way. The question is, do we use an ABC triangle or a router (like the Mikrotik CRS804 or similar) to connect them? Thanks in advance.

A switch would be more straightforward, but I want to try a mesh. With 3 sparks you can get direct connection between any of the pairs by utilizing both ports. The downside is that there will be three different subnets.

2 Likes

The playbook is up: Connect Three DGX Spark in a Ring Topology | DGX Spark

with an interesting note: In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster

1 Like

I’m not sure this makes any sense. Why would they need to assign IPs to both “twins” on each port? Also, they assign IPs from the same subnet to both “twins” (which are separate interfaces from the OS standpoint) - this is a bad practice from the networking standpoint and can cause lots of weird issues (and will break autodiscovery).

1 Like

Ah, that makes a lot of sense — just to get you and the other experts here to come up with a community solution, as always… ;)

2 Likes

Thank you very much for the replies, I’ll play a bit as soon as I get around to it.

Patrick from ServeTheHome had some recommendations in that thread as they tested Switches with Sparks. A new video will be up shortly with that topic.

2 Likes

Watching progress on a 3-node cluster setup with great interest :)

Not even a question for me - as soon as i see the link - I’m straight up paying Eugr back for all his support! The man earned it more than those @nvidia marketing clankers talking crap.

1 Like

In a mesh scenerio you’d have to use two ports on spark, from what I understand would drop the speed to 100G on each.
”To achieve 200 Gbps, the CX7 uses multi-host mode aggregating both PCIe x4 links — so both ports are always active together even through a single cable.”
One cable = 200G (both PCIe x4 links feeding that one port)”

1 Like

I just caught up to this – wasn’t really browsing forums yesterday. Interesting. You’re right that doesn’t make sense…

Maybe they intended for parallelism to still treat it as 2x NCCL/RoCEv2 interfaces even though our cap is still 100G? Although in that case, we’d need to use 6x separate subnets instead of 3… Not that I’m convinced that would bring any benefit – in fact I’m pretty certain that would only be a drawback.

And if they didn’t intend that (to leverage both partitions to be 2x interfaces per interconnect), then yeah it’s just superfluous… but I don’t think it would really hurt either. So I don’t know if it’s wrong but it is unnecessary.

2 Likes

Well, and I was right. The bandwidth was half of what it should be with this kind of setup (with 2 or 3 nodes). With a proper setup (each subinterface has its own subnet), everything works as it should.

It does require a special NCCL build - stay tuned.

When set up as in the playbook:

# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  62831 on      spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   4016 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#  Rank  2 Group  0 Pid   8382 on     spark3 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869168    1431655764     float    none      -1   656678   26.16   17.44       0   651586   26.37   17.58       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.5093
#
# Collective test concluded: all_gather_perf

When set up properly:

# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  65784 on      spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   7094 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#  Rank  2 Group  0 Pid   9011 on     spark3 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869168    1431655764     float    none      -1   477778   35.96   23.97       0   472288   36.38   24.25       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.1112
#
# Collective test concluded: all_gather_perf
#
5 Likes

3-spark’s Eugr be like

6 Likes

OK, I’ve got 3 node mesh working with my internal builds. For the most part, it makes more sense to just use 2 nodes to run larger models and a third one to run embedding/small models and other stuff.

Tensor-parallel doesn’t work with 3 nodes as most (all?) models don’t support this configuration. However, there are other options.

I tried to run Qwen3-397B-int4-autoround with pipeline-parallel-size=3 and it fits the model + ~500K of context with gpu-memory-utilization 0.7, but the performance is obviously lower than running on dual Sparks, but somewhat usable. I want to try Expert Parallel to see if it works better for this config.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 868.36 ± 136.55 2431.50 ± 429.00 2427.07 ± 429.00 2431.64 ± 428.99
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 18.45 ± 0.46 19.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_pp @ d4096 781.30 ± 530.48 46168.34 ± 60274.66 46163.91 ± 60274.66 46168.48 ± 60274.67
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_tg @ d4096 17.90 ± 0.29 18.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 @ d4096 364.60 ± 17.69 5635.19 ± 282.86 5630.77 ± 282.86 5635.47 ± 282.86
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 @ d4096 18.04 ± 0.12 19.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_pp @ d8192 1180.34 ± 8.33 6945.97 ± 48.72 6941.54 ± 48.72 6946.23 ± 48.78
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_tg @ d8192 17.77 ± 0.23 19.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 @ d8192 366.17 ± 2.00 5597.64 ± 30.64 5593.21 ± 30.64 5597.74 ± 30.64
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 @ d8192 17.85 ± 0.08 18.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_pp @ d16384 1146.14 ± 3.18 14300.90 ± 39.49 14296.47 ± 39.49 14301.15 ± 39.63
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_tg @ d16384 17.50 ± 0.47 19.00 ± 1.41
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 @ d16384 339.53 ± 0.78 6036.26 ± 13.88 6031.84 ± 13.88 6036.51 ± 13.77
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 @ d16384 17.86 ± 0.23 18.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_pp @ d32768 1061.60 ± 1.94 30872.52 ± 56.36 30868.09 ± 56.36 30872.70 ± 56.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound ctx_tg @ d32768 17.76 ± 0.36 18.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 @ d32768 306.20 ± 3.18 6693.63 ± 69.29 6689.20 ± 69.29 6693.76 ± 69.31
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 @ d32768 17.45 ± 0.02 18.33 ± 0.47

llama-benchy (0.3.5)
date: 2026-03-25 12:03:17 | latency mode: api | pp basis: ttfr

3 Likes

The mesh configuration breaks 2 node clusters unless patched NCCL is used, so until I fully test and merge it into main, I have to reconfigure my cluster every night so my nightly pipeline succeeds.

4 Likes

I’m very tempted to buy a third Spark and eagerly await to see if you manage to squeeze out a little more performance. Even just releasing memory pressure and being able to fit more context would be nice.

1 Like

We have updated the Three sparks in a Ring playbook to allow higher bandwidth. Please check it out now: Connect Three DGX Spark in a Ring Topology | DGX Spark

1 Like