Should we as a community gofundme one Spark for Eugr's nightly builds?

raphael.amorim · March 21, 2026, 6:02pm

We can still fund the 4th spark :D
That’s money well spent.

vedcsolution · March 21, 2026, 6:20pm

I totally agree, I’m in

eugr · March 21, 2026, 7:10pm

Let’s wait, otherwise there won’t be an incentive to try 3-node configurations :) We already know that 4-node configurations work well, although of course I wouldn’t mind getting a 4th one at some point :)

vedcsolution · March 21, 2026, 7:28pm

By the way, thanks to the community and the enthusiasm shown for Spark, the third Spark is on the way. The question is, do we use an ABC triangle or a router (like the Mikrotik CRS804 or similar) to connect them? Thanks in advance.

eugr · March 21, 2026, 8:06pm

A switch would be more straightforward, but I want to try a mesh. With 3 sparks you can get direct connection between any of the pairs by utilizing both ports. The downside is that there will be three different subnets.

elsaco · March 21, 2026, 8:31pm

The playbook is up: Connect Three DGX Spark in a Ring Topology | DGX Spark

with an interesting note: In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster

eugr · March 21, 2026, 10:51pm

I’m not sure this makes any sense. Why would they need to assign IPs to both “twins” on each port? Also, they assign IPs from the same subnet to both “twins” (which are separate interfaces from the OS standpoint) - this is a bad practice from the networking standpoint and can cause lots of weird issues (and will break autodiscovery).

_cjg · March 21, 2026, 11:16pm

Ah, that makes a lot of sense — just to get you and the other experts here to come up with a community solution, as always… ;)

vedcsolution · March 21, 2026, 11:45pm

Thank you very much for the replies, I’ll play a bit as soon as I get around to it.

cosinus · March 22, 2026, 2:17pm

Patrick from ServeTheHome had some recommendations in that thread as they tested Switches with Sparks. A new video will be up shortly with that topic.

fred_dev · March 22, 2026, 3:31pm

Watching progress on a 3-node cluster setup with great interest :)

letsrock85 · March 22, 2026, 5:41pm

Not even a question for me - as soon as i see the link - I’m straight up paying Eugr back for all his support! The man earned it more than those @nvidia marketing clankers talking crap.

adi-sonusflow · March 22, 2026, 6:00pm

In a mesh scenerio you’d have to use two ports on spark, from what I understand would drop the speed to 100G on each.
”To achieve 200 Gbps, the CX7 uses multi-host mode aggregating both PCIe x4 links — so both ports are always active together even through a single cable.”
”One cable = 200G (both PCIe x4 links feeding that one port)”

dbsci · March 22, 2026, 6:27pm

I just caught up to this – wasn’t really browsing forums yesterday. Interesting. You’re right that doesn’t make sense…

Maybe they intended for parallelism to still treat it as 2x NCCL/RoCEv2 interfaces even though our cap is still 100G? Although in that case, we’d need to use 6x separate subnets instead of 3… Not that I’m convinced that would bring any benefit – in fact I’m pretty certain that would only be a drawback.

And if they didn’t intend that (to leverage both partitions to be 2x interfaces per interconnect), then yeah it’s just superfluous… but I don’t think it would really hurt either. So I don’t know if it’s wrong but it is unnecessary.

eugr · March 23, 2026, 5:27am

Well, and I was right. The bandwidth was half of what it should be with this kind of setup (with 2 or 3 nodes). With a proper setup (each subinterface has its own subnet), everything works as it should.

It does require a special NCCL build - stay tuned.

When set up as in the playbook:

# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  62831 on      spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   4016 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#  Rank  2 Group  0 Pid   8382 on     spark3 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869168    1431655764     float    none      -1   656678   26.16   17.44       0   651586   26.37   17.58       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.5093
#
# Collective test concluded: all_gather_perf

When set up properly:

# nccl-tests version 2.17.9 nccl-headers=22903 nccl-library=22907
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 3(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  65784 on      spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   7094 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#  Rank  2 Group  0 Pid   9011 on     spark3 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869168    1431655764     float    none      -1   477778   35.96   23.97       0   472288   36.38   24.25       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.1112
#
# Collective test concluded: all_gather_perf
#

raphael.amorim · March 23, 2026, 5:21pm

3-spark’s Eugr be like

eugr · March 25, 2026, 7:12pm

OK, I’ve got 3 node mesh working with my internal builds. For the most part, it makes more sense to just use 2 nodes to run larger models and a third one to run embedding/small models and other stuff.

Tensor-parallel doesn’t work with 3 nodes as most (all?) models don’t support this configuration. However, there are other options.

I tried to run Qwen3-397B-int4-autoround with pipeline-parallel-size=3 and it fits the model + ~500K of context with gpu-memory-utilization 0.7, but the performance is obviously lower than running on dual Sparks, but somewhat usable. I want to try Expert Parallel to see if it works better for this config.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048	868.36 ± 136.55		2431.50 ± 429.00	2427.07 ± 429.00	2431.64 ± 428.99
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	18.45 ± 0.46	19.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_pp @ d4096	781.30 ± 530.48		46168.34 ± 60274.66	46163.91 ± 60274.66	46168.48 ± 60274.67
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_tg @ d4096	17.90 ± 0.29	18.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048 @ d4096	364.60 ± 17.69		5635.19 ± 282.86	5630.77 ± 282.86	5635.47 ± 282.86
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32 @ d4096	18.04 ± 0.12	19.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_pp @ d8192	1180.34 ± 8.33		6945.97 ± 48.72	6941.54 ± 48.72	6946.23 ± 48.78
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_tg @ d8192	17.77 ± 0.23	19.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048 @ d8192	366.17 ± 2.00		5597.64 ± 30.64	5593.21 ± 30.64	5597.74 ± 30.64
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32 @ d8192	17.85 ± 0.08	18.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_pp @ d16384	1146.14 ± 3.18		14300.90 ± 39.49	14296.47 ± 39.49	14301.15 ± 39.63
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_tg @ d16384	17.50 ± 0.47	19.00 ± 1.41
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048 @ d16384	339.53 ± 0.78		6036.26 ± 13.88	6031.84 ± 13.88	6036.51 ± 13.77
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32 @ d16384	17.86 ± 0.23	18.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_pp @ d32768	1061.60 ± 1.94		30872.52 ± 56.36	30868.09 ± 56.36	30872.70 ± 56.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	ctx_tg @ d32768	17.76 ± 0.36	18.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048 @ d32768	306.20 ± 3.18		6693.63 ± 69.29	6689.20 ± 69.29	6693.76 ± 69.31
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32 @ d32768	17.45 ± 0.02	18.33 ± 0.47

llama-benchy (0.3.5)
date: 2026-03-25 12:03:17 | latency mode: api | pp basis: ttfr

eugr · March 25, 2026, 7:14pm

The mesh configuration breaks 2 node clusters unless patched NCCL is used, so until I fully test and merge it into main, I have to reconfigure my cluster every night so my nightly pipeline succeeds.

serapis · March 30, 2026, 3:55am

I’m very tempted to buy a third Spark and eagerly await to see if you manage to squeeze out a little more performance. Even just releasing memory pressure and being able to fit more context would be nice.

aniculescu · March 30, 2026, 4:01pm

We have updated the Three sparks in a Ring playbook to allow higher bandwidth. Please check it out now: Connect Three DGX Spark in a Ring Topology | DGX Spark

Topic		Replies	Views
Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun! DGX Spark / GB10 llama	2	191	April 5, 2026
6x Spark setup DGX Spark / GB10	109	7330	April 1, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3640	March 6, 2026
Value of 2nd Spark? DGX Spark / GB10 Projects	21	1353	March 30, 2026
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	1868	February 7, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1834	January 2, 2026
DGX Spark performance DGX Spark / GB10	50	3686	February 27, 2026
How are you planning on using your DGX spark? DGX Spark / GB10 Projects	22	2407	February 24, 2026
DGX Spark (SM121) Software Support is Severely Lacking - Official Roadmap Needed DGX Spark / GB10	41	3798	February 15, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2058	March 27, 2026

Should we as a community gofundme one Spark for Eugr's nightly builds?

Related topics