Should we as a community gofundme one Spark for Eugr's nightly builds?

eugr · March 30, 2026, 5:38pm

I’m pretty happy how the 3-spark mesh support is working with our community docker now, so I’ll be releasing it later this week after I update the documentation and coordinate with @dbsci for Sparkrun to support it on day 1 as well.

My build pipeline has been running the updated version for a whole week now.

serapis · March 31, 2026, 5:55am

@eugr can you preview some insights for the 3 Spark setup? What performance do you see? Can you run the common models like Qwen 3.5 397b or does the TP=3 setup cause issues? I’m very tempted

eugr · March 31, 2026, 6:19am

See my post above, I benchmarked Qwen 3.5 397B Autoround.

TL;DR: you can’t do tp=3, none of the existing models will support it.

--pipeline-parallel 3 will let you run a model that can’t fit on dual Sparks, but without additional speed improvements (total throughtput in concurrency setting may improve though).
--data-parallel 3 (possibly with --enable-expert-parallel) will let you run a model that can fit on a single Spark, but allow for better concurrency.

You can also run models with --tensor-parallel 2 in a 3-node configuration - in this case only two nodes will be utilized, so basically you run it as a two-node cluster.

I think for majority of users, 3-node mesh would be best used for one large(r) model on two nodes and embedding+reranking+small fast model+maybe TTS on the 3rd node.

Also, when/if NVFP4 gets full hardware support, something like Qwen3.5-397B-NVFP4 in pp=3 would be a viable choice too. I want to actually try it next (just need to clean up some space first).

serapis · March 31, 2026, 6:37am

I’m primarily interested in running models at decent performance (e.g., what I get with my dual cluster) but ideally with less memory pressure. Your benchmark looked a little worse than what I get with TP=2 right now. Do you expect to see more improvements? Otherwise I may have to hunt for a fourth spark in addition to the third I am considering 😂

eugr · March 31, 2026, 4:25pm

Only tensor-parallel can give you better performance with more Sparks. Unless someone implements uneven split for TP in vLLM, it will only work for even numbers (and for most models, it would be power of 2 - 2, 4, 8).

Pipeline Parallelism splits layers, so it lets you run bigger models, but with the speed equal (or slightly lower) to a single Spark. Data Parallelism allows to serve more users by running a clone of the model.

So yeah, you’d need 4 nodes to get more memory AND better performance.

paul_sobon · March 31, 2026, 5:16pm

I run this type of setup sortof, with two Sparks ganged together running a large model and a HP Z2 Strix Halo running embedding & reranking. Not as elegant as three Sparks but it does the job..

eugr · March 31, 2026, 5:34pm

I have a Strix Halo as well, it currently runs gpt-oss-120b and qwen3-vl-8b in llama.cpp.

eugr · April 1, 2026, 1:56am

OK, 3 node support is officially out: Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun!

fred_dev · April 1, 2026, 6:11pm

Too soon to try the NVFP4 version of Qwen 397b on 3-node?

eugr · April 1, 2026, 6:50pm

No, actually, it’s a good time to try it. Today build includes some improvements to Flashinfer that finally get rid of autotuner errors and NVFP4 seems to be quite stable now. It may improve prompt processing performance too, but token generation will likely be the same or a bit slower than int4 autoround.

vedcsolution · April 1, 2026, 7:41pm

Do you think the performance on 3 nodes would be the same with Ring as with a router? Thanks

eugr · April 1, 2026, 8:16pm

Theoretically, mesh should be faster as there is no switching overhead - all nodes are connected to the other two directly, so it really depends on whether there are any inefficiencies in the NCCL related to mesh.

Topic		Replies	Views
Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun! DGX Spark / GB10 llama	2	194	April 5, 2026
6x Spark setup DGX Spark / GB10	109	7332	April 1, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3641	March 6, 2026
Value of 2nd Spark? DGX Spark / GB10 Projects	21	1353	March 30, 2026
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	1869	February 7, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1834	January 2, 2026
DGX Spark performance DGX Spark / GB10	50	3686	February 27, 2026
How are you planning on using your DGX spark? DGX Spark / GB10 Projects	22	2407	February 24, 2026
DGX Spark (SM121) Software Support is Severely Lacking - Official Roadmap Needed DGX Spark / GB10	41	3799	February 15, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2058	March 27, 2026

Should we as a community gofundme one Spark for Eugr's nightly builds?

Related topics