Should we as a community gofundme one Spark for Eugr's nightly builds?

I’m pretty happy how the 3-spark mesh support is working with our community docker now, so I’ll be releasing it later this week after I update the documentation and coordinate with @dbsci for Sparkrun to support it on day 1 as well.

My build pipeline has been running the updated version for a whole week now.

2 Likes

@eugr can you preview some insights for the 3 Spark setup? What performance do you see? Can you run the common models like Qwen 3.5 397b or does the TP=3 setup cause issues? I’m very tempted

See my post above, I benchmarked Qwen 3.5 397B Autoround.

TL;DR: you can’t do tp=3, none of the existing models will support it.

  • --pipeline-parallel 3 will let you run a model that can’t fit on dual Sparks, but without additional speed improvements (total throughtput in concurrency setting may improve though).

  • --data-parallel 3 (possibly with --enable-expert-parallel) will let you run a model that can fit on a single Spark, but allow for better concurrency.

You can also run models with --tensor-parallel 2 in a 3-node configuration - in this case only two nodes will be utilized, so basically you run it as a two-node cluster.

I think for majority of users, 3-node mesh would be best used for one large(r) model on two nodes and embedding+reranking+small fast model+maybe TTS on the 3rd node.

Also, when/if NVFP4 gets full hardware support, something like Qwen3.5-397B-NVFP4 in pp=3 would be a viable choice too. I want to actually try it next (just need to clean up some space first).

I’m primarily interested in running models at decent performance (e.g., what I get with my dual cluster) but ideally with less memory pressure. Your benchmark looked a little worse than what I get with TP=2 right now. Do you expect to see more improvements? Otherwise I may have to hunt for a fourth spark in addition to the third I am considering 😂

Only tensor-parallel can give you better performance with more Sparks. Unless someone implements uneven split for TP in vLLM, it will only work for even numbers (and for most models, it would be power of 2 - 2, 4, 8).

Pipeline Parallelism splits layers, so it lets you run bigger models, but with the speed equal (or slightly lower) to a single Spark. Data Parallelism allows to serve more users by running a clone of the model.

So yeah, you’d need 4 nodes to get more memory AND better performance.

2 Likes

I run this type of setup sortof, with two Sparks ganged together running a large model and a HP Z2 Strix Halo running embedding & reranking. Not as elegant as three Sparks but it does the job..

I have a Strix Halo as well, it currently runs gpt-oss-120b and qwen3-vl-8b in llama.cpp.

OK, 3 node support is officially out: Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun!

6 Likes

Too soon to try the NVFP4 version of Qwen 397b on 3-node?

No, actually, it’s a good time to try it. Today build includes some improvements to Flashinfer that finally get rid of autotuner errors and NVFP4 seems to be quite stable now. It may improve prompt processing performance too, but token generation will likely be the same or a bit slower than int4 autoround.

Do you think the performance on 3 nodes would be the same with Ring as with a router? Thanks

Theoretically, mesh should be faster as there is no switching overhead - all nodes are connected to the other two directly, so it really depends on whether there are any inefficiencies in the NCCL related to mesh.

2 Likes