First of all, hello to every one, I’m Danny, new to the forum and excited to be here! I’m thinking of purchasing 2 DGX Sparks to run in a cluster… Primary use case is running my Open Claw build’s brain and vibe coding web and mobile apps. My thought process is, host the best LLM I can possibly run for openclaw and vibe coding should provide the best output… I know that might not be accurate but I’m also thinking I want this thing to be somewhat future proof atleast for a few years and as the LLMs grow, with a cluster I should be ok to run the majority of the LLMs I’d want. Ofcourse being able to throw in a 3rd or 4th Spark into the cluster later is also an option for the future if needed.
So with all that being said…. Please chime in on whether a cluster would be overkill or not for my example use case. What is the largest LLM you have been able to run successfully on a cluster… and.. I’m also thinking, these things are probably going to go up in price soon given NVIDIA just increased it on their end and the Asus Ascent is still going for $3,000 for the 1TB version… Something tells me that price is about to go up soon.
Favorites for dual not have been MiniMax-M2.5 with 229B and QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ.
You can also run openai/gpt-oss-120b, Qwen/Qwen3.5-122B-A10B and Qwen/Qwen3-Coder-Next-FP8 at really good performance on a cluster.
You could run unsloth/Qwen3.5-397B-A17B-GGUF:Q3_K_M, but at 3-bit quantization you’ll get better results with the first 2.
Try Step 3.5 Flash. 196B parameters, 10b active, highly efficient attention mechanism. There are two quants which work great on the Spark - the official Q4_K_S and the IQ4_XS.
The official one is stable limited to ~190k context, getting a bit over 20 t/s decreasing to about 8 near context limit. The other is a little bit smaller so you can have 256k context, but actually better in perplexity.
The IQ4_XS from ubergarm mixes precision and thus isn’t quite as fast as the main one - 17 t/s up to around 80k context and then dives to around 3-4 at 256k (but stable). It also has a broken jinja template you have to fix for tool calling.
Several other releases in the last two weeks overshadowed Step3.5’s release, but it is awesome on 1x Spark. I’m planning to make a post about this model, but want to try a new Autoround quant with vLLM before I do.