Just got a second DGX Spark and stacked both over a 200 Gb QSFP56 ConnectX-7 link using NVIDIA’s stacked-Sparks guide.
Single-node setup has been solid: GB10 Blackwell, 121 GiB unified memory, Ubuntu 24.04 ARM, vLLM Docker, NCCL over direct attach. I’m currently running AEON-7’s Qwen3.6-27B AEON Ultimate Uncensored Multimodal in NVFP4 with DFlash + MTP, getting around 301 tok/sec aggregate at 128 concurrent users with 262K context. Peak memory is roughly 95 to 110 GiB.
My workload is agentic, not normal chat. I run OpenClaw with about two dozen agents handling supervisor work, sneaker/business tasks, mail/calendar, vision, multimodal, and long-context tool use.
Now that I have two Sparks, I’m deciding between:
Scaling the same 27B for more parallel sessions and throughput
Running a larger supervisor model in the 70B, MoE, or 100B+ range across both nodes
Curious what others are running on 2-node Spark setups:
Model and quant?
Tensor parallel, pipeline parallel, or KV cache sharding?
Any DFlash, EAGLE, or MTP speculative decoding success across nodes?
For agentic work, are dense models like Qwen 70B or DeepSeek preferred over MoE models like Mixtral or GLM?
Has anyone tried MiniMax M2.7 or GLM-5.1 across two Sparks?
I care most about controllability, long context, structured output, tool use, and keeping “thinking” off for worker agents while saving reasoning for the supervisor layer.
Happy to share single-node bench numbers and Compose files if useful.
For large models across two Sparks right now, the best options seem to be MiniMax 2.7 and Qwen 3.5 397B.
In my experience, MiniMax is the stronger model for language tasks, while Qwen is currently the best large multimodal model you can run on two Sparks.
Both handle OpenCLAW very well, but running them heavily limits your available Spark resources, and you generally can’t run additional side models above 8B alongside them.
Also worth noting: if you’re running these on a two Spark setup, vLLM is basically the primary deployment route, and it consumes nearly all available VRAM.
If you’re running smaller models, I’d recommend SGLang instead. It handles multiple users and agent calls much better. Ollama is not ideal for that use case, whereas vLLM and SGLang both support it well.
If you use SGLang, I like pairing it with LM Studio on the Sparks. You can install LM Studio on both units and run one model per Spark. It is not always the most optimal setup, since some models still need community support and tuning, but it can make deployment and management a bit easier.
I have a comparable use case and also upgraded to a second Spark two weeks ago.
I am still using MiniMax2.5, which does a decent job on two Sparks. For coding tasks I prefer qwen3-coder-next, also on two Sparks. Nemotron-3-super would also be a nice candidate, but I don’t like its personality. For research tasks I often use Nemotron-3-nano that is served by LM-Studio. Most of the other LLMs are running in eugrs vllm docker.
I do have the possibility though to move the main agent (qwen3.6:35B) to a RTX5090. By doing so I can use the RAM of both Sparks and have a really quick main-agent who manages the subagents for coding / research etc. I like that setup very much.
To isolate my OpenClaw I am using a Strix-Halo 128GB. This is maybe the best part of my setup, because 90% of the time my agent is doing pretty much nothing or only standard tasks. During those times all other machines are off and the main-agent runs on an AMD Strix-Halo. That machine only consumes 13W electric power in idle. My Sparks are consuming never less than 65W each in idle, since I connected them. The AMD is ~30% slower than the Sparks, but it’s doing its job very efficiently.
By the way: Have you discovered that OpenClaw does not realize if you change models in LM-Studio? That’s a very nice way of testing different models, because you never have to change OpenClaws config. When using vllm, it seems OpenClaw always has to know to which LLM it’s communicating.
I’m also using MiniMax 2.7 (cyankiwi/MiniMax-M2.7-AWQ-4bit, 128K context, via spark-vllm-docker and a custom recipe) on a dual cluster for my totally local OpenClaw. I get ~2.800 token/s pp and ~42 token/s tg (apologies for my initial typo with crazy 58, I misread my notes) according to llama-benchy. Overall this feels very good in OpenClaw.
I’m running OpenClaw on the worker-node in the cluster, which can e.g. spawn a local whisper.cpp for ASR and other utilities. On the primary cluster-node, I also run a small llama.cpp server for embeddings for OpenClaw’s memory-search, and maybe other things in the future.
I’m very interested about DeepSeek V4, once it’s stable in spark-vllm-docker.
A follow-up question to you all, since MiniMax via LLM, OpenClaw, Embeddings-llama.cpp consumes almost all the memory on the cluster (primary node 95.6/121GB, secondary node 94.4/121GB), has anybody successfully run turboquant or some other memory-saving techniques? I limit “gpu_memory_utilization: 0.7” in my recipe.
From my understanding of the DGX Spark cluster specs, MiniMax is obviously taking up a large amount of RAM, and the VLM is going to use RAM regardless, so TurboQuant will not really help reduce RAM usage in this case. The bigger issue seems to be that vLLM itself is extremely memory hungry. It is not just because you are running a very large model. Even if you move to a larger model, vLLM still tends to consume a similar proportion of available VRAM because it wants to reserve as much as possible and then manage that memory allocation internally.
Could you post your recipe on the Spark Arena leaderboard? 58 tokens per second is by far the fastest I have heard reported for that model.
So I am not a expert here Andrea but I will say this, with Minimax speed and the fact I can use vLLM and tun multiple agents on it. I don’t need any other model to really to be running.
Why are you using llama.cpp for an embedding model? With no-ray run the embedding on vLLM. I’m doing this for mxbai-embed-large and spec 1GiB RAM per node to accommodate.
@Alexander-F - Thanks for the turboquant clarification. Sorry, the speed was me misreading my notes (also corrected it in the post above). I just tested it again (after my cluster was running for a day), nothing special:
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
minimaxai/MiniMax-M2.7
pp2048
2987.25 ± 7.40
691.73 ± 1.70
685.58 ± 1.70
691.85 ± 1.67
minimaxai/MiniMax-M2.7
tg32
41.84 ± 0.04
43.20 ± 0.04
My clobbered-together recipe, might be inconsistent, I just copy/pasted:
some references say --override-generation-config '{{"top_k":40,"top_p":0.95,"temperature":1.0,"min_p":0.01}}' \ might also be useful, I’m still testing if this is true.
I don’t like vLLM-containers and am way more comfortable with just running llama.cpp. I “grew up” on generative-AI with llama.cpp on my Macs, Jetsons, PC. Yes, vLLM is better for containers, production,… but I hate its glued-togetherness/brittleness/complexity. If llama.cpp would support DGX Spark clusters nicely, I’d rather run only llama.cpp’s llama-server.
My embeddings llama-server uses 640MB, starts up very quickly (GGUF model) and for me, it runs like a charm.
I’ve been running Qwen3.6 35b and MiniMax M2.7 the last month and have really been enjoying that combo.
MiniMax is the “engineer” with its superior coding ability, while Qwen3.6 is more of an “assistant” with its multimodal capabilities and much faster TPS.
The models I use, that fit simultaneously with full context, are: