Now running 2x DGX Spark stacked over QSFP56 looking for model recs for agentic workloads (Hermes / OpenClaw)

Hey Guys

Just got a second DGX Spark and stacked both over a 200 Gb QSFP56 ConnectX-7 link using NVIDIA’s stacked-Sparks guide.

Single-node setup has been solid: GB10 Blackwell, 121 GiB unified memory, Ubuntu 24.04 ARM, vLLM Docker, NCCL over direct attach. I’m currently running AEON-7’s Qwen3.6-27B AEON Ultimate Uncensored Multimodal in NVFP4 with DFlash + MTP, getting around 301 tok/sec aggregate at 128 concurrent users with 262K context. Peak memory is roughly 95 to 110 GiB.

My workload is agentic, not normal chat. I run OpenClaw with about two dozen agents handling supervisor work, sneaker/business tasks, mail/calendar, vision, multimodal, and long-context tool use.

Now that I have two Sparks, I’m deciding between:

  1. Scaling the same 27B for more parallel sessions and throughput

  2. Running a larger supervisor model in the 70B, MoE, or 100B+ range across both nodes

Curious what others are running on 2-node Spark setups:

  • Model and quant?

  • Tensor parallel, pipeline parallel, or KV cache sharding?

  • Any DFlash, EAGLE, or MTP speculative decoding success across nodes?

  • For agentic work, are dense models like Qwen 70B or DeepSeek preferred over MoE models like Mixtral or GLM?

  • Has anyone tried MiniMax M2.7 or GLM-5.1 across two Sparks?

I care most about controllability, long context, structured output, tool use, and keeping “thinking” off for worker agents while saving reasoning for the supervisor layer.

Happy to share single-node bench numbers and Compose files if useful.

Very curious to hear other opinions on this.

For large models across two Sparks right now, the best options seem to be MiniMax 2.7 and Qwen 3.5 397B.

In my experience, MiniMax is the stronger model for language tasks, while Qwen is currently the best large multimodal model you can run on two Sparks.

Both handle OpenCLAW very well, but running them heavily limits your available Spark resources, and you generally can’t run additional side models above 8B alongside them.

Also worth noting: if you’re running these on a two Spark setup, vLLM is basically the primary deployment route, and it consumes nearly all available VRAM.

If you’re running smaller models, I’d recommend SGLang instead. It handles multiple users and agent calls much better. Ollama is not ideal for that use case, whereas vLLM and SGLang both support it well.

If you use SGLang, I like pairing it with LM Studio on the Sparks. You can install LM Studio on both units and run one model per Spark. It is not always the most optimal setup, since some models still need community support and tuning, but it can make deployment and management a bit easier.

It seems like Minimax is they way and yes like you said it takes everything memory wise. What is the specific recipe ?

This will get you 38-40 t/s, max context around 128k you can push it to 180k

description: vLLM serving MiniMax-M2.7-AWQ with Ray distributed backend
model: cyankiwi/MiniMax-M2.7-AWQ-4bit
container: vllm-node
mods:
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_model_len: 128000
env: {}
command: |
vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit
–trust-remote-code
–port {port}
–host {host}
–gpu-memory-utilization {gpu_memory_utilization}
-tp {tensor_parallel}
–distributed-executor-backend ray
–max-model-len {max_model_len}
–load-format fastsafetensors
–enable-auto-tool-choice
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2
recipe_version: ‘1’
name: MiniMax-M2.7-AWQ
cluster_only: true

I have a comparable use case and also upgraded to a second Spark two weeks ago.

I am still using MiniMax2.5, which does a decent job on two Sparks. For coding tasks I prefer qwen3-coder-next, also on two Sparks. Nemotron-3-super would also be a nice candidate, but I don’t like its personality. For research tasks I often use Nemotron-3-nano that is served by LM-Studio. Most of the other LLMs are running in eugrs vllm docker.

I do have the possibility though to move the main agent (qwen3.6:35B) to a RTX5090. By doing so I can use the RAM of both Sparks and have a really quick main-agent who manages the subagents for coding / research etc. I like that setup very much.

To isolate my OpenClaw I am using a Strix-Halo 128GB. This is maybe the best part of my setup, because 90% of the time my agent is doing pretty much nothing or only standard tasks. During those times all other machines are off and the main-agent runs on an AMD Strix-Halo. That machine only consumes 13W electric power in idle. My Sparks are consuming never less than 65W each in idle, since I connected them. The AMD is ~30% slower than the Sparks, but it’s doing its job very efficiently.

By the way: Have you discovered that OpenClaw does not realize if you change models in LM-Studio? That’s a very nice way of testing different models, because you never have to change OpenClaws config. When using vllm, it seems OpenClaw always has to know to which LLM it’s communicating.

Alex I used your recipe and have gotten great result.

Did you increase its context window? I’m currently trying 190k. Minimax does degrade pretty hard the more context to give it.

Nah I did the 198k. Also was wondering have you tried Deepseek v4 Flash ? with the GGUF it can run at decent BIT. I haven’t tried it yet.

I’m also using MiniMax 2.7 (cyankiwi/MiniMax-M2.7-AWQ-4bit, 128K context, via spark-vllm-docker and a custom recipe) on a dual cluster for my totally local OpenClaw. I get ~2.800 token/s pp and ~42 token/s tg (apologies for my initial typo with crazy 58, I misread my notes) according to llama-benchy. Overall this feels very good in OpenClaw.

I’m running OpenClaw on the worker-node in the cluster, which can e.g. spawn a local whisper.cpp for ASR and other utilities. On the primary cluster-node, I also run a small llama.cpp server for embeddings for OpenClaw’s memory-search, and maybe other things in the future.

I’m very interested about DeepSeek V4, once it’s stable in spark-vllm-docker.

So it seems minimax is the way then

A follow-up question to you all, since MiniMax via LLM, OpenClaw, Embeddings-llama.cpp consumes almost all the memory on the cluster (primary node 95.6/121GB, secondary node 94.4/121GB), has anybody successfully run turboquant or some other memory-saving techniques? I limit “gpu_memory_utilization: 0.7” in my recipe.

P.S: I run my vllm-cluster with no-ray

From my understanding of the DGX Spark cluster specs, MiniMax is obviously taking up a large amount of RAM, and the VLM is going to use RAM regardless, so TurboQuant will not really help reduce RAM usage in this case. The bigger issue seems to be that vLLM itself is extremely memory hungry. It is not just because you are running a very large model. Even if you move to a larger model, vLLM still tends to consume a similar proportion of available VRAM because it wants to reserve as much as possible and then manage that memory allocation internally.

Could you post your recipe on the Spark Arena leaderboard? 58 tokens per second is by far the fastest I have heard reported for that model.

So I am not a expert here Andrea but I will say this, with Minimax speed and the fact I can use vLLM and tun multiple agents on it. I don’t need any other model to really to be running.

I kinda wish, that we could run minimax and Gemma 4 would be awesome. But not enough vram with only two sparks

Why are you using llama.cpp for an embedding model? With no-ray run the embedding on vLLM. I’m doing this for mxbai-embed-large and spec 1GiB RAM per node to accommodate.

I’d also love to see 58 tokens/s on MiniMax-M2.7-AWQ-4bit. My max is 41.

@Alexander-F - Thanks for the turboquant clarification. Sorry, the speed was me misreading my notes (also corrected it in the post above). I just tested it again (after my cluster was running for a day), nothing special:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
minimaxai/MiniMax-M2.7 pp2048 2987.25 ± 7.40 691.73 ± 1.70 685.58 ± 1.70 691.85 ± 1.67
minimaxai/MiniMax-M2.7 tg32 41.84 ± 0.04 43.20 ± 0.04

My clobbered-together recipe, might be inconsistent, I just copy/pasted:

description: vLLM serving MiniMax-M2.7-AWQ 4Bit (neu)
model: cyankiwi/MiniMax-M2.7-AWQ-4bit
container: vllm-node-tf5
mods: []
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_model_len: 128000
env:
  VLLM_USE_DEEP_GEMM: 0
  VLLM_USE_FLASHINFER_SAMPLER: 0
  VLLM_FLOAT32_MATMUL_PRECISION: "high"
  VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
  OMP_NUM_THREADS: 8
command: |
  vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
      --served-model-name minimaxai/MiniMax-M2.7 \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --max-num-seqs 4 \
      --max-num-batched-tokens 8192 \
      --enable-prefix-caching \
      --enable-chunked-prefill \
      --kv-cache-dtype fp8 \
      --attention-backend flashinfer \
      --dtype auto \
      --disable-custom-all-reduce
recipe_version: '1'
name: MiniMax-M2.7-AWQ
cluster_only: true

some references say --override-generation-config '{{"top_k":40,"top_p":0.95,"temperature":1.0,"min_p":0.01}}' \ might also be useful, I’m still testing if this is true.

@TechnoTim apologies!!! My bad, I misread my notes. See my other comments. 41 is also my max.

@jrsphd - its just my personal taste.

I don’t like vLLM-containers and am way more comfortable with just running llama.cpp. I “grew up” on generative-AI with llama.cpp on my Macs, Jetsons, PC. Yes, vLLM is better for containers, production,… but I hate its glued-togetherness/brittleness/complexity. If llama.cpp would support DGX Spark clusters nicely, I’d rather run only llama.cpp’s llama-server.

My embeddings llama-server uses 640MB, starts up very quickly (GGUF model) and for me, it runs like a charm.

I’ve been running Qwen3.6 35b and MiniMax M2.7 the last month and have really been enjoying that combo.

MiniMax is the “engineer” with its superior coding ability, while Qwen3.6 is more of an “assistant” with its multimodal capabilities and much faster TPS.

The models I use, that fit simultaneously with full context, are:

  • cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
  • cyankiwi/MiniMax-M2.7-AWQ-4bit