3 x DGX SPARKs
Wanted to share a build that took some real work to land: running Xiaomi’s MiMo V2.5 (310B total / 15B active MoE, NVFP4) FULLY omnimodal (text, image, video, audio) at 1,000,000 token context across 3x DGX Spark over RoCE, no switch.
The challenge:
MiMo has 64 attention heads and 4 KV heads. Neither divides by 3, so stock vLLM cannot tensor-parallel shard it across 3 nodes (the QKV split asserts on exact divisibility). PP=3 works but kills MTP and tanks single-stream throughput.
The fix (virtual-head padding):
We pad the heads to 96 query / 6 KV so they divide cleanly by 3 (32 q / 2 KV per rank), then zero-mask the pad heads so they contribute nothing. This is the same approach used for MiniMax-M3’s TP=3, ported onto MiMo’s attention class AND the MTP draft config. Two more fixes were needed: a FusedMoE zero-fill (the uninitialized padded MoE tail was corrupting NVFP4 output until made truly zero-equivalent), and an attention_sink_bias padding fix for the MTP draft (the loader did 64//3=21 while the virtual sink pads to 32).
Results (69-scenario tool-calling eval, thinking OFF, 3-run avg):
-
Quality 97.3, Responsiveness 96.4, Deployability 97.3
-
Decode 38.8 tok/s, effective 35.1 tok/s
-
Median answer latency ~1.2s
-
KV cache 3,127,938 tokens at 1M context (3.13x concurrency)
-
All 4 modalities verified live (described a real video clip + its audio track)
Thinking ON vs OFF (6 full runs): OFF wins clearly, 97.3 vs 88.9 quality and 2x lower answer latency. Thinking ON only posts higher raw tokens/sec because it generates internal reasoning tokens. For agentic / tool-calling work, run thinking OFF.
Infra notes: per-node HCA mapping for NCCL, Ray executor with object-store-memory capped to 1GB + memory-monitor disabled (the GB10 unified memory sits near full when loaded, which is normal), worker-first launch, MTU 9000.
Full recipe, the patch mod, and every raw benchmark JSON (reproducible): GitHub - tonyd2wild/MiMo-V2.5-Omni-3x-DGX-Spark-TP-3-MTP: MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP2 + 1M context + full omni, recipe and full benchmarks · GitHub