MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP + 1M context 39 tok/s

3 x DGX SPARKs

Wanted to share a build that took some real work to land: running Xiaomi’s MiMo V2.5 (310B total / 15B active MoE, NVFP4) FULLY omnimodal (text, image, video, audio) at 1,000,000 token context across 3x DGX Spark over RoCE, no switch.

The challenge:

MiMo has 64 attention heads and 4 KV heads. Neither divides by 3, so stock vLLM cannot tensor-parallel shard it across 3 nodes (the QKV split asserts on exact divisibility). PP=3 works but kills MTP and tanks single-stream throughput.

The fix (virtual-head padding):

We pad the heads to 96 query / 6 KV so they divide cleanly by 3 (32 q / 2 KV per rank), then zero-mask the pad heads so they contribute nothing. This is the same approach used for MiniMax-M3’s TP=3, ported onto MiMo’s attention class AND the MTP draft config. Two more fixes were needed: a FusedMoE zero-fill (the uninitialized padded MoE tail was corrupting NVFP4 output until made truly zero-equivalent), and an attention_sink_bias padding fix for the MTP draft (the loader did 64//3=21 while the virtual sink pads to 32).

Results (69-scenario tool-calling eval, thinking OFF, 3-run avg):

  • Quality 97.3, Responsiveness 96.4, Deployability 97.3

  • Decode 38.8 tok/s, effective 35.1 tok/s

  • Median answer latency ~1.2s

  • KV cache 3,127,938 tokens at 1M context (3.13x concurrency)

  • All 4 modalities verified live (described a real video clip + its audio track)

Thinking ON vs OFF (6 full runs): OFF wins clearly, 97.3 vs 88.9 quality and 2x lower answer latency. Thinking ON only posts higher raw tokens/sec because it generates internal reasoning tokens. For agentic / tool-calling work, run thinking OFF.

Infra notes: per-node HCA mapping for NCCL, Ray executor with object-store-memory capped to 1GB + memory-monitor disabled (the GB10 unified memory sits near full when loaded, which is normal), worker-first launch, MTU 9000.

Full recipe, the patch mod, and every raw benchmark JSON (reproducible): GitHub - tonyd2wild/MiMo-V2.5-Omni-3x-DGX-Spark-TP-3-MTP: MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP2 + 1M context + full omni, recipe and full benchmarks · GitHub

How’s it compared to a fp8 qwen 3.6 27b? In my testing the 27b has been far, far more reliable than any minimax m2.7 quant I’ve tested, qwen 3.5 397b awq 4bit also fell flat on its face while the 27b does solid even with difficult vision tasks.

Great job, @tonyd615 !
Have you run per tests with context above 64k? This is seem to be where 2-node variant starts heavily chocking and being non-MLA could be an model architecture challenge.

In my experience 27b can be an amazing model (same as 122b) if you overcome it’s lack of large amount of weights by very detailed prompting and with harness that forces it to retrieve information instead of trying to invent in its knowledge gaps. It tends to think a lot, though, so it inflates context and E2E is double of 122b even if generation speed would be the same (they are not - 122b is 1.75x faster). And it has 256k session limit, not 1M. But it’s excellent as a sub-agent, where an agent with large context and weights size like ds4f or similar generates prompts for it enriched with all the details it knows.