MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP + 1M context 39 tok/s

tonyd615 · June 21, 2026, 2:34am

3 x DGX SPARKs

Wanted to share a build that took some real work to land: running Xiaomi’s MiMo V2.5 (310B total / 15B active MoE, NVFP4) FULLY omnimodal (text, image, video, audio) at 1,000,000 token context across 3x DGX Spark over RoCE, no switch.

The challenge:

MiMo has 64 attention heads and 4 KV heads. Neither divides by 3, so stock vLLM cannot tensor-parallel shard it across 3 nodes (the QKV split asserts on exact divisibility). PP=3 works but kills MTP and tanks single-stream throughput.

The fix (virtual-head padding):

We pad the heads to 96 query / 6 KV so they divide cleanly by 3 (32 q / 2 KV per rank), then zero-mask the pad heads so they contribute nothing. This is the same approach used for MiniMax-M3’s TP=3, ported onto MiMo’s attention class AND the MTP draft config. Two more fixes were needed: a FusedMoE zero-fill (the uninitialized padded MoE tail was corrupting NVFP4 output until made truly zero-equivalent), and an attention_sink_bias padding fix for the MTP draft (the loader did 64//3=21 while the virtual sink pads to 32).

Results (69-scenario tool-calling eval, thinking OFF, 3-run avg):

Quality 97.3, Responsiveness 96.4, Deployability 97.3
Decode 38.8 tok/s, effective 35.1 tok/s
Median answer latency ~1.2s
KV cache 3,127,938 tokens at 1M context (3.13x concurrency)
All 4 modalities verified live (described a real video clip + its audio track)

Thinking ON vs OFF (6 full runs): OFF wins clearly, 97.3 vs 88.9 quality and 2x lower answer latency. Thinking ON only posts higher raw tokens/sec because it generates internal reasoning tokens. For agentic / tool-calling work, run thinking OFF.

Infra notes: per-node HCA mapping for NCCL, Ray executor with object-store-memory capped to 1GB + memory-monitor disabled (the GB10 unified memory sits near full when loaded, which is normal), worker-first launch, MTU 9000.

Full recipe, the patch mod, and every raw benchmark JSON (reproducible): GitHub - tonyd2wild/MiMo-V2.5-Omni-3x-DGX-Spark-TP-3-MTP: MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP2 + 1M context + full omni, recipe and full benchmarks · GitHub

corbett_korbett · June 22, 2026, 11:00am

How’s it compared to a fp8 qwen 3.6 27b? In my testing the 27b has been far, far more reliable than any minimax m2.7 quant I’ve tested, qwen 3.5 397b awq 4bit also fell flat on its face while the 27b does solid even with difficult vision tasks.

0rand · June 28, 2026, 6:34am

Great job, @tonyd615 !
Have you run per tests with context above 64k? This is seem to be where 2-node variant starts heavily chocking and being non-MLA could be an model architecture challenge.

0rand · June 28, 2026, 6:44am

In my experience 27b can be an amazing model (same as 122b) if you overcome it’s lack of large amount of weights by very detailed prompting and with harness that forces it to retrieve information instead of trying to invent in its knowledge gaps. It tends to think a lot, though, so it inflates context and E2E is double of 122b even if generation speed would be the same (they are not - 122b is 1.75x faster). And it has 256k session limit, not 1M. But it’s excellent as a sub-agent, where an agent with large context and weights size like ds4f or similar generates prompts for it enriched with all the details it knows.

Topic		Replies	Views
MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP + 1M context 39 tok/s DGX Spark / GB10 agentic-ai	8	476	June 21, 2026
MiMo-V2.5 Omni · TP=2 · 1M context · NVFP4 KV on 2× DGX Spark DGX Spark / GB10	26	819	June 26, 2026
Mimo V2.5 Flash on 2 Nodes DGX Spark / GB10 deepseek	90	1883	June 30, 2026
MiMo-V2.5 (New model) DGX Spark / GB10	51	5632	May 24, 2026
MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks DGX Spark / GB10	45	3741	June 25, 2026
Mimo 2.5 Pro NVFP4 on 8xGB10 cluster DGX Spark / GB10	10	1028	June 9, 2026
MiMo-V2.5-Pro-FP4-DFlash DGX Spark / GB10	13	825	June 26, 2026
Serving MiniMax‑M3 (vision + long‑context) cross‑node on 2× DGX Spark (GB10, sm_121) DGX Spark / GB10	2	436	June 20, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1138	June 20, 2026
DGX Spark performance DGX Spark / GB10	49	6242	February 13, 2026

MiMo V2.5 Omni on 3x DGX Spark: TP=3 + MTP + 1M context 39 tok/s

Related topics