MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks

a3refaat · May 18, 2026, 12:43am

Hey guys, I decided to stop lurking and make an effort to contribute in this forum. I got lukealonso/MiMo-V2.5-NVFP4 running on a 2× DGX Spark / GB10 cluster with vLLM, including Omni/multimodal serving and MTP speculative decoding.

Special thanks to @CyberTen and @mclenithan for getting the native model quant off the ground and providing me with a starting point for getting this working. This model is looking very promising from a quality and throughput standpoint for 2xSpark users. Multimodal capabilities are a big plus as well, although this setup has only validated image processing so far. Audio/Video could work, but I haven't tested this yet. Hopefully this can provide a good baseline for further testing & optimization from the community.

Patch bundle / reproducible recipe:

https://github.com/eugr/spark-vllm-docker/pull/251

Working configuration

Item	Value
Hardware	2× DGX Spark / GB10, TP=2
Model	`lukealonso/MiMo-V2.5-NVFP4`
vLLM	`0.21.1rc1.dev39`, CUDA 13.2 build
Load format	`instanttensor`
Attention	`TRITON_ATTN_DIFFKV`
Dense GEMM	FlashInfer-CUTLASS MXFP8
MoE	FlashInfer-CUTLASS NVFP4
KV cache	`fp8_e4m3`
Context	`131072`
Serving	Omni, image input validated
Spec decode	MiMo MTP, `num_speculative_tokens=2`

Expected startup markers:

Resolved architecture: MiMoV2OmniForCausalLM
Resolved architecture: MiMoV2OmniMTPModel
Using FlashInferCutlassMxfp8LinearKernel for MXFP8 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
Using TRITON_ATTN_DIFFKV
cache_dtype="fp8_e4m3"

Launch shape

vllm serve lukealonso/MiMo-V2.5-NVFP4 \
  --served-model-name MiMo-V2.5-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --load-format instanttensor \
  --hf-overrides '{"architectures":["MiMoV2OmniForCausalLM"]}' \
  --limit-mm-per-prompt '{"image":4,"video":1,"audio":1}' \
  --attention-backend triton_attn_diffkv \                                              
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 131072 \ 
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ 
  --no-async-scheduling \
  --enable-auto-tool-choice \ 
  --tool-call-parser mimo \
  --reasoning-parser mimo                                                             

                                                                                   
export VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1                                             
export VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1                                                
export NCCL_CUMEM_ENABLE=0
export NCCL_NVLS_ENABLE=0                                                            
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1                                                  
export VLLM_FLASHINFER_MOE_BACKEND=throughput

Important fixes

The main correctness issue was not performance tuning; it was checkpoint/layout compatibility.

1. ModelOpt mixed MXFP8/NVFP4 dispatch

The checkpoint mixes MXFP8 dense layers with NVFP4 experts. The ModelOpt mixed-precision path needed to dispatch MXFP8 linear layers to the MXFP8 method instead of falling through.

if quant_algo == "MXFP8":
    return ModelOptMxFp8LinearMethod(self.mxfp8_config)
if quant_algo == "NVFP4":
    return ModelOptNvFp4LinearMethod(self.nvfp4_config)

2. Do not invert weight_scale_inv

For this checkpoint, weight_scale_inv is UE8M0 MXFP8 scale metadata. Despite the name, it should not be reciprocal-inverted. I aliased it to vLLM’s expected weight_scale parameter.

layer.register_parameter("weight_scale", weight_scale)
layer.register_parameter("weight_scale_inv", weight_scale)

3. Fix fused QKV TP loading

This was the biggest text-quality bug. The NVFP4 checkpoint uses a deinterleaved QKV layout, but the loader was blindly chunking the fused QKV tensor by TP rank. That can load Q rows into K/V slots.

Bad pattern:

loaded_weight = loaded_weight.chunk(tp_size, dim=0)[tp_rank]
default_weight_loader(param, loaded_weight)

Fix: use the parameter’s QKV-aware loader.

param = params_dict[name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)

This same fix is needed for the MTP draft model.

4. Fix Omni + MTP quant metadata mapping

Omni remaps target model keys from:

model.* -> language_model.model.*

But MTP draft modules still live under:

model.mtp.*

So the MTP draft lost its quant metadata. The fix remaps MTP metadata back for the draft model:

hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_prefix={
        "language_model.model.mtp.": "model.mtp.",
    }
)

After this, Omni MTP acceptance recovered.

Validation

Text generation:       OK
Tool calling:          OK
Reasoning parser:      OK
Image prompt:          OK
Omni MTP depth 2:      OK
Non-eager execution:   OK

Example MTP acceptance after the Omni quant fix:

Text prompt:       42 / 78 = 53.85%
Image prompt:      38 / 52 = 73.08%
Tool eval run:  15621 / 21221 = 73.61%

Benchmarks

tool-eval-bench

score:           89 / 100
points:          123 / 138
rating:          ★★★★ Good
safety warnings: 0

llama-benchy

Run shape: pp=2048, tg=32, prefix caching enabled, concurrency 1/2, 3 runs.

Depth	Concurrency	Prefill total t/s	Decode total t/s	TTFR ms
0	1	3734.84	34.14	785.03
0	2	3001.78	62.64	1295.67
4096	1	2854.95	36.69	954.03
4096	2	2335.61	52.77	1690.58
8192	1	2484.35	33.19	1061.04
8192	2	2088.01	53.65	1900.97
16384	1	1983.69	32.75	1269.15
16384	2	1728.92	53.77	2310.42
32768	1	1404.16	29.55	1695.20
32768	2	1272.26	44.78	3170.00
65536	1	895.17	21.69	2524.51
98304	1	660.81	19.62	3336.00
114688	1	581.64	19.32	3757.79

The exact depth=131072 point with tg=32 is invalid because prompt + requested output exceeds the 131072-token model window:

input tokens: at least 131041
requested output tokens: 32
total: at least 131073 > 131072

Also, concurrency=2 at very long context shows a large latency increase. That is expected here: without additional long-context attention optimizations, decode becomes dominated by scanning a large KV cache. For this 2× Spark setup, c2 is useful at lower/mid context; c1 is the cleaner view at 65k+ context.

Known caveats

Image input is validated; audio/video paths still need separate validation.
FP8 E4M3 KV cache works, but this run used calculate_kv_scales=False.
The exact 131072-depth benchmark needs smaller tg or slightly lower prompt depth.
This is currently packaged as a reproducible runtime mod/recipe, not an upstream-clean vLLM PR series.

Minimal replication checklist

Use a vLLM build with MiMo V2.5 and TRITON_ATTN_DIFFKV.
Apply the two mods from the PR:
- mods/fix-modelopt-mixed-mxfp8
- mods/fix-mimo-v2-vllm
Launch the recipe with --load-format instanttensor.
Confirm MXFP8, NVFP4, DiffKV, Omni, and MTP startup markers.
Run a text prompt, an image prompt, and a tool-call prompt.

Thanks to the Xiaomi/MiMo team, the NVFP4 export author, the vLLM contributors, and the prior MiMo V2.5 DGX Spark notes that helped guide this setup.

Alexander-F · May 18, 2026, 2:31am

I will give it a shot, thank you for all the hard work!

eparin82 · May 18, 2026, 6:37am

── Run 1/2 ──────────────────────────────────────
[Q&A] 157 tokens in 8.95s = 17.5 tok/s (prompt: 37)
[Code] 512 tokens in 13.37s = 38.2 tok/s (prompt: 44)
[JSON] 732 tokens in 17.66s = 41.4 tok/s (prompt: 62)
[Math] 64 tokens in 1.83s = 34.9 tok/s (prompt: 43)
[LongCode] 2048 tokens in 52.25s = 39.1 tok/s (prompt: 51)

── Run 2/2 ──────────────────────────────────────
[Q&A] 161 tokens in 4.36s = 36.9 tok/s (prompt: 37)
[Code] 512 tokens in 12.82s = 39.9 tok/s (prompt: 44)
[JSON] 736 tokens in 17.56s = 41.9 tok/s (prompt: 62)
[Math] 64 tokens in 1.91s = 33.5 tok/s (prompt: 43)
[LongCode] 2048 tokens in 53.56s = 38.2 tok/s (prompt: 51)

tool-eval-bench: MiMo-V2.5-NVFP4

Final Score: 88/100 (88.4%)
Points: 122/138
Rating: ★★★★ Good
Deployability: 74
Responsiveness: 41
Median Turn: 3783.6ms
Token Efficiency: 0.4
Total Tokens: 304436
Error Rate: 0.0%
Worst Category: M Autonomous Planning (67%)

arctic.gus · May 18, 2026, 8:09am

great work, many thanks. Me and claude spent half a day trying to mod Luke’s SGLang images, but at best I could only get garbled output with cuda graph turned off. Will give this a shot later on today.

serapis · May 18, 2026, 9:58am

I spent some time bashing my head against this model but couldn’t get it to work. Well done!

Pushing this beyond 128K context window didn’t succeed. It seems like these quants are really pushing the boundaries of our systems.

I tried this with both lukealonso/MiMo-V2.5-NVFP4 · Hugging Face and shadowlilac/MiMo-V2.5-NVFP4 · Hugging Face – here is what I got:

╭──────────────────────── ⚡ llama-benchy Throughput Benchmark ────────────────────────╮
│ lukealonso/MiMo-V2.5-NVFP4                                                           │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3            │
│ latency=generation                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:04:40

  llama-benchy 0.3.7
  Estimated latency: 234.2 ms

                                  llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━
┃ Test                 ┃  c  ┃   pp t/s ┃   tg t/s ┃ TTFT (ms) ┃ Total (ms) ┃    Tokens
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━
│ pp2048 tg128 @ d0    │ c1  │    3,577 │     34.0 │       869 │      4,335 │  2048+128
│ pp2048 tg128 @ d0    │ c2  │    2,828 │     54.3 │     1,449 │      5,612 │  2048+128
│ pp2048 tg128 @ d0    │ c4  │    2,717 │     51.4 │     2,841 │     11,212 │  2048+128
│ pp2048 tg128 @ d4096 │ c1  │    3,149 │     34.6 │     2,247 │      5,650 │  2048+128
│ pp2048 tg128 @ d4096 │ c2  │    2,949 │     51.0 │     4,168 │      8,733 │  2048+128
│ pp2048 tg128 @ d4096 │ c4  │    3,002 │     53.6 │     8,113 │     15,046 │  2048+128
│ pp2048 tg128 @ d8192 │ c1  │    3,026 │     32.1 │     3,685 │      7,371 │  2048+128
│ pp2048 tg128 @ d8192 │ c2  │    2,731 │     50.1 │     7,502 │     11,318 │  2048+128
│ pp2048 tg128 @ d8192 │ c4  │    2,866 │     39.3 │    13,425 │     20,530 │  2048+128
└──────────────────────┴─────┴──────────┴──────────┴───────────┴────────────┴───────────


                                 Speculative Decoding Results
┏━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━
┃ Prompt     ┃ Dep… ┃ Eff t/s ┃    α % ┃ Waste ┃ τ len ┃ Win ┃ Draft t/s ┃ TTFT ms ┃ Tot
┡━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━
│ filler     │    0 │    27.8 │  63.2% │   37% │   1.3 │   2 │      24.7 │       8 │
│ code       │    0 │    35.2 │  74.5% │   25% │   1.5 │   2 │      28.0 │       6 │
│ structured │    0 │    35.2 │  82.3% │   18% │   1.6 │   2 │      26.4 │      11 │
│ filler     │   4K │    21.5 │  62.3% │   38% │   1.2 │   2 │      19.1 │      17 │
│ code       │   4K │    35.4 │  74.5% │   25% │   1.5 │   2 │      28.2 │       7 │
│ structured │   4K │    35.3 │  79.6% │   20% │   1.6 │   2 │      27.0 │      18 │
│ filler     │   8K │    19.8 │  58.5% │   42% │   1.2 │   2 │      18.3 │      22 │
│ code       │   8K │    36.5 │  80.6% │   19% │   1.6 │   2 │      27.9 │       7 │
│ structured │   8K │    35.8 │  82.3% │   18% │   1.6 │   2 │      26.9 │       7 │
└────────────┴──────┴─────────┴────────┴───────┴───────┴─────┴───────────┴─────────┴────

  Highest acceptance: structured (82.3%)  Lowest: filler (58.5%)
  Draft window: 1.5/2 positions used (73% utilization)  Avg waste: 27%

Something seems to be off with the tool calling - I see a bunch of this in the logs:

(APIServer pid=255) INFO 05-18 09:54:41 [qwen3xml_tool_parser.py:1157] vLLM Successfully import tool parser Qwen3XMLToolParser !
(APIServer pid=255) WARNING 05-18 09:54:44 [qwen3xml_tool_parser.py:303] Error when parsing XML elements: not well-formed (invalid token): line 6, column 1
(APIServer pid=255) INFO 05-18 09:54:44 [qwen3xml_tool_parser.py:1157] vLLM Successfully import tool parser Qwen3XMLToolParser !

and this:

(APIServer pid=255) WARNING 05-18 09:55:29 [serving.py:911] MTP truncation detected for request chatcmpl-bf4ad8dea81bdb9a: finished with 'stop' but tools configured and only reasoning produced.

ekkis · May 18, 2026, 11:40am

I’ve been wanting to try this model for a while now, great job getting this working!

I pulled your PR and got it up and running in no time. I’m running a slightly newer VLLM build (v0.21.1rc1.dev50+) and had no issues running higher context, testing with 261k now. VLLM reports max concurrency of 1.54x at 261440 max-num-len. I only commented out PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True in the recipe as I seem to remember it causing issues with the Qwen 3.5 397b recipe.

Audio tokenizer is not found so that’s not working, Opencode running Kimi K2.6 claimed it was an issue with the safetensor loading not accepting subfolders, not sure about that one but audio isn’t very important to me anyway.

I also see the Error when parsing XML elements: not well-formed (invalid token) error, but quite infrequently.

I got a similar tool-eval-bench score as above, this is with max-num-len at 261k:

MiMo-V2.5-NVFP4 — tool-eval-bench v1.7.0

Score: 88/100 (122/138 points) ★★★★ Good

Deployability: 75/100 | Responsiveness: 45/100 (median turn: 3.4s)

Breakdown: 48 pass, 17 partial, 4 fail out of 69 scenarios.

serapis · May 18, 2026, 11:50am

I found the issue for the tool calling error – hoping to get this merged soon: [Bugfix] Fix duplicate Qwen3 XML function close recovery by SeraphimSerapis · Pull Request #42969 · vllm-project/vllm · GitHub

serapis · May 18, 2026, 11:58am

So you run 261K with max_num_seqs 1?

I might give that a try.

ekkis · May 18, 2026, 1:08pm

I didn’t set max-num-seqs at all, I tried initially but it limits the cuda graphs precompile and appeared to tank the performance, may need to retry.

I did get a crash using opencode and openclaw simultaneously with 100k+ contexts on each so it’s not perfectly stable.

mclenithan · May 18, 2026, 3:36pm

Nice work @a3refaat !!

a3refaat · May 18, 2026, 6:46pm

Applied this fix and I’m getting modestly improved results. Good find! Something else worth noting, this quant checkpoint doesn’t have any calibrated scale for fp8_e4m3 kv cache. I don’t have substantive benchmarks to support this claim, but anecdotally these low precision types need adequate scale factors from a broad calibration dataset, even for kv cache. The checkpoint provides these calibrated scales for MXFP8 weights (E8M0) but not for E4M3 kv cache type. After some testing I found that my peak concurrent request support decreased from 2.18x → 2x at 196k ctx when using BF16 kv_cache_dtype. Tg throughput was similar at lower context lengths. So for long context workflows where quality is important, I will probably stick with bf16 for now

tonyd615 · May 19, 2026, 4:39am

finally got it working. Would love to get the context up if we can.

ekkis · May 19, 2026, 4:42am

I played around with this model a fair bit yesterday but by the end of the day I returned to Minimax M2.7. The main issues are with tool calling in Opencode and extreme thinking loops once you go much past 100k context. Despite the good tool-eval-bench score I find M2.7 to be much more reliable both in Opencode and OpenClaw.

The tool calling issues are probably from the Qwen template, but even after applying both @serapis PR from above and the Qwen 3.5 template mod and / or using qwen3_coder as tool-call-parser the issue persists. The failure modes are either «Expected function.name to be a string» or simply just stopping mid work, both familiar issues from the Qwen models.

renek · May 19, 2026, 7:19am

Claude Code made some reliability real work tests and deepseek v4 Flash gets it better than m2.7 and Mimo is on par with deepseek with Omni…

tonyd615 · May 19, 2026, 5:28pm

Love Minimax but I’d tell you check out DS4

Alexander-F · May 20, 2026, 12:00pm

I tried MiMo-V2.5-NVFP4 on my 2 Spark cluster, and everything is up and running.

Huge thanks to everyone involved for all the hard work that went into making this possible. It’s genuinely appreciated.

That said, based on my testing and real-world use, it just doesn’t quite hold up against M2.7 for my needs.

pfnguyen · May 20, 2026, 5:45pm

This is exciting, I haven’t been motivated to try out new models recently, been putting off trying to get this one going. Been very happy with 397B in general, other than the very limited memory footprint/concurrency available. I really want to try out mimo to get more room for context.

tonyd615 · May 24, 2026, 7:11pm

Has anyone done any additional testing ? I kind of feel back to DS4 and Minimax 2.7

vr8vr8 · May 25, 2026, 4:02pm

So those 2 was better for you?

tonyd615 · May 25, 2026, 8:23pm

Yea Minimax seems to be the MOST stable but I feel like a more optimizsed DS4 will win as the best current model for 2 clusters.

Topic		Replies	Views
Mimo V2.5 Flash on 2 Nodes DGX Spark / GB10 deepseek	165	3222	July 10, 2026
MiMo-V2.5 (New model) DGX Spark / GB10	57	6276	July 9, 2026
Mimo 2.5 Pro NVFP4 on 8xGB10 cluster DGX Spark / GB10	10	1060	June 9, 2026
MiMo-V2.5 Omni · TP=2 · 1M context · NVFP4 KV on 2× DGX Spark DGX Spark / GB10	26	966	June 26, 2026
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	92	6718	April 12, 2026
MiniMax M3 NVFP4 and NVFP4 REAP 50 for 4x & 2x DGX Sparks DGX Spark / GB10 Projects	53	3590	July 2, 2026
MiniMax M3 : NVFP4 for Quad DGX Spark DGX Spark / GB10 agentic-ai , deepseek	116	7509	June 25, 2026
MiniMax M2.7 NFVP4 Recipe & Benchmarks DGX Spark / GB10 llama	125	12494	July 9, 2026
MiMo-V2.5-Pro-FP4-DFlash DGX Spark / GB10	13	946	June 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11976	April 9, 2026