Hey guys, I decided to stop lurking and make an effort to contribute in this forum. I got lukealonso/MiMo-V2.5-NVFP4
running on a 2ร DGX Spark / GB10 cluster with vLLM, including Omni/multimodal serving and MTP speculative decoding.
Patch bundle / reproducible recipe:
https://github.com/eugr/spark-vllm-docker/pull/251
Working configuration
| Item | Value |
|---|---|
| Hardware | 2ร DGX Spark / GB10, TP=2 |
| Model | lukealonso/MiMo-V2.5-NVFP4 |
| vLLM | 0.21.1rc1.dev39, CUDA 13.2 build |
| Load format | instanttensor |
| Attention | TRITON_ATTN_DIFFKV |
| Dense GEMM | FlashInfer-CUTLASS MXFP8 |
| MoE | FlashInfer-CUTLASS NVFP4 |
| KV cache | fp8_e4m3 |
| Context | 131072 |
| Serving | Omni, image input validated |
| Spec decode | MiMo MTP, num_speculative_tokens=2 |
Expected startup markers:
Resolved architecture: MiMoV2OmniForCausalLM
Resolved architecture: MiMoV2OmniMTPModel
Using FlashInferCutlassMxfp8LinearKernel for MXFP8 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
Using TRITON_ATTN_DIFFKV
cache_dtype="fp8_e4m3"
Launch shape
vllm serve lukealonso/MiMo-V2.5-NVFP4 \
--served-model-name MiMo-V2.5-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--load-format instanttensor \
--hf-overrides '{"architectures":["MiMoV2OmniForCausalLM"]}' \
--limit-mm-per-prompt '{"image":4,"video":1,"audio":1}' \
--attention-backend triton_attn_diffkv \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 131072 \
--max-num-batched-tokens 16384 \
--enable-prefix-caching \
--enable-chunked-prefill \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--no-async-scheduling \
--enable-auto-tool-choice \
--tool-call-parser mimo \
--reasoning-parser mimo
export VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1
export VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export NCCL_CUMEM_ENABLE=0
export NCCL_NVLS_ENABLE=0
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
Important fixes
The main correctness issue was not performance tuning; it was checkpoint/layout compatibility.
1. ModelOpt mixed MXFP8/NVFP4 dispatch
The checkpoint mixes MXFP8 dense layers with NVFP4 experts. The ModelOpt mixed-precision path needed to dispatch MXFP8 linear layers to the MXFP8 method instead of falling through.
if quant_algo == "MXFP8":
return ModelOptMxFp8LinearMethod(self.mxfp8_config)
if quant_algo == "NVFP4":
return ModelOptNvFp4LinearMethod(self.nvfp4_config)
2. Do not invert weight_scale_inv
For this checkpoint, weight_scale_inv is UE8M0 MXFP8 scale metadata. Despite the name, it should not be reciprocal-inverted. I aliased it to vLLMโs expected weight_scale parameter.
layer.register_parameter("weight_scale", weight_scale)
layer.register_parameter("weight_scale_inv", weight_scale)
3. Fix fused QKV TP loading
This was the biggest text-quality bug. The NVFP4 checkpoint uses a deinterleaved QKV layout, but the loader was blindly chunking the fused QKV tensor by TP rank. That can load Q rows into K/V slots.
Bad pattern:
loaded_weight = loaded_weight.chunk(tp_size, dim=0)[tp_rank]
default_weight_loader(param, loaded_weight)
Fix: use the parameterโs QKV-aware loader.
param = params_dict[name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)
This same fix is needed for the MTP draft model.
4. Fix Omni + MTP quant metadata mapping
Omni remaps target model keys from:
model.* -> language_model.model.*
But MTP draft modules still live under:
model.mtp.*
So the MTP draft lost its quant metadata. The fix remaps MTP metadata back for the draft model:
hf_to_vllm_mapper = WeightsMapper(
orig_to_new_prefix={
"language_model.model.mtp.": "model.mtp.",
}
)
After this, Omni MTP acceptance recovered.
Validation
Text generation: OK
Tool calling: OK
Reasoning parser: OK
Image prompt: OK
Omni MTP depth 2: OK
Non-eager execution: OK
Example MTP acceptance after the Omni quant fix:
Text prompt: 42 / 78 = 53.85%
Image prompt: 38 / 52 = 73.08%
Tool eval run: 15621 / 21221 = 73.61%
Benchmarks
tool-eval-bench
score: 89 / 100
points: 123 / 138
rating: โ
โ
โ
โ
Good
safety warnings: 0
llama-benchy
Run shape: pp=2048, tg=32, prefix caching enabled, concurrency 1/2, 3 runs.
| Depth | Concurrency | Prefill total t/s | Decode total t/s | TTFR ms |
|---|---|---|---|---|
| 0 | 1 | 3734.84 | 34.14 | 785.03 |
| 0 | 2 | 3001.78 | 62.64 | 1295.67 |
| 4096 | 1 | 2854.95 | 36.69 | 954.03 |
| 4096 | 2 | 2335.61 | 52.77 | 1690.58 |
| 8192 | 1 | 2484.35 | 33.19 | 1061.04 |
| 8192 | 2 | 2088.01 | 53.65 | 1900.97 |
| 16384 | 1 | 1983.69 | 32.75 | 1269.15 |
| 16384 | 2 | 1728.92 | 53.77 | 2310.42 |
| 32768 | 1 | 1404.16 | 29.55 | 1695.20 |
| 32768 | 2 | 1272.26 | 44.78 | 3170.00 |
| 65536 | 1 | 895.17 | 21.69 | 2524.51 |
| 98304 | 1 | 660.81 | 19.62 | 3336.00 |
| 114688 | 1 | 581.64 | 19.32 | 3757.79 |
The exact depth=131072 point with tg=32 is invalid because prompt + requested output exceeds the 131072-token model window:
input tokens: at least 131041
requested output tokens: 32
total: at least 131073 > 131072
Also, concurrency=2 at very long context shows a large latency increase. That is expected here: without additional long-context attention optimizations, decode becomes dominated by scanning a large KV cache. For this 2ร Spark setup, c2 is useful at lower/mid context; c1 is the cleaner view at 65k+ context.
Known caveats
- Image input is validated; audio/video paths still need separate validation.
- FP8 E4M3 KV cache works, but this run used
calculate_kv_scales=False. - The exact 131072-depth benchmark needs smaller
tgor slightly lower prompt depth. - This is currently packaged as a reproducible runtime mod/recipe, not an upstream-clean vLLM PR series.
Minimal replication checklist
- Use a vLLM build with MiMo V2.5 and
TRITON_ATTN_DIFFKV. - Apply the two mods from the PR:
mods/fix-modelopt-mixed-mxfp8mods/fix-mimo-v2-vllm
- Launch the recipe with
--load-format instanttensor. - Confirm MXFP8, NVFP4, DiffKV, Omni, and MTP startup markers.
- Run a text prompt, an image prompt, and a tool-call prompt.
Thanks to the Xiaomi/MiMo team, the NVFP4 export author, the vLLM contributors, and the prior MiMo V2.5 DGX Spark notes that helped guide this setup.