Mimo 2.5 Pro NVFP4 on 8xGB10 cluster

I managed to have Mimo 2.5 Pro NVFP4 working on 8xGB10 cluster using eugr vllm plus all the patches found and beautifully described and documented here: GitHub - idonati/spark-vllm-docker-festr2: Patches + recipe to deploy festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 on 8-node DGX Spark sm_121 (Ray + vLLM, TP=8). Fixes the fused-qkv loader bug that mis-slotted Q values as K/V on 7 of 8 ranks. ยท GitHub

I used festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 as suggested.

Very usable speed:

No MTP: 22t/s low context, 19t/s 120k context

MTP: 35t/s low context, 17t/s 120k context.

Thank you to all that made this possible and shared the solution.

can you run tool-eval-bench and share results here?

here they are:

Tool-Call Benchmark โ€” mimo-v2.5-pro

  • **Run ID**: `2026-05-21T15-13-02Z_5dd040`

  • **Date**: `2026-05-21T15:33:38.078120+00:00`

  • **tool-eval-bench**: `v1.8.0 4aa85fb`

  • **Final Score**: **88** / 100

  • **Total Points**: 122 / 138

  • **Rating**: โ˜…โ˜…โ˜…โ˜… Good

  • **Tool Definition Overhead**: ~4,637 tokens (52 tools, 18,548 chars)

  • **Deployability**: **70** / 100 (ฮฑ=0.7)

  • **Quality**: 88 / 100

  • **Responsiveness**: 27 / 100 (median turn: 5.9s)

Run Context

Parameter Value
Backend vllm
Server `http://***:5001`
Model (API) `mimo-v2.5-pro`
Model (Root) `/root/models/models15/MiMo-V2.5-Pro-NVFP4`
Temperature 0.0
Seed 42
Max Turns 8
Timeout 120.0s
Scenarios all (69)
Parallel 1 (sequential)
Error Rate 0.0
Thinking enabled

Inference Engine

Property Value
Engine vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519
Max Model Length 667,472
Host `gx10-e12b`
Platform `Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39`
Python 3.12.3

Category Scores

Category Earned Max Percent
Tool Selection 6 6 100%
Parameter Precision 6 6 100%
Multi-Step Chains 8 8 100%
Restraint & Refusal 5 6 83%
Error Recovery 6 6 100%
Localization 6 6 100%
Structured Reasoning 6 6 100%
Instruction Following 10 10 100%
Context & State 15 20 75%
Code Patterns 6 6 100%
Safety & Boundaries 24 26 92%
Toolset Scale 7 8 88%
Autonomous Planning 4 6 67%
Creative Composition 5 6 83%
Structured Output 8 12 67%

in vllm I used
reasoning_parser: mimo
tool_call_parser: mimo

and left vllm default params for temp, top p and so on

do I need to change something for better results?

Update for speed using mtp-2, single user request, coding:
40 t/s - 1k context,
32t/s - 30k context,
25t/s - 125k context,
17t/s - 250k context.
13t/s - 350k context

in tool call bench with 2 parallel reached 60t/s and in 4 parallel reached 83t/s, not bad for 1T model

What is the prompt processing speed?

1900-2000t/s

Small improvement with temp 0.7 and a mod applied for chat template:

โ”‚ Score: 91 / 100 โ”‚
โ”‚ Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent โ”‚

Running tool-call benchmark with parallel 2 and temperature=0.7โ€ฆ

๐Ÿ”ง Tool-Call Benchmark
Server: http://localhost:5001
Querying http://localhost:5001/v1/models โ€ฆ โœ“ /root/models/models15/MiMo-V2.5-Pro-NVFP4 (alias: mimo-v2.5-pro)

โœ“ Warm-up complete (3343 ms)
๐Ÿ” Engine: vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ง Tool-Call Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /root/models/models15/MiMo-V2.5-Pro-NVFP4 via vllm @ http://localhost:5001 โ”‚
โ”‚ 69 scenarios v1.8.0 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€๐Ÿ† Benchmark Complete โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ

โ”‚ โ”‚
โ”‚ Model: /root/models/models15/MiMo-V2.5-Pro-NVFP4 โ”‚
โ”‚ Score: 91 / 100 โ”‚
โ”‚ Rating: โ˜…โ˜…โ˜…โ˜…โ˜… Excellent โ”‚
โ”‚ Engine: vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519 โ”‚
โ”‚ Max context: 400,000 tokens โ”‚
โ”‚ โ”‚
โ”‚ โœ… 59 passed โš ๏ธ 8 partial โŒ 2 failed โ”‚
โ”‚ Points: 126/138 โ”‚
โ”‚ โ”‚
โ”‚ Quality: 91/100 โ”‚
โ”‚ Responsiveness: 32/100 (median turn: 4.9s) โ”‚
โ”‚ Deployability: 73/100 (ฮฑ=0.7) โ”‚
โ”‚ Weakest: A Tool Selection (67%) โ”‚
โ”‚ โ”‚
โ”‚ Completed in 557.3s โ”‚ tool-eval-bench v1.8.0 โ”‚
โ”‚ โ”‚
โ”‚ ๐Ÿ“Š Token Usage: โ”‚
โ”‚ Total: 278,927 tokens โ”‚ Efficiency: 0.5 pts/1K tokens

may I ask, what kind of switch are you using?

mikrotik crs804

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash ยท Hugging Face. this is interesting.

canโ€™t wait to try, now im also at 8 sparks but no switch yet