I managed to have Mimo 2.5 Pro NVFP4 working on 8xGB10 cluster using eugr vllm plus all the patches found and beautifully described and documented here: GitHub - idonati/spark-vllm-docker-festr2: Patches + recipe to deploy festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 on 8-node DGX Spark sm_121 (Ray + vLLM, TP=8). Fixes the fused-qkv loader bug that mis-slotted Q values as K/V on 7 of 8 ranks. ยท GitHub
I used festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 as suggested.
Very usable speed:
No MTP: 22t/s low context, 19t/s 120k context
MTP: 35t/s low context, 17t/s 120k context.
Thank you to all that made this possible and shared the solution.
can you run tool-eval-bench and share results here?
here they are:
Tool-Call Benchmark โ mimo-v2.5-pro
-
**Run ID**: `2026-05-21T15-13-02Z_5dd040`
-
**Date**: `2026-05-21T15:33:38.078120+00:00`
-
**tool-eval-bench**: `v1.8.0 4aa85fb`
-
**Final Score**: **88** / 100
-
**Total Points**: 122 / 138
-
**Rating**: โ
โ
โ
โ
Good
-
**Tool Definition Overhead**: ~4,637 tokens (52 tools, 18,548 chars)
-
**Deployability**: **70** / 100 (ฮฑ=0.7)
-
**Quality**: 88 / 100
-
**Responsiveness**: 27 / 100 (median turn: 5.9s)
Run Context
| Parameter |
Value |
| Backend |
vllm |
| Server |
`http://***:5001` |
| Model (API) |
`mimo-v2.5-pro` |
| Model (Root) |
`/root/models/models15/MiMo-V2.5-Pro-NVFP4` |
| Temperature |
0.0 |
| Seed |
42 |
| Max Turns |
8 |
| Timeout |
120.0s |
| Scenarios |
all (69) |
| Parallel |
1 (sequential) |
| Error Rate |
0.0 |
| Thinking |
enabled |
Inference Engine
| Property |
Value |
| Engine |
vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519 |
| Max Model Length |
667,472 |
| Host |
`gx10-e12b` |
| Platform |
`Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39` |
| Python |
3.12.3 |
Category Scores
| Category |
Earned |
Max |
Percent |
| Tool Selection |
6 |
6 |
100% |
| Parameter Precision |
6 |
6 |
100% |
| Multi-Step Chains |
8 |
8 |
100% |
| Restraint & Refusal |
5 |
6 |
83% |
| Error Recovery |
6 |
6 |
100% |
| Localization |
6 |
6 |
100% |
| Structured Reasoning |
6 |
6 |
100% |
| Instruction Following |
10 |
10 |
100% |
| Context & State |
15 |
20 |
75% |
| Code Patterns |
6 |
6 |
100% |
| Safety & Boundaries |
24 |
26 |
92% |
| Toolset Scale |
7 |
8 |
88% |
| Autonomous Planning |
4 |
6 |
67% |
| Creative Composition |
5 |
6 |
83% |
| Structured Output |
8 |
12 |
67% |
in vllm I used
reasoning_parser: mimo
tool_call_parser: mimo
and left vllm default params for temp, top p and so on
do I need to change something for better results?
Update for speed using mtp-2, single user request, coding:
40 t/s - 1k context,
32t/s - 30k context,
25t/s - 125k context,
17t/s - 250k context.
13t/s - 350k context
in tool call bench with 2 parallel reached 60t/s and in 4 parallel reached 83t/s, not bad for 1T model
What is the prompt processing speed?
Small improvement with temp 0.7 and a mod applied for chat template:
โ Score: 91 / 100 โ
โ Rating: โ
โ
โ
โ
โ
Excellent โ
Running tool-call benchmark with parallel 2 and temperature=0.7โฆ
๐ง Tool-Call Benchmark
Server: http://localhost:5001
Querying http://localhost:5001/v1/models โฆ โ /root/models/models15/MiMo-V2.5-Pro-NVFP4 (alias: mimo-v2.5-pro)
โ Warm-up complete (3343 ms)
๐ Engine: vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ง Tool-Call Benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /root/models/models15/MiMo-V2.5-Pro-NVFP4 via vllm @ http://localhost:5001 โ
โ 69 scenarios v1.8.0 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๐ Benchmark Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ Model: /root/models/models15/MiMo-V2.5-Pro-NVFP4 โ
โ Score: 91 / 100 โ
โ Rating: โ
โ
โ
โ
โ
Excellent โ
โ Engine: vLLM 0.21.1rc1.dev117+ge8026fa64.d20260519 โ
โ Max context: 400,000 tokens โ
โ โ
โ โ
59 passed โ ๏ธ 8 partial โ 2 failed โ
โ Points: 126/138 โ
โ โ
โ Quality: 91/100 โ
โ Responsiveness: 32/100 (median turn: 4.9s) โ
โ Deployability: 73/100 (ฮฑ=0.7) โ
โ Weakest: A Tool Selection (67%) โ
โ โ
โ Completed in 557.3s โ tool-eval-bench v1.8.0 โ
โ โ
โ ๐ Token Usage: โ
โ Total: 278,927 tokens โ Efficiency: 0.5 pts/1K tokens
may I ask, what kind of switch are you using?
p_noch
11
canโt wait to try, now im also at 8 sparks but no switch yet