MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks

Hey guys, I decided to stop lurking and make an effort to contribute in this forum. I got lukealonso/MiMo-V2.5-NVFP4 running on a 2ร— DGX Spark / GB10 cluster with vLLM, including Omni/multimodal serving and MTP speculative decoding.

Special thanks to @CyberTen and @mclenithan for getting the native model quant off the ground and providing me with a starting point for getting this working. This model is looking very promising from a quality and throughput standpoint for 2xSpark users. Multimodal capabilities are a big plus as well, although this setup has only validated image processing so far. Audio/Video could work, but I haven't tested this yet. Hopefully this can provide a good baseline for further testing & optimization from the community.

Patch bundle / reproducible recipe:

https://github.com/eugr/spark-vllm-docker/pull/251

Working configuration

ItemValue
Hardware2ร— DGX Spark / GB10, TP=2
Modellukealonso/MiMo-V2.5-NVFP4
vLLM0.21.1rc1.dev39, CUDA 13.2 build
Load formatinstanttensor
AttentionTRITON_ATTN_DIFFKV
Dense GEMMFlashInfer-CUTLASS MXFP8
MoEFlashInfer-CUTLASS NVFP4
KV cachefp8_e4m3
Context131072
ServingOmni, image input validated
Spec decodeMiMo MTP, num_speculative_tokens=2

Expected startup markers:

Resolved architecture: MiMoV2OmniForCausalLM
Resolved architecture: MiMoV2OmniMTPModel
Using FlashInferCutlassMxfp8LinearKernel for MXFP8 GEMM
Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
Using TRITON_ATTN_DIFFKV
cache_dtype="fp8_e4m3"

Launch shape

vllm serve lukealonso/MiMo-V2.5-NVFP4 \
  --served-model-name MiMo-V2.5-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --load-format instanttensor \
  --hf-overrides '{"architectures":["MiMoV2OmniForCausalLM"]}' \
  --limit-mm-per-prompt '{"image":4,"video":1,"audio":1}' \
  --attention-backend triton_attn_diffkv \                                              
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 131072 \ 
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ 
  --no-async-scheduling \
  --enable-auto-tool-choice \ 
  --tool-call-parser mimo \
  --reasoning-parser mimo                                                             

                                                                                   
export VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1                                             
export VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1                                                
export NCCL_CUMEM_ENABLE=0
export NCCL_NVLS_ENABLE=0                                                            
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1                                                  
export VLLM_FLASHINFER_MOE_BACKEND=throughput

Important fixes

The main correctness issue was not performance tuning; it was checkpoint/layout compatibility.

1. ModelOpt mixed MXFP8/NVFP4 dispatch

The checkpoint mixes MXFP8 dense layers with NVFP4 experts. The ModelOpt mixed-precision path needed to dispatch MXFP8 linear layers to the MXFP8 method instead of falling through.

if quant_algo == "MXFP8":
    return ModelOptMxFp8LinearMethod(self.mxfp8_config)
if quant_algo == "NVFP4":
    return ModelOptNvFp4LinearMethod(self.nvfp4_config)
2. Do not invert weight_scale_inv

For this checkpoint, weight_scale_inv is UE8M0 MXFP8 scale metadata. Despite the name, it should not be reciprocal-inverted. I aliased it to vLLMโ€™s expected weight_scale parameter.

layer.register_parameter("weight_scale", weight_scale)
layer.register_parameter("weight_scale_inv", weight_scale)
3. Fix fused QKV TP loading

This was the biggest text-quality bug. The NVFP4 checkpoint uses a deinterleaved QKV layout, but the loader was blindly chunking the fused QKV tensor by TP rank. That can load Q rows into K/V slots.

Bad pattern:

loaded_weight = loaded_weight.chunk(tp_size, dim=0)[tp_rank]
default_weight_loader(param, loaded_weight)

Fix: use the parameterโ€™s QKV-aware loader.

param = params_dict[name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)

This same fix is needed for the MTP draft model.

4. Fix Omni + MTP quant metadata mapping

Omni remaps target model keys from:

model.* -> language_model.model.*

But MTP draft modules still live under:

model.mtp.*

So the MTP draft lost its quant metadata. The fix remaps MTP metadata back for the draft model:

hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_prefix={
        "language_model.model.mtp.": "model.mtp.",
    }
)

After this, Omni MTP acceptance recovered.


Validation

Text generation:       OK
Tool calling:          OK
Reasoning parser:      OK
Image prompt:          OK
Omni MTP depth 2:      OK
Non-eager execution:   OK

Example MTP acceptance after the Omni quant fix:

Text prompt:       42 / 78 = 53.85%
Image prompt:      38 / 52 = 73.08%
Tool eval run:  15621 / 21221 = 73.61%

Benchmarks

tool-eval-bench

score:           89 / 100
points:          123 / 138
rating:          โ˜…โ˜…โ˜…โ˜… Good
safety warnings: 0

llama-benchy

Run shape: pp=2048, tg=32, prefix caching enabled, concurrency 1/2, 3 runs.

DepthConcurrencyPrefill total t/sDecode total t/sTTFR ms
013734.8434.14785.03
023001.7862.641295.67
409612854.9536.69954.03
409622335.6152.771690.58
819212484.3533.191061.04
819222088.0153.651900.97
1638411983.6932.751269.15
1638421728.9253.772310.42
3276811404.1629.551695.20
3276821272.2644.783170.00
655361895.1721.692524.51
983041660.8119.623336.00
1146881581.6419.323757.79

The exact depth=131072 point with tg=32 is invalid because prompt + requested output exceeds the 131072-token model window:

input tokens: at least 131041
requested output tokens: 32
total: at least 131073 > 131072

Also, concurrency=2 at very long context shows a large latency increase. That is expected here: without additional long-context attention optimizations, decode becomes dominated by scanning a large KV cache. For this 2ร— Spark setup, c2 is useful at lower/mid context; c1 is the cleaner view at 65k+ context.


Known caveats

  • Image input is validated; audio/video paths still need separate validation.
  • FP8 E4M3 KV cache works, but this run used calculate_kv_scales=False.
  • The exact 131072-depth benchmark needs smaller tg or slightly lower prompt depth.
  • This is currently packaged as a reproducible runtime mod/recipe, not an upstream-clean vLLM PR series.

Minimal replication checklist

  1. Use a vLLM build with MiMo V2.5 and TRITON_ATTN_DIFFKV.
  2. Apply the two mods from the PR:
    • mods/fix-modelopt-mixed-mxfp8
    • mods/fix-mimo-v2-vllm
  3. Launch the recipe with --load-format instanttensor.
  4. Confirm MXFP8, NVFP4, DiffKV, Omni, and MTP startup markers.
  5. Run a text prompt, an image prompt, and a tool-call prompt.

Thanks to the Xiaomi/MiMo team, the NVFP4 export author, the vLLM contributors, and the prior MiMo V2.5 DGX Spark notes that helped guide this setup.

I will give it a shot, thank you for all the hard work!

โ”€โ”€ Run 1/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[Q&A] 157 tokens in 8.95s = 17.5 tok/s (prompt: 37)
[Code] 512 tokens in 13.37s = 38.2 tok/s (prompt: 44)
[JSON] 732 tokens in 17.66s = 41.4 tok/s (prompt: 62)
[Math] 64 tokens in 1.83s = 34.9 tok/s (prompt: 43)
[LongCode] 2048 tokens in 52.25s = 39.1 tok/s (prompt: 51)

โ”€โ”€ Run 2/2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[Q&A] 161 tokens in 4.36s = 36.9 tok/s (prompt: 37)
[Code] 512 tokens in 12.82s = 39.9 tok/s (prompt: 44)
[JSON] 736 tokens in 17.56s = 41.9 tok/s (prompt: 62)
[Math] 64 tokens in 1.91s = 33.5 tok/s (prompt: 43)
[LongCode] 2048 tokens in 53.56s = 38.2 tok/s (prompt: 51)

tool-eval-bench: MiMo-V2.5-NVFP4

Final Score: 88/100 (88.4%)
Points: 122/138
Rating: โ˜…โ˜…โ˜…โ˜… Good
Deployability: 74
Responsiveness: 41
Median Turn: 3783.6ms
Token Efficiency: 0.4
Total Tokens: 304436
Error Rate: 0.0%
Worst Category: M Autonomous Planning (67%)

great work, many thanks. Me and claude spent half a day trying to mod Lukeโ€™s SGLang images, but at best I could only get garbled output with cuda graph turned off. Will give this a shot later on today.

I spent some time bashing my head against this model but couldnโ€™t get it to work. Well done!

Pushing this beyond 128K context window didnโ€™t succeed. It seems like these quants are really pushing the boundaries of our systems.

I tried this with both lukealonso/MiMo-V2.5-NVFP4 ยท Hugging Face and shadowlilac/MiMo-V2.5-NVFP4 ยท Hugging Face โ€“ here is what I got:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โšก llama-benchy Throughput Benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ lukealonso/MiMo-V2.5-NVFP4                                                           โ”‚
โ”‚ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3            โ”‚
โ”‚ latency=generation                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โœ“ Complete โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 27/27 0:04:40

  llama-benchy 0.3.7
  Estimated latency: 234.2 ms

                                  llama-benchy Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”ƒ Test                 โ”ƒ  c  โ”ƒ   pp t/s โ”ƒ   tg t/s โ”ƒ TTFT (ms) โ”ƒ Total (ms) โ”ƒ    Tokens
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”‚ pp2048 tg128 @ d0    โ”‚ c1  โ”‚    3,577 โ”‚     34.0 โ”‚       869 โ”‚      4,335 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d0    โ”‚ c2  โ”‚    2,828 โ”‚     54.3 โ”‚     1,449 โ”‚      5,612 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d0    โ”‚ c4  โ”‚    2,717 โ”‚     51.4 โ”‚     2,841 โ”‚     11,212 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d4096 โ”‚ c1  โ”‚    3,149 โ”‚     34.6 โ”‚     2,247 โ”‚      5,650 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d4096 โ”‚ c2  โ”‚    2,949 โ”‚     51.0 โ”‚     4,168 โ”‚      8,733 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d4096 โ”‚ c4  โ”‚    3,002 โ”‚     53.6 โ”‚     8,113 โ”‚     15,046 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d8192 โ”‚ c1  โ”‚    3,026 โ”‚     32.1 โ”‚     3,685 โ”‚      7,371 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d8192 โ”‚ c2  โ”‚    2,731 โ”‚     50.1 โ”‚     7,502 โ”‚     11,318 โ”‚  2048+128
โ”‚ pp2048 tg128 @ d8192 โ”‚ c4  โ”‚    2,866 โ”‚     39.3 โ”‚    13,425 โ”‚     20,530 โ”‚  2048+128
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


                                 Speculative Decoding Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”
โ”ƒ Prompt     โ”ƒ Depโ€ฆ โ”ƒ Eff t/s โ”ƒ    ฮฑ % โ”ƒ Waste โ”ƒ ฯ„ len โ”ƒ Win โ”ƒ Draft t/s โ”ƒ TTFT ms โ”ƒ Tot
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”
โ”‚ filler     โ”‚    0 โ”‚    27.8 โ”‚  63.2% โ”‚   37% โ”‚   1.3 โ”‚   2 โ”‚      24.7 โ”‚       8 โ”‚
โ”‚ code       โ”‚    0 โ”‚    35.2 โ”‚  74.5% โ”‚   25% โ”‚   1.5 โ”‚   2 โ”‚      28.0 โ”‚       6 โ”‚
โ”‚ structured โ”‚    0 โ”‚    35.2 โ”‚  82.3% โ”‚   18% โ”‚   1.6 โ”‚   2 โ”‚      26.4 โ”‚      11 โ”‚
โ”‚ filler     โ”‚   4K โ”‚    21.5 โ”‚  62.3% โ”‚   38% โ”‚   1.2 โ”‚   2 โ”‚      19.1 โ”‚      17 โ”‚
โ”‚ code       โ”‚   4K โ”‚    35.4 โ”‚  74.5% โ”‚   25% โ”‚   1.5 โ”‚   2 โ”‚      28.2 โ”‚       7 โ”‚
โ”‚ structured โ”‚   4K โ”‚    35.3 โ”‚  79.6% โ”‚   20% โ”‚   1.6 โ”‚   2 โ”‚      27.0 โ”‚      18 โ”‚
โ”‚ filler     โ”‚   8K โ”‚    19.8 โ”‚  58.5% โ”‚   42% โ”‚   1.2 โ”‚   2 โ”‚      18.3 โ”‚      22 โ”‚
โ”‚ code       โ”‚   8K โ”‚    36.5 โ”‚  80.6% โ”‚   19% โ”‚   1.6 โ”‚   2 โ”‚      27.9 โ”‚       7 โ”‚
โ”‚ structured โ”‚   8K โ”‚    35.8 โ”‚  82.3% โ”‚   18% โ”‚   1.6 โ”‚   2 โ”‚      26.9 โ”‚       7 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€

  Highest acceptance: structured (82.3%)  Lowest: filler (58.5%)
  Draft window: 1.5/2 positions used (73% utilization)  Avg waste: 27%

Something seems to be off with the tool calling - I see a bunch of this in the logs:

(APIServer pid=255) INFO 05-18 09:54:41 [qwen3xml_tool_parser.py:1157] vLLM Successfully import tool parser Qwen3XMLToolParser !
(APIServer pid=255) WARNING 05-18 09:54:44 [qwen3xml_tool_parser.py:303] Error when parsing XML elements: not well-formed (invalid token): line 6, column 1
(APIServer pid=255) INFO 05-18 09:54:44 [qwen3xml_tool_parser.py:1157] vLLM Successfully import tool parser Qwen3XMLToolParser !

and this:

(APIServer pid=255) WARNING 05-18 09:55:29 [serving.py:911] MTP truncation detected for request chatcmpl-bf4ad8dea81bdb9a: finished with 'stop' but tools configured and only reasoning produced.

Iโ€™ve been wanting to try this model for a while now, great job getting this working!

I pulled your PR and got it up and running in no time. Iโ€™m running a slightly newer VLLM build (v0.21.1rc1.dev50+) and had no issues running higher context, testing with 261k now. VLLM reports max concurrency of 1.54x at 261440 max-num-len. I only commented out PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True in the recipe as I seem to remember it causing issues with the Qwen 3.5 397b recipe.

Audio tokenizer is not found so thatโ€™s not working, Opencode running Kimi K2.6 claimed it was an issue with the safetensor loading not accepting subfolders, not sure about that one but audio isnโ€™t very important to me anyway.

I also see the Error when parsing XML elements: not well-formed (invalid token) error, but quite infrequently.

I got a similar tool-eval-bench score as above, this is with max-num-len at 261k:

MiMo-V2.5-NVFP4 โ€” tool-eval-bench v1.7.0

Score: 88/100 (122/138 points) โ˜…โ˜…โ˜…โ˜… Good

Deployability: 75/100 | Responsiveness: 45/100 (median turn: 3.4s)

Breakdown: 48 pass, 17 partial, 4 fail out of 69 scenarios.

I found the issue for the tool calling error โ€“ hoping to get this merged soon: [Bugfix] Fix duplicate Qwen3 XML function close recovery by SeraphimSerapis ยท Pull Request #42969 ยท vllm-project/vllm ยท GitHub

So you run 261K with max_num_seqs 1?

I might give that a try.

I didnโ€™t set max-num-seqs at all, I tried initially but it limits the cuda graphs precompile and appeared to tank the performance, may need to retry.

I did get a crash using opencode and openclaw simultaneously with 100k+ contexts on each so itโ€™s not perfectly stable.

Nice work @a3refaat !!

Applied this fix and Iโ€™m getting modestly improved results. Good find! Something else worth noting, this quant checkpoint doesnโ€™t have any calibrated scale for fp8_e4m3 kv cache. I donโ€™t have substantive benchmarks to support this claim, but anecdotally these low precision types need adequate scale factors from a broad calibration dataset, even for kv cache. The checkpoint provides these calibrated scales for MXFP8 weights (E8M0) but not for E4M3 kv cache type. After some testing I found that my peak concurrent request support decreased from 2.18x โ†’ 2x at 196k ctx when using BF16 kv_cache_dtype. Tg throughput was similar at lower context lengths. So for long context workflows where quality is important, I will probably stick with bf16 for now

finally got it working. Would love to get the context up if we can.

I played around with this model a fair bit yesterday but by the end of the day I returned to Minimax M2.7. The main issues are with tool calling in Opencode and extreme thinking loops once you go much past 100k context. Despite the good tool-eval-bench score I find M2.7 to be much more reliable both in Opencode and OpenClaw.

The tool calling issues are probably from the Qwen template, but even after applying both @serapis PR from above and the Qwen 3.5 template mod and / or using qwen3_coder as tool-call-parser the issue persists. The failure modes are either ยซExpected function.name to be a stringยป or simply just stopping mid work, both familiar issues from the Qwen models.

Claude Code made some reliability real work tests and deepseek v4 Flash gets it better than m2.7 and Mimo is on par with deepseek with Omniโ€ฆ

Love Minimax but Iโ€™d tell you check out DS4

I tried MiMo-V2.5-NVFP4 on my 2 Spark cluster, and everything is up and running.

Huge thanks to everyone involved for all the hard work that went into making this possible. Itโ€™s genuinely appreciated.

That said, based on my testing and real-world use, it just doesnโ€™t quite hold up against M2.7 for my needs.

This is exciting, I havenโ€™t been motivated to try out new models recently, been putting off trying to get this one going. Been very happy with 397B in general, other than the very limited memory footprint/concurrency available. I really want to try out mimo to get more room for context.

Has anyone done any additional testing ? I kind of feel back to DS4 and Minimax 2.7

So those 2 was better for you?

Yea Minimax seems to be the MOST stable but I feel like a more optimizsed DS4 will win as the best current model for 2 clusters.