MiMo-V2.5 (New model)

The HF page looks more like a marketing page. Is anyone tried this family of models? Some benchmarks looks very impressive. Will fit into 2 sparks.

Thanks for bringing this up. I am afraid that, with its 310B FP8 parameters, this will not fit two Sparks only, unless you quantise it first.

I ment the 4bit version, the nvfp4 will be around 174 GB, still some space to context.

I have 4x GB10’s… Should I try SGLang or vLLM for this model? I think SGLang is going to give me more capabilities for all the multi modal bells and whistles but let me know if this is the wrong path. Will start setting up tonight. (Cancelling my DeepSeek v4 Flash setup :D too many models, too fast)

I spent a lot of today trying to quantize this model on my hardware. No luck at all. Not enough ram to handle it, couldn’t do layer by layer, and the model.safetensors.index.json is completely out-of-date vs. the safetensors files. How are people even able to load the model?? I had to either rename all the safetensors or hack the index to have the correct naming in order for transformers to even look at the model.

# MiMo V2.5 (310B FP8 MoE) on a 4-Node DGX Spark Cluster — Working SGLang Config

A reproducible, end-to-end recipe for serving **`XiaomiMiMo/MiMo-V2.5`** (310B total / 15B active, native FP8, omni-modal) across a **4-node ASUS Ascent GX10 (DGX Spark / GB10) cluster** with `lmsysorg/sglang:dev-cu13-mimo-v2.5`, TP=4 over 200 Gbps RoCE RDMA.

The cluster comes up clean, accepts requests, and serves at full 256K context. There are two non-obvious gotchas (parser routing + a `torchcodec` ABI mismatch) that you’ll hit if you copy the cookbook recipe verbatim — both are documented below with the exact fix.

This post exists so anyone else with a 4× GB10 cluster can skip the multi-day debugging cycle and just run the model.

-–

## TL;DR — What works

| | |

|—|—|

| **Image** | `lmsysorg/sglang:dev-cu13-mimo-v2.5` |

| **Topology** | 4 nodes × 1 GB10 each, TP=4, nnodes=4 |

| **Interconnect** | 200 Gbps RoCE RDMA via MikroTik CRS812, MTU 9000 |

| **Memory** | `–mem-fraction-static 0.70` (≈30 GiB free per node after warmup) |

| **Context** | `max_total_num_tokens=318,687`, `context_len=262,144` |

| **TTFT** | ~0.46 s on short prompts |

| **Throughput** | ~31.5 tok/s decode (TP=4 over RoCE) |

| **Reasoning parser** | **`mimo`** — NOT `qwen3` (this is a real bug, see below) |

| **Tool-call parser** | `mimo` |

| **Attention backend** | `triton` (FP8 CUTLASS dispatch is broken on sm_121a) |

| **MM attention backend** | `triton_attn` |

| **MoE runner** | `auto` |

| **KV cache dtype** | `fp8_e4m3` |

| **EAGLE speculative decoding** | **DISABLED** — see “Known head-node OOM” below |

-–

## Hardware

- **4× ASUS Ascent GX10** (DGX Spark variant)

- GB10 GPU, compute capability `sm_121a`

- 119.61 GB unified memory per node (CPU + GPU shared — no separate GPU OOM killer)

- Internal RoCE fabric: `enP2p1s0f0np0` interface on every node, **MTU 9000**

- Switch: MikroTik CRS812 (jumbo-frame capable), 200 Gbps per port

- RoCE IPs: `192.168.100.10`, `.11`, `.12`, `.13` for ranks 0–3

> **MTU 9000 must be persistent across reboots** (set in netplan). MTU mismatch on any node will silently hang NCCL collective init — the cluster will look “stuck on barrier” with no error.

-–

## OS hygiene (every node)

```bash

# CRITICAL: kill the desktop GUI — it eats GPU memory and can deadlock

# during weight load on Grace-Blackwell unified memory.

sudo systemctl set-default multi-user.target

sudo reboot

# (or temporarily: killall gnome-shell Xorg firefox gjs mutter-x11-frames)

# Drop page caches before launch (UMA `cudaMemGetInfo` underreports

# free memory by 3–4 GiB while page cache is full)

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# Lower swappiness — keep model weights resident

echo “vm.swappiness=10” | sudo tee -a /etc/sysctl.conf

sudo sysctl -p

# Disable services you don’t need

sudo systemctl disable --now bluetooth ModemManager wpa_supplicant avahi-daemon fwupd

```

-–

## Common environment variables (per node)

These match the 397B Qwen recipe and are unchanged for MiMo V2.5. The container runs `–network host --gpus all --ipc host --privileged --shm-size 10.24g` and inherits these vars:

```bash

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

TORCH_CUDA_ARCH_LIST=12.1a # GB10 is sm_121a, not sm_120

NCCL_IGNORE_CPU_AFFINITY=1

NCCL_SOCKET_IFNAME=enP2p1s0f0np0

NCCL_IB_HCA=roceP2p1s0f0 # RoCE adapter name on Ascent

NCCL_IB_DISABLE=0

NCCL_CUMEM_ENABLE=0 # CRITICAL — prevents init deadlock

GLOO_SOCKET_IFNAME=enP2p1s0f0np0

TP_SOCKET_IFNAME=enP2p1s0f0np0

OMP_NUM_THREADS=4

MKL_NUM_THREADS=4

TORCH_NUM_THREADS=4

SGLANG_ENABLE_SPEC_V2=True

VLLM_HOST_IP=<this node’s RoCE IP> # 192.168.100.10..13

MODEL_LOADER_EXTRA_CONFIG={“enable_multithread_load”:“true”,“num_threads”:64}

```

`NCCL_CUMEM_ENABLE=0` is the single most important variable — without it the cluster will hang during NCCL init on Grace-Blackwell.

-–

## SGLang launch arguments

```text

python3 -m sglang.launch_server \

–trust-remote-code \

–model-path /models/mimo-v2.5 \

–tp 4 \

–nnodes 4 \

–dist-init-addr 192.168.100.10:29500 \

–attention-backend triton \

–mm-attention-backend triton_attn \

–moe-runner-backend auto \

–mem-fraction-static 0.70 \

–kv-cache-dtype fp8_e4m3 \

–max-running-requests 2 \

–chunked-prefill-size 8192 \

–max-prefill-tokens 8192 \

–swa-full-tokens-ratio 0.3 \

–reasoning-parser mimo \

–tool-call-parser mimo \

–schedule-policy lpm \

–model-loader-extra-config “$MODEL_LOADER_EXTRA_CONFIG” \

–host 0.0.0.0 \

–port 30000 \

–node-rank <0|1|2|3>

```

### Why `–mem-fraction-static 0.70` and not 0.85+?

The `cudaMemGetInfo` API is broken on Grace-Blackwell unified memory — it underreports free memory because the Linux page cache is counted as “used” but is kernel-reclaimable. SGLang’s allocator does an early pre-flight check using this API, and overshoots get killed by the host before model load completes. 0.70 is the stable upper bound across our 4 nodes; 0.78 occasionally OOMs the head node during the second torch fork-storm.

### Why `triton` instead of FlashInfer / CUTLASS?

`sm_121a` (GB10) is **not** `sm_120` (B200/GB300), even though they’re both “Blackwell”. Many prebuilt FP8 CUTLASS kernels are only compiled for `sm_120` and silently miss on `sm_121a`. FlashInfer also has incomplete sm_121a coverage as of this image. Forcing `triton` for both attention and MM-attention bypasses the issue entirely with a small (~10–15%) throughput hit.

-–

## Two gotchas you WILL hit

### Gotcha 1: `–reasoning-parser qwen3` misroutes everything

The official cookbook for some related Xiaomi/Qwen models uses `–reasoning-parser qwen3`. **Do not do this for MiMo V2.5.** With the qwen3 parser:

- `enable_thinking=false` → V2.5’s chat template injects an empty `` pair. The qwen3 parser dumps **the entire response** into `delta.reasoning_content`, leaving `delta.content` empty. Looks like the model produced nothing.

- `enable_thinking=true` → the qwen3 parser never sees the closing `` because MiMo’s reasoning emit is shaped differently. Result: a 32K-token runaway “reasoning loop” that hits `finish_reason=length` with no answer.

This matches the symptom pattern reported in [SGLang issue #20786]( [Bug] Qwen3.5-0.8B return content=null with --reasoning-parser qwen3 · Issue #20786 · sgl-project/sglang · GitHub ) for related models.

**Fix:** `–reasoning-parser mimo`. Both vLLM and SGLang ship a `mimo` parser. With it:

- thinking-OFF returns clean content in `delta.content`

- thinking-ON correctly splits `reasoning_content` from `content` and terminates on the close tag

### Gotcha 2: `torchcodec` ABI mismatch on launch

`MiMoV2OmniForCausalLM` (the only architecture SGLang actually implements for V2.5) imports `torchcodec` at module load — even when you only intend to use text. The container has no `torchcodec` installed, so import fails. The naive fix — `pip install torchcodec` — does not work either:

- Only `torchcodec >= 0.11.x` ships wheels for `cp312` / `aarch64`

- `torchcodec 0.11.x` has an ABI mismatch with the bundled PyTorch 2.9.1+cu130:

```

undefined symbol: torch_dtype_float4_e2m1fn_x2

```

- `torchcodec==0.7.0` (which would match) does not have a wheel for this Python/arch combo on PyPI.

**Fix (text-only deployments):** install the real wheel for its dist-info, then overwrite the Python files with stubs so the broken `.so` libraries never load. `transformers/audio_utils.py` calls `importlib.metadata.version(“torchcodec”)` at import — that’s the only reason the dist-info matters.

```bash

docker exec sglang-rank-X bash -c ’

pip install --quiet --no-cache-dir torchcodec 2>&1 | tail -1

SP=$(python3 -c “import site; print(site.getsitepackages()[0])”)

mkdir -p $SP/torchcodec/decoders

# Top-level: skip submodule imports + .so loads

printf “%s\n” “_version_ = \“0.11.1\”” > $SP/torchcodec/_init_.py

# Stub decoders.AudioDecoder (only thing mimo_v2_omni imports)

printf “%s\n%s\n%s\n%s\n” \

"class AudioDecoder:" \\

"    def \__init_\_(self, \*a, \*\*kw):" \\

"        raise NotImplementedError(\\"stub torchcodec.decoders.AudioDecoder --- text-only\\")" \\

"" > $SP/torchcodec/decoders/\__init_\_.py

python3 -c "

import importlib.metadata

print(\“metadata version:\”, importlib.metadata.version(\“torchcodec\”))

from torchcodec.decoders import AudioDecoder

print(\“stub OK\”, AudioDecoder)

"

```

We tried first to swap the architecture string to `MiMoV2ForCausalLM` (no MM imports). It fails — there is no SGLang implementation for that arch in this image. We tried `–language-only`. It requires `–encoder-urls` to be set, which is incompatible with a text-only deploy. The stub is the only working path we found.

If you actually want audio decoding, you need to source-build a `torchcodec` against `pytorch 2.9.1+cu130` for `sm_121a`. We have not done this.

### Gotcha 3 (advisory): EAGLE speculative decoding kills rank 0

The cookbook recipe enables EAGLE:

```text

--speculative-algorithm EAGLE

--speculative-num-steps 3

--speculative-eagle-topk 1

--speculative-num-draft-tokens 4

```

On Grace-Blackwell unified memory this consistently SIGKILLs the head node during init. Cause: SGLang’s EAGLE setup forks a **second** multi-thread loader on rank 0 to load the draft model, and the fork-storm trips a `multiprocessing.resource_tracker leaked semaphore` pattern that takes the head process down. Rank 0 dies silently; ranks 1–3 sit forever on the NCCL barrier.

We removed EAGLE entirely. Re-introducing it later via `–speculative-draft-model-path` (out-of-process draft loader) may work; we have not tested it. If you need EAGLE on GB10, treat this as the first thing to investigate.

-–

## Launch order

Order matters. Workers must be online before the head node starts the all-reduce barrier, otherwise rank 0 will time out.

```text

1. Start sleep-infinity containers on ALL 4 nodes

2. Install the torchcodec stub on ALL 4 nodes (must be on every rank,

not just head — every container imports MiMoV2OmniForCausalLM)

3. Launch SGLang on ranks 1, 2, 3 (the workers) — they wait for rank 0

4. Wait ~10 seconds

5. Launch SGLang on rank 0 (the head)

6. Watch /tmp/sglang-rank0.log for “The server is fired up and ready to roll!”

```

Full launcher (workers + head, RoCE addressing, env wiring): [`scripts/experiments/the-giant-launch.sh`](https://github.com/dizyx/nockerl-local-models/blob/main/scripts/experiments/the-giant-launch.sh) in the public repo.

-–

## Health check

Once `rank 0` logs the ready banner:

```bash

curl -s http://192.168.100.10:30000/health | python3 -m json.tool

curl -s http://192.168.100.10:30000/v1/models | python3 -m json.tool

curl -s http://192.168.100.10:30000/v1/chat/completions \

-H ‘Content-Type: application/json’ \

-d '{

"model": "mimo-v2.5",

"messages": \[{"role":"user","content":"Say hello in one short sentence."}\],

"max_tokens": 64,

"chat_template_kwargs": {"enable_thinking": false}

}’ | python3 -m json.tool

```

`enable_thinking` MUST go inside `chat_template_kwargs` — it is not a top-level field. The `mimo` reasoning parser will produce clean `content` and an empty `reasoning_content` when thinking is off.

-–

## Smoke results (post-fix)

After applying the parser fix and the torchcodec stub, both modes work:

| Mode | TTFT | Decode tok/s | `content` produced | `reasoning_content` |

|—|—|—|—|—|

| `enable_thinking=false` | 0.46 s | ~31.5 tok/s | clean | empty |

| `enable_thinking=true` (narrow Q) | 0.5–1 s | ~30 tok/s | clean | bounded |

For broad open-ended prompts in thinking mode, the model can produce very long reasoning chains (tens of thousands of reasoning tokens). This is a model-intrinsic property, not a parser bug — doubling `max_tokens` from 16K to 32K just doubles reasoning output. Set tight reasoning budgets if you serve open-ended workloads.

-–

## What we have NOT tested

- Vision input end-to-end (the omni MM processor is wired but we never validated image paths beyond import-time)

- Audio decoding (we explicitly stubbed `AudioDecoder`)

- EAGLE speculative decoding (removed — see Gotcha 3)

- DeepEP path

- Multi-token-prediction (MTP) with custom draft

If you get any of these working on a 4× GB10 cluster, please reply on the forum thread — would love to compare notes.

-–

## Why we wrote this up

We spent multiple debugging cycles on the parser and torchcodec issues and couldn’t find them documented anywhere for the GB10 + 4-node case. The cookbook recipes target H100 / B200, where neither bug surfaces. If you have a 4-Spark cluster you should be able to copy the launcher and the env block above and have a working server in under 30 minutes once the image is pulled.

-–

## References

- SGLang day-0 image: `lmsysorg/sglang:dev-cu13-mimo-v2.5`

- Model: [`XiaomiMiMo/MiMo-V2.5`]( XiaomiMiMo/MiMo-V2.5 · Hugging Face )

- SGLang issue #20786 — same parser-misroute pattern on related models

- NVIDIA DGX Spark forum: DGX Spark / GB10 User Forum - NVIDIA Developer Forums

- Companion DGX Spark recipes & wheels: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

-–

*Cluster: 4× ASUS Ascent GX10 (GB10 / sm_121a), 200 Gbps RoCE via MikroTik CRS812, MTU 9000, NCCL 2.x, PyTorch 2.9.1+cu130, SGLang `dev-cu13-mimo-v2.5`.*

Major progress here!
I’ll do a formal write up soon, but for those of you with 4 GB10’s this is a hell of a model once you do a bunch of patches. Right now I am testing MTP options so I can show llama-benchy and tool testing results. Check this is out: XiaomiMiMo/MiMo-V2.5 · MiMo V2.5 on 4-node Grace-Blackwell DGX Spark: 5 SGLang PRs and 2 doc notes

Thanks for your updates @mclenithan!

I am really interested in this model. From the testing that I have done using the cloud it seems like a good candidate to run locally. I was wondering whether the community was interested in this model at all, since I have not seen much discussion about it - seems like DeepSeek V4 got most of the attention.

Is there potential for this model to work on a cluster of 2 GB10 in a quantized form in the future?

Based on the size of lukealonso/MiMo-V2.5-NVFP4 · Hugging Face , I don’t see any issues with running it on two devices.

what Claude says about this:

The core issue: --moe-runner-backend triton (the only reliable backend on SM121a) does not support NVFP4. The Triton fused MoE kernel can load FP8 and BF16 weights, but cannot dequantize FP4. On non-Blackwell (SM90), NVFP4 falls back to Marlin W4A16 — but Marlin also lacks SM121a support.
flashinfer_dsl would be the only theoretical candidate — it generates Triton kernels via DSL and is explicitly documented for NVFP4/ModelOpt. But whether the generated kernels work on SM121a is completely untested and undocumented.
Conclusion: You can’t just copy the 4-node recipe 1:1 and plug in NVFP4 weights instead of FP8 ones—the MoE kernel path is different, and none of the working SM121a backends support NVFP4. You’d have to test flashinfer_dsl, but that’s a shot in the dark with no guarantee of success, and you’d probably still need 4 nodes due to memory constraints.

So if somebody wants to test, would be keen to know. :D

can we run the pro one with cluster of 8?

Seems that people run Q3 on single DGX spark 8) Reddit - Please wait for verification

Any recipes ready for 2x cluster?

I started wondering how it will compre to qwen 3.5 397.

So far, this model is incredibly sensitive to sampling parameters, like temperature, repitition penalties, etc… I am still not convinced I have the settings right yet or that this model overthinks, (compared to Qwen3.6 27B even). I tried to get MTP working on the unquantized 2.5 version but my computers keep OOM locking up, seems it’s grabbing too much space to use MTP. There are several other strategies left to get it to work, but I won’t be physically near my computers for the next two weeks, can’t risk locking them up again while I am away :D. (If anyone has good tips on how to prevent OOM issues while launching vLLM of SGLang with untested memory specifications, let me know. I’ve tried watchers and things, but so far, unreliable.)

I think it’d probably have to be quantized down a little. But with all the new patterns to get things to fit/work, I am sure there is a way to do it.

I think it’d be possible with the right quant, would have to investigate something like running Intel’s Autoround on it. Looks like no one has attempted Autoround on huggingface. Maybe $100 or less to rent the stack on RunPod to run Autoround and test.

Big update: XiaomiMiMo/MiMo-V2.5 · MiMo V2.5 on 4-node Grace-Blackwell DGX Spark: 5 SGLang PRs and 2 doc notes I got vision to work cleanly!

RE: SGLang sampling parameters, they are super sensitive with this model, but once you install those 6 patches above ^^^ (lol) and use these exact parameters, the model is outstanding! My parameter data:

{
  "model": "MiMo-V2.5",
  "source": "Xiaomi official recommendations + community Thought Loop mitigation (do NOT use Qwen3 settings)",
  "recommended_defaults": {
    "temperature": 0.6,
    "top_p": 0.95,
    "repetition_penalty": 1.2,
    "enable_thinking": true
  },
  "full_schema": {
    "temperature":        { "default": 0.6,  "min": 0,    "max": 2,   "step": 0.05 },
    "top_p":              { "default": 0.95, "min": 0,    "max": 1,   "step": 0.01 },
    "top_k":              { "default": -1,   "min": -1,   "max": 200, "step": 1 },
    "min_p":              { "default": 0,    "min": 0,    "max": 1,   "step": 0.01 },
    "presence_penalty":   { "default": 0,    "min": -2,   "max": 2,   "step": 0.1 },
    "frequency_penalty":  { "default": 0,    "min": -2,   "max": 2,   "step": 0.1 },
    "repetition_penalty": { "default": 1.2,  "min": 0.5,  "max": 2,   "step": 0.05 },
    "enable_thinking":    { "default": true }
  },
  "supported_params": [
    "temperature",
    "top_p",
    "top_k",
    "min_p",
    "presence_penalty",
    "frequency_penalty",
    "repetition_penalty",
    "enable_thinking"
  ],
  "notes": [
    "temperature=0.6 + top_p=0.95 are Xiaomi's published recommendations for MiMo V2.5.",
    "repetition_penalty=1.2 is a community-validated mitigation for the 'Thought Loop' failure mode where the model recurses inside <think> blocks.",
    "DO NOT copy Qwen3 sampling settings (temperature=0.7, top_p=0.8, no repetition_penalty) — they trigger the Thought Loop on MiMo.",
    "top_k=-1 means disabled (no top-k truncation). Leave as-is unless you have a specific reason.",
    "enable_thinking=true keeps the chain-of-thought <think> block on; set false only if you want shorter, non-reasoning responses."
  ]
}

Final things:

Tools

Tool-Call Benchmark — /models/mimo-v2.5

  • Run ID: 2026-05-05T15-20-52Z_331c93

  • Date: 2026-05-05T15:36:06.044830+00:00

  • tool-eval-bench: v1.5.0

  • Final Score: 89 / 100

  • Total Points: 123 / 138

  • Rating: ★★★★ Good

  • Tool Definition Overhead: ~4,637 tokens (52 tools, 18,548 chars)

  • Deployability: 75 / 100 (α=0.7)

  • Quality: 89 / 100

  • Responsiveness: 41 / 100 (median turn: 3.8s)

Run Context

| Backend | sglang |
| Server | http://localhost:30000/v1 |
| Model (API) | /models/mimo-v2.5 |
| Temperature | 0.0 (extra_params override below) |
| Seed | 42 |
| Max Turns | 8 |
| Timeout | 60.0s |
| Scenarios | all (69) |
| Parallel | 1 (sequential) |
| Error Rate | 0.0 |
| Thinking | enabled |
| Extra Params | {“temperature”: 0.6, “top_p”: 0.95, “top_k”: 20} |

Inference Engine

| Max Model Length | 1,048,576 |
| Host | gx10-c66d |
| Platform | Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39 |
| Python | 3.12.3 |

Category Scores

| Tool Selection | 6 / 66 | 100% |
| Multi-Step Chains | 6 / 8 | 75% |
| Restraint & Refusal | 6 / 6 | 100% |
| Error Recovery | 6 / 6 | 100% |
| Localization | 6 / 6 | 100% |
| Structured Reasoning | 6 / 6 | 100% |
| Instruction Following | 10 / 10 | 100% |
| Context & State | 12 / 20 | 60% | ← worst category
| Code Patterns | 5 / 6 | 83% |
| Safety & Boundaries | 25 / 26 | 96% |
| Toolset Scale | 8 / 8 | 100% |
| Autonomous Planning | 6 / 6 | 100% |
| Creative Composition | 5 / 6 | 83% |
| Structured Output | 10 / 12 | 83% |

benchy

(Note: this was before I gave it a lot more context and more concurrency)

# llama-benchy 0.3.7 | model=/models/mimo-v2.5 | latency=2.70 ms | prefix_caching=True
#
# pp = prompt-processing throughput  (tokens/sec)
# tg = token-generation throughput   (tokens/sec)
# *_req = per-request slice of the same number
# TTFR = time to first response token  (ms)
# e2e_TTFT = wall-clock TTFT incl. queueing  (ms)

 depth conc     pp_thr    pp_req     tg_thr    tg_req     TTFR_ms   e2e_TTFT_ms
------------------------------------------------------------------------------
     0    1     2247.2    2247.2       33.7      33.7    836521.8      868670.7
     0    2     2494.2    1366.1       50.4      25.2   1336977.3     1456566.3
     0    5      659.2     672.2       39.5      26.8   6460108.4     6494164.3
     0   10      660.6     377.0       41.5      25.1  14395862.1    14445583.7
  4096    1     2407.6    2407.6       33.4      33.4    853610.9      885888.2
  4096    2     2390.8    1494.9       50.3      25.2   1420111.8     1713474.0
  4096    5      713.8     743.9       39.3      26.8   6711230.9     6849996.4
  4096   10      727.6     413.3       41.0      25.1  14853935.2    14940618.9
  8192    1     2276.6    2276.6       32.6      32.6    902337.2      934160.2
  8192    2     2315.9    1458.3       49.9      24.9   1458966.6     1768587.5
  8192    5      702.5     709.8       38.8      26.6   6865853.0     7006656.3
  8192   10      457.7     182.3       28.4      24.8  21742073.4    22514395.5
 16384    1     2110.0    2110.0       32.3      32.3    973433.3     1006635.3
 16384    2     2200.1    1378.4       49.0      24.5   1541974.0     1862829.8
 16384    5      373.3     479.8       23.4      26.2  12219349.5    13369921.5
 16384   10      287.4     112.7       18.8      24.6  35507262.1    37589973.3
 32768    1     1846.5    1846.5       30.6      30.6   1112567.5     1146487.5
 32768    2     2012.5    1257.0       47.5      23.7   1691238.7     2037614.0
 32768    5      172.2      93.8       12.9      25.3  31700569.5    33982943.0
 32768   10      157.7      84.7       10.7      23.7  66674730.9    71420339.9
 65535    1     1496.4    1496.4       28.5      28.5   1373040.4     1405799.9
 65535    2     1610.1     986.3       44.7      22.4   2141855.3     2543717.8
 65535    5       87.6      45.4        6.9      22.9  63172657.5    68202452.0
 65535   10       79.0      53.7        5.5      22.3 132376342.1   143246568.8

Run Command

docker exec sglang-rank-${RANK} python3 -m sglang.launch_server \
  --trust-remote-code \
  --model-path /models/mimo-v2.5 \
  --tp 4 \
  --nnodes 4 \
  --node-rank ${RANK} \
  --dist-init-addr xxxxxxxxxxxxxxxxx \
  --attention-backend triton \
  --mm-attention-backend triton_attn \
  --moe-runner-backend auto \
  --mem-fraction-static 0.88 \
  --kv-cache-dtype fp8_e4m3 \
  --max-running-requests 24 \
  --chunked-prefill-size 8192 \
  --max-prefill-tokens 8192 \
  --swa-full-tokens-ratio 0.3 \
  --reasoning-parser qwen3 \
  --tool-call-parser mimo \
  --quantization fp8 \
  --schedule-policy lpm \
  --host 0.0.0.0 \
  --port 30000 \
  --model-loader-extra-config "$MODEL_LOADER_EXTRA_CONFIG"

Nice to see the first benchmarks of this on a Spark system.

I had high hopes for running this on a dual node cluster but given the throughput I see in your benchmarks, it might not perform well enough to give it a try unless Intel or others provide an AutoRound or equivalent quant.

I’ve been wanting to run this badly since the model dropped. As soon as I figure out how to build sglang with b12x (lukealonso’s fork) on sparks, I’m going to give this a run for myself and see how it performs.

I REALLY want to give my opinion of this model officially, but still doing a lot of configuration. This model is not that popular compared to Deepseek, Qwen, KimiK, etc… Which means documentation is laughable, (but I also wanted a unique challenge, that’s why I have dedicated so much time to it). Currently, I can say, it’s really difficult to get setup right, but the payout is great.

This MIMO 2.5 is best with SGLang, (looked like vLLM was a bit behind in support when I was trying to configure). It has a super cool multi-modal audio experience: instead of just hooking up STT capabilities, (which this model also does), you can feed it audio to process directly, meaning it can sense tone, background noise, multiple speakers, etc… It can process videos really well. It’s vision capabilities are fantastic.

If you think MIMO 2.5 sounds cool, I could use some help figuring out MTP. Currently, I try to get it working, but it causes OOM on all my servers, even when I pull back the memory allocation dramatically, not sure what is going one. A) Need some advice on how people experiment on their GB10’s while protecting them from OOM locks. B) Wtf does MTP need so much more RAM for this model? Still investigating that. If I can get that right, might be able to hit 50t/s or more.