MiMo-V2.5-Pro-FP4-DFlash

Was wondering if there is any possibility MiMo-V2.5-Pro-FP4-DFlash could be run on 4-node DGX Spark. The output speed seems promising.

The weights in that repo are 570GB… so, probably not going to fit on a 4 node cluster. But, the end of the blog post does say they’re working on the same improvements for the regular MiMo-V2.5 model.

I managed to run the dflash vllm Mimo2.5 Pro from above but on 8x sparks 62t/s in a create a snake game test:

bash manageMimo2.5ProB12x-Nccl2304-Ray.sh game-bench
=== Game Benchmark (Single-Stream, temp=0) ===
Waiting for server to be ready…
Server ready after 1s
Running game benchmark (Snake game generation)…
Completion tokens: 1500
Prompt tokens: 61
Total tokens: 1561
Wall time: 23.95s
tok/s: 62.62

Not bad for a 1T model

Wow, that’s amazing. How about the prefill speed?

1500t/s prefill

@ciprianveg that’s really impressive. Any chance you could you share your recipe?

Does 62t/s hold up over ctx length? With DFlash decode falls off a cliff at ~ 16k for me and becomes slower than no spec decode not long after. I was wondering if a model with DFlash intergrated would fair better.

I been waiting on the regular model to drop dflash man

I need help from you I am going to mention you

it holds better than mtp or eagle in my coding test. Do you have a specific code related benchmark for long context you want me to run?

I tested on a 65k context java prompt and it was above 40t/s and without dflash it is at cca 20tps

yes, sure, this evening I will clean up the code and make it easier to reproduce and share it afterwards as eugr yaml amd mods

# DFlash draft model: MiMo-V2.5-Pro-NVFP4/dflash/ (5-layer Qwen3 BF16)
# Target model: MiMo-V2.5-Pro-NVFP4 (70-layer MiMoV2 NVFP4)
# =============================================================================

recipe_version: "1"
name: MiMo-V2.5-Pro-DFlash-Nccl2304-8xGB10
description: "Ray v2 + DFlash speculative decoding for MiMo V2.5 Pro NVFP4 on 8× GB10"

model: festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8
container: vllm-node-mimo-nightly-dflash:latest

cluster_only: true
mods:
  - mods/fix-mimo-v2-vllm

defaults:
  port: 5001
  host: 0.0.0.0
  tensor_parallel: 8
  pipeline_parallel: 1
  gpu_memory_utilization: 0.69
  block_size: 32
  max_model_len: 510000
  max_num_batched_tokens: 4096
  max_num_seqs: 4
  served_model_name: mimo-v2.5-pro
  speculative_draft_tensor_parallel_size: 1

env:
  # ── NCCL 2.30.4 wedge fix ──
  LD_PRELOAD: /opt/nccl-2.30.4/lib/libnccl.so.2
  LD_LIBRARY_PATH: /opt/nccl-2.30.4/lib

  # ── MoE / compilation ──
  NCCL_NVLS_ENABLE: "1"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"
  VLLM_USE_FLASHINFER_MOE_FP4: "0"

  # ── FlashInfer / compilation ──
  FLASHINFER_DISABLE_VERSION_CHECK: "1"
  TILELANG_CLEANUP_TEMP_FILES: "1"
  TORCH_CUDA_ARCH_LIST: "12.1a"
  FLASHINFER_CUDA_ARCH_LIST: "12.1a"

  # ── JIT cache dirs ──
  DG_JIT_CACHE_DIR: /cache/huggingface/deepgemm-cache
  TORCHINDUCTOR_CACHE_DIR: /cache/huggingface/torchinductor-cache
  TRITON_CACHE_DIR: /cache/huggingface/triton-cache
  TORCH_EXTENSIONS_DIR: /cache/huggingface/torch_extensions
  VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
  DG_JIT_USE_NVRTC: "0"
  DG_JIT_PRINT_COMPILER_COMMAND: "1"

  # ── Long context support ──
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"

  # ── NCCL / InfiniBand (RoCE v2) ──
  NCCL_NET: "IB"
  NCCL_IB_DISABLE: "0"
  NCCL_DEBUG: "WARN"
  NCCL_ASYNC_ERROR_HANDLING: "1"
  NCCL_BLOCKING_WAIT: "0"
  NCCL_IB_ROCE_VERSION_NUM: "2"
  NCCL_IB_QPS_PER_CONNECTION: "8"
  NCCL_IB_SPLIT_DATA_ON_QPS: "1"
  NCCL_MIN_NCHANNELS: "32"
  NCCL_IB_PCI_RELAXED_ORDERING: "1"
  NCCL_IB_MERGE_NICS: "1"
  NCCL_NET_PLUGIN: "none"

  # ── Offline / cache ──
  HF_HOME: /cache/huggingface
  HF_HUB_OFFLINE: "1"
  TRANSFORMERS_OFFLINE: "1"

  # ── Threading ──
  OMP_NUM_THREADS: "8"

  # ── Memory ──
  PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"

  # ── Ray v2 executor + compiled DAG ──
  VLLM_USE_RAY_V2_EXECUTOR_BACKEND: "1"
  VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM: "1"
  RAY_CGRAPH_get_timeout: "1800"

command: |
  vllm serve /root/models/models17/Mimo-2.5-Pro-FP4-Dflash \
    --trust-remote-code \
    -tp {tensor_parallel} \
    -pp {pipeline_parallel} \
    --pipeline-parallel-size {pipeline_parallel} \
    --enable-prefix-caching \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --max-num-seqs {max_num_seqs} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --load-format safetensors \
    --attention-backend flashinfer \
    --moe-backend marlin \
    --reasoning-parser mimo \
    --tool-call-parser mimo \
    --enable-auto-tool-choice \
    --served-model-name {served_model_name} \
    --host {host} \
    --port {port} \
    --distributed-executor-backend ray \
    --generation-config vllm \
    --chat-template /workspace/mods/fix-mimo-v2-vllm/chat_template.jinja \
    --default-chat-template-kwargs '{"keep_all_reasoning":true}' \
    --reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \
    --block-size 32 \
    --enable-flashinfer-autotune \
    --speculative-config '{"method":"dflash","model":"/root/models/models17/Mimo-2.5-Pro-FP4-Dflash/dflash","num_speculative_tokens":7,"attention_backend":"triton_attn","draft_tensor_parallel_size":1}' \
    --performance-mode balanced --no-async-scheduling 

DockerfileMimoDflash.zip (1.6 KB)

fix-mimo-v2-vllm.zip (174.2 KB)

=== Long-Context Benchmark ===
Type: coding
Target input: 1000 tokens
Output: 1500 tokens

Actual tokens: 1000 tokens (confirmed by server)

Sending request (streaming via httpx)…

============ Result ============
Input tokens: 1025
Output tokens: 1500
Wall time: 32.18s

TTFT: 1420.1 ms
Prefill tok/s: 721.8
Gen tok/s: 48.7
Mean ITL: 20.5 ms

Saved to bench_long_coding_1000.json

PEAK server-side gen throughput: 65.2 tok/s

=== Long-Context Benchmark ===
Type: coding
Target input: 32000 tokens
Output: 1500 tokens

Actual tokens: 31996 tokens (confirmed by server)

Sending request (streaming via httpx)…

============ Result ============
Input tokens: 32021
Output tokens: 1500
Wall time: 55.39s

TTFT: 18908.0 ms
Prefill tok/s: 1693.5
Gen tok/s: 41.1
Mean ITL: 24.3 ms

Saved to bench_long_coding_32000.json

PEAK server-side gen throughput: 45.1 tok/s

=== Long-Context Benchmark ===
Type: coding
Target input: 128000 tokens
Output: 1500 tokens

Actual tokens: 127978 tokens (confirmed by server)

Sending request (streaming via httpx)…

============ Result ============
Input tokens: 128003
Output tokens: 1500
Wall time: 160.37s

TTFT: 111930.6 ms
Prefill tok/s: 1143.6
Gen tok/s: 30.9
Mean ITL: 32.3 ms

Saved to bench_long_coding_128000.json

PEAK server-side gen throughput: 36.3 tok/s

=== Long-Context Benchmark ===
Type: coding
Target input: 256000 tokens
Output: 1500 tokens

Actual tokens: 255973 tokens (confirmed by server)

Sending request (streaming via httpx)…

============ Result ============
Input tokens: 255998
Output tokens: 1500
Wall time: 377.27s

TTFT: 309331.6 ms
Prefill tok/s: 827.6
Gen tok/s: 22.1
Mean ITL: 45.3 ms

Saved to bench_long_coding_256000.json

PEAK server-side gen throughput: 25.9 tok/s

The speed at 256k context with dflash is still bigger than the speed at 0 context without dflash :)

i did a small update at the yaml above and updated the results, 10% faster at 256k context

Did also a create a snake game speed benchmark (tok/s: 63.39):

=== Game Benchmark (Single-Stream, temp=0) ===
Waiting for server to be ready…
Server ready after 1s
Running game benchmark (Snake game generation)…
Completion tokens: 1500
Prompt tokens: 61
Total tokens: 1561
Wall time: 23.66s
tok/s: 63.39