Was wondering if there is any possibility MiMo-V2.5-Pro-FP4-DFlash could be run on 4-node DGX Spark. The output speed seems promising.
The weights in that repo are 570GB… so, probably not going to fit on a 4 node cluster. But, the end of the blog post does say they’re working on the same improvements for the regular MiMo-V2.5 model.
I managed to run the dflash vllm Mimo2.5 Pro from above but on 8x sparks 62t/s in a create a snake game test:
bash manageMimo2.5ProB12x-Nccl2304-Ray.sh game-bench
=== Game Benchmark (Single-Stream, temp=0) ===
Waiting for server to be ready…
Server ready after 1s
Running game benchmark (Snake game generation)…
Completion tokens: 1500
Prompt tokens: 61
Total tokens: 1561
Wall time: 23.95s
tok/s: 62.62
Not bad for a 1T model
Wow, that’s amazing. How about the prefill speed?
1500t/s prefill
@ciprianveg that’s really impressive. Any chance you could you share your recipe?
Does 62t/s hold up over ctx length? With DFlash decode falls off a cliff at ~ 16k for me and becomes slower than no spec decode not long after. I was wondering if a model with DFlash intergrated would fair better.
I been waiting on the regular model to drop dflash man
I need help from you I am going to mention you
it holds better than mtp or eagle in my coding test. Do you have a specific code related benchmark for long context you want me to run?
I tested on a 65k context java prompt and it was above 40t/s and without dflash it is at cca 20tps
yes, sure, this evening I will clean up the code and make it easier to reproduce and share it afterwards as eugr yaml amd mods
# DFlash draft model: MiMo-V2.5-Pro-NVFP4/dflash/ (5-layer Qwen3 BF16)
# Target model: MiMo-V2.5-Pro-NVFP4 (70-layer MiMoV2 NVFP4)
# =============================================================================
recipe_version: "1"
name: MiMo-V2.5-Pro-DFlash-Nccl2304-8xGB10
description: "Ray v2 + DFlash speculative decoding for MiMo V2.5 Pro NVFP4 on 8× GB10"
model: festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8
container: vllm-node-mimo-nightly-dflash:latest
cluster_only: true
mods:
- mods/fix-mimo-v2-vllm
defaults:
port: 5001
host: 0.0.0.0
tensor_parallel: 8
pipeline_parallel: 1
gpu_memory_utilization: 0.69
block_size: 32
max_model_len: 510000
max_num_batched_tokens: 4096
max_num_seqs: 4
served_model_name: mimo-v2.5-pro
speculative_draft_tensor_parallel_size: 1
env:
# ── NCCL 2.30.4 wedge fix ──
LD_PRELOAD: /opt/nccl-2.30.4/lib/libnccl.so.2
LD_LIBRARY_PATH: /opt/nccl-2.30.4/lib
# ── MoE / compilation ──
NCCL_NVLS_ENABLE: "1"
VLLM_MARLIN_USE_ATOMIC_ADD: "1"
VLLM_USE_FLASHINFER_MOE_FP4: "0"
# ── FlashInfer / compilation ──
FLASHINFER_DISABLE_VERSION_CHECK: "1"
TILELANG_CLEANUP_TEMP_FILES: "1"
TORCH_CUDA_ARCH_LIST: "12.1a"
FLASHINFER_CUDA_ARCH_LIST: "12.1a"
# ── JIT cache dirs ──
DG_JIT_CACHE_DIR: /cache/huggingface/deepgemm-cache
TORCHINDUCTOR_CACHE_DIR: /cache/huggingface/torchinductor-cache
TRITON_CACHE_DIR: /cache/huggingface/triton-cache
TORCH_EXTENSIONS_DIR: /cache/huggingface/torch_extensions
VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
DG_JIT_USE_NVRTC: "0"
DG_JIT_PRINT_COMPILER_COMMAND: "1"
# ── Long context support ──
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
# ── NCCL / InfiniBand (RoCE v2) ──
NCCL_NET: "IB"
NCCL_IB_DISABLE: "0"
NCCL_DEBUG: "WARN"
NCCL_ASYNC_ERROR_HANDLING: "1"
NCCL_BLOCKING_WAIT: "0"
NCCL_IB_ROCE_VERSION_NUM: "2"
NCCL_IB_QPS_PER_CONNECTION: "8"
NCCL_IB_SPLIT_DATA_ON_QPS: "1"
NCCL_MIN_NCHANNELS: "32"
NCCL_IB_PCI_RELAXED_ORDERING: "1"
NCCL_IB_MERGE_NICS: "1"
NCCL_NET_PLUGIN: "none"
# ── Offline / cache ──
HF_HOME: /cache/huggingface
HF_HUB_OFFLINE: "1"
TRANSFORMERS_OFFLINE: "1"
# ── Threading ──
OMP_NUM_THREADS: "8"
# ── Memory ──
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
# ── Ray v2 executor + compiled DAG ──
VLLM_USE_RAY_V2_EXECUTOR_BACKEND: "1"
VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM: "1"
RAY_CGRAPH_get_timeout: "1800"
command: |
vllm serve /root/models/models17/Mimo-2.5-Pro-FP4-Dflash \
--trust-remote-code \
-tp {tensor_parallel} \
-pp {pipeline_parallel} \
--pipeline-parallel-size {pipeline_parallel} \
--enable-prefix-caching \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--max-num-seqs {max_num_seqs} \
--gpu-memory-utilization {gpu_memory_utilization} \
--load-format safetensors \
--attention-backend flashinfer \
--moe-backend marlin \
--reasoning-parser mimo \
--tool-call-parser mimo \
--enable-auto-tool-choice \
--served-model-name {served_model_name} \
--host {host} \
--port {port} \
--distributed-executor-backend ray \
--generation-config vllm \
--chat-template /workspace/mods/fix-mimo-v2-vllm/chat_template.jinja \
--default-chat-template-kwargs '{"keep_all_reasoning":true}' \
--reasoning-config '{"reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \
--block-size 32 \
--enable-flashinfer-autotune \
--speculative-config '{"method":"dflash","model":"/root/models/models17/Mimo-2.5-Pro-FP4-Dflash/dflash","num_speculative_tokens":7,"attention_backend":"triton_attn","draft_tensor_parallel_size":1}' \
--performance-mode balanced --no-async-scheduling
DockerfileMimoDflash.zip (1.6 KB)
fix-mimo-v2-vllm.zip (174.2 KB)
=== Long-Context Benchmark ===
Type: coding
Target input: 1000 tokens
Output: 1500 tokens
Actual tokens: 1000 tokens (confirmed by server)
Sending request (streaming via httpx)…
============ Result ============
Input tokens: 1025
Output tokens: 1500
Wall time: 32.18s
TTFT: 1420.1 ms
Prefill tok/s: 721.8
Gen tok/s: 48.7
Mean ITL: 20.5 ms
Saved to bench_long_coding_1000.json
PEAK server-side gen throughput: 65.2 tok/s
=== Long-Context Benchmark ===
Type: coding
Target input: 32000 tokens
Output: 1500 tokens
Actual tokens: 31996 tokens (confirmed by server)
Sending request (streaming via httpx)…
============ Result ============
Input tokens: 32021
Output tokens: 1500
Wall time: 55.39s
TTFT: 18908.0 ms
Prefill tok/s: 1693.5
Gen tok/s: 41.1
Mean ITL: 24.3 ms
Saved to bench_long_coding_32000.json
PEAK server-side gen throughput: 45.1 tok/s
=== Long-Context Benchmark ===
Type: coding
Target input: 128000 tokens
Output: 1500 tokens
Actual tokens: 127978 tokens (confirmed by server)
Sending request (streaming via httpx)…
============ Result ============
Input tokens: 128003
Output tokens: 1500
Wall time: 160.37s
TTFT: 111930.6 ms
Prefill tok/s: 1143.6
Gen tok/s: 30.9
Mean ITL: 32.3 ms
Saved to bench_long_coding_128000.json
PEAK server-side gen throughput: 36.3 tok/s
=== Long-Context Benchmark ===
Type: coding
Target input: 256000 tokens
Output: 1500 tokens
Actual tokens: 255973 tokens (confirmed by server)
Sending request (streaming via httpx)…
============ Result ============
Input tokens: 255998
Output tokens: 1500
Wall time: 377.27s
TTFT: 309331.6 ms
Prefill tok/s: 827.6
Gen tok/s: 22.1
Mean ITL: 45.3 ms
Saved to bench_long_coding_256000.json
PEAK server-side gen throughput: 25.9 tok/s
The speed at 256k context with dflash is still bigger than the speed at 0 context without dflash :)
i did a small update at the yaml above and updated the results, 10% faster at 256k context
Did also a create a snake game speed benchmark (tok/s: 63.39):
=== Game Benchmark (Single-Stream, temp=0) ===
Waiting for server to be ready…
Server ready after 1s
Running game benchmark (Snake game generation)…
Completion tokens: 1500
Prompt tokens: 61
Total tokens: 1561
Wall time: 23.66s
tok/s: 63.39