My experience is different. I have tried dflash = 5, but the performance degrades precipitously as the context window grows. Instead I am finding that the trusty old standard MTP=3 gets the best performance.
Here I have highlighted what I look for. Acceptance rate is 87-90% – this means that the small Qwen model is processing 3 tokens, then 122b is processing the same 3 tokens in one go, and 90% of the time it matches and accepts the result. When it fails, 122b reverts to processing the three tokens 1 at a time which is all the more slower, because we have the MTP overhead, so we want high acceptance rates.
The next thing I track is the peak, and average t/s performance. Peak gives me an idea of raw performance – this config starts out at 70-80 t/s then as the context window grows above 25k it starts to settle in to the average of around 38 t/s.
What this tell me is if I can brake work up into lots of small concurrent tasks with short context windows the inference is very performant. This is how we get to a 250 t/s aggregate out of a single GB10.
This is the EC model running on the flashqla docker
# Recipe: Intel Qwen3.5-122B-A10B-int4-AutoRound-EC
recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound-EC
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound-EC
# HuggingFace model to download (optional, for --download-model)
model: shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC
solo_only: true
# Container image to use
container: vllm-node-dflash
mods:
- mods/flashqla
- mods/fix-qwen3.5-enhanced-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
max_model_len: 196608
gpu_memory_utilization: 0.76
max_num_batched_tokens: 16384
max-num-seqs: 16
served_model_name: qwen/qwen3.5-122b
speculative_mtp: '{"method": "mtp", "num_speculative_tokens": 3}'
speculative_dflash: '{"method": "dflash", "model":"z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 5}'
coding_config: '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
# Environment variables
env:
HF_HUB_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1
VLLM_ENABLE_CUDAGRAPH_GC: 1
VLLM_USE_FLASHINFER_SAMPLER: 1
# The vLLM serve command template
command: |
vllm serve shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC \
--served-model-name {served_model_name} \
--max-model-len {max_model_len} \
--gpu-memory-utilization {gpu_memory_utilization} \
--max-num-batched-tokens {max_num_batched_tokens} \
--max-num-seqs {max-num-seqs} \
--dtype bfloat16 \
--attention-backend FLASHINFER \
--port {port} \
--host {host} \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-chunked-prefill \
--speculative-config '{speculative_mtp}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--chat-template qwen3.5-enhanced.jinja \
--reasoning-parser qwen3 \
--generation-config auto \
--override-generation-config '{coding_config}'
# --kv-cache-dtype fp8_e4m3
# --language-model-only
