Bfloat16 Quality = Speed?

My experience is different. I have tried dflash = 5, but the performance degrades precipitously as the context window grows. Instead I am finding that the trusty old standard MTP=3 gets the best performance.

Here I have highlighted what I look for. Acceptance rate is 87-90% – this means that the small Qwen model is processing 3 tokens, then 122b is processing the same 3 tokens in one go, and 90% of the time it matches and accepts the result. When it fails, 122b reverts to processing the three tokens 1 at a time which is all the more slower, because we have the MTP overhead, so we want high acceptance rates.

The next thing I track is the peak, and average t/s performance. Peak gives me an idea of raw performance – this config starts out at 70-80 t/s then as the context window grows above 25k it starts to settle in to the average of around 38 t/s.

What this tell me is if I can brake work up into lots of small concurrent tasks with short context windows the inference is very performant. This is how we get to a 250 t/s aggregate out of a single GB10.

This is the EC model running on the flashqla docker

# Recipe: Intel Qwen3.5-122B-A10B-int4-AutoRound-EC

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound-EC
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound-EC

# HuggingFace model to download (optional, for --download-model)
model: shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC

solo_only: true

# Container image to use
container: vllm-node-dflash

mods:
  - mods/flashqla
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.76
  max_num_batched_tokens: 16384
  max-num-seqs: 16
  served_model_name: qwen/qwen3.5-122b
  speculative_mtp: '{"method": "mtp", "num_speculative_tokens": 3}'
  speculative_dflash: '{"method": "dflash", "model":"z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 5}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

# Environment variables
env:
  HF_HUB_OFFLINE: 1
  TRANSFORMERS_OFFLINE: 1
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  VLLM_ENABLE_CUDAGRAPH_GC: 1
  VLLM_USE_FLASHINFER_SAMPLER: 1

# The vLLM serve command template
command: |
  vllm serve shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --attention-backend FLASHINFER \
  --port {port} \
  --host {host} \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_mtp}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

#  --kv-cache-dtype fp8_e4m3
#  --language-model-only