Bfloat16 Quality = Speed?

whpthomas · May 21, 2026, 5:24pm

My experience is different. I have tried dflash = 5, but the performance degrades precipitously as the context window grows. Instead I am finding that the trusty old standard MTP=3 gets the best performance.

Here I have highlighted what I look for. Acceptance rate is 87-90% – this means that the small Qwen model is processing 3 tokens, then 122b is processing the same 3 tokens in one go, and 90% of the time it matches and accepts the result. When it fails, 122b reverts to processing the three tokens 1 at a time which is all the more slower, because we have the MTP overhead, so we want high acceptance rates.

The next thing I track is the peak, and average t/s performance. Peak gives me an idea of raw performance – this config starts out at 70-80 t/s then as the context window grows above 25k it starts to settle in to the average of around 38 t/s.

What this tell me is if I can brake work up into lots of small concurrent tasks with short context windows the inference is very performant. This is how we get to a 250 t/s aggregate out of a single GB10.

This is the EC model running on the flashqla docker

# Recipe: Intel Qwen3.5-122B-A10B-int4-AutoRound-EC

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound-EC
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound-EC

# HuggingFace model to download (optional, for --download-model)
model: shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC

solo_only: true

# Container image to use
container: vllm-node-dflash

mods:
  - mods/flashqla
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.76
  max_num_batched_tokens: 16384
  max-num-seqs: 16
  served_model_name: qwen/qwen3.5-122b
  speculative_mtp: '{"method": "mtp", "num_speculative_tokens": 3}'
  speculative_dflash: '{"method": "dflash", "model":"z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 5}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

# Environment variables
env:
  HF_HUB_OFFLINE: 1
  TRANSFORMERS_OFFLINE: 1
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  VLLM_ENABLE_CUDAGRAPH_GC: 1
  VLLM_USE_FLASHINFER_SAMPLER: 1

# The vLLM serve command template
command: |
  vllm serve shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --attention-backend FLASHINFER \
  --port {port} \
  --host {host} \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_mtp}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

#  --kv-cache-dtype fp8_e4m3
#  --language-model-only

Topic		Replies	Views
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	297	26429	June 16, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	59	2848	June 3, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	308	26789	June 9, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2775	May 11, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	32	16103	June 16, 2026
Qwen3.6-27B-Dflash link DGX Spark / GB10 Projects	20	4239	April 29, 2026
DFlash LLM for DGX Spark - too good to be true? DGX Spark / GB10	37	3282	April 17, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	430	21043	June 18, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5969	March 16, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11215	April 9, 2026

Bfloat16 Quality = Speed?

Related topics