Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena

whpthomas · June 2, 2026, 4:51pm

# Recipe: Intel Qwen3.5-122B-A10B-int4-AutoRound-EC

recipe_version: "1"
name: Qwen3.5-122B-A10B-int4-AutoRound-EC
description: vLLM serving Qwen3.5-122B-A10B-int4-AutoRound-EC

# HuggingFace model to download (optional, for --download-model)
model: shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC

solo_only: true

# Container image to use
container: vllm-node-tf5

mods:
  - mods/fix-qwen3.5-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.75
  max_num_batched_tokens: 16384
  max-num-seqs: 8
  served_model_name: qwen/qwen3.5-122b
  speculative_mtp: '{"method": "mtp", "num_speculative_tokens": 3}'
  speculative_dflash: '{"method": "dflash", "model":"z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 5}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1
  FLASHINFER_DISABLE_VERSION_CHECK: 1
  VLLM_ENABLE_CUDAGRAPH_GC: 1
  VLLM_USE_FLASHINFER_SAMPLER: 1

# The vLLM serve command template
command: |
  vllm serve shieldstar/Qwen3.5-122B-A10B-int4-AutoRound-EC \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --dtype bfloat16 \
  --attention-backend flash_attn \
  --port {port} \
  --host {host} \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-prompt-tokens-details \
  --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
  --speculative-config '{speculative_mtp}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.5-enhanced.jinja \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

#  --language-model-only

See earlier in this tread for instructions on setting up the chat template

https://forums.developer.nvidia.com/t/fastest-qwen-3-5-122b-int4-recipe-on-dgx-spark-tested-and-published-on-spark-arena/370834/13

Topic		Replies	Views
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	308	25816	June 9, 2026
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	284	24296	June 3, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	59	5726	June 2, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	236	9011	June 6, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16521	March 24, 2026
Bfloat16 Quality = Speed? DGX Spark / GB10	106	5313	May 26, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	417	19660	June 9, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10954	April 9, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2695	May 11, 2026
Introducing the Spark Arena DGX Spark / GB10	128	8534	April 10, 2026

Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena

Related topics