Absolutely agree.. this model with the patches has made the investment in the spark worthwhile for me too. Pair this with 3.6 27b at its slow speed for really hard tasks, it’s a great combo!
Check this setup for 3.6 27B dgx-spark/spark-vllm-docker/rdtand-Qwen3.6-27B-PrismaQuant-5.5bit-vllm at main · technigmaai/dgx-spark · GitHub
yep, that’s actually the one I am running. Both this and that one are great. If I had the spare cash, I’d get a second spark to run both models so I could switch between them quickly.
=> [vllm-builder 2/7] RUN --mount=type=cache,id=repo-cache,target=/repo-cache cd /repo-cache && if [ ! -d "vllm" ]; then echo "Cache miss: Cloning vLLM from sc 1.4s
=> [vllm-builder 3/7] WORKDIR /workspace/vllm/vllm 0.0s
=> ERROR [vllm-builder 4/7] RUN if [ -n "40898" ]; then git config --global user.email "builder@example.com"; git config --global user.name "Docker Builder"; 1.7s
------
> [vllm-builder 4/7] RUN if [ -n "40898" ]; then git config --global user.email "builder@example.com"; git config --global user.name "Docker Builder"; echo "Applying PRs: 40898"; for pr in 40898; do echo "Fetching and merging PR #$pr..."; git fetch origin pull/${pr}/head:pr-${pr}; git merge pr-${pr} --no-edit; done; fi:
0.157 Applying PRs: 40898
0.157 Fetching and merging PR #40898...
1.415 From https://github.com/vllm-project/vllm
1.415 * [new ref] refs/pull/40898/head -> pr-40898
1.696 Auto-merging tests/v1/worker/test_gpu_model_runner.py
1.696 Auto-merging vllm/config/speculative.py
1.696 Auto-merging vllm/model_executor/models/qwen3_dflash.py
1.696 CONFLICT (content): Merge conflict in vllm/model_executor/models/qwen3_dflash.py
1.696 Auto-merging vllm/transformers_utils/configs/speculators/algos.py
1.696 Auto-merging vllm/v1/core/kv_cache_utils.py
1.696 Auto-merging vllm/v1/core/sched/scheduler.py
1.696 Auto-merging vllm/v1/spec_decode/llm_base_proposer.py
1.696 Auto-merging vllm/v1/worker/gpu_model_runner.py
1.698 Automatic merge failed; fix conflicts and then commit the result.
------
ERROR: failed to build: failed to solve: process "/bin/sh -c if [ -n \"$VLLM_PRS\" ]; then git config --global user.email \"builder@example.com\"; git config --global user.name \"Docker Builder\"; echo \"Applying PRs: $VLLM_PRS\"; for pr in $VLLM_PRS; do echo \"Fetching and merging PR #$pr...\"; git fetch origin pull/${pr}/head:pr-${pr}; git merge pr-${pr} --no-edit; done; fi" did not complete successfully: exit code: 1
vLLM build failed — restoring previous wheels...
This is the thread for Albond’s 122B optimizations. Please move to a new thread or the long Qwen3.6-27B thread.
That said, the reason this is happening is that the key PR to enable SWA for DFlash currently has a merge conflict. The author from z-lab was pinged earlier today. Meanwhile, you can see the other thread for a working pin of base vLLM where the PR can be cleanly applied, or just wait a day or so.
Amazing work! I wouldn’t mind helping keep this “alive” now that you’re done. My hopes are with Qwen 3.7 we can convince them for another 122b as that would be amazing for us.
Awesome work!
But I have question, during vllm launch i get:
(EngineCore pid=151) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the input
s were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore pid=151) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore pid=151) /usr/local/lib/python3.12/dist-packages/triton/language/core.py:2284: UserWarning: tl.make_block_ptr is deprecated. Use TensorDescriptor or tl.make_tensor_descriptor instead.
(EngineCore pid=151) warn("tl.make_block_ptr is deprecated. Use TensorDescriptor or tl.make_tensor_descriptor instead.")
Are they save to ignore?
Enterprise/server-grade cards are disproportionately more expensive partly because they feature NVLink, which provides a massive performance scaling boost when pooled together.
I suspect that if you ran a benchmark comparison on a 2-to-4 node Spark setup versus a 2-to-4 node H200 cluster, the performance gap would be far greater than just a 5x difference.
I have to say I agree 1,000% Thank you @Albond and other contributors to this thread.
Can someone share working docker config for this with --max-model-len 262144? Mine is crashing a lot…
Yes, I have been struggling to find the actual incantation for the optimal qwen3.5-122B-A10B-FP8 recipe.
I am using this with spark-vllm-docker at the moment:
# Recipe: Qwen3.5-122B-A10B-FP8
# Qwen3.5-122B model in native FP8 quantization
recipe_version: "1"
name: Qwen3.5-122B-FP8
description: vLLM serving Qwen3.5-122B-FP8
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.5-122B-A10B-FP8
# Only cluster is supported
cluster_only: true
# Container image to use
container: vllm-node
# No mods required
mods:
- mods/fix-qwen3.5-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 8192
# Environment variables
env: {}
# The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--max-model-len {max_model_len} \
--gpu-memory-utilization {gpu_memory_utilization} \
--port {port} \
--host {host} \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--chat-template unsloth.jinja \
-tp {tensor_parallel} --distributed-executor-backend ray \
--speculative-config '{{"method": "dflash", "model": "z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 4}}' \
--max-num-batched-tokens {max_num_batched_tokens}
It is pretty much the default, with a speculative-config extra directive.
This is for 2 nodes?
Yes. Sorry mine is FP8 for two nodes. For single node switch to int4-autoround I guess
Hi everybody,
greatly appreciate the efforts to optimize the seutp to utilize this wonderful model.
btw. I needed the guidance of qwen36-35b on hermes to compile the vllm container.
I want to share my docker-compose script, which runs quite smooth (50 t/s) and fits well in hermes.
#
# Qwen3.5-122b-hybrid-int4fp8
#
# API endpoint: http://localhost:11435/v1
# Model name served as: qwen
#
services:
qwen36-122b-intfp8:
image: vllm-qwen35-v2:latest
container_name: qwen35-122b
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
dns:
- 8.8.8.8
- 8.8.4.4
shm_size: 64gb
ulimits:
memlock: -1
stack: 67108864
ports:
- "11435:11435"
volumes:
- /home/topo/models:/models # modify to your needs
# - ./qwen3.6_chat_template.jinja:/chat_template.jinja:ro ## this may be worth to test
environment:
- VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO}
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- VLLM_USE_FLASHINFER_SAMPLER=0
- VLLM_MARLIN_USE_ATOMIC_ADD=1
- VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
command: >
serve /models/qwen35-122b-hybrid-int4fp8
--served-model-name qwen
--port 11435
--max-model-len 262144
--gpu-memory-utilization 0.90
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend FLASHINFER
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11435/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 240s
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
networks:
default:
name: qwen35-network
You left your HF_TOKEN in the message.
thanks
I tried to launch NVFP4 version which Nvidia dropped recently with the same setting which make 36 35b work at 100tps. Didn’t really work, got 20 tps.
I tried a few recipes a couple of days ago, got the same-ish result. Decided not to fight it and kept using Albond’s hybrid still as usual as it’s awesome at 50+ t/sec
When I need fast, iterating work, I switch to 35b-a3b-nvfp4
Thank you for the work on this. While going through the installs for both Spark Founders and GX10 I captured my processes and results in a runbook and will keep it updated with issues I find and solutions - drewid74/optimized-qwen35-hybrid-v2-runbook-public: Production runbook for Qwen3.5-122B hybrid INT4+FP8 on NVIDIA DGX Spark GB10 — optimization stack, PD firmware wedge diagnosis, bench results with aggregated 105.3 tok/s mean between the two ndoes.
Bench results — albond’s harness, isolated, 2026-06-16
Per-prompt warm-cache best (tok/s):
| Test | sparky1 (DGX Founders) | sparka (ASUS GX10) | albond ref |
|---|---|---|---|
| Q&A 256 | 52.0 | 53.6 | 51.3 |
| Code 512 | 53.8 | 55.4 | 52.8 |
| JSON 1024 | 53.5 | 53.9 | 51.1 |
| Math 64 | 48.4 | 50.0 | 47.8 |
| LongCode 2048 | 55.5 | 57.0 | 54.9 |
| Mean | 52.0 | 53.3 | 51.6 |
Thank you for this. I’ve accidentally deleted all my docker images a few days ago while cleaning up, and I couldn’t rebuild the Albond’s one. I got it back up and running now thanks to your writeup :)
BTW @a.fairaizl Have you tried MTP=3 for Speculative Decoding setting? I get a bit of a bump in performance with 3 tokens against 2 with still a very high acceptance rate. Might worth the shot for you.