Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

Absolutely agree.. this model with the patches has made the investment in the spark worthwhile for me too. Pair this with 3.6 27b at its slow speed for really hard tasks, it’s a great combo!

Check this setup for 3.6 27B dgx-spark/spark-vllm-docker/rdtand-Qwen3.6-27B-PrismaQuant-5.5bit-vllm at main · technigmaai/dgx-spark · GitHub

yep, that’s actually the one I am running. Both this and that one are great. If I had the spare cash, I’d get a second spark to run both models so I could switch between them quickly.

 => [vllm-builder 2/7] RUN --mount=type=cache,id=repo-cache,target=/repo-cache     cd /repo-cache &&     if [ ! -d "vllm" ]; then         echo "Cache miss: Cloning vLLM from sc  1.4s
 => [vllm-builder 3/7] WORKDIR /workspace/vllm/vllm                                                                                                                               0.0s
 => ERROR [vllm-builder 4/7] RUN if [ -n "40898" ]; then         git config --global user.email "builder@example.com";         git config --global user.name "Docker Builder";    1.7s
------
 > [vllm-builder 4/7] RUN if [ -n "40898" ]; then         git config --global user.email "builder@example.com";         git config --global user.name "Docker Builder";                 echo "Applying PRs: 40898";         for pr in 40898; do             echo "Fetching and merging PR #$pr...";             git fetch origin pull/${pr}/head:pr-${pr};             git merge pr-${pr} --no-edit;         done;     fi:
0.157 Applying PRs: 40898
0.157 Fetching and merging PR #40898...
1.415 From https://github.com/vllm-project/vllm
1.415  * [new ref]             refs/pull/40898/head -> pr-40898
1.696 Auto-merging tests/v1/worker/test_gpu_model_runner.py
1.696 Auto-merging vllm/config/speculative.py
1.696 Auto-merging vllm/model_executor/models/qwen3_dflash.py
1.696 CONFLICT (content): Merge conflict in vllm/model_executor/models/qwen3_dflash.py
1.696 Auto-merging vllm/transformers_utils/configs/speculators/algos.py
1.696 Auto-merging vllm/v1/core/kv_cache_utils.py
1.696 Auto-merging vllm/v1/core/sched/scheduler.py
1.696 Auto-merging vllm/v1/spec_decode/llm_base_proposer.py
1.696 Auto-merging vllm/v1/worker/gpu_model_runner.py
1.698 Automatic merge failed; fix conflicts and then commit the result.
------
ERROR: failed to build: failed to solve: process "/bin/sh -c if [ -n \"$VLLM_PRS\" ]; then         git config --global user.email \"builder@example.com\";         git config --global user.name \"Docker Builder\";                 echo \"Applying PRs: $VLLM_PRS\";         for pr in $VLLM_PRS; do             echo \"Fetching and merging PR #$pr...\";             git fetch origin pull/${pr}/head:pr-${pr};             git merge pr-${pr} --no-edit;         done;     fi" did not complete successfully: exit code: 1
vLLM build failed — restoring previous wheels...

This is the thread for Albond’s 122B optimizations. Please move to a new thread or the long Qwen3.6-27B thread.

That said, the reason this is happening is that the key PR to enable SWA for DFlash currently has a merge conflict. The author from z-lab was pinged earlier today. Meanwhile, you can see the other thread for a working pin of base vLLM where the PR can be cleanly applied, or just wait a day or so.

Amazing work! I wouldn’t mind helping keep this “alive” now that you’re done. My hopes are with Qwen 3.7 we can convince them for another 122b as that would be amazing for us.

Awesome work!

But I have question, during vllm launch i get:
(EngineCore pid=151) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the input
s were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore pid=151) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore pid=151) /usr/local/lib/python3.12/dist-packages/triton/language/core.py:2284: UserWarning: tl.make_block_ptr is deprecated. Use TensorDescriptor or tl.make_tensor_descriptor instead.
(EngineCore pid=151) warn("tl.make_block_ptr is deprecated. Use TensorDescriptor or tl.make_tensor_descriptor instead.")

Are they save to ignore?

Enterprise/server-grade cards are disproportionately more expensive partly because they feature NVLink, which provides a massive performance scaling boost when pooled together.

I suspect that if you ran a benchmark comparison on a 2-to-4 node Spark setup versus a 2-to-4 node H200 cluster, the performance gap would be far greater than just a 5x difference.

I have to say I agree 1,000% Thank you @Albond and other contributors to this thread.

Can someone share working docker config for this with --max-model-len 262144? Mine is crashing a lot…

Yes, I have been struggling to find the actual incantation for the optimal qwen3.5-122B-A10B-FP8 recipe.

I am using this with spark-vllm-docker at the moment:

# Recipe: Qwen3.5-122B-A10B-FP8
# Qwen3.5-122B model in native FP8 quantization

recipe_version: "1"
name: Qwen3.5-122B-FP8
description: vLLM serving Qwen3.5-122B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.5-122B-A10B-FP8

# Only cluster is supported
cluster_only: true

# Container image to use
container: vllm-node

# No mods required
mods:
  - mods/fix-qwen3.5-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 8192

# Environment variables
env: {}

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
    --max-model-len {max_model_len} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format fastsafetensors \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --chat-template unsloth.jinja \
    -tp {tensor_parallel} --distributed-executor-backend ray \
    --speculative-config '{{"method": "dflash", "model": "z-lab/Qwen3.5-122B-A10B-DFlash", "num_speculative_tokens": 4}}' \
    --max-num-batched-tokens {max_num_batched_tokens}

It is pretty much the default, with a speculative-config extra directive.

This is for 2 nodes?

Yes. Sorry mine is FP8 for two nodes. For single node switch to int4-autoround I guess

Hi everybody,

greatly appreciate the efforts to optimize the seutp to utilize this wonderful model.

btw. I needed the guidance of qwen36-35b on hermes to compile the vllm container.

I want to share my docker-compose script, which runs quite smooth (50 t/s) and fits well in hermes.

# 
# Qwen3.5-122b-hybrid-int4fp8 
#
# API endpoint: http://localhost:11435/v1
# Model name served as: qwen
#

services:
  qwen36-122b-intfp8:
    image: vllm-qwen35-v2:latest
    container_name: qwen35-122b
    restart: unless-stopped

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    ipc: host
    dns:
      - 8.8.8.8
      - 8.8.4.4
    shm_size: 64gb

    ulimits:
      memlock: -1
      stack: 67108864

    ports:
      - "11435:11435"

    volumes:
      - /home/topo/models:/models      # modify to your needs
#      - ./qwen3.6_chat_template.jinja:/chat_template.jinja:ro    ## this may be worth to test

    environment:
      - VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO}
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - VLLM_USE_FLASHINFER_SAMPLER=0
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1

    command: >
      serve /models/qwen35-122b-hybrid-int4fp8
      --served-model-name qwen
      --port 11435
      --max-model-len 262144
      --gpu-memory-utilization 0.90
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --attention-backend FLASHINFER
      --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11435/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 240s

    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"


networks:
  default:
    name: qwen35-network

You left your HF_TOKEN in the message.

thanks

I tried to launch NVFP4 version which Nvidia dropped recently with the same setting which make 36 35b work at 100tps. Didn’t really work, got 20 tps.

I tried a few recipes a couple of days ago, got the same-ish result. Decided not to fight it and kept using Albond’s hybrid still as usual as it’s awesome at 50+ t/sec

When I need fast, iterating work, I switch to 35b-a3b-nvfp4

Thank you for the work on this. While going through the installs for both Spark Founders and GX10 I captured my processes and results in a runbook and will keep it updated with issues I find and solutions - drewid74/optimized-qwen35-hybrid-v2-runbook-public: Production runbook for Qwen3.5-122B hybrid INT4+FP8 on NVIDIA DGX Spark GB10 — optimization stack, PD firmware wedge diagnosis, bench results with aggregated 105.3 tok/s mean between the two ndoes.

Bench results — albond’s harness, isolated, 2026-06-16

Per-prompt warm-cache best (tok/s):

Test sparky1 (DGX Founders) sparka (ASUS GX10) albond ref
Q&A 256 52.0 53.6 51.3
Code 512 53.8 55.4 52.8
JSON 1024 53.5 53.9 51.1
Math 64 48.4 50.0 47.8
LongCode 2048 55.5 57.0 54.9
Mean 52.0 53.3 51.6

Thank you for this. I’ve accidentally deleted all my docker images a few days ago while cleaning up, and I couldn’t rebuild the Albond’s one. I got it back up and running now thanks to your writeup :)

BTW @a.fairaizl Have you tried MTP=3 for Speculative Decoding setting? I get a bit of a bump in performance with 3 tokens against 2 with still a very high acceptance rate. Might worth the shot for you.