Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

I intentionally left this out since Qwen has official sampling recommendations (Qwen/Qwen3.5-122B-A10B · Hugging Face) that are worth following closely.

For agentic/coding workflows specifically, they recommend Thinking mode for precise coding tasks:

  • temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

For more general or conversational tasks, Thinking mode general:

  • temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Personally I stick close to these and only adjust slightly based on feel — in my experience Qwen3.5 becomes noticeably unstable when you stray too far from the recommended ranges. The presence_penalty is worth tuning between 0–2 to control repetition, but going too high can cause language mixing.

In my own software products that call the OpenAI-compatible API, I don’t lock in a single set of parameters at startup — I change them dynamically per request depending on the task type. So a coding request, a reasoning task, and a general chat message can each hit the model with a different preset within the same session.

Are there some pointers to where to add or setup this dynamic switching task type? I use OpenWebUI and Vllm… should/can I add some router that assesses the task type at hand, that then adds the best set of parameters dynamically? And, do agent harnassess do these kinda (probably quite essential) tweaks?

Following along at home. Ran into an issue on Step 3
git clone GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
cd spark-vllm-docker
docker build -t vllm-sm121 .
cd ..

~/git/spark-vllm-docker$ docker build -t vllm-sm121 .

[+] Building 0.7s (19/19) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 13.58kB 0.0s
=> resolve image config for docker-image://docker.io/docker/dockerfile:1.6 0.3s
=> CACHED docker-image://docker.io/docker/dockerfile:1.6@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04 0.2s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 515B 0.0s
=> [base 1/5] FROM docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04@sha256:f9492f2eea77fbc3d0c14fa8738f35946b42da7 0.0s
=> CACHED [base 2/5] RUN apt update && apt install -y --no-install-recommends curl vim cmake build-essen 0.0s
=> CACHED [base 3/5] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install torch torchvi 0.0s
=> CACHED [base 4/5] WORKDIR /workspace/vllm 0.0s
=> CACHED [base 5/5] RUN git clone -b dgxspark-3node-ring GitHub - zyang-dev/nccl: Optimized primitives for collective multi-GPU communication · GitHub && cd nccl & 0.0s
=> CACHED [runner 2/9] RUN --mount=type=bind,from=base,source=/workspace/vllm/nccl/build/pkg/deb,target=/workspa 0.0s
=> CACHED [runner 3/9] WORKDIR /workspace/vllm 0.0s
=> CACHED [runner 4/9] RUN mkdir -p tiktoken_encodings && wget -O tiktoken_encodings/o200k_base.tiktoken "ht 0.0s
=> CACHED [runner 5/9] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install torch torch 0.0s
=> CACHED [runner 6/9] RUN --mount=type=bind,source=wheels,target=/workspace/wheels --mount=type=cache,id=uv 0.0s
=> CACHED [runner 7/9] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install ray[default] 0.0s
=> CACHED [runner 8/9] RUN rm /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2 && ln -s 0.0s
=> ERROR [runner 9/9] COPY build-metadata.yaml /workspace/build-metadata.yaml 0.0s

[runner 9/9] COPY build-metadata.yaml /workspace/build-metadata.yaml:


ERROR: failed to build: failed to solve: failed to compute cache key: failed to calculate checksum of ref cc04f87f-b550-47f3-912a-571f68748e6a::eqaon0j0fxegtdc4ccm2z9i8e: “/build-metadata.yaml”: not found

It might be better to use his build-and-copy.sh with teh appropriate flags
A nice to have for some might be steps to set up a venv and a requirments file.

Try

cd spark-vllm-docker
git checkout 49d6d9fefd7cd05e63af8b28e4b514e9d30d249f
./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5
cd ..

49d6d9fefd7cd05e63af8b28e4b514e9d30d249f - my version and should work fine and I will update README soon.

Need to recheck.

That produces problems with multiple of the vLLM patches. The vLLM version must be incorrect

ERROR [vllm-builder 5/8] RUN curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35568.diff -o pr35568.diff

For coding I use the following, I leave the reasoning parser off because I find including thinking traces in the multi-turn-chain-of-thought context helps the model with coding tasks – less dumb. UMMV

docker run -it --name vllm-qwen35 \
  --gpus all --net=host --ipc=host \
  -v ~/models:/models \
  vllm-qwen35-v2 \
  serve /models/qwen35-122b-hybrid-int4fp8 \
  --served-model-name qwen/qwen3.5 \
  --max-model-len 196608 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.88 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --generation-config auto \
  --override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

Updated – the temperature and coding config is good. 256k context rots after about 130K so I limit to 196k ceiling

That commit has its own issues with build-and-copy.sh

./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5
Commit hash matches (e7f630c8) — wheels are up to date.
All flashinfer wheels are up to date — skipping download.
FlashInfer wheels ready.
Rebuilding vLLM wheels (–vllm-ref specified)…
vLLM build command: docker build --target vllm-export --output type=local,dest=./wheels --build-arg BUILD_JOBS=16 --build-arg TORCH_CUDA_ARCH_LIST=12.1a --build-arg FLASHINFER_CUDA_ARCH_LIST=12.1a --build-arg VLLM_REF=v0.19.0 --build-arg CACHEBUST_VLLM=1775649384 .
[+] Building 3.7s (15/19)                                                                                docker:default
=> [internal] load build definition from Dockerfile                                                               0.0s
=> => transferring dockerfile: 13.58kB                                                                            0.0s
=> resolve image config for docker-image://docker.io/docker/dockerfile:1.6                                        0.2s
=> CACHED docker-image://docker.io/docker/dockerfile:1.6@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f  0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04                                    0.2s
=> [internal] load .dockerignore                                                                                  0.0s
=> => transferring context: 2B                                                                                    0.0s
=> [base 1/5] FROM docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04@sha256:f9492f2eea77fbc3d0c14fa8738f35946b42da7  0.0s
=> CACHED [base 2/5] RUN apt update &&     apt install -y --no-install-recommends     curl vim cmake build-essen  0.0s
=> CACHED [base 3/5] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv      uv pip install torch torchvi  0.0s
=> CACHED [base 4/5] WORKDIR /workspace/vllm                                                                      0.0s
=> CACHED [base 5/5] RUN git clone -b dgxspark-3node-ring 
 &&     cd nccl &  0.0s
=> CACHED [vllm-builder 1/8] WORKDIR /workspace/vllm                                                              0.0s
=> [vllm-builder 2/8] RUN --mount=type=cache,id=repo-cache,target=/repo-cache     cd /repo-cache &&     if [ ! -  2.3s
=> [vllm-builder 3/8] WORKDIR /workspace/vllm/vllm                                                                0.0s
=> [vllm-builder 4/8] RUN if [ -n “” ]; then         echo "Applying PRs: ";         for pr in ; do             e  0.2s
=> ERROR [vllm-builder 5/8] RUN curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35  0.5s

[vllm-builder 5/8] RUN curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35568.diff -o pr35568.diff     && if git apply --reverse --check pr35568.diff 2>/dev/null; then          echo “PR 35568 already applied, skipping.”;        else          echo “Applying PR 35568…”;          git apply -v pr35568.diff;        fi     && rm pr35568.diff:
0.445 Applying PR 35568…
0.445 Checking patch csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm.cuh…
0.445 error: csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm.cuh: No such file or directory
0.445 Checking patch csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh…
0.445 error: csrc/libtorch_stable/quantization/w8a8/cutlass/c3x/scaled_mm_sm120_fp8_dispatch.cuh: No such file or directory
0.445 Checking patch csrc/moe/marlin_moe_wna16/generate_kernels.py…
0.445 Checking patch csrc/moe/marlin_moe_wna16/ops.cu…
0.445 Checking patch csrc/quantization/marlin/generate_kernels.py…
0.445 Checking patch tests/kernels/moe/test_moe.py…
0.445 Checking patch tests/kernels/quantization/test_marlin_gemm.py…
0.445 Checking patch vllm/model_executor/layers/quantization/utils/marlin_utils.py…




ERROR: failed to build: failed to solve: process “/bin/sh -c curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35568.diff -o pr35568.diff     && if git apply --reverse --check pr35568.diff 2>/dev/null; then          echo "PR 35568 already applied, skipping.";        else          echo "Applying PR 35568…";          git apply -v pr35568.diff;        fi     && rm pr35568.diff” did not complete successfully: exit code: 1
vLLM build failed — restoring previous wheels…

I also tried to do a docker build -t vllm-sm121 . on that commit and got the same error as above.

Let me check and update.

I need to update project with fork eugr/spark-vllm-docker - vLLM to unstable each day :)

Got it running now with the latest eugr build.

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 4.59s = 55.7 tok/s (prompt: 23)

[Code] 502 tokens in 8.83s = 56.8 tok/s (prompt: 30)
[JSON] 1024 tokens in 18.35s = 55.8 tok/s (prompt: 48)
[Math] 64 tokens in 1.24s = 51.6 tok/s (prompt: 29)
[LongCode] 2048 tokens in 34.71s = 59.0 tok/s (prompt: 37)

But I get a lot of warnings during startup I think (could also be specific to that version)

(EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] Failed to reload cubin file statically launchable autotuner triton_poi_fused_clone_copy_index_select_slice_split_5
(EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] Traceback (most recent call last): (EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/triton_bundler.py”, line 240, in load_autotuners
(EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] compile_result.reload_cubin_path()
(EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1879, in reload_cubin_path
(EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] raise RuntimeError( (EngineCore pid=139) [rank0]:W0408 13:06:12.560000 139 torch/_inductor/triton_bundler.py:242] RuntimeError: (‘Cubin file saved by TritonBundler not found at %s’, ‘/tmp/torchinductor_root/triton/0/EYN56BI7P5GDER2UBHTXVLBW3D5TZGX3KOBVTFCUG
RGZVNJWAUAQ/triton_poi_fused_clone_copy_index_select_slice_split_5.cubin’)

But its running and the outputs seem to be fine so far. Still investigating. I had to patch the inc.py patch file a bit. patch.diff.txt (988 Bytes)

I am also running this just as mod currently. Just put everything in a directory and copy the patch files and add this run.sh

#!/bin/bash
set -e
patch -p1 -d /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/ < inc.py.patch2
python3 patch_int8_lmhead.py

My current recipe file I am running:

speedup.txt (1.7 KB)

But yeah in general, vLLM is not helping with its many patches and instabilities etc. As soon as you have a properly running version, best to freeze it at this point sadly :D

But many thanks for your efforts @Albond ! Quite the speedup now :) moves this into really usable.

Please, check latest README in github. Main changes:

  1. Fix huggingface-cli
  2. Add 2 “sed” in step 3 to remove RUN blocks from spark-vllm-docker:
    sed -i ‘/# TEMPORARY PATCH for broken FP8 kernels/,/&& rm pr35568.diff/d’ Dockerfile
    sed -i ‘/# TEMPORARY PATCH for broken compilation/,/&& rm pr38919.diff/d’ Dockerfile

I am in progress to review the latest README from scratch and probably it works.

I have finished all the steps successfully
on the step of building spark-vllm-docker I have used the command

./build-and-copy.sh

becouse the command

docker build -t vllm-sm121 .

give me error

=> CANCELED [base 1/5] FROM docker.io/nvidia/cuda:13.2.0-devel-ubuntu24.04@sha256:f9492f2eea77fbc3d0c14fa8738f359 0.0s
=> ERROR [runner 9/9] COPY build-metadata.yaml /workspace/build-metadata.yaml

the next step

docker build -t vllm-qwen35-v2 -f docker/Dockerfile.v2 .

works good for me

but when I try to start this docker with command:

docker run -d --name vllm-qwen35
–gpus all --net=host --ipc=host
-v /home/xqdev/Models:/models
vllm-qwen35-v019-v2
serve /models/qwen35-122b-hybrid-int4fp8
–served-model-name qwen
–port 8000
–max-model-len 262144
–gpu-memory-utilization 0.90
–reasoning-parser qwen3
–attention-backend FLASHINFER
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’

I am getting the error

[v2] Applying INT8 LM Head patch…
OK: INT8 LM Head v2 patch applied (clean)
[v2] Starting vLLM…
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 4, in
from vllm.entrypoints.cli.main import main
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/init.py”, line 3, in
from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py”, line 5, in
from vllm.benchmarks.latency import add_cli_args, main
File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py”, line 15, in
from vllm.engine.arg_utils import EngineArgs
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 35, in
from vllm.config import (
File “/usr/local/lib/python3.12/dist-packages/vllm/config/init.py”, line 6, in
from vllm.config.compilation import (
File “/usr/local/lib/python3.12/dist-packages/vllm/config/compilation.py”, line 22, in
from vllm.platforms import current_platform
File “/usr/local/lib/python3.12/dist-packages/vllm/platforms/init.py”, line 279, in getattr
_current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py”, line 111, in resolve_obj_by_qualname
module = importlib.import_module(module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/importlib/init.py”, line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py”, line 21, in
import vllm._C # noqa
^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN2at4cuda24getCurrentCUDABlasHandleEv

what am I doing wrong? how to start this new container?

I’ve updated README and provide some steps to avoid this issue already, but you ran bare ./build-and-copy.sh, which uses defaults --vllm-ref main --tag vllm-node and doesn’t enable --tf5. The README explicitly requires:

git checkout 49d6d9fefd7cd05e63af8b28e4b514e9d30d249f
sed -i '/# TEMPORARY PATCH for broken FP8 kernels/,/&& rm pr35568.diff/d' Dockerfile
sed -i '/# TEMPORARY PATCH for broken compilation/,/&& rm pr38919.diff/d' Dockerfile
./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5

Without --vllm-ref v0.19.0 the script builds from main, which is past the torch ABI we tested against. The build “succeeds” but the resulting image is broken at import time.

Fix:

1. cd spark-vllm-docker && git checkout 49d6d9fefd7cd05e63af8b28e4b514e9d30d249f
2. Apply the two sed commands from the README
3. ./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5 --no-cache (the --no-cache is to nuke any stale BuildKit layers from the previous bare build)
4. Rebuild Step 4: docker build -t vllm-qwen35-v2 -f docker/Dockerfile.v2 . --no-cache
5. Then Step 5 will work.

Not sure what I did wrong, I used the Intel AutoRound INT4 by default, then just added the mtp flag since I dont want to deal with the hybrid stuff. The llama-benchy shows I have performance regression?

I am using latest eugr’s docker build

Withput mtp

spark-vllm-docker/run-recipe.sh 
qwen3.5-122b-int4-autoround \
–solo \
-e HF_HOME=/llm/llm_models/vllm_models \
--max-num-seqs 2 \
–gpu-memory-utilization 0.85 \
–kv-cache-dtype fp8 --host 0.0.0.0
| 122ba10b | pp2048 | 2062.46 ± 4.14 |              | 908.97 ± 11.81 | 907.82 ± 11.81 |  909.04 ± 11.82 |
| 122ba10b|   tg32 |   28.86 ± 0.08 | 29.00 ± 0.00 |

With mtp and flashinfer

spark-vllm-docker/run-recipe.sh 
qwen3.5-122b-int4-autoround \
–solo \
-e HF_HOME=/llm/llm_models/vllm_models \
--max-num-seqs 2 \
–gpu-memory-utilization 0.85 \
–kv-cache-dtype fp8 --host 0.0.0.0 \
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’ \
–attention-backend FLASHINFER
| 122ba10b | pp2048 | 1904.37 ± 11.08 |              | 987.20 ± 22.92 | 985.32 ± 22.92 |  987.27 ± 22.93 |
| 122ba10b |   tg32 |    20.79 ± 0.48 | 22.09 ± 0.51 |

build-and-copy from eugr repo doesn’t have –no-cache flags:

Unknown parameter passed: --no-cache

llama-benchy counts steps, not tokens. With MTP each step produces 2–3 tokens, so steps get slower but you get more tokens per step — the benchmark just can’t see that.
Your real tg32 is closer to 20.79 × ~1.5–2 = 31–41 tok/s, which is actually faster than 28.86.

That is interesting. I guess current implementation of llama-benchy does not take into consideration of MTPs?

Also, just to verify, for people who just want to use the MTP, they don’t need to do anything but download Intel AutoRound INT4 and pass the flag?

–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:1}’

This is correct, llama-benchy basically does not work with MTP. Yet.

We need a good MTP enabled benchmark. I know eugr is aware.

Right now yes (about MTP in vLLM), but this may change depending on the vLLM version.