I intentionally left this out since Qwen has official sampling recommendations (Qwen/Qwen3.5-122B-A10B · Hugging Face) that are worth following closely.
For agentic/coding workflows specifically, they recommend Thinking mode for precise coding tasks:
Personally I stick close to these and only adjust slightly based on feel — in my experience Qwen3.5 becomes noticeably unstable when you stray too far from the recommended ranges. The presence_penalty is worth tuning between 0–2 to control repetition, but going too high can cause language mixing.
In my own software products that call the OpenAI-compatible API, I don’t lock in a single set of parameters at startup — I change them dynamically per request depending on the task type. So a coding request, a reasoning task, and a general chat message can each hit the model with a different preset within the same session.
Are there some pointers to where to add or setup this dynamic switching task type? I use OpenWebUI and Vllm… should/can I add some router that assesses the task type at hand, that then adds the best set of parameters dynamically? And, do agent harnassess do these kinda (probably quite essential) tweaks?
ERROR: failed to build: failed to solve: failed to compute cache key: failed to calculate checksum of ref cc04f87f-b550-47f3-912a-571f68748e6a::eqaon0j0fxegtdc4ccm2z9i8e: “/build-metadata.yaml”: not found
It might be better to use his build-and-copy.sh with teh appropriate flags
A nice to have for some might be steps to set up a venv and a requirments file.
For coding I use the following, I leave the reasoning parser off because I find including thinking traces in the multi-turn-chain-of-thought context helps the model with coding tasks – less dumb. UMMV
But yeah in general, vLLM is not helping with its many patches and instabilities etc. As soon as you have a properly running version, best to freeze it at this point sadly :D
But many thanks for your efforts @Albond ! Quite the speedup now :) moves this into really usable.
Please, check latest README in github. Main changes:
Fix huggingface-cli
Add 2 “sed” in step 3 to remove RUN blocks from spark-vllm-docker:
sed -i ‘/# TEMPORARY PATCH for broken FP8 kernels/,/&& rm pr35568.diff/d’ Dockerfile
sed -i ‘/# TEMPORARY PATCH for broken compilation/,/&& rm pr38919.diff/d’ Dockerfile
I am in progress to review the latest README from scratch and probably it works.
[v2] Applying INT8 LM Head patch…
OK: INT8 LM Head v2 patch applied (clean)
[v2] Starting vLLM…
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 4, in
from vllm.entrypoints.cli.main import main
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/init.py”, line 3, in
from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/latency.py”, line 5, in
from vllm.benchmarks.latency import add_cli_args, main
File “/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/latency.py”, line 15, in
from vllm.engine.arg_utils import EngineArgs
File “/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py”, line 35, in
from vllm.config import (
File “/usr/local/lib/python3.12/dist-packages/vllm/config/init.py”, line 6, in
from vllm.config.compilation import (
File “/usr/local/lib/python3.12/dist-packages/vllm/config/compilation.py”, line 22, in
from vllm.platforms import current_platform
File “/usr/local/lib/python3.12/dist-packages/vllm/platforms/init.py”, line 279, in getattr
_current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py”, line 111, in resolve_obj_by_qualname
module = importlib.import_module(module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/importlib/init.py”, line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py”, line 21, in
import vllm._C # noqa
^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN2at4cuda24getCurrentCUDABlasHandleEv
what am I doing wrong? how to start this new container?
I’ve updated README and provide some steps to avoid this issue already, but you ran bare ./build-and-copy.sh, which uses defaults --vllm-ref main --tag vllm-node and doesn’t enable --tf5. The README explicitly requires:
git checkout 49d6d9fefd7cd05e63af8b28e4b514e9d30d249f
sed -i '/# TEMPORARY PATCH for broken FP8 kernels/,/&& rm pr35568.diff/d' Dockerfile
sed -i '/# TEMPORARY PATCH for broken compilation/,/&& rm pr38919.diff/d' Dockerfile
./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5
Without --vllm-ref v0.19.0 the script builds from main, which is past the torch ABI we tested against. The build “succeeds” but the resulting image is broken at import time.
Fix:
1. cd spark-vllm-docker && git checkout 49d6d9fefd7cd05e63af8b28e4b514e9d30d249f
2. Apply the two sed commands from the README
3. ./build-and-copy.sh -t vllm-sm121 --vllm-ref v0.19.0 --tf5 --no-cache (the --no-cache is to nuke any stale BuildKit layers from the previous bare build)
4. Rebuild Step 4: docker build -t vllm-qwen35-v2 -f docker/Dockerfile.v2 . --no-cache
5. Then Step 5 will work.
Not sure what I did wrong, I used the Intel AutoRound INT4 by default, then just added the mtp flag since I dont want to deal with the hybrid stuff. The llama-benchy shows I have performance regression?
llama-benchy counts steps, not tokens. With MTP each step produces 2–3 tokens, so steps get slower but you get more tokens per step — the benchmark just can’t see that.
Your real tg32 is closer to 20.79 × ~1.5–2 = 31–41 tok/s, which is actually faster than 28.86.