Success with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ

Over the past month, I’ve been trying the various Qwen 3.5 models looking for a stable coding platform. Early on, I discovered the model’s tendency to overthink, often ending up in endless loops on trivial questions. I also ran into instances where there’d be a massive slow-down after a few minutes into a session, to the point that the model because useless.

Recently, the Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled and Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 models have popped up and they seem to offer a solution to this problem.

Here’s the configuration for running QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ on a dual cluster.

cd spark-vllm-docker
./launch-cluster.sh \
  -e HF_TOKEN \
  -e HF_HUB_OFFLINE=1 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -t vllm-node:tf5 \
  --apply-mod mods/fix-qwen3.5-chat-template \
  exec vllm serve QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --gpu-memory-utilization 0.7 \
    --mm-processor-cache-gb 0 \
    --kv-cache-dtype fp8 \
    --max-model-len auto \
    --attention-backend flashinfer \
    --load-format fastsafetensors \
    --distributed-executor-backend ray \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --served-model-name qwen35-27b \
    --enable-chunked-prefill \
    --host 0.0.0.0 \
    --port 8000

The model is giving excellent performance, and supports Speculative Decoding.

Initially, the model would quit after the response in a number of harnesses, such as OpenCode and Qwen Code. It turns out that the chat template provided by default is very simplified, compared to the original Qwen 3.5 template. Applying the “fix-qwen3.5-chat-template” mod seems to fix that.

3 Likes

Do you have any benchmarks that speak to the performance? Especially given you utilize MTP. How does it keep up with large context?

3 Likes