Success with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ

brian322 · April 2, 2026, 12:25am

Over the past month, I’ve been trying the various Qwen 3.5 models looking for a stable coding platform. Early on, I discovered the model’s tendency to overthink, often ending up in endless loops on trivial questions. I also ran into instances where there’d be a massive slow-down after a few minutes into a session, to the point that the model because useless.

Recently, the Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled and Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 models have popped up and they seem to offer a solution to this problem.

Here’s the configuration for running QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ on a dual cluster.

cd spark-vllm-docker
./launch-cluster.sh \
  -e HF_TOKEN \
  -e HF_HUB_OFFLINE=1 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -t vllm-node:tf5 \
  --apply-mod mods/fix-qwen3.5-chat-template \
  exec vllm serve QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --gpu-memory-utilization 0.7 \
    --mm-processor-cache-gb 0 \
    --kv-cache-dtype fp8 \
    --max-model-len auto \
    --attention-backend flashinfer \
    --load-format fastsafetensors \
    --distributed-executor-backend ray \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --served-model-name qwen35-27b \
    --enable-chunked-prefill \
    --host 0.0.0.0 \
    --port 8000

The model is giving excellent performance, and supports Speculative Decoding.

Initially, the model would quit after the response in a number of harnesses, such as OpenCode and Qwen Code. It turns out that the chat template provided by default is very simplified, compared to the original Qwen 3.5 template. Applying the “fix-qwen3.5-chat-template” mod seems to fix that.

serapis · April 2, 2026, 2:54am

Do you have any benchmarks that speak to the performance? Especially given you utilize MTP. How does it keep up with large context?

Topic		Replies	Views
Implementation Guide: DGX Spark with Qwen3.5-35B-A3B via llama.cpp for Claude Code DGX Spark / GB10 Projects llama , agentic-ai	3	357	April 2, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	13915	March 24, 2026
Ok, I've fully bought into the hype now DGX Spark / GB10	9	704	March 2, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	211	5114	April 4, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	7967	March 24, 2026
Issues with qwen3 coder next on dual node cluster DGX Spark / GB10	5	249	March 11, 2026
How do I run Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled on vllm community docker? DGX Spark / GB10 llama	4	1706	March 13, 2026
Qwen3.5-35B-A3B on NVIDIA DGX Spark DGX Spark / GB10	6	2470	March 17, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	43	7186	April 3, 2026
Adding recipe support for OptimizeLLM/Qwen3-VL-30B-A3B-Thinking-NVFP4 DGX Spark / GB10	6	185	March 20, 2026

Success with QuantTrio/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ

Related topics