Just wanted to chime in with the Qwen/Qwen3.6-35B-A3B-FP8 recipe that has been really solid for me with OpenCode for a couple days now.
cat recipes/qwen3.6-35b-a3b-fp8.yaml
# Recipe: Qwen/Qwen3.6-35B-A3B model in native FP8 format
recipe_version: "1"
name: Qwen36-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8
# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8
solo_only: true
# Container image to use
container: vllm-node-tf5
# Mod
mods:
- mods/fix-qwen3.6-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
gpu_memory_utilization: 0.85
max_model_len: 262144
max_num_batched_tokens: 32768
# Environment variables
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
# The vLLM serve command template
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host {host} \
--port {port} \
--kv-cache-dtype bfloat16 \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enable-auto-tool-choice \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend flashinfer \
--load-format instanttensor \
--default-chat-template-kwargs '{{"preserve_thinking": true}}' \
--override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
--speculative-config '{{"method":"mtp","num_speculative_tokens":3}}' \
--max-num-seqs 4 \
--language-model-only \
--enable-prefix-caching
cat mods/fix-qwen3.6-chat-template/run.sh (which could probably be cleaner/smarter):
#!/bin/bash
set -e
CHAT_TEMPLATE="qwen3.5-enhanced.jinja"
if [ -f "${CHAT_TEMPLATE}" ] && [ -s "${CHAT_TEMPLATE}" ]
then
cp ${CHAT_TEMPLATE} ${WORKSPACE_DIR}/${CHAT_TEMPLATE}
echo "=======> to apply chat template, use --chat-template ${CHAT_TEMPLATE}"
else
echo "# See https://github.com/allanchan339/vLLM-Qwen3.5-27B/tree/main and"
echo "# https://github.com/allanchan339/vLLM-Qwen3.5-27B/blob/main/qwen3.5-enhanced.jinja"
exit 1
fi
Then I run with ./run-recipe.sh qwen3.6-35b-a3b-fp8 --chat-template qwen3.5-enhanced.jinja -e HF_TOKEN=${HF_TOKEN}
If it matters, running vLLM reports: 0.19.1rc1.dev374+g1174723eb.d20260417
Token generation usually in the 30-40/sec range, which Iโm very happy with!
For OpenCode, I read a hint somewhere that "npm": "@ai-sdk/anthropic" would help reduce tool-call failures; since I set that, I have not encountered a single tool call fail.
opencode.json (YMMV on the various reserved/context/input/output values):
{
"$schema": "https://opencode.ai/config.json",
"compaction": {
"auto": true,
"prune": true,
"reserved": 16384
},
"model": "local/qwen36",
"provider": {
"local": {
"npm": "@ai-sdk/anthropic",
"name": "local",
"options": {
"baseURL": "http://PUT_YOUR_IP_ADDRESS_HERE:8000/v1",
"apiKey": "dummy"
},
"models": {
"qwen36": {
"name": "qwen36",
"limit": {
"context": 212992,
"input": 180224,
"output": 32768
}
}
}
}
},
"agent": {
"build": {
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 32768
},
"plan": {
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 32768
}
},
}
Feedback welcome!