I spent a couple of hours today looking into chat templates for Qwen3.6 and Qwen3.5, especially used here on the forum and it seems like they are optimized for the commonly used benchmark tool tool-eval-bench --short. I might be wrong, but I noticed a gap between tool-eval statistics posted by some people and performance in real coding tasks.
TC-11 Simple Math
fix-qwen3.6-chat-template > Do not call calculators for trivial arithmetic or tautological same-unit conversions unless tool use is explicitly required.
TC-03 Implicit Tool Need
fix-qwen3.6-chat-template > When the user asks to send, forward, notify, share, or email something, and an email-sending tool is available, use the email tool once recipient, subject, and body are known or safely infer
And many more cases, that can be found in the template. So does that make tool-eval benchmarks useless? Especially the --short version. If yes, where do we go from here? Having trusted members of the community run BlackBox benchmarks?
I can’t be the only one who noticed that.
Here are the Template benchmarks for Qwen/Qwen3.6-27B-FP8:
37/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/chat_template.jinja
/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/chat_template.jinja
13/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/archive/qwen3.6/chat_template-v13.jinja
/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/archive/qwen3.6/chat_template-v13.jinja
50/100 _ NO CHEAT _ unsloth/Qwen3.6-27B-NVFP4/chat_template.jinja
/root/.cache/huggingface/templates/unsloth/Qwen3.6-27B-NVFP4/raw/main/chat_template.jinja
97/100 _ NO CHEAT _ random file _ chat_template.jinja
/root/.cache/huggingface/templates/works.jinja
60/100 _ NO CHEAT _ allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix
/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja
100/100 _ CHEAT _ technigmaai/nvidia-Qwen3.6-35B-A3B-NVFP4/fix-qwen3.6-chat-template
/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja
–
Recipe for the baseline:
nano recipes/test.yaml change chat_template
./build-and-copy.sh
./run-recipe.sh test --solo
name: Qwen3.6-27B-FP8
recipe_version: "1"
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"
model: Qwen/Qwen3.6-27B-FP8
container: vllm-node-tf5
build_args:
- --tf5
defaults:
port: **INSERT**
host: 0.0.0.0
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 16384
max_num_seqs: 4
chat_template: **INSERT**
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
command: |
vllm serve Qwen/Qwen3.6-27B-FP8 \
-O3 \
--max-model-len {max_model_len} \
--max-num-seqs {max_num_seqs} \
--served-model-name **INSERT** \
--enable-prefix-caching \
--gpu-memory-utilization {gpu_memory_utilization} \
--port {port} \
--host {host} \
--load-format instanttensor \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens {max_num_batched_tokens} \
--trust-remote-code \
--chat-template {chat_template} \
--default-chat-template-kwargs '{{"preserve_thinking": true}}' \
--speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
--override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'
Thanks for starting this thread! I’ve worked on tool-eval-bench to give the community means to test-drive their systems and smoke out issues early on to ensure that they can run stable models for their agentic deployments.
The --short flag is meant as a quick sanity check, not a definitive evaluation. The full-blown suite (and especially the addition of GSM8K, MMLU, IFEval in the latest release) help tackle the gap between tool-eval scores and real coding tasks. There is still lots to do and I always welcome contributions to help ensure we have an easy to use and comprehensive way to evaluate models for our DGX systems.
Warm regards,
Tim
Your work in the community and the tool are awesome. The question is how can we prevent people from cheating on your benchmark.