BlackBox Evaluation Benchmarks required?

I spent a couple of hours today looking into chat templates for Qwen3.6 and Qwen3.5, especially used here on the forum and it seems like they are optimized for the commonly used benchmark tool tool-eval-bench --short. I might be wrong, but I noticed a gap between tool-eval statistics posted by some people and performance in real coding tasks.

TC-11 Simple Math
fix-qwen3.6-chat-template > Do not call calculators for trivial arithmetic or tautological same-unit conversions unless tool use is explicitly required.

TC-03 Implicit Tool Need
fix-qwen3.6-chat-template > When the user asks to send, forward, notify, share, or email something, and an email-sending tool is available, use the email tool once recipient, subject, and body are known or safely infer

And many more cases, that can be found in the template. So does that make tool-eval benchmarks useless? Especially the --short version. If yes, where do we go from here? Having trusted members of the community run BlackBox benchmarks?

I can’t be the only one who noticed that.

Here are the Template benchmarks for Qwen/Qwen3.6-27B-FP8:

37/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/chat_template.jinja

/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/chat_template.jinja

13/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/archive/qwen3.6/chat_template-v13.jinja

/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/archive/qwen3.6/chat_template-v13.jinja

50/100 _ NO CHEAT _ unsloth/Qwen3.6-27B-NVFP4/chat_template.jinja

/root/.cache/huggingface/templates/unsloth/Qwen3.6-27B-NVFP4/raw/main/chat_template.jinja

97/100 _ NO CHEAT _ random file _ chat_template.jinja

/root/.cache/huggingface/templates/works.jinja

60/100 _ NO CHEAT _ allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix

/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja

100/100 _ CHEAT _ technigmaai/nvidia-Qwen3.6-35B-A3B-NVFP4/fix-qwen3.6-chat-template

/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja

Recipe for the baseline:
nano recipes/test.yaml change chat_template
./build-and-copy.sh
./run-recipe.sh test --solo


name: Qwen3.6-27B-FP8
recipe_version: "1"
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"

model: Qwen/Qwen3.6-27B-FP8

container: vllm-node-tf5

build_args:
  - --tf5

defaults:
  port: **INSERT**
  host: 0.0.0.0
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4
  chat_template: **INSERT**
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --served-model-name **INSERT** \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format instanttensor \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --chat-template {chat_template} \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'

Thanks for starting this thread! I’ve worked on tool-eval-bench to give the community means to test-drive their systems and smoke out issues early on to ensure that they can run stable models for their agentic deployments.

The --short flag is meant as a quick sanity check, not a definitive evaluation. The full-blown suite (and especially the addition of GSM8K, MMLU, IFEval in the latest release) help tackle the gap between tool-eval scores and real coding tasks. There is still lots to do and I always welcome contributions to help ensure we have an easy to use and comprehensive way to evaluate models for our DGX systems.

Warm regards,
Tim

Your work in the community and the tool are awesome. The question is how can we prevent people from cheating on your benchmark.