BlackBox Evaluation Benchmarks required?

nvidiaspark · June 1, 2026, 3:23am

I spent a couple of hours today looking into chat templates for Qwen3.6 and Qwen3.5, especially used here on the forum and it seems like they are optimized for the commonly used benchmark tool tool-eval-bench --short. I might be wrong, but I noticed a gap between tool-eval statistics posted by some people and performance in real coding tasks.

TC-11 Simple Math
fix-qwen3.6-chat-template > Do not call calculators for trivial arithmetic or tautological same-unit conversions unless tool use is explicitly required.

TC-03 Implicit Tool Need
fix-qwen3.6-chat-template > When the user asks to send, forward, notify, share, or email something, and an email-sending tool is available, use the email tool once recipient, subject, and body are known or safely infer

And many more cases, that can be found in the template. So does that make tool-eval benchmarks useless? Especially the --short version. If yes, where do we go from here? Having trusted members of the community run BlackBox benchmarks?

I can’t be the only one who noticed that.

nvidiaspark · June 1, 2026, 6:22am

Here are the Template benchmarks for Qwen/Qwen3.6-27B-FP8:

37/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/chat_template.jinja

/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/chat_template.jinja

13/100 _ NO CHEAT _ froggeric/Qwen-Fixed-Chat-Templates/archive/qwen3.6/chat_template-v13.jinja

/root/.cache/huggingface/hub/models--froggeric--Qwen-Fixed-Chat-Templates/snapshots/c31fd393e531dbacd92b6deb99a2037cc949f950/archive/qwen3.6/chat_template-v13.jinja

50/100 _ NO CHEAT _ unsloth/Qwen3.6-27B-NVFP4/chat_template.jinja

/root/.cache/huggingface/templates/unsloth/Qwen3.6-27B-NVFP4/raw/main/chat_template.jinja

97/100 _ NO CHEAT _ random file _ chat_template.jinja

/root/.cache/huggingface/templates/works.jinja

60/100 _ NO CHEAT _ allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix

/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja

100/100 _ CHEAT _ technigmaai/nvidia-Qwen3.6-35B-A3B-NVFP4/fix-qwen3.6-chat-template

/root/.cache/huggingface/templates/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/refs/heads/main/chat-template/qwen3.6-enhanced.jinja

–

Recipe for the baseline:
nano recipes/test.yaml change chat_template
./build-and-copy.sh
./run-recipe.sh test --solo


name: Qwen3.6-27B-FP8
recipe_version: "1"
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"

model: Qwen/Qwen3.6-27B-FP8

container: vllm-node-tf5

build_args:
  - --tf5

defaults:
  port: **INSERT**
  host: 0.0.0.0
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 4
  chat_template: **INSERT**
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    -O3 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --served-model-name **INSERT** \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format instanttensor \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --chat-template {chat_template} \
    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \
    --speculative-config '{{"method": "qwen3_next_mtp", "num_speculative_tokens": 3}}' \
    --override-generation-config '{{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}}'

serapis · June 1, 2026, 7:09am

Thanks for starting this thread! I’ve worked on tool-eval-bench to give the community means to test-drive their systems and smoke out issues early on to ensure that they can run stable models for their agentic deployments.

The --short flag is meant as a quick sanity check, not a definitive evaluation. The full-blown suite (and especially the addition of GSM8K, MMLU, IFEval in the latest release) help tackle the gap between tool-eval scores and real coding tasks. There is still lots to do and I always welcome contributions to help ensure we have an easy to use and comprehensive way to evaluate models for our DGX systems.

Warm regards,
Tim

nvidiaspark · June 1, 2026, 7:18am

Your work in the community and the tool are awesome. The question is how can we prevent people from cheating on your benchmark.

Topic		Replies	Views
Introducing Tool Eval Bench CLI DGX Spark / GB10 Projects llama , agentic-ai	160	5156	June 11, 2026
Collecting eval results for Spark-sized quants of models DGX Spark / GB10 benchmarks , llm	50	1884	May 11, 2026
Toolery 0.1.0 - a deterministic tool-calling benchmark for local LLMs DGX Spark / GB10 Projects test , tools , benchmarks , llama , agentic-ai	7	518	June 2, 2026
Qwen3.5 Tool Calling finally fixed (possibly) DGX Spark / GB10	44	6477	May 4, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	59	2566	June 3, 2026
Deterministic Coding Benchmark - My Results (Codeneedle) DGX Spark / GB10 Projects	7	345	June 2, 2026
Introducing the Spark Arena DGX Spark / GB10	128	8679	April 10, 2026
The Quant Escape Room — A Community Benchmark Proposal DGX Spark / GB10 Projects gaming , agentic-ai	12	268	March 5, 2026
New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.) DGX Spark / GB10 Projects llama	17	2594	April 21, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	308	26107	June 9, 2026

BlackBox Evaluation Benchmarks required?

Related topics