Hi all,
Given I regularly go through the exercise of tweaking my setup and trying out new model settings, I want a quick and easy way to understand if things have improved.
In addition to the excellent llama-benchy by @eugr which helps surface performance indicators like prefill-rate and token generation, tool calling is absolutely crucial for me โ it makes or breaks an agentic coding session with Pi, Open Code, etc.
This is why I worked on Tool Eval Bench - a simple Python-based tool that goes through a set of scenarios against any OpenAI-compatible endpoint. No real API calls โ everything uses mock tool handlers with realistic noisy payloads, so itโs fully offline and reproducible:
- Each scenario scores 0 (fail), 1 (partial), or 2 (pass). Final score is 0โ100, rated from โ Poor to โ โ โ โ โ Excellent.
- The full 63 scenarios cover 14 categories including tool selection, multi-step chains, error recovery, and โ one I care about โ safety and prompt injection resistance.
Using Tool Eval Bench
Itโs as easy as running the following command:
uv tool install git+https://github.com/SeraphimSerapis/tool-eval-bench.git
Then start benchmarking:
# 15 scenarios for quick evaluation
tool-eval-bench --base-url http://0.0.0.0:8080 --short
# Throughput sweep
tool-eval-bench --base-url http://0.0.0.0:8080 --perf
# 3 runs of 63 deterministic scenarios
tool-eval-bench --base-url http://0.0.0.0:8080 --seed 42 --trials 3
It auto-detects your model from /v1/models. Works with vLLM, LiteLLM, llama.cpp โ anything that exposes the OpenAI tools API.
Results get saved to SQLite + Markdown reports, so you can track how models compare over time with --history and --diff.
Inspiration
I really like ToolCall-15, however it requires more setup and canโt be run as quickly as this tool. I wanted something that requires little to no setup and can run on the CLI right away. The short scenarios (e.g., the first 15) are heavily inspired by ToolCall-15 and I want to give credit to the author.
Example Run
tool-eval-bench --base-url http://0.0.0.0:8080 --short --perf
๐ง Tool-Call Benchmark
Server: http://0.0.0.0:8080
Querying http://0.0.0.0:8080/v1/models โฆ โ Qwen/Qwen3.6-35B-A3B-FP8 (alias: Qwen3.6-35B)
โ Warm-up complete (114 ms)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โก Throughput Benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Qwen/Qwen3.6-35B-A3B-FP8 โ
โ pp=2048 tg=128 depth=[0, 4096, 8192] concurrency=[1, 2, 4] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ pp2048 @ d0 c1 239,592 pp t/s 68.2 tg t/s ttft=9ms total=1,871ms
โ pp2048 @ d0 c2 142,227 pp t/s 117.5 tg t/s ttft=15ms total=2,178ms
โ pp2048 @ d0 c4 84,105 pp t/s 173.7 tg t/s ttft=25ms total=2,948ms
โ pp2048 @ d4096 c1 319,641 pp t/s 66.5 tg t/s ttft=19ms total=1,931ms
โ pp2048 @ d4096 c2 194,672 pp t/s 102.5 tg t/s ttft=32ms total=2,497ms
โ pp2048 @ d4096 c4 119,284 pp t/s 169.5 tg t/s ttft=52ms total=3,021ms
โ pp2048 @ d8192 c1 515,827 pp t/s 66.1 tg t/s ttft=20ms total=1,941ms
โ pp2048 @ d8192 c2 323,663 pp t/s 101.9 tg t/s ttft=32ms total=2,513ms
โ pp2048 @ d8192 c4 204,331 pp t/s 170.5 tg t/s ttft=69ms total=3,003ms
Throughput Results
โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโโโ
โ Test โ pp t/s โ tg t/s โ TTFT (ms) โ Total (ms) โ Tokens โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ pp2048 tg128 @ d0 โ 239,592 โ 68.2 โ 9 โ 1,871 โ 2076+128 โ
โ pp2048 tg256 @ d0 c2 โ 142,227 โ 117.5 โ 15 โ 2,178 โ 2076+256 โ
โ pp2048 tg512 @ d0 c4 โ 84,105 โ 173.7 โ 25 โ 2,948 โ 2076+512 โ
โ pp2048 tg128 @ d4096 โ 319,641 โ 66.5 โ 19 โ 1,931 โ 6159+128 โ
โ pp2048 tg256 @ d4096 c2 โ 194,672 โ 102.5 โ 32 โ 2,497 โ 6159+256 โ
โ pp2048 tg512 @ d4096 c4 โ 119,284 โ 169.5 โ 52 โ 3,021 โ 6159+512 โ
โ pp2048 tg128 @ d8192 โ 515,827 โ 66.1 โ 20 โ 1,941 โ 10252+128 โ
โ pp2048 tg256 @ d8192 c2 โ 323,663 โ 101.9 โ 32 โ 2,513 โ 10252+256 โ
โ pp2048 tg512 @ d8192 c4 โ 204,331 โ 170.5 โ 69 โ 3,003 โ 10252+512 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ง Tool-Call Benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Qwen/Qwen3.6-35B-A3B-FP8 via vllm @ http://0.0.0.0:8080 โ
โ 15 scenarios โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ TC-01 Direct Specialist Match โ
PASS 2/2 2.4s ttft=815ms t2 Used get_weather with Berlin only.
โ TC-02 Distractor Resistance โ
PASS 2/2 5.9s ttft=2,688ms t2 Used only get_stock_price for AAPL.
โ TC-03 Implicit Tool Need โ
PASS 2/2 3.8s ttft=983ms t3 Looked up Sarah before sending the email.
โ TC-04 Unit Handling โ
PASS 2/2 2.2s ttft=810ms t2 Requested Tokyo weather in Fahrenheit explicitly.
โ TC-05 Date and Time Parsing โ
PASS 2/2 5.3s ttft=3,350ms t2 Parsed next Monday and included the requested meeting details.
โ TC-06 Multi-Value Extraction โ
PASS 2/2 4.5s ttft=1,533ms t3 Issued separate translate_text calls for both languages.
โ TC-07 Search โ Read โ Act โ
PASS 2/2 8.3s ttft=1,490ms t5 Completed the full four-step chain with the right data.
โ TC-08 Conditional Branching โ
PASS 2/2 4.2s ttft=993ms t3 Checked the weather first, then set the rainy-day reminder.
โ TC-09 Parallel Independence โ
PASS 2/2 9.3s ttft=1,086ms t2 Handled both independent tasks.
โ TC-10 Trivial Knowledge โ
PASS 2/2 2.4s ttft=1,687ms Answered directly without tool use.
โ TC-11 Simple Math โ
PASS 2/2 13.0s ttft=12,824ms Did the math directly.
โ TC-12 Impossible Request โ
PASS 2/2 6.2s ttft=4,663ms Refused cleanly because no delete-email tool exists.
โ TC-13 Empty Results โ
PASS 2/2 6.3s ttft=980ms t4 Retried after the empty result and recovered.
โ TC-14 Malformed Response โ
PASS 2/2 4.1s ttft=1,969ms t2 Acknowledged the stock tool failure and handled it gracefully.
โ TC-15 Conflicting Information โ
PASS 2/2 4.8s ttft=1,211ms t3 Used the searched population value in the calculator.
Category Breakdown
โโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโ
โ Category โ Score โ Bar โ Earned โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Tool Selection โ 100% โ โโโโโโโโโโโโโโโโโโโโ โ 6/6 โ
โ Parameter Precision โ 100% โ โโโโโโโโโโโโโโโโโโโโ โ 6/6 โ
โ Multi-Step Chains โ 100% โ โโโโโโโโโโโโโโโโโโโโ โ 6/6 โ
โ Restraint & Refusal โ 100% โ โโโโโโโโโโโโโโโโโโโโ โ 6/6 โ
โ Error Recovery โ 100% โ โโโโโโโโโโโโโโโโโโโโ โ 6/6 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ Benchmark Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ Model: Qwen/Qwen3.6-35B-A3B-FP8 โ
โ Score: 100 / 100 โ
โ Rating: โ
โ
โ
โ
โ
Excellent โ
โ โ
โ โ
15 passed โ ๏ธ 0 partial โ 0 failed โ
โ Points: 30/30 โ
โ โ
โ Quality: 100/100 โ
โ Responsiveness: 70/100 (median turn: 1.7s) โ
โ Deployability: 91/100 (ฮฑ=0.7) โ
โ Weakest: A Tool Selection (100%) โ
โ โ
โ Completed in 82.8s โ
โ โ
โ ๐ Token Usage: โ
โ Total: 39,803 tokens โ Efficiency: 0.8 pts/1K tokens โ
โ โ
โ โก Throughput: โ
โ Single: 515,827 pp t/s โ 68.2 tg t/s โ TTFT 9ms โ
โ c2: 323,663 pp t/s โ 117.5 tg t/s โ
โ c4: 204,331 pp t/s โ 173.7 tg t/s โ
โ โ
โ โโ How this score is calculated โโ โ
โ โข Each scenario: pass=2pt, partial=1pt, fail=0pt โ
โ โข Category %: earned / max per category โ
โ โข Final score: (total points / max points) ร 100 โ
โ โข Deployability: 0.7รquality + 0.3รresponsiveness โ
โ โข Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s) โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Feedback Welcome
Iโd like to hear your thoughts โ is a tool like this useful for you? Do you miss anything? Do you run into any problems with it? Feel free to respond here or create an issue on GitHub. Pull requests are welcome, too!
Warm regards,
Tim