Toolery 0.1.0 - a deterministic tool-calling benchmark for local LLMs

arctic.gus · May 31, 2026, 9:05am

Very cool, thanks for making this, definitely good to have additional evals to grill local models with.
Noticed an occasional minor bug where history tab doesnt show current or previous runs until you close and re-open the TUI.
Few ideas on my wishlist that would be awesome to have:

tool call traces to be readable in the TUI
average tokens/s calculated from the tool test section also being captured, as sometimes I skip llama-benchy to complete the run quicker
not sure if possible, but pulling recipe params would be super neat, so that I could compare how changing different params affects performance…
Thanks again for making it.

Topic		Replies	Views
Introducing Tool Eval Bench CLI DGX Spark / GB10 Projects llama , agentic-ai	161	5593	June 18, 2026
Step-3.7-AWQ: 2xSpark: 48TG at C1, 108Toks at C8 DGX Spark / GB10 llama , agentic-ai , deepseek	0	193	June 15, 2026
New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.) DGX Spark / GB10 Projects llama	17	2760	April 21, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	35	17161	June 23, 2026
Introducing the Spark Arena DGX Spark / GB10	128	9570	April 10, 2026
New Model - Poolside Laguna XS.2 DGX Spark / GB10 agentic-ai	18	1447	June 3, 2026
MiniMax M3 : NVFP4 for Quad DGX Spark DGX Spark / GB10 agentic-ai , deepseek	114	6404	June 19, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	59	3049	June 3, 2026
Minimax3 on 2 nodes decode ~10.7 tok/s, 4bits DGX Spark / GB10 llama	26	991	June 21, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2818	May 11, 2026

Toolery 0.1.0 - a deterministic tool-calling benchmark for local LLMs

Related topics