Toolery 0.1.0 - a deterministic tool-calling benchmark for local LLMs

Hey folks!

Long-time lurker, finally got something worth posting. For the last month I’ve been tinkering away on a little home lab for comparing local LLMs — half because I genuinely needed it, half because, well, you know how it goes once a project grabs you.

The result is Toolery, and I’d love for you to take it for a spin. My hope is that it actually earns a spot in your toolkit rather than collecting dust in your bookmarks.

Fair warning, though: like every free tool and every benchmark ever made, it’s not perfect — but it can hand you the answers you’ve been hunting for. So grab a coffee, point it at your endpoint, and let’s see what your rig is really made of.

If you’ve spent any real time building out a homelab, you know the loop. There’s a box humming in the closet, maybe a node or two, vLLM or llama.cpp serving a model on port 8000. You pull a fresh quant, swap the served model, nudge a sampler setting, and then… what? You fire off a handful of prompts, it feels about right, and you move on. That gut-feel step is the weakest link in the whole setup, and it’s exactly the gap I built Toolery to close.

Toolery is a deterministic benchmark for LLM tool-calling. It runs a model through 143 hand-written scenarios across four difficulty tiers — 40 easy, 45 medium, 34 hard, 24 very-hard — and scores every run with plain assertions over the tool calls, arguments, and final text. No model-as-judge anywhere in the pipeline, which means two things that matter for a home setup: it costs nothing to run, and the same run gives you the same number every time. You can diff results instead of squinting at them.

It’s built for people serving their own models. Point it at any OpenAI-compatible endpoint and it works, but the whole thing is shaped around local serving — vLLM, llama.cpp, SGLang — and it’s cluster-aware, so the same model on a single node versus a multi-node topology gets tracked as a separate config rather than smeared together. There’s a terminal dashboard (TUI) that probes your common ports, finds the live endpoint, launches the run in the background, and shows progress as it goes. You don’t leave the console.

The part I find most useful is the capability matrix. Instead of one blurry “score,” you get separate columns for things like agentic planning, error recovery, parameter precision, state tracking, instruction following, restraint, and calibration/hallucination. That’s where you actually learn something — a model can look great overall and still fall apart on, say, recovering from a bad tool result, which is the kind of thing that quietly wrecks an agent in production.

One thing worth saying plainly: treat the numbers as a compass, not a verdict. 143 scenarios is a solid signal, but it isn’t your workload, your tools, or your latency budget. The README itself flags that gaps under ~2 percentage points are noise. Use Toolery to shortlist candidates, catch a regression after you change a quant, or sanity-check a new model before you trust it — not to crown a single “best” model and stop thinking. The real test is still your own prompts running against your own tools.

Getting started is short:

git clone https://github.com/karolpalys/toolery.git
cd toolery
uv sync
export TOOLERY_BASE_URL=http://localhost:8000
uv run toolery run --model my-model --adapter raw --tier easy --trials 3
uv run toolery tui

That last line opens the dashboard once you’ve got a run or two recorded.

It’s MIT-licensed and on GitHub: https://github.com/karolpalys/toolery — happy to hear what breaks, what’s confusing, or what scenarios you’d want added. If you’re already serving models at home, I think you’ll get something out of it.

I’m wide open to any feedback just a heads-up that my bandwidth’s going to be a bit thin for a while, so be gentle if I’m slow to reply. I’m absolutely planning to keep developing this, but what’s already in there should be useful as is.

One last thing worth pointing you at: the Profiles tab. That’s where it gets genuinely handy instead of staring at a wall of numbers, you can pick a local LLM based on what you actually need it for. It rebalances the weighting across capabilities depending on the job, so a model tuned for agentic coding bubbles up for one profile and a tidy structured-output workhorse wins for another. Saves you the mental gymnastics of deciding which columns matter today.

That’s it from me go break it, and let me know how it goes.

This is a comment without regard to the tool itself, and actually spans a bigger topic… Did you reach out to (I think) @serapis? If not him, there’s a member who also created a tool to rank capabilities which has been popular around here.

I’m all for homebrewed solns, esp when they provide real value, but the disconnectedness doesn’t help. It was the same when spark-vllm-docker/sparkrun started. Those dudes have since merged efforts and then started collecting everything under spark-arena. It’d make sense to have a single, badass benchmarking tool, hosted under a single platform, which is developed by the Spark crowd.

I appreciate the mention! I’ve been working on GitHub - SeraphimSerapis/tool-eval-bench: Tool-calling quality benchmark for LLM serving stacks. 65+ deterministic scenarios testing multi-turn orchestration, safety boundaries, and structured output. Supports vLLM, LiteLLM, and llama.cpp. · GitHub for some time and also discussed an integration into the sparkrun and Spark Arena ecosystem with @raphael.amorim and @dbsci to avoid working on our projects in isolation.

@karol.spark this looks great a great project. Thanks for sharing it! I’ll make sure to give it a spin. I also welcome any contributions to tool-eval-bench that may come to mind if you’re interested in collaborating!

Thanks for the warm post. I’ve worked with tool-eval-bench myself and it was actually my main inspiration, so thanks to you @serapis

At some point I realised that most models are already getting close to 100%, which can be a bit misleading, as it makes them look like they’re all at the same level, while in reality that’s not always the case. That’s why I decided to build something more challenging, where not every model can easily hit 60–70%, and where there’s still room to properly test upcoming models.

As you can see in the attached screenshots, I managed to test GPT‑5.5 on Codex. It reached around 80% in coding and debugging, which is a great result, but there are still tasks that remain unsolved.

The whole idea behind this project was to create my own “home lab” to test different LLM quantizations and compare them directly. For example, I’m running tests on Minimax 2.7 in AWQ and NVFP4, which lets me clearly see where the quality differences are.

Each test run can be configured so that every task is executed 5 times using C4. Interestingly, not all of those runs end successfully, which actually highlights some instability in the tested models.

Overall, for me this has turned into a really useful tool it’s easy to use and great for building your own library of tested models. There’s no such thing as a perfect model, just like there’s no perfect benchmark (there are already hundreds out there), but it’s always worth trying to find the best model for your specific needs that’s also where profiles come in.

There were also reported issues with MIMO V2.5 when it switched into “thinking mode,” it would leave traces that led to incorrect outputs being read. I ran into a similar problem with Minimax 2.7, but it’s now been resolved, and both models are currently passing the tests correctly.

Each configuration takes available compute (sparks) into account, so there’s quite a lot of flexibility.

Feel free to test it out, and I’m very open to any constructive feedback.

Very cool, thanks for making this, definitely good to have additional evals to grill local models with.
Noticed an occasional minor bug where history tab doesnt show current or previous runs until you close and re-open the TUI.
Few ideas on my wishlist that would be awesome to have:

  • tool call traces to be readable in the TUI
  • average tokens/s calculated from the tool test section also being captured, as sometimes I skip llama-benchy to complete the run quicker
  • not sure if possible, but pulling recipe params would be super neat, so that I could compare how changing different params affects performance…
    Thanks again for making it.

In the history tab, you can press the ‘R’ key to refresh the page without having to close the TUI, or you can click on ‘R - Refresh’ in the bottom left corner—it does exactly the same thing.

Thanks for the 3 suggestions you provided; I’ll try to implement as much as I can in the next update. It seems to me that making: “tool call traces to be readable in the TUI” will be the fastest to implement.

Hi, the tool looks promising. I benefit it, thanks

Have you considered to add option to control the thinikng_mode. It will provide the flexibility to asses the the model behavior, without need to relaunch it. As this option is runtime controllable (via chat template kwargs), imho it is feasible

It’s not that easy with thinking mode. Thinking mode it’s only passively cleaned up:

toolery/adapters/openai_raw.py:103-119 — If the model spits out <think>...</think> on its own, the adapter strips it out (strip_reasoning_tags) so that the structured-output checks don’t crash on valid responses.

The biggest catch: there is no single universal toggle

This is the reason why this isn’t trivial. The method for enabling/disabling thinking varies per model and per backend:

Backend / Model How to disable/enable thinking
vLLM (Qwen3, etc.) chat_template_kwargs: {"enable_thinking": false}
Qwen3 / some others /no_think or /think token in the prompt
OpenAI-style reasoning (GPT-o, MiniMax-M2) reasoning_effort: low/medium/high
DeepSeek / QwQ Often impossible to disable — the model always thinks

What’s landed since 0.1.0 (now 0.4.1) :

  • DGX Spark topology is a first-class ranking axis — the same model+adapter on single / dual / triple / quad / octa appears as separate rows, with prefill + decode t/s tracked per topology, so you can see exactly what extra Sparks buy you.
  • Empirical re-tiering of all 143 scenarios by measured pass-rate across three local models instead of hand-guessed difficulty.
  • --timeout-scale so long chain-of-thought reasoning models aren’t killed mid-answer removed a whole class of false “timeout” failures.
  • Determinism / fairness fixes: stateful mocks with cross-tool gates, grader robustness to output style, a golden_probe passability guard, and repair of every scenario that scored 0% across all models (they were measuring mock bugs, not the model).
  • TUI polish: frozen rank/model columns, difficulty + category filters, and a rock-stable 5 s live refresh (no more horizontal-scroll jumps, jitter, or lost row selection)
  • Easier scenario browsing: the Scenarios tab now has Difficulty and Category filter dropdowns and shows each scenario’s live tier/category/tags, so you can quickly segregate and drill into specific scenarios instead of scrolling the whole 143-entry list.

@arctic.gus
-tool call traces to be readable in the TUI ✅
-average tokens/s calculated from the tool test section also being captured, as sometimes I skip llama-benchy to complete the run quicker ✅
-history bug fixed as well✅

As I’ve said from the very start, the whole point of this tool is to gauge what models can actually do on real tasks not vibes, not reputation, just measured behavior.

And yes, I know some of you won’t like it, because you’ve got a different opinion and your favorite model sits too low while the one you can’t stand sits too high. That’s fine. One of the core ideas here is exactly that you don’t have to take my word for it: you can
run your favorite model yourself, across several quantization variants, and decide for yourself which one is best for you i.e. which one loses the least.

And since guidelines and benchmarks all love to move the goalposts over time, my Toolery has had an upgrade too.

One more note: GPT-5.5 was run through every scenario only once hence just 143 trials (every other model runs each scenario 5 times and the score is averaged = 5 * 143 = 715). Even so, it makes the gap between local LLMs and the ones from the largest providers pretty clear.

And that gap isn’t just my tool talking, it’s in line with the latest refreshed Artificial Analysis (artificialanalysis.ai) benchmark, same as it was before:

and how it is today:

And one last thing: the charts make it pretty clear that the Chinese models aren’t even close to the top-tier solutions. I’ll also admit I’m a little disappointed by MiniMax-M3 it’s a 2× larger model than MiniMax-M2.7, yet it still trips over some of the tasks in my
test.

Just please remember: make your own research first!

very nice, will update mine later on and give it another run, thanks.