Hey folks!
Long-time lurker, finally got something worth posting. For the last month I’ve been tinkering away on a little home lab for comparing local LLMs — half because I genuinely needed it, half because, well, you know how it goes once a project grabs you.
The result is Toolery, and I’d love for you to take it for a spin. My hope is that it actually earns a spot in your toolkit rather than collecting dust in your bookmarks.
Fair warning, though: like every free tool and every benchmark ever made, it’s not perfect — but it can hand you the answers you’ve been hunting for. So grab a coffee, point it at your endpoint, and let’s see what your rig is really made of.
If you’ve spent any real time building out a homelab, you know the loop. There’s a box humming in the closet, maybe a node or two, vLLM or llama.cpp serving a model on port 8000. You pull a fresh quant, swap the served model, nudge a sampler setting, and then… what? You fire off a handful of prompts, it feels about right, and you move on. That gut-feel step is the weakest link in the whole setup, and it’s exactly the gap I built Toolery to close.
Toolery is a deterministic benchmark for LLM tool-calling. It runs a model through 143 hand-written scenarios across four difficulty tiers — 40 easy, 45 medium, 34 hard, 24 very-hard — and scores every run with plain assertions over the tool calls, arguments, and final text. No model-as-judge anywhere in the pipeline, which means two things that matter for a home setup: it costs nothing to run, and the same run gives you the same number every time. You can diff results instead of squinting at them.
It’s built for people serving their own models. Point it at any OpenAI-compatible endpoint and it works, but the whole thing is shaped around local serving — vLLM, llama.cpp, SGLang — and it’s cluster-aware, so the same model on a single node versus a multi-node topology gets tracked as a separate config rather than smeared together. There’s a terminal dashboard (TUI) that probes your common ports, finds the live endpoint, launches the run in the background, and shows progress as it goes. You don’t leave the console.
The part I find most useful is the capability matrix. Instead of one blurry “score,” you get separate columns for things like agentic planning, error recovery, parameter precision, state tracking, instruction following, restraint, and calibration/hallucination. That’s where you actually learn something — a model can look great overall and still fall apart on, say, recovering from a bad tool result, which is the kind of thing that quietly wrecks an agent in production.
One thing worth saying plainly: treat the numbers as a compass, not a verdict. 143 scenarios is a solid signal, but it isn’t your workload, your tools, or your latency budget. The README itself flags that gaps under ~2 percentage points are noise. Use Toolery to shortlist candidates, catch a regression after you change a quant, or sanity-check a new model before you trust it — not to crown a single “best” model and stop thinking. The real test is still your own prompts running against your own tools.
Getting started is short:
git clone https://github.com/karolpalys/toolery.git
cd toolery
uv sync
export TOOLERY_BASE_URL=http://localhost:8000
uv run toolery run --model my-model --adapter raw --tier easy --trials 3
uv run toolery tui
That last line opens the dashboard once you’ve got a run or two recorded.
It’s MIT-licensed and on GitHub: https://github.com/karolpalys/toolery — happy to hear what breaks, what’s confusing, or what scenarios you’d want added. If you’re already serving models at home, I think you’ll get something out of it.
I’m wide open to any feedback just a heads-up that my bandwidth’s going to be a bit thin for a while, so be gentle if I’m slow to reply. I’m absolutely planning to keep developing this, but what’s already in there should be useful as is.
One last thing worth pointing you at: the Profiles tab. That’s where it gets genuinely handy instead of staring at a wall of numbers, you can pick a local LLM based on what you actually need it for. It rebalances the weighting across capabilities depending on the job, so a model tuned for agentic coding bubbles up for one profile and a tidy structured-output workhorse wins for another. Saves you the mental gymnastics of deciding which columns matter today.
That’s it from me go break it, and let me know how it goes.







