Toolery 0.1.0 - a deterministic tool-calling benchmark for local LLMs

Very cool, thanks for making this, definitely good to have additional evals to grill local models with.
Noticed an occasional minor bug where history tab doesnt show current or previous runs until you close and re-open the TUI.
Few ideas on my wishlist that would be awesome to have:

  • tool call traces to be readable in the TUI
  • average tokens/s calculated from the tool test section also being captured, as sometimes I skip llama-benchy to complete the run quicker
  • not sure if possible, but pulling recipe params would be super neat, so that I could compare how changing different params affects performance…
    Thanks again for making it.