New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.)

Why I built this?

I’ve been happily using llama-bench to benchmark local models performance running in llama.cpp. One great feature is that it can help to evaluate performance at different context lengths and present the output in a table format that is easy to digest.

However, llama.cpp is not the only inference engine I use, I also use SGLang and vLLM. But llama-bench can only work with llama.cpp, and other benchmarking tools that I found are more focused on concurrency and total throughput.

Also, llama-bench performs measurements using the C++ engine directly which is not representative of the end user experience which can be quite different in practice.

vLLM has its own powerful benchmarking tool, but while it can be used with other inference engines, there are a few issues:

  • You can’t easily measure how prompt processing speed degrades as context grows. You can use vllm bench sweep serve, but it only works well with vLLM with prefix caching disabled on the server. Even with random prompts it will reuse the same prompt between multiple runs which will hit the cache in llama-server for instance. So you will get very low median TTFT times and very high prompt processing speeds.
  • The TTFT measurement it uses is not actually until the first usable token, it’s until the very first data chunk from the server which may not contain any generated tokens in /v1/chat/completions mode.
  • Random dataset is the only ones that allows to specify an arbitrary number of tokens, but randomly generated token sequence doesn’t let you adequately measure speculative decoding/MTP.

As of today, I haven’t been able to find any existing benchmarking tool that brings llama-bench style measurements at different context lengths to any OpenAI-compatible endpoint.

What is llama-benchy?

It’s a CLI benchmarking tool that measures:

  • Prompt Processing (pp) and Token Generation (tg) speeds at different context lengths.
  • Allows to benchmark context prefill and follow up prompt separately.
  • Reports additional metrics, like time to first response, estimated prompt processing time and end-to-end time to first token.

It works with any OpenAI-compatible endpoint that exposes /v1/chat/completions and also:

  • Supports configurable prompt length (--pp), generation length (--tg), and context depth (--depth).
  • Can run multiple iterations (--runs) and report mean ± std.
  • Uses HuggingFace tokenizers for accurate token counts.
  • Downloads a book from Project Gutenberg to use as source text for prompts to ensure better benchmarking of spec.decoding/MTP models.
  • Supports executing a command after each run (e.g., to clear cache).
  • Configurable latency measurement mode to estimate server/network overhead and provide more accurate prompt processing numbers.

Quick Demo

Benchmarking MiniMax 2.1 AWQ running on my dual Spark cluster with up to 100000 context:

# Run without installation
uvx llama-benchy --base-url http://spark:8888/v1 --model cyankiwi/MiniMax-M2.1-AWQ-4bit --depth 0 4096 8192 16384 32768 65535 100000 --adapt-prompt --latency-mode generation --enable-prefix-caching

Output:

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 3544.10 ± 37.29 688.41 ± 6.09 577.93 ± 6.09 688.45 ± 6.10
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 36.11 ± 0.06
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d4096 3150.63 ± 7.84 1410.55 ± 3.24 1300.06 ± 3.24 1410.58 ± 3.24
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d4096 34.36 ± 0.08
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d4096 2562.47 ± 21.71 909.77 ± 6.75 799.29 ± 6.75 909.81 ± 6.75
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d4096 33.41 ± 0.05
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d8192 2832.52 ± 12.34 3002.66 ± 12.57 2892.18 ± 12.57 3002.70 ± 12.57
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d8192 31.38 ± 0.06
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d8192 2261.83 ± 10.69 1015.96 ± 4.29 905.48 ± 4.29 1016.00 ± 4.29
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d8192 30.55 ± 0.08
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d16384 2473.70 ± 2.15 6733.76 ± 5.76 6623.28 ± 5.76 6733.80 ± 5.75
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d16384 27.89 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d16384 1824.55 ± 6.32 1232.96 ± 3.89 1122.48 ± 3.89 1233.00 ± 3.89
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d16384 27.21 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d32768 2011.11 ± 2.40 16403.98 ± 19.43 16293.50 ± 19.43 16404.03 ± 19.43
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d32768 22.09 ± 0.07
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d32768 1323.21 ± 4.62 1658.25 ± 5.41 1547.77 ± 5.41 1658.29 ± 5.41
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d32768 21.81 ± 0.07
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d65535 1457.71 ± 0.26 45067.98 ± 7.94 44957.50 ± 7.94 45068.01 ± 7.94
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d65535 15.72 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d65535 840.36 ± 2.35 2547.54 ± 6.79 2437.06 ± 6.79 2547.60 ± 6.80
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d65535 15.63 ± 0.02
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_pp @ d100000 1130.05 ± 1.89 88602.31 ± 148.70 88491.83 ± 148.70 88602.37 ± 148.70
cyankiwi/MiniMax-M2.1-AWQ-4bit ctx_tg @ d100000 12.14 ± 0.02
cyankiwi/MiniMax-M2.1-AWQ-4bit pp2048 @ d100000 611.01 ± 2.50 3462.39 ± 13.73 3351.90 ± 13.73 3462.42 ± 13.73
cyankiwi/MiniMax-M2.1-AWQ-4bit tg32 @ d100000 12.05 ± 0.03

llama-benchy (0.1.0)
date: 2026-01-06 11:44:49 | latency mode: generation

GitHub

https://github.com/eugr/llama-benchy

7 Likes

Thanks for one more amazing contribution @eugr

New major update is out - v0.2.0.

It brings:

  • concurrency support
  • added JSON and CSV outputs. JSON output contains values for individual runs in addition to mean/std values.
  • ability to save results to a file

Example of concurrency testing. This is for model running on a single Spark with the following parameters:

  • gpu_memory_utilization: 0.7
  • max_model_len: 202752
  • max_num_batched_tokens: 4096
  • max_num_seqs: 64
llama-benchy \
  --base-url http://spark:8888/v1 \
  --model cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --served-model-name glm-4.7-flash \
  --depth 0 4096 \
  --adapt-prompt \
  --concurrency 1 2 10

Maximum supported concurrency is:

GPU KV cache size: 1,239,088 tokens
Maximum concurrency for 202,752 tokens per request: 6.11x
model test t/s (total) t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 (c1) 5326.13 ± 38.60 5326.13 ± 38.60 336.01 ± 5.86 331.75 ± 5.86 336.14 ± 5.82
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 (c1) 41.75 ± 0.03 41.75 ± 0.03
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 (c2) 5772.72 ± 310.43 2953.01 ± 141.62 632.95 ± 22.99 628.68 ± 22.99 633.00 ± 22.99
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 (c2) 73.74 ± 1.31 37.38 ± 0.63
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 (c10) 6118.58 ± 28.10 897.80 ± 400.77 2336.27 ± 705.73 2332.00 ± 705.73 2336.31 ± 705.72
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 (c10) 87.65 ± 2.76 15.33 ± 4.11
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 @ d4096 (c1) 5066.12 ± 9.45 5066.12 ± 9.45 1097.99 ± 18.53 1093.72 ± 18.53 1098.06 ± 18.53
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 @ d4096 (c1) 39.27 ± 0.11 39.27 ± 0.11
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 @ d4096 (c2) 5274.22 ± 15.83 2665.63 ± 38.23 2042.41 ± 37.06 2038.14 ± 37.06 2042.50 ± 37.06
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 @ d4096 (c2) 68.02 ± 0.32 34.73 ± 0.66
cyankiwi/GLM-4.7-Flash-AWQ-4bit pp2048 @ d4096 (c10) 5217.03 ± 15.20 1051.40 ± 608.75 6666.76 ± 2782.40 6662.49 ± 2782.40 6666.81 ± 2782.39
cyankiwi/GLM-4.7-Flash-AWQ-4bit tg32 @ d4096 (c10) 31.33 ± 0.21 8.14 ± 5.03

llama-benchy (0.2.0)
date: 2026-02-05 15:49:22 | latency mode: api

Please note that that t/s (total) is measured for the entire concurrent batch run, and does not indicate peak throughput. Peak throughput will be added in the next release.

4 Likes

Very nice!

BTW are you familiar with aiperf?

Playing around with it lately. May a source of inspiration.

Yes, I’ve looked at it, as well as a few others, but it’s closer to vllm bench serve - lots of functionality, but I wanted to create something like llama-bench, but for all backends, so we could see the effect of context on follow up requests. And have a simple to digest output.

Thank you for the share, eugr. I’m curious if the oss-120b benchmark results on GitHub were obtained on a single DGX Spark setup?

It’s a mix. Most were dual Sparks, with the exception of concurrency example which was for a single Spark. The numbers for dual sparks are lower than they should be, I was running some other stuff on those sparks at that time, and it slowed things down a bit. On a good day I see up to 77-78 t/s on single requests with low context.

Feel free to check latest numbers for some models here as well

PSA: llama-benchy will now try to infer HF model name (and served-model-name if applicable) from the endpoint if --model parameter is not specified. Works with vLLM and SGLang.

While it works with llama.cpp if the server was launched with -hf parameter (passing HF model name), depending on the GGUF model repository, it may not be able to load a tokenizer. In this case you can pass HF model name in non-GGUF format that has a tokenizer in the repository. It will still work, but may report slightly incorrect results depending on how different is the tokenizer from the default gpt2 one.

thanks for doing what you are doing @eugr , more helpful than nvidia staff - hope they send you some sparks for free as a thank you at least

1 Like