New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.)

eugr · January 6, 2026, 7:57pm

Why I built this?

I’ve been happily using llama-bench to benchmark local models performance running in llama.cpp. One great feature is that it can help to evaluate performance at different context lengths and present the output in a table format that is easy to digest.

However, llama.cpp is not the only inference engine I use, I also use SGLang and vLLM. But llama-bench can only work with llama.cpp, and other benchmarking tools that I found are more focused on concurrency and total throughput.

Also, llama-bench performs measurements using the C++ engine directly which is not representative of the end user experience which can be quite different in practice.

vLLM has its own powerful benchmarking tool, but while it can be used with other inference engines, there are a few issues:

You can’t easily measure how prompt processing speed degrades as context grows. You can use vllm bench sweep serve, but it only works well with vLLM with prefix caching disabled on the server. Even with random prompts it will reuse the same prompt between multiple runs which will hit the cache in llama-server for instance. So you will get very low median TTFT times and very high prompt processing speeds.
The TTFT measurement it uses is not actually until the first usable token, it’s until the very first data chunk from the server which may not contain any generated tokens in /v1/chat/completions mode.
Random dataset is the only ones that allows to specify an arbitrary number of tokens, but randomly generated token sequence doesn’t let you adequately measure speculative decoding/MTP.

As of today, I haven’t been able to find any existing benchmarking tool that brings llama-bench style measurements at different context lengths to any OpenAI-compatible endpoint.

What is llama-benchy?

It’s a CLI benchmarking tool that measures:

Prompt Processing (pp) and Token Generation (tg) speeds at different context lengths.
Allows to benchmark context prefill and follow up prompt separately.
Reports additional metrics, like time to first response, estimated prompt processing time and end-to-end time to first token.

It works with any OpenAI-compatible endpoint that exposes /v1/chat/completions and also:

Supports configurable prompt length (--pp), generation length (--tg), and context depth (--depth).
Can run multiple iterations (--runs) and report mean ± std.
Uses HuggingFace tokenizers for accurate token counts.
Downloads a book from Project Gutenberg to use as source text for prompts to ensure better benchmarking of spec.decoding/MTP models.
Supports executing a command after each run (e.g., to clear cache).
Configurable latency measurement mode to estimate server/network overhead and provide more accurate prompt processing numbers.

Quick Demo

Benchmarking MiniMax 2.1 AWQ running on my dual Spark cluster with up to 100000 context:

# Run without installation
uvx llama-benchy --base-url http://spark:8888/v1 --model cyankiwi/MiniMax-M2.1-AWQ-4bit --depth 0 4096 8192 16384 32768 65535 100000 --adapt-prompt --latency-mode generation --enable-prefix-caching

Output:

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048	3544.10 ± 37.29	688.41 ± 6.09	577.93 ± 6.09	688.45 ± 6.10
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32	36.11 ± 0.06
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d4096	3150.63 ± 7.84	1410.55 ± 3.24	1300.06 ± 3.24	1410.58 ± 3.24
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d4096	34.36 ± 0.08
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d4096	2562.47 ± 21.71	909.77 ± 6.75	799.29 ± 6.75	909.81 ± 6.75
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d4096	33.41 ± 0.05
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d8192	2832.52 ± 12.34	3002.66 ± 12.57	2892.18 ± 12.57	3002.70 ± 12.57
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d8192	31.38 ± 0.06
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d8192	2261.83 ± 10.69	1015.96 ± 4.29	905.48 ± 4.29	1016.00 ± 4.29
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d8192	30.55 ± 0.08
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d16384	2473.70 ± 2.15	6733.76 ± 5.76	6623.28 ± 5.76	6733.80 ± 5.75
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d16384	27.89 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d16384	1824.55 ± 6.32	1232.96 ± 3.89	1122.48 ± 3.89	1233.00 ± 3.89
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d16384	27.21 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d32768	2011.11 ± 2.40	16403.98 ± 19.43	16293.50 ± 19.43	16404.03 ± 19.43
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d32768	22.09 ± 0.07
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d32768	1323.21 ± 4.62	1658.25 ± 5.41	1547.77 ± 5.41	1658.29 ± 5.41
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d32768	21.81 ± 0.07
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d65535	1457.71 ± 0.26	45067.98 ± 7.94	44957.50 ± 7.94	45068.01 ± 7.94
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d65535	15.72 ± 0.04
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d65535	840.36 ± 2.35	2547.54 ± 6.79	2437.06 ± 6.79	2547.60 ± 6.80
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d65535	15.63 ± 0.02
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_pp @ d100000	1130.05 ± 1.89	88602.31 ± 148.70	88491.83 ± 148.70	88602.37 ± 148.70
cyankiwi/MiniMax-M2.1-AWQ-4bit	ctx_tg @ d100000	12.14 ± 0.02
cyankiwi/MiniMax-M2.1-AWQ-4bit	pp2048 @ d100000	611.01 ± 2.50	3462.39 ± 13.73	3351.90 ± 13.73	3462.42 ± 13.73
cyankiwi/MiniMax-M2.1-AWQ-4bit	tg32 @ d100000	12.05 ± 0.03

llama-benchy (0.1.0)
date: 2026-01-06 11:44:49 | latency mode: generation

GitHub

https://github.com/eugr/llama-benchy

raphael.amorim · January 6, 2026, 8:46pm

Thanks for one more amazing contribution @eugr

eugr · February 6, 2026, 12:22am

New major update is out - v0.2.0.

It brings:

concurrency support
added JSON and CSV outputs. JSON output contains values for individual runs in addition to mean/std values.
ability to save results to a file

Example of concurrency testing. This is for model running on a single Spark with the following parameters:

gpu_memory_utilization: 0.7
max_model_len: 202752
max_num_batched_tokens: 4096
max_num_seqs: 64

llama-benchy \
  --base-url http://spark:8888/v1 \
  --model cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --served-model-name glm-4.7-flash \
  --depth 0 4096 \
  --adapt-prompt \
  --concurrency 1 2 10

Maximum supported concurrency is:

GPU KV cache size: 1,239,088 tokens
Maximum concurrency for 202,752 tokens per request: 6.11x

model	test	t/s (total)	t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 (c1)	5326.13 ± 38.60	5326.13 ± 38.60	336.01 ± 5.86	331.75 ± 5.86	336.14 ± 5.82
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 (c1)	41.75 ± 0.03	41.75 ± 0.03
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 (c2)	5772.72 ± 310.43	2953.01 ± 141.62	632.95 ± 22.99	628.68 ± 22.99	633.00 ± 22.99
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 (c2)	73.74 ± 1.31	37.38 ± 0.63
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 (c10)	6118.58 ± 28.10	897.80 ± 400.77	2336.27 ± 705.73	2332.00 ± 705.73	2336.31 ± 705.72
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 (c10)	87.65 ± 2.76	15.33 ± 4.11
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 @ d4096 (c1)	5066.12 ± 9.45	5066.12 ± 9.45	1097.99 ± 18.53	1093.72 ± 18.53	1098.06 ± 18.53
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 @ d4096 (c1)	39.27 ± 0.11	39.27 ± 0.11
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 @ d4096 (c2)	5274.22 ± 15.83	2665.63 ± 38.23	2042.41 ± 37.06	2038.14 ± 37.06	2042.50 ± 37.06
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 @ d4096 (c2)	68.02 ± 0.32	34.73 ± 0.66
cyankiwi/GLM-4.7-Flash-AWQ-4bit	pp2048 @ d4096 (c10)	5217.03 ± 15.20	1051.40 ± 608.75	6666.76 ± 2782.40	6662.49 ± 2782.40	6666.81 ± 2782.39
cyankiwi/GLM-4.7-Flash-AWQ-4bit	tg32 @ d4096 (c10)	31.33 ± 0.21	8.14 ± 5.03

llama-benchy (0.2.0)
date: 2026-02-05 15:49:22 | latency mode: api

Please note that that t/s (total) is measured for the entire concurrent batch run, and does not indicate peak throughput. Peak throughput will be added in the next release.

cosinus · February 6, 2026, 4:41pm

Very nice!

BTW are you familiar with aiperf?

Playing around with it lately. May a source of inspiration.

eugr · February 6, 2026, 5:59pm

Yes, I’ve looked at it, as well as a few others, but it’s closer to vllm bench serve - lots of functionality, but I wanted to create something like llama-bench, but for all backends, so we could see the effect of context on follow up requests. And have a simple to digest output.

wdhong · February 9, 2026, 3:01am

Thank you for the share, eugr. I’m curious if the oss-120b benchmark results on GitHub were obtained on a single DGX Spark setup?

eugr · February 9, 2026, 5:15am

It’s a mix. Most were dual Sparks, with the exception of concurrency example which was for a single Spark. The numbers for dual sparks are lower than they should be, I was running some other stuff on those sparks at that time, and it slowed things down a bit. On a good day I see up to 77-78 t/s on single requests with low context.

raphael.amorim · February 13, 2026, 1:11am

Feel free to check latest numbers for some models here as well

eugr · March 12, 2026, 6:32am

PSA: llama-benchy will now try to infer HF model name (and served-model-name if applicable) from the endpoint if --model parameter is not specified. Works with vLLM and SGLang.

While it works with llama.cpp if the server was launched with -hf parameter (passing HF model name), depending on the GGUF model repository, it may not be able to load a tokenizer. In this case you can pass HF model name in non-GGUF format that has a tokenizer in the repository. It will still work, but may report slightly incorrect results depending on how different is the tokenizer from the default gpt2 one.

space11 · March 12, 2026, 9:55am

thanks for doing what you are doing @eugr , more helpful than nvidia staff - hope they send you some sparks for free as a thank you at least

Topic		Replies	Views
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	34	2216	March 12, 2026
Introducing the Spark Arena DGX Spark / GB10	109	2792	March 11, 2026
Using genai_perf for multilingual data Models llama	1	46	November 14, 2025
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	295	March 3, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	1138	January 7, 2026
Dgx spark benchmark performance DGX Spark / GB10	17	1758	January 4, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	1803	February 25, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	2886	January 25, 2026
LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM Technical Blog nim	1	150	July 7, 2025
(sparkrun) Qwen3.5 GGUF Benchmarks over llama.cpp RPC DGX Spark / GB10 Projects llama	3	421	March 11, 2026

New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.)

Why I built this?

What is llama-benchy?

Quick Demo

GitHub

Related topics