Looks like we’re about to get some extra PP performance in llama.cpp via this PR https://github.com/ggml-org/llama.cpp/pull/17906.
Nice boost! What compilation flag have you used?
-DCMAKE_CUDA_ARCHITECTURES=“120f”?
I had to specify --DCMAKE_CUDA_ARCHITECTURES=121a-real
Getting a very nice boost on gpt-oss-120b:
December 19, 2025
Spark:
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 | 1937.39 ± 9.65 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 | 59.05 ± 0.13 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d4096 | 1842.81 ± 2.61 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d4096 | 55.90 ± 0.45 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d8192 | 1761.17 ± 5.50 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d8192 | 53.14 ± 0.62 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d16384 | 1555.17 ± 5.76 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d16384 | 49.49 ± 0.30 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d32768 | 1269.19 ± 4.56 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d32768 | 42.87 ± 0.18 |
build: 74e05131e (7486)
December 24, 2025
Spark:
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 | 2438.11 ± 13.72 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 | 57.81 ± 0.53 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d4096 | 2294.32 ± 12.61 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d4096 | 54.68 ± 0.52 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d8192 | 2149.21 ± 8.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d8192 | 51.75 ± 0.56 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d16384 | 1824.37 ± 8.93 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d16384 | 48.29 ± 0.21 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d32768 | 1415.53 ± 9.85 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d32768 | 41.42 ± 0.17 |
build: f5acfb2ff (7535)
I am surprised that I only achieve a maximum of approx. 12t/s with the mxfp4 version via the llama.cpp WebUI (via llama-bench I get 59t/s, tg128 @ d4096). With the Q8-XL version, I get 54t/s via llama-bench and via the WebUI (and continue.dev).
my cml (I use the longer cml arguments because they are easier for me to understand even after some time has passed).
/home/cjg/Projekte/01_llama.cpp/llama.cpp/build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL --alias “gpt-oss-120b|Q8-XL” --jinja --gpu-layers 999 --ctx-size 128000 --host 0.0.0.0 --port 51011 --flash-attn 1 --batch-size 2048 --ubatch-size 2048 --no-mmap --log-file /home/cjg/.cache/llama.cpp/log/llama-server.log --log-timestamps --log-verbosity 3
Is this also the case for you, or is my cml incorrect?
the command you provided isn’t for MXFP4 with llama-server. There’s no reason it should be any slower with llama-server than with llama-bench. I recommend trying the ggml-org quant: ggml-org/gpt-oss-120b-GGUF · Hugging Face
After checking the simple cl (thanks), I noticed that I forgot to reset the --gpu-layers parameter to 999 …
Actually, it can be. llama-bench runs inference directly with C++ engine without all the extra overhead including network, http layer, jinja templates, etc. I see quite a difference between numbers reported by llama-bench and what you get on the client side.
I’ve been working on my own benchmarking tool in the past couple of days that works with any OpenAI-compatible endpoint and outputs results in a format similar to llama-bench. It also tries to estimate server overhead for more accurate prefill computations. Once I’m happy with the logic, I’ll publish it on GitHub.
I developed one a couple of years ago: GitHub - coder543/llm-speed-benchmark: A tool that can be used to measure the sequential performance of any OpenAI-compatible LLM API
I was thinking about getting it back out the other day and updating it.
I’ve seen it and a few others before deciding to develop yet another one. I couldn’t find any that could do measurements at different context lengths, reliably get around caching issues and work well with MTP/spec decoding, while presenting the numbers in the same way as llama-bench and getting prefill speeds as accurately as possible.
Mine doesn’t have plotting, etc, though.. Or even concurrency yet.
Well, here are the results using my new benchmarking tool. The numbers are lower than llama-bench, even though I’ve done everything to account for extra processing and network latency that is needed to serve the request. But this is more indicative of the actual performance, and the reported numbers are pretty close to what llama-server itself outputs in logs.
llama-bench
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 | 2449.83 ± 10.27 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 | 57.85 ± 0.44 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d4096 | 2293.59 ± 8.99 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d4096 | 54.81 ± 0.30 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d8192 | 2147.98 ± 10.64 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d8192 | 52.14 ± 0.50 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d16384 | 1845.71 ± 7.11 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d16384 | 48.53 ± 0.36 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | pp2048 @ d32768 | 1404.70 ± 7.36 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | tg32 @ d32768 | 41.72 ± 0.18 |
build: f5acfb2ff (7535)
llama-benchy (my tool)
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| openai/gpt-oss-120b | pp2048 | 2016.80 ± 55.46 | 1053.99 ± 27.56 | 1016.23 ± 27.56 | 1114.54 ± 27.00 |
| openai/gpt-oss-120b | tg32 | 53.02 ± 0.75 | |||
| openai/gpt-oss-120b | ctx_pp @ d4096 | 1993.70 ± 72.04 | 2095.26 ± 74.49 | 2057.50 ± 74.49 | 2162.88 ± 76.13 |
| openai/gpt-oss-120b | ctx_tg @ d4096 | 47.67 ± 1.73 | |||
| openai/gpt-oss-120b | pp2048 @ d4096 | 1745.20 ± 96.72 | 1214.85 ± 64.89 | 1177.10 ± 64.89 | 1283.75 ± 66.40 |
| openai/gpt-oss-120b | tg32 @ d4096 | 46.60 ± 1.12 | |||
| openai/gpt-oss-120b | ctx_pp @ d8192 | 1812.17 ± 29.03 | 4559.83 ± 72.16 | 4522.08 ± 72.16 | 4636.35 ± 71.30 |
| openai/gpt-oss-120b | ctx_tg @ d8192 | 41.05 ± 1.22 | |||
| openai/gpt-oss-120b | pp2048 @ d8192 | 1332.17 ± 65.34 | 1578.81 ± 75.80 | 1541.05 ± 75.80 | 1657.52 ± 77.52 |
| openai/gpt-oss-120b | tg32 @ d8192 | 40.58 ± 1.16 | |||
| openai/gpt-oss-120b | ctx_pp @ d16384 | 1501.69 ± 22.91 | 10951.33 ± 165.94 | 10913.57 ± 165.94 | 11043.11 ± 169.81 |
| openai/gpt-oss-120b | ctx_tg @ d16384 | 34.38 ± 1.64 | |||
| openai/gpt-oss-120b | pp2048 @ d16384 | 989.61 ± 60.90 | 2115.09 ± 127.44 | 2077.34 ± 127.44 | 2208.19 ± 131.26 |
| openai/gpt-oss-120b | tg32 @ d16384 | 33.86 ± 1.50 | |||
| openai/gpt-oss-120b | ctx_pp @ d32768 | 1145.69 ± 41.67 | 28678.42 ± 1066.31 | 28640.66 ± 1066.31 | 28799.58 ± 1073.84 |
| openai/gpt-oss-120b | ctx_tg @ d32768 | 25.87 ± 1.72 | |||
| openai/gpt-oss-120b | pp2048 @ d32768 | 662.55 ± 55.96 | 3150.71 ± 259.44 | 3112.95 ± 259.44 | 3273.24 ± 267.15 |
| openai/gpt-oss-120b | tg32 @ d32768 | 25.60 ± 1.70 |
llama-benchy (build: 72470d9)
date: 2026-01-04 09:56:05 | latency mode: generation
Running on vLLM (single Spark)
You can see that vLLM is really much-much better in prompt processing which confirms my subjective observations. Too bad the inference is slower initially but it catches up with longer contexts.
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| openai/gpt-oss-120b | pp2048 | 4663.70 ± 42.51 | 521.55 ± 4.03 | 439.17 ± 4.03 | 614.72 ± 3.29 |
| openai/gpt-oss-120b | tg32 | 33.55 ± 0.05 | |||
| openai/gpt-oss-120b | ctx_pp @ d4096 | 4057.10 ± 8.24 | 1091.97 ± 2.05 | 1009.59 ± 2.05 | 1186.76 ± 1.51 |
| openai/gpt-oss-120b | ctx_tg @ d4096 | 32.72 ± 0.07 | |||
| openai/gpt-oss-120b | pp2048 @ d4096 | 3172.51 ± 16.16 | 727.94 ± 3.29 | 645.56 ± 3.29 | 821.97 ± 3.97 |
| openai/gpt-oss-120b | tg32 @ d4096 | 32.63 ± 0.02 | |||
| openai/gpt-oss-120b | ctx_pp @ d8192 | 3548.29 ± 7.68 | 2391.10 ± 4.99 | 2308.73 ± 4.99 | 2489.23 ± 4.45 |
| openai/gpt-oss-120b | ctx_tg @ d8192 | 31.22 ± 0.04 | |||
| openai/gpt-oss-120b | pp2048 @ d8192 | 2687.84 ± 9.43 | 844.34 ± 2.68 | 761.96 ± 2.68 | 941.50 ± 2.64 |
| openai/gpt-oss-120b | tg32 @ d8192 | 31.50 ± 0.10 | |||
| openai/gpt-oss-120b | ctx_pp @ d16384 | 2931.35 ± 8.62 | 5671.66 ± 16.44 | 5589.28 ± 16.44 | 5778.30 ± 16.48 |
| openai/gpt-oss-120b | ctx_tg @ d16384 | 28.77 ± 0.04 | |||
| openai/gpt-oss-120b | pp2048 @ d16384 | 2044.47 ± 8.49 | 1084.12 ± 4.17 | 1001.75 ± 4.17 | 1186.10 ± 4.70 |
| openai/gpt-oss-120b | tg32 @ d16384 | 29.55 ± 0.01 | |||
| openai/gpt-oss-120b | ctx_pp @ d32768 | 2210.17 ± 0.54 | 14908.39 ± 3.60 | 14826.02 ± 3.60 | 15031.63 ± 4.44 |
| openai/gpt-oss-120b | ctx_tg @ d32768 | 24.97 ± 0.03 | |||
| openai/gpt-oss-120b | pp2048 @ d32768 | 1398.80 ± 3.97 | 1546.50 ± 4.15 | 1464.13 ± 4.15 | 1659.47 ± 4.98 |
| openai/gpt-oss-120b | tg32 @ d32768 | 26.65 ± 0.01 |
llama-benchy (0.1.1.dev1+g7646c3141.7646c3141)
date: 2026-01-06 12:52:13 | latency mode: generation
This is a great tool, thanks for sharing. Here are my results on MSI EdgeXpert Spark w/the latest llama.cpp with these options:
~/llama.cpp/build/bin/llama-server -m ~/models/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
–port 8000
–host 0.0.0.0
–alias “gpt-oss-120b”
–jinja
–temp 1.0
–top-p 1.0
–top-k 0
–ctx-size 0
-b 2048 -ub 2048
–reasoning-format auto
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------|----------------:|-----------------:|-------------------:|-------------------:|-------------------:|
| openai/gpt-oss-120b | pp2048 | 1714.43 ± 124.48 | 1254.94 ± 90.62 | 1201.14 ± 90.62 | 1323.57 ± 92.01 |
| openai/gpt-oss-120b | tg32 | 45.70 ± 0.23 | | | |
| openai/gpt-oss-120b | pp2048 @ d4096 | 1935.03 ± 55.36 | 3231.89 ± 90.99 | 3178.09 ± 90.99 | 3307.32 ± 91.15 |
| openai/gpt-oss-120b | tg32 @ d4096 | 41.33 ± 0.83 | | | |
| openai/gpt-oss-120b | pp2048 @ d8192 | 1739.80 ± 25.70 | 5941.18 ± 86.18 | 5887.38 ± 86.18 | 6024.52 ± 86.94 |
| openai/gpt-oss-120b | tg32 @ d8192 | 36.39 ± 1.23 | | | |
| openai/gpt-oss-120b | pp2048 @ d16384 | 1434.99 ± 40.09 | 12909.05 ± 352.39 | 12855.25 ± 352.39 | 13006.15 ± 355.23 |
| openai/gpt-oss-120b | tg32 @ d16384 | 30.69 ± 0.84 | | | |
| openai/gpt-oss-120b | pp2048 @ d32768 | 1127.11 ± 40.12 | 30983.80 ± 1120.50 | 30930.00 ± 1120.50 | 31108.35 ± 1128.23 |
| openai/gpt-oss-120b | tg32 @ d32768 | 24.00 ± 1.40 | | | |
llama-benchy (0.1.1)
date: 2026-01-07 01:54:54 | latency mode: generation
I’ve got a second spark on order - is vLLM missing these same blackwell optimizations for spark and do you think they will be added?
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| openai/gpt-oss-120b | pp2048 | 1714.43 ± 124.48 | 1254.94 ± 90.62 | 1201.14 ± 90.62 | 1323.57 ± 92.01 |
| openai/gpt-oss-120b | tg32 | 45.70 ± 0.23 | |||
| openai/gpt-oss-120b | pp2048 @ d4096 | 1935.03 ± 55.36 | 3231.89 ± 90.99 | 3178.09 ± 90.99 | 3307.32 ± 91.15 |
| openai/gpt-oss-120b | tg32 @ d4096 | 41.33 ± 0.83 | |||
| openai/gpt-oss-120b | pp2048 @ d8192 | 1739.80 ± 25.70 | 5941.18 ± 86.18 | 5887.38 ± 86.18 | 6024.52 ± 86.94 |
| openai/gpt-oss-120b | tg32 @ d8192 | 36.39 ± 1.23 | |||
| openai/gpt-oss-120b | pp2048 @ d16384 | 1434.99 ± 40.09 | 12909.05 ± 352.39 | 12855.25 ± 352.39 | 13006.15 ± 355.23 |
| openai/gpt-oss-120b | tg32 @ d16384 | 30.69 ± 0.84 | |||
| openai/gpt-oss-120b | pp2048 @ d32768 | 1127.11 ± 40.12 | 30983.80 ± 1120.50 | 30930.00 ± 1120.50 | 31108.35 ± 1128.23 |
| openai/gpt-oss-120b | tg32 @ d32768 | 24.00 ± 1.40 |
llama-benchy (0.1.1)
date: 2026-01-07 01:54:54 | latency mode: generation
Yes, vLLM is missing blackwell-optimized paths for FP4 for now, I hope they will be added soon though.
Thanks for all your contributions, I look forward to trying your vLLM optimized docker container setup soon, I want to try out Minimax M2.1.
