Llama.cpp experimental native mxfp4 support for blackwell PR

Looks like we’re about to get some extra PP performance in llama.cpp via this PR https://github.com/ggml-org/llama.cpp/pull/17906.

1 Like

Nice boost! What compilation flag have you used?
-DCMAKE_CUDA_ARCHITECTURES=“120f”?

I had to specify --DCMAKE_CUDA_ARCHITECTURES=121a-real
Getting a very nice boost on gpt-oss-120b:

December 19, 2025

Spark:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1937.39 ± 9.65
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 59.05 ± 0.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1842.81 ± 2.61
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 55.90 ± 0.45
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1761.17 ± 5.50
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 53.14 ± 0.62
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1555.17 ± 5.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 49.49 ± 0.30
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1269.19 ± 4.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 42.87 ± 0.18

build: 74e05131e (7486)

December 24, 2025

Spark:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 2438.11 ± 13.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 57.81 ± 0.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 2294.32 ± 12.61
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.68 ± 0.52
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 2149.21 ± 8.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 51.75 ± 0.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1824.37 ± 8.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 48.29 ± 0.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1415.53 ± 9.85
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 41.42 ± 0.17

build: f5acfb2ff (7535)

4 Likes

I am surprised that I only achieve a maximum of approx. 12t/s with the mxfp4 version via the llama.cpp WebUI (via llama-bench I get 59t/s, tg128 @ d4096). With the Q8-XL version, I get 54t/s via llama-bench and via the WebUI (and continue.dev).

my cml (I use the longer cml arguments because they are easier for me to understand even after some time has passed).

/home/cjg/Projekte/01_llama.cpp/llama.cpp/build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL --alias “gpt-oss-120b|Q8-XL” --jinja --gpu-layers 999 --ctx-size 128000 --host 0.0.0.0 --port 51011 --flash-attn 1 --batch-size 2048 --ubatch-size 2048 --no-mmap --log-file /home/cjg/.cache/llama.cpp/log/llama-server.log --log-timestamps --log-verbosity 3

Is this also the case for you, or is my cml incorrect?

the command you provided isn’t for MXFP4 with llama-server. There’s no reason it should be any slower with llama-server than with llama-bench. I recommend trying the ggml-org quant: ggml-org/gpt-oss-120b-GGUF · Hugging Face

After checking the simple cl (thanks), I noticed that I forgot to reset the --gpu-layers parameter to 999 …

Actually, it can be. llama-bench runs inference directly with C++ engine without all the extra overhead including network, http layer, jinja templates, etc. I see quite a difference between numbers reported by llama-bench and what you get on the client side.

I’ve been working on my own benchmarking tool in the past couple of days that works with any OpenAI-compatible endpoint and outputs results in a format similar to llama-bench. It also tries to estimate server overhead for more accurate prefill computations. Once I’m happy with the logic, I’ll publish it on GitHub.

1 Like

I developed one a couple of years ago: GitHub - coder543/llm-speed-benchmark: A tool that can be used to measure the sequential performance of any OpenAI-compatible LLM API

I was thinking about getting it back out the other day and updating it.

I’ve seen it and a few others before deciding to develop yet another one. I couldn’t find any that could do measurements at different context lengths, reliably get around caching issues and work well with MTP/spec decoding, while presenting the numbers in the same way as llama-bench and getting prefill speeds as accurately as possible.

Mine doesn’t have plotting, etc, though.. Or even concurrency yet.

1 Like

Well, here are the results using my new benchmarking tool. The numbers are lower than llama-bench, even though I’ve done everything to account for extra processing and network latency that is needed to serve the request. But this is more indicative of the actual performance, and the reported numbers are pretty close to what llama-server itself outputs in logs.

llama-bench

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 2449.83 ± 10.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 57.85 ± 0.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 2293.59 ± 8.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.81 ± 0.30
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 2147.98 ± 10.64
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 52.14 ± 0.50
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1845.71 ± 7.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 48.53 ± 0.36
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1404.70 ± 7.36
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 41.72 ± 0.18

build: f5acfb2ff (7535)

llama-benchy (my tool)

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
openai/gpt-oss-120b pp2048 2016.80 ± 55.46 1053.99 ± 27.56 1016.23 ± 27.56 1114.54 ± 27.00
openai/gpt-oss-120b tg32 53.02 ± 0.75
openai/gpt-oss-120b ctx_pp @ d4096 1993.70 ± 72.04 2095.26 ± 74.49 2057.50 ± 74.49 2162.88 ± 76.13
openai/gpt-oss-120b ctx_tg @ d4096 47.67 ± 1.73
openai/gpt-oss-120b pp2048 @ d4096 1745.20 ± 96.72 1214.85 ± 64.89 1177.10 ± 64.89 1283.75 ± 66.40
openai/gpt-oss-120b tg32 @ d4096 46.60 ± 1.12
openai/gpt-oss-120b ctx_pp @ d8192 1812.17 ± 29.03 4559.83 ± 72.16 4522.08 ± 72.16 4636.35 ± 71.30
openai/gpt-oss-120b ctx_tg @ d8192 41.05 ± 1.22
openai/gpt-oss-120b pp2048 @ d8192 1332.17 ± 65.34 1578.81 ± 75.80 1541.05 ± 75.80 1657.52 ± 77.52
openai/gpt-oss-120b tg32 @ d8192 40.58 ± 1.16
openai/gpt-oss-120b ctx_pp @ d16384 1501.69 ± 22.91 10951.33 ± 165.94 10913.57 ± 165.94 11043.11 ± 169.81
openai/gpt-oss-120b ctx_tg @ d16384 34.38 ± 1.64
openai/gpt-oss-120b pp2048 @ d16384 989.61 ± 60.90 2115.09 ± 127.44 2077.34 ± 127.44 2208.19 ± 131.26
openai/gpt-oss-120b tg32 @ d16384 33.86 ± 1.50
openai/gpt-oss-120b ctx_pp @ d32768 1145.69 ± 41.67 28678.42 ± 1066.31 28640.66 ± 1066.31 28799.58 ± 1073.84
openai/gpt-oss-120b ctx_tg @ d32768 25.87 ± 1.72
openai/gpt-oss-120b pp2048 @ d32768 662.55 ± 55.96 3150.71 ± 259.44 3112.95 ± 259.44 3273.24 ± 267.15
openai/gpt-oss-120b tg32 @ d32768 25.60 ± 1.70

llama-benchy (build: 72470d9)
date: 2026-01-04 09:56:05 | latency mode: generation

Running on vLLM (single Spark)

You can see that vLLM is really much-much better in prompt processing which confirms my subjective observations. Too bad the inference is slower initially but it catches up with longer contexts.

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
openai/gpt-oss-120b pp2048 4663.70 ± 42.51 521.55 ± 4.03 439.17 ± 4.03 614.72 ± 3.29
openai/gpt-oss-120b tg32 33.55 ± 0.05
openai/gpt-oss-120b ctx_pp @ d4096 4057.10 ± 8.24 1091.97 ± 2.05 1009.59 ± 2.05 1186.76 ± 1.51
openai/gpt-oss-120b ctx_tg @ d4096 32.72 ± 0.07
openai/gpt-oss-120b pp2048 @ d4096 3172.51 ± 16.16 727.94 ± 3.29 645.56 ± 3.29 821.97 ± 3.97
openai/gpt-oss-120b tg32 @ d4096 32.63 ± 0.02
openai/gpt-oss-120b ctx_pp @ d8192 3548.29 ± 7.68 2391.10 ± 4.99 2308.73 ± 4.99 2489.23 ± 4.45
openai/gpt-oss-120b ctx_tg @ d8192 31.22 ± 0.04
openai/gpt-oss-120b pp2048 @ d8192 2687.84 ± 9.43 844.34 ± 2.68 761.96 ± 2.68 941.50 ± 2.64
openai/gpt-oss-120b tg32 @ d8192 31.50 ± 0.10
openai/gpt-oss-120b ctx_pp @ d16384 2931.35 ± 8.62 5671.66 ± 16.44 5589.28 ± 16.44 5778.30 ± 16.48
openai/gpt-oss-120b ctx_tg @ d16384 28.77 ± 0.04
openai/gpt-oss-120b pp2048 @ d16384 2044.47 ± 8.49 1084.12 ± 4.17 1001.75 ± 4.17 1186.10 ± 4.70
openai/gpt-oss-120b tg32 @ d16384 29.55 ± 0.01
openai/gpt-oss-120b ctx_pp @ d32768 2210.17 ± 0.54 14908.39 ± 3.60 14826.02 ± 3.60 15031.63 ± 4.44
openai/gpt-oss-120b ctx_tg @ d32768 24.97 ± 0.03
openai/gpt-oss-120b pp2048 @ d32768 1398.80 ± 3.97 1546.50 ± 4.15 1464.13 ± 4.15 1659.47 ± 4.98
openai/gpt-oss-120b tg32 @ d32768 26.65 ± 0.01

llama-benchy (0.1.1.dev1+g7646c3141.7646c3141)
date: 2026-01-06 12:52:13 | latency mode: generation

1 Like

This is a great tool, thanks for sharing. Here are my results on MSI EdgeXpert Spark w/the latest llama.cpp with these options:

~/llama.cpp/build/bin/llama-server -m ~/models/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf
–port 8000
–host 0.0.0.0
–alias “gpt-oss-120b”
–jinja
–temp 1.0
–top-p 1.0
–top-k 0
–ctx-size 0
-b 2048 -ub 2048
–reasoning-format auto

| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |

|:--------------------|----------------:|-----------------:|-------------------:|-------------------:|-------------------:|

| openai/gpt-oss-120b | pp2048 | 1714.43 ± 124.48 | 1254.94 ± 90.62 | 1201.14 ± 90.62 | 1323.57 ± 92.01 |

| openai/gpt-oss-120b | tg32 | 45.70 ± 0.23 | | | |

| openai/gpt-oss-120b | pp2048 @ d4096 | 1935.03 ± 55.36 | 3231.89 ± 90.99 | 3178.09 ± 90.99 | 3307.32 ± 91.15 |

| openai/gpt-oss-120b | tg32 @ d4096 | 41.33 ± 0.83 | | | |

| openai/gpt-oss-120b | pp2048 @ d8192 | 1739.80 ± 25.70 | 5941.18 ± 86.18 | 5887.38 ± 86.18 | 6024.52 ± 86.94 |

| openai/gpt-oss-120b | tg32 @ d8192 | 36.39 ± 1.23 | | | |

| openai/gpt-oss-120b | pp2048 @ d16384 | 1434.99 ± 40.09 | 12909.05 ± 352.39 | 12855.25 ± 352.39 | 13006.15 ± 355.23 |

| openai/gpt-oss-120b | tg32 @ d16384 | 30.69 ± 0.84 | | | |

| openai/gpt-oss-120b | pp2048 @ d32768 | 1127.11 ± 40.12 | 30983.80 ± 1120.50 | 30930.00 ± 1120.50 | 31108.35 ± 1128.23 |

| openai/gpt-oss-120b | tg32 @ d32768 | 24.00 ± 1.40 | | | |

llama-benchy (0.1.1)

date: 2026-01-07 01:54:54 | latency mode: generation

I’ve got a second spark on order - is vLLM missing these same blackwell optimizations for spark and do you think they will be added?

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
openai/gpt-oss-120b pp2048 1714.43 ± 124.48 1254.94 ± 90.62 1201.14 ± 90.62 1323.57 ± 92.01
openai/gpt-oss-120b tg32 45.70 ± 0.23
openai/gpt-oss-120b pp2048 @ d4096 1935.03 ± 55.36 3231.89 ± 90.99 3178.09 ± 90.99 3307.32 ± 91.15
openai/gpt-oss-120b tg32 @ d4096 41.33 ± 0.83
openai/gpt-oss-120b pp2048 @ d8192 1739.80 ± 25.70 5941.18 ± 86.18 5887.38 ± 86.18 6024.52 ± 86.94
openai/gpt-oss-120b tg32 @ d8192 36.39 ± 1.23
openai/gpt-oss-120b pp2048 @ d16384 1434.99 ± 40.09 12909.05 ± 352.39 12855.25 ± 352.39 13006.15 ± 355.23
openai/gpt-oss-120b tg32 @ d16384 30.69 ± 0.84
openai/gpt-oss-120b pp2048 @ d32768 1127.11 ± 40.12 30983.80 ± 1120.50 30930.00 ± 1120.50 31108.35 ± 1128.23
openai/gpt-oss-120b tg32 @ d32768 24.00 ± 1.40

llama-benchy (0.1.1)
date: 2026-01-07 01:54:54 | latency mode: generation

Yes, vLLM is missing blackwell-optimized paths for FP4 for now, I hope they will be added soon though.

Thanks for all your contributions, I look forward to trying your vLLM optimized docker container setup soon, I want to try out Minimax M2.1.

1 Like